All Questions
89 questions
0
votes
2
answers
104
views
How can I return average of array for each row in a PySpark dataframe?
Say I have data like the following:
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, DoubleType, StructField, StructType, LongType
spark = SparkSession.builder.appName(&...
0
votes
0
answers
44
views
Pyspark convert the structure of the data
I have a bigquery table. Sample data in json format is
[{"cust_id": 12345,
"Seq_column": [1,2,3,4,5]}]
When i fetch the data using pyspark and save it back to bigquery table, the ...
1
vote
1
answer
403
views
array(struct) to array(map)—PySpark
I have a df with the following schema,
g_hut: string
date: date
arr_data:array
element:struct
Id:string
Q_Id:string
Q_Type:string
I want to convert the arr_data ...
5
votes
2
answers
124
views
How to join two tables with aggregation
I have two pyspark dataframes:
one is:
name start end
bob 1 3
john 5 8
and second is:
day outcome
1 a
2 c
3 d
4 a
5 e
6 c
7 u
8 l
And i need concat days ...
1
vote
1
answer
168
views
Use java_method in Spark to obtain values from Java array
Using Spark's java_method (and getISOCountries from java.util.Locale), I try to access the list of all countries. I get no error, but the returned value looks like [Ljava.lang.String;@5a68a908. When I ...
0
votes
1
answer
821
views
How To Unnest And Pivot Multiple JSON-like Structures Inside A PySpark DataFrame
I'm trying to transform raw "Event" data coming from a Google Analytics account using PySpark. Each "Event" record has a field called "event_params", which contains sub-...
2
votes
1
answer
310
views
PySpark: How to split the array based on value in pyspark dataframe, aslo reflect the same with corrsponding another column with array type
I have a Pyspark dataframe :
ids
names
[1, 1, 2, 3, 1, 2, 3, 7, 5]
[a, b, c, l, s, o, c, d, e]
[3, 8, 9, 3, 9, 0, 0, 6, 7, 8]
[s, l, h, p, q, g, c, d, p, s]
[9, 6, 5, 4, 7, 6, 5, 9, 2, 5, 5, 4, 7]
[q, ...
0
votes
1
answer
776
views
Max value from array of string in pyspark
I am very new to spark, trying to find max value from array of string but getting errors. Tried couple of things like creating dataframe/split/using lit but facing further errors. Can anyone please ...
0
votes
1
answer
1k
views
Check if an array contains values from a list and add list as columns
I have a data_frame as below,
Id
Col1
1
[["A", "B", "E", "F"]]
2
[["A", "D", "E"]]
I have a list as ["A", "B", ...
1
vote
1
answer
875
views
How to extract data from a JSON key/value pair, if the key also has the actual value
I am trying to parse/flatten JSON data using PySpark DataFrame API.
The challenge is, I need to extract one data element ('Id') from the actual Key/Attribute, and also filter only the rows having '...
0
votes
1
answer
446
views
PySpark-How to find out the top n most frequently occurring value in an array column?
For the sample data below, wondering how I can find out the most frequently occurring value in the column colour. The data type of colour is WrappedArray. There could be n number of elements in the ...
1
vote
1
answer
435
views
Convert dataframe array type column to string without losing element names/schema
In my dataframe, I need to convert an array type column to string without losing the element names/schema for the data in the column.
My dataframe schema:
root
|-- accountId: string (nullable = true)
...
2
votes
3
answers
3k
views
Check if an array of array contains an array
Here are two columns of my dataframe (df):
A
B
["a"]
[["a"], ["b"]]
["c"]
[["a"], ["b"]]
I want to create an array that tells whether the ...
0
votes
1
answer
76
views
Spark dataframe create explode with order
I have a data like below
Input Df
+----------+-----------------------------------+--------------|
|SALES_NO |SALE_LINE_NUM | CODE_1 | CODE_3 | CODE_2 |
+----------+---------------------------...
1
vote
1
answer
210
views
Array of string using pandas_udf
This function returns array of int:
from pyspark.sql import functions as F
import pandas as pd
@F.pandas_udf('array<int>')
def pudf(x: pd.Series, y: pd.Series) -> pd.Series:
return pd....