Skip to main content

All Questions

0 votes
2 answers
104 views

How can I return average of array for each row in a PySpark dataframe?

Say I have data like the following: from pyspark.sql import SparkSession from pyspark.sql.types import ArrayType, DoubleType, StructField, StructType, LongType spark = SparkSession.builder.appName(&...
T_d's user avatar
  • 63
0 votes
0 answers
44 views

Pyspark convert the structure of the data

I have a bigquery table. Sample data in json format is [{"cust_id": 12345, "Seq_column": [1,2,3,4,5]}] When i fetch the data using pyspark and save it back to bigquery table, the ...
Shravan K's user avatar
  • 105
1 vote
1 answer
403 views

array(struct) to array(map)—PySpark

I have a df with the following schema, g_hut: string date: date arr_data:array element:struct Id:string Q_Id:string Q_Type:string I want to convert the arr_data ...
i.n.n.m's user avatar
  • 3,046
5 votes
2 answers
124 views

How to join two tables with aggregation

I have two pyspark dataframes: one is: name start end bob 1 3 john 5 8 and second is: day outcome 1 a 2 c 3 d 4 a 5 e 6 c 7 u 8 l And i need concat days ...
user453575457's user avatar
1 vote
1 answer
168 views

Use java_method in Spark to obtain values from Java array

Using Spark's java_method (and getISOCountries from java.util.Locale), I try to access the list of all countries. I get no error, but the returned value looks like [Ljava.lang.String;@5a68a908. When I ...
ZygD's user avatar
  • 24.6k
0 votes
1 answer
821 views

How To Unnest And Pivot Multiple JSON-like Structures Inside A PySpark DataFrame

I'm trying to transform raw "Event" data coming from a Google Analytics account using PySpark. Each "Event" record has a field called "event_params", which contains sub-...
aknickel's user avatar
2 votes
1 answer
310 views

PySpark: How to split the array based on value in pyspark dataframe, aslo reflect the same with corrsponding another column with array type

I have a Pyspark dataframe : ids names [1, 1, 2, 3, 1, 2, 3, 7, 5] [a, b, c, l, s, o, c, d, e] [3, 8, 9, 3, 9, 0, 0, 6, 7, 8] [s, l, h, p, q, g, c, d, p, s] [9, 6, 5, 4, 7, 6, 5, 9, 2, 5, 5, 4, 7] [q, ...
praveen kumar's user avatar
0 votes
1 answer
776 views

Max value from array of string in pyspark

I am very new to spark, trying to find max value from array of string but getting errors. Tried couple of things like creating dataframe/split/using lit but facing further errors. Can anyone please ...
Shailendra Kumar's user avatar
0 votes
1 answer
1k views

Check if an array contains values from a list and add list as columns

I have a data_frame as below, Id Col1 1 [["A", "B", "E", "F"]] 2 [["A", "D", "E"]] I have a list as ["A", "B", ...
Jessie's user avatar
  • 313
1 vote
1 answer
875 views

How to extract data from a JSON key/value pair, if the key also has the actual value

I am trying to parse/flatten JSON data using PySpark DataFrame API. The challenge is, I need to extract one data element ('Id') from the actual Key/Attribute, and also filter only the rows having '...
Kris's user avatar
  • 109
0 votes
1 answer
446 views

PySpark-How to find out the top n most frequently occurring value in an array column?

For the sample data below, wondering how I can find out the most frequently occurring value in the column colour. The data type of colour is WrappedArray. There could be n number of elements in the ...
user4046073's user avatar
1 vote
1 answer
435 views

Convert dataframe array type column to string without losing element names/schema

In my dataframe, I need to convert an array type column to string without losing the element names/schema for the data in the column. My dataframe schema: root |-- accountId: string (nullable = true) ...
bda's user avatar
  • 422
2 votes
3 answers
3k views

Check if an array of array contains an array

Here are two columns of my dataframe (df): A B ["a"] [["a"], ["b"]] ["c"] [["a"], ["b"]] I want to create an array that tells whether the ...
hexagon's user avatar
  • 73
0 votes
1 answer
76 views

Spark dataframe create explode with order

I have a data like below Input Df +----------+-----------------------------------+--------------| |SALES_NO |SALE_LINE_NUM | CODE_1 | CODE_3 | CODE_2 | +----------+---------------------------...
Meclier.023's user avatar
1 vote
1 answer
210 views

Array of string using pandas_udf

This function returns array of int: from pyspark.sql import functions as F import pandas as pd @F.pandas_udf('array<int>') def pudf(x: pd.Series, y: pd.Series) -> pd.Series: return pd....
ZygD's user avatar
  • 24.6k

15 30 50 per page
1
2 3 4 5 6