How can I return average of array for each row in a PySpark dataframe?

Question

Say I have data like the following:

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, DoubleType, StructField, StructType, LongType

spark = SparkSession.builder.appName("AveragingArray").getOrCreate()

# Define schema
schema = StructType([
    StructField("id", LongType(), True),
    StructField("arrays", ArrayType(DoubleType()), True)
])

# Sample data
data = [
    (1, [18.2, 50.9]),
    (2, [1.0, 3.4, 5.5]),
    (3, [10.0, 20.1, 30.0])
]

# Create DataFrame
df = spark.createDataFrame(data, schema=schema)

I want a simple way to convert the arrays item to be the average of the values in the array. For example, the first row would be 34.55.

I had tried something like this:

from pyspark.sql import functions as F

def average_arrays(df):
    # Explode the array to individual elements
    exploded_df = df.withColumn("arrays", F.explode("arrays"))
    
    # Group by the original identifier and calculate the average then return as int
    result_df = exploded_df.groupBy("id").agg(F.avg("arrays").alias("arrays"))
    df = df.withColumn('arrays', col('arrays')[0].cast('int')) 
    
    return result_df

However, it still just returns the original array. I'd hugely appreciate any help here, thanks in advance.

lihao · Accepted Answer · 2024-08-26 14:02:05Z

0

I can't understand your mean. I try to run your code. I find that it returns the average value of the array. My results are as follows.

+---+------------------+
| id|            arrays|
+---+------------------+
|  1|             34.55|
|  3|20.033333333333335|
|  2|3.3000000000000003|
+---+------------------+

edited Aug 26, 2024 at 14:02

lihao

7834 silver badges11 bronze badges

answered Aug 26, 2024 at 2:08

A code sheep

1

1

This should be a comment, not an answer.
– President James K. Polk
Commented Aug 26, 2024 at 14:02

Add a comment |

mrsrinivas · Accepted Answer · 2024-08-28 14:50:25Z

The return value in average_arrays() is incorect

The code should be

def average_arrays(df):
    # Explode the array to individual elements
    exploded_df = df.withColumn("arrays", F.explode("arrays"))
    
    # Group by the original identifier and calculate the average then return as int
    result_df = exploded_df.groupBy("id").agg(F.avg("arrays").alias("arrays"))

    return result_df

Aditional tip: I will rewrite the job in the following way for simplicity.


df.createOrReplaceTempView("array_table")


# Execute the SQL query
result_df = spark.sql("""
    SELECT
        id,
        values,
        -- Calculate the average of the array elements
        aggregate(values, 0.0D, (acc, x) -> acc + x) / size(values) AS average
    FROM array_table
""")

result_df.show(truncate=False)

aggregate doc

aggregate(expr, start, merge, finish) - Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state. The final state is converted into the final result by applying a finish function.

Collectives™ on Stack Overflow

How can I return average of array for each row in a PySpark dataframe?

2 Answers 2

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Related