All Questions
96 questions
1
vote
1
answer
97
views
How to Exclude Rows Based on a Dynamic Condition in a PySpark Window Function?
I am working with PySpark and need to create a window function that calculates the median of the previous 5 values in a column. However, I want to exclude rows where a specific column feature is True. ...
0
votes
1
answer
415
views
Count distinct operation for a window with partition by, order by and range between in spark scala?
I have a dataframe which looks like this -
+----------------------------------------------------+
|DeviceId | TimeString | UnixTimestamp | Group |
+-----------------------------------------------...
0
votes
1
answer
85
views
Max interval between occurrences of two values - Spark
I have a spark dataframe in the format below. I wanted to calculate the max travel streaks to other countries from India before returning back to India.
I have created a flag to indicate if India was ...
0
votes
0
answers
141
views
Pyspark find min price within a date range in pyspark dataframe
I have a Dataframe which has following data(simulation).
Price
Date
InvoiceNumber
Product_Type
12.65
12/30/2021
INV_19984
AXN UN1234
18.78
1/23/2022
INV_200174
AXN UN1234
11.78
1/25/2022
INV_200173
...
0
votes
2
answers
349
views
Keep only modified rows in Pyspark
I need to clean a dataset filtering only modified rows (compared to the previous one) based on certain fields (in the example below we only consider cities and sports, for each id), keeping only the ...
1
vote
2
answers
101
views
Create column that shows previous earned salary (groupedBy)
I'm trying to build a replacement for the .shift() function of Python. I'm pretty close, but need the final touch to make everything work. This would be implementing a right form of GroupBy.
I have a ...
1
vote
1
answer
181
views
Speeding up a custom aggregate for window function
I am looking for alternatives to this code which would run faster
inp = spark.createDataFrame([
["1", "A", 7, 2],
["1", "A", 14, 3],
["1",...
1
vote
2
answers
1k
views
Using rangeBetween considering months rather than days in PySpark
I'm looking how to translate this chunk of SQL code into PySpark syntax.
SELECT MEAN(some_value) OVER (
ORDER BY yyyy_mm_dd
RANGE BETWEEN INTERVAL 3 MONTHS PRECEDING AND CURRENT ROW
) AS ...
1
vote
2
answers
468
views
PySpark - assigning group id based on group member count
I have a dataframe where I want to assign id in for each window partition and for each 5 rows. Meaning, the id should increase/change when the partition has a different value or the number of rows in ...
0
votes
1
answer
534
views
Conditional forward fill values in Spark dataframe
I have a Spark dataframe where I need to create a window partition column ("desired_output").
I simply want this conditional column to equal the "flag" column (0) until the first ...
1
vote
1
answer
1k
views
Back fill nulls with non null values in Spark dataframe
I have a Spark data frame where I need to create a window partition column ("desired_output"). This column needs to back fill and non-null values.
I am looking to backfill the first non-null ...
1
vote
1
answer
575
views
How to retain the preceding updated row values in PySpark and use it in the next row calculation?
The below condition needs to be applied on RANK and RANKA columns
Input table:
Condition for RANK column:
IF RANK == 0 : then RANK= previous RANK value + 1 ;
else : RANK=RANK
Condition ...
1
vote
1
answer
217
views
Cumulative count specified status by ID
I have a dataframe in PySpark, similar to this:
+---+------+-----------+
|id |status|date |
+---+------+-----------+
|1 |1 |01-01-2022 |
|1 |0 |02-01-2022 |
|1 |0 |03-01-2022 |
|1 ...
0
votes
1
answer
555
views
Count occurrences of a filtered value in a window
I have the following dataframe:
| Timestamp | info |
+-------------------+----------+
|2016-01-01 17:54:30| 8 |
|2016-02-01 12:16:18| 2 |
|2016-03-01 12:17:57| 1 |
|...
4
votes
3
answers
833
views
How to create a map column with rolling window aggregates per each key
Problem description
I need help with a pyspark.sql function that will create a new variable aggregating records over a specified Window() into a map of key-value pairs.
Reproducible Data
df = spark....