Skip to main content

All Questions

1 vote
1 answer
97 views

How to Exclude Rows Based on a Dynamic Condition in a PySpark Window Function?

I am working with PySpark and need to create a window function that calculates the median of the previous 5 values in a column. However, I want to exclude rows where a specific column feature is True. ...
user29963762's user avatar
0 votes
1 answer
415 views

Count distinct operation for a window with partition by, order by and range between in spark scala?

I have a dataframe which looks like this - +----------------------------------------------------+ |DeviceId | TimeString | UnixTimestamp | Group | +-----------------------------------------------...
aamirr's user avatar
  • 181
0 votes
1 answer
85 views

Max interval between occurrences of two values - Spark

I have a spark dataframe in the format below. I wanted to calculate the max travel streaks to other countries from India before returning back to India. I have created a flag to indicate if India was ...
Abishek's user avatar
  • 857
0 votes
0 answers
141 views

Pyspark find min price within a date range in pyspark dataframe

I have a Dataframe which has following data(simulation). Price Date InvoiceNumber Product_Type 12.65 12/30/2021 INV_19984 AXN UN1234 18.78 1/23/2022 INV_200174 AXN UN1234 11.78 1/25/2022 INV_200173 ...
Ramaraju.d's user avatar
  • 1,365
0 votes
2 answers
349 views

Keep only modified rows in Pyspark

I need to clean a dataset filtering only modified rows (compared to the previous one) based on certain fields (in the example below we only consider cities and sports, for each id), keeping only the ...
Jresearcher's user avatar
1 vote
2 answers
101 views

Create column that shows previous earned salary (groupedBy)

I'm trying to build a replacement for the .shift() function of Python. I'm pretty close, but need the final touch to make everything work. This would be implementing a right form of GroupBy. I have a ...
DataDude's user avatar
  • 156
1 vote
1 answer
181 views

Speeding up a custom aggregate for window function

I am looking for alternatives to this code which would run faster inp = spark.createDataFrame([ ["1", "A", 7, 2], ["1", "A", 14, 3], ["1",...
ironv's user avatar
  • 1,058
1 vote
2 answers
1k views

Using rangeBetween considering months rather than days in PySpark

I'm looking how to translate this chunk of SQL code into PySpark syntax. SELECT MEAN(some_value) OVER ( ORDER BY yyyy_mm_dd RANGE BETWEEN INTERVAL 3 MONTHS PRECEDING AND CURRENT ROW ) AS ...
Tinkerbell's user avatar
1 vote
2 answers
468 views

PySpark - assigning group id based on group member count

I have a dataframe where I want to assign id in for each window partition and for each 5 rows. Meaning, the id should increase/change when the partition has a different value or the number of rows in ...
ArVe's user avatar
  • 13
0 votes
1 answer
534 views

Conditional forward fill values in Spark dataframe

I have a Spark dataframe where I need to create a window partition column ("desired_output"). I simply want this conditional column to equal the "flag" column (0) until the first ...
bbal20's user avatar
  • 133
1 vote
1 answer
1k views

Back fill nulls with non null values in Spark dataframe

I have a Spark data frame where I need to create a window partition column ("desired_output"). This column needs to back fill and non-null values. I am looking to backfill the first non-null ...
bbal20's user avatar
  • 133
1 vote
1 answer
575 views

How to retain the preceding updated row values in PySpark and use it in the next row calculation?

The below condition needs to be applied on RANK and RANKA columns Input table: Condition for RANK column: IF RANK == 0 : then RANK= previous RANK value + 1 ; else : RANK=RANK Condition ...
Shirin's user avatar
  • 45
1 vote
1 answer
217 views

Cumulative count specified status by ID

I have a dataframe in PySpark, similar to this: +---+------+-----------+ |id |status|date | +---+------+-----------+ |1 |1 |01-01-2022 | |1 |0 |02-01-2022 | |1 |0 |03-01-2022 | |1 ...
Ana Beatriz's user avatar
0 votes
1 answer
555 views

Count occurrences of a filtered value in a window

I have the following dataframe: | Timestamp | info | +-------------------+----------+ |2016-01-01 17:54:30| 8 | |2016-02-01 12:16:18| 2 | |2016-03-01 12:17:57| 1 | |...
Babbara's user avatar
  • 488
4 votes
3 answers
833 views

How to create a map column with rolling window aggregates per each key

Problem description I need help with a pyspark.sql function that will create a new variable aggregating records over a specified Window() into a map of key-value pairs. Reproducible Data df = spark....
Adrian's user avatar
  • 108

15 30 50 per page
1
2 3 4 5
7