Newest 'pyspark+apache-spark-sql+window-functions' Questions

1 vote

1 answer

97 views

How to Exclude Rows Based on a Dynamic Condition in a PySpark Window Function?

I am working with PySpark and need to create a window function that calculates the median of the previous 5 values in a column. However, I want to exclude rows where a specific column feature is True. ...

user29963762

11

asked Mar 11 at 10:34

0 votes

1 answer

415 views

Count distinct operation for a window with partition by, order by and range between in spark scala?

aamirr

181

asked Oct 20, 2023 at 0:50

0 votes

1 answer

85 views

Max interval between occurrences of two values - Spark

I have a spark dataframe in the format below. I wanted to calculate the max travel streaks to other countries from India before returning back to India. I have created a flag to indicate if India was ...

Abishek

857

asked Sep 15, 2023 at 7:40

0 votes

0 answers

141 views

Pyspark find min price within a date range in pyspark dataframe

I have a Dataframe which has following data(simulation). Price Date InvoiceNumber Product_Type 12.65 12/30/2021 INV_19984 AXN UN1234 18.78 1/23/2022 INV_200174 AXN UN1234 11.78 1/25/2022 INV_200173 ...

Ramaraju.d

1,365

asked Jan 11, 2023 at 14:42

0 votes

2 answers

349 views

Keep only modified rows in Pyspark

I need to clean a dataset filtering only modified rows (compared to the previous one) based on certain fields (in the example below we only consider cities and sports, for each id), keeping only the ...

Jresearcher

347

asked Nov 24, 2022 at 8:13

1 vote

2 answers

101 views

Create column that shows previous earned salary (groupedBy)

I'm trying to build a replacement for the .shift() function of Python. I'm pretty close, but need the final touch to make everything work. This would be implementing a right form of GroupBy. I have a ...

DataDude

156

asked Oct 6, 2022 at 13:49

1 vote

1 answer

181 views

Speeding up a custom aggregate for window function

I am looking for alternatives to this code which would run faster inp = spark.createDataFrame([ ["1", "A", 7, 2], ["1", "A", 14, 3], ["1",...

ironv

1,058

asked Sep 30, 2022 at 5:18

1 vote

2 answers

1k views

Using rangeBetween considering months rather than days in PySpark

I'm looking how to translate this chunk of SQL code into PySpark syntax. SELECT MEAN(some_value) OVER ( ORDER BY yyyy_mm_dd RANGE BETWEEN INTERVAL 3 MONTHS PRECEDING AND CURRENT ROW ) AS ...

Tinkerbell

145

asked Sep 14, 2022 at 8:56

1 vote

2 answers

468 views

PySpark - assigning group id based on group member count

I have a dataframe where I want to assign id in for each window partition and for each 5 rows. Meaning, the id should increase/change when the partition has a different value or the number of rows in ...

ArVe

13

asked Sep 13, 2022 at 20:15

0 votes

1 answer

534 views

Conditional forward fill values in Spark dataframe

I have a Spark dataframe where I need to create a window partition column ("desired_output"). I simply want this conditional column to equal the "flag" column (0) until the first ...

bbal20

133

asked Jul 28, 2022 at 16:20

1 vote

1 answer

1k views

Back fill nulls with non null values in Spark dataframe

I have a Spark data frame where I need to create a window partition column ("desired_output"). This column needs to back fill and non-null values. I am looking to backfill the first non-null ...

bbal20

133

asked Jul 27, 2022 at 4:11

1 vote

1 answer

575 views

How to retain the preceding updated row values in PySpark and use it in the next row calculation?

The below condition needs to be applied on RANK and RANKA columns Input table: Condition for RANK column: IF RANK == 0 : then RANK= previous RANK value + 1 ; else : RANK=RANK Condition ...

Shirin

45

asked Jul 25, 2022 at 7:59

1 vote

1 answer

217 views

Cumulative count specified status by ID

I have a dataframe in PySpark, similar to this: +---+------+-----------+ |id |status|date | +---+------+-----------+ |1 |1 |01-01-2022 | |1 |0 |02-01-2022 | |1 |0 |03-01-2022 | |1 ...

Ana Beatriz

109

asked Jul 8, 2022 at 13:17

0 votes

1 answer

555 views

Count occurrences of a filtered value in a window

I have the following dataframe: | Timestamp | info | +-------------------+----------+ |2016-01-01 17:54:30| 8 | |2016-02-01 12:16:18| 2 | |2016-03-01 12:17:57| 1 | |...

Babbara

488

asked Jun 28, 2022 at 18:31

4 votes

3 answers

833 views

How to create a map column with rolling window aggregates per each key

Problem description I need help with a pyspark.sql function that will create a new variable aggregating records over a specified Window() into a map of key-value pairs. Reproducible Data df = spark....

Adrian

108

asked Jun 27, 2022 at 12:39

Collectives™ on Stack Overflow

All Questions

How to Exclude Rows Based on a Dynamic Condition in a PySpark Window Function?

Count distinct operation for a window with partition by, order by and range between in spark scala?

Max interval between occurrences of two values - Spark

Pyspark find min price within a date range in pyspark dataframe

Keep only modified rows in Pyspark

Create column that shows previous earned salary (groupedBy)

Speeding up a custom aggregate for window function

Using rangeBetween considering months rather than days in PySpark

PySpark - assigning group id based on group member count

Conditional forward fill values in Spark dataframe

Back fill nulls with non null values in Spark dataframe

How to retain the preceding updated row values in PySpark and use it in the next row calculation?

Cumulative count specified status by ID

Count occurrences of a filtered value in a window

How to create a map column with rolling window aggregates per each key

Hot Network Questions

Collectives™ on Stack Overflow

All Questions

Related Tags