Skip to main content

All Questions

0 votes
1 answer
79 views

Monotonically increasing id order

The spec of monotonically order id monotonically_increasing_id says The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. So I assume there is some ordering ...
BelowZero's user avatar
  • 1,393
-1 votes
1 answer
46 views

How to use LIMIT ALL with DataFrame

When using Spark SQL I can use LIMIT ALL to return all rows. Is there an equivalent when using the DataFrame API so that I can do something like df.limit("ALL")?
David's user avatar
  • 2,697
0 votes
0 answers
151 views

How to validate nested Spark DataFrame with Pandera?

Is there any possibility to validate nested Spark DataFrame with pandera.pyspark? This is an example with StructType, but similar could be with ArrayType. from pandera.pyspark import DataFrameModel, ...
matt91t's user avatar
  • 311
0 votes
0 answers
30 views

Index Error when generating a Data Quality and Insights Report due to array column

I'm using AWS's Data Wrangler service to prepare some data to train a ML model. I have a very simple CSV file which has 3 columns and 4 rows: State,Current,History 1,2.045301,[2.045236##2.045129##2....
user27717733's user avatar
0 votes
1 answer
95 views

PySpark: Throwing error 'Column' object is not callable while using .count()

I'm working with a PySpark DataFrame and trying to count the number of null values in each column. I tried the following expression: [col(c).isNull().count() for c in df.columns] throws error: ----&...
aroyc's user avatar
  • 922
0 votes
1 answer
90 views

Insert column at specified position

How to insert a column at specified position without listing all the existing column names? I have this dataframe: from pyspark.sql import functions as F df = spark.range(1).select( F.lit(11)....
ZygD's user avatar
  • 24.6k
-1 votes
1 answer
47 views

Pyspark Data frame not returning rows having value more than 8 digits

I have created a sample data frame in Pyspark and the ID column contains a few values having more than 8 digits number. But it returns only those rows having less than 8 digits values in ID field. Can ...
Deveshwari Devi's user avatar
0 votes
1 answer
51 views

Pyspark select after join raises ambiguity but column should only be present in one of the dataframes

I'm doing a join on two dataframes that come from the same original dataframe. These then suffer some aggregations and the columns selected are not equal except for the ones that are used to join. So ...
Miguel Rodrigues's user avatar
1 vote
1 answer
77 views

How to apply an expression from a column to another column in pyspark dataframe?

I would like to know if it is possible to apply. for example, I have this table: new_feed_dt regex_to_apply expr_to_apply 053021 | _(\d+) | date_format(to_date(new_feed_dt, '...
Tomás Jullier's user avatar
2 votes
1 answer
58 views

Pyspark - Retrieve the value from the field dynamically specified in other field of the same data frame

I'm working with PySpark and have a challenging scenario where I need to dynamically retrieve the value of a field specified in another field of the same DataFrame. I then need to compare this ...
Piotr Wojcik's user avatar
0 votes
0 answers
207 views

Pyspark DAGScheduler: Failed to update accumulator because of Pyspark UDF?

When I run a UDF on pyspark I get this on the console all the time. It hasnt failed any unittest yet which prompts me to question if this is something I need to attend to? but this is my first time ...
Daniel Koh's user avatar
1 vote
0 answers
36 views

Create a Sparse vector from Pyspark dataframe maintaing the index

I have pyspark df like this: +--------------------+-------+----------+----------+----------+----------+--------+ | user_id|game_id|3mon_views|3mon_carts|3mon_trans|3mon_views| dt| +---...
Chris_007's user avatar
  • 923
0 votes
2 answers
69 views

PySpark equivalent of Spark sliding() function

I have a multiline flat file which I wish to convert to an rdd/dataframe as a 4 column dataframe, or rdd array via PySpark. The Spark Scala code is, #from pyspark.sql import SparkSession # Scala ...
M__'s user avatar
  • 636
2 votes
2 answers
107 views

Aggregate (sum) consecutive rows where the number of consecutive rows is defined in a dataframe column

Initial Dataframe: Every "id" has the same "range" value, I have to execute the following aggregation: grouping on column "id" a dynamic range of consecutive rows (col &...
csbr's user avatar
  • 151
0 votes
2 answers
42 views

How to change a value of a row in condition of a value in a previous row in an ordred dataframe by date of a unique id?

I need insights for how to do this in spark: My dataframe is this |ID | DATE | State |X | 20-01-2023 | N |X | 21-01-2023 | S |X | 22-01-2023 | S |X | 23-01-2023 | ...
Ilyas's user avatar
  • 3

15 30 50 per page
1
2 3 4 5
70