All Questions
1,036 questions
0
votes
1
answer
79
views
Monotonically increasing id order
The spec of monotonically order id monotonically_increasing_id says
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
So I assume there is some ordering ...
-1
votes
1
answer
46
views
How to use LIMIT ALL with DataFrame
When using Spark SQL I can use LIMIT ALL to return all rows. Is there an equivalent when using the DataFrame API so that I can do something like df.limit("ALL")?
0
votes
0
answers
151
views
How to validate nested Spark DataFrame with Pandera?
Is there any possibility to validate nested Spark DataFrame with pandera.pyspark? This is an example with StructType, but similar could be with ArrayType.
from pandera.pyspark import DataFrameModel, ...
0
votes
0
answers
30
views
Index Error when generating a Data Quality and Insights Report due to array column
I'm using AWS's Data Wrangler service to prepare some data to train a ML model.
I have a very simple CSV file which has 3 columns and 4 rows:
State,Current,History
1,2.045301,[2.045236##2.045129##2....
0
votes
1
answer
95
views
PySpark: Throwing error 'Column' object is not callable while using .count()
I'm working with a PySpark DataFrame and trying to count the number of null values in each column. I tried the following expression:
[col(c).isNull().count() for c in df.columns]
throws error:
----&...
0
votes
1
answer
90
views
Insert column at specified position
How to insert a column at specified position without listing all the existing column names?
I have this dataframe:
from pyspark.sql import functions as F
df = spark.range(1).select(
F.lit(11)....
-1
votes
1
answer
47
views
Pyspark Data frame not returning rows having value more than 8 digits
I have created a sample data frame in Pyspark and the ID column contains a few values having more than 8 digits number. But it returns only those rows having less than 8 digits values in ID field. Can ...
0
votes
1
answer
51
views
Pyspark select after join raises ambiguity but column should only be present in one of the dataframes
I'm doing a join on two dataframes that come from the same original dataframe. These then suffer some aggregations and the columns selected are not equal except for the ones that are used to join.
So ...
1
vote
1
answer
77
views
How to apply an expression from a column to another column in pyspark dataframe?
I would like to know if it is possible to apply.
for example, I have this table:
new_feed_dt regex_to_apply expr_to_apply
053021 | _(\d+) | date_format(to_date(new_feed_dt, '...
2
votes
1
answer
58
views
Pyspark - Retrieve the value from the field dynamically specified in other field of the same data frame
I'm working with PySpark and have a challenging scenario where I need to dynamically retrieve the value of a field specified in another field of the same DataFrame. I then need to compare this ...
0
votes
0
answers
207
views
Pyspark DAGScheduler: Failed to update accumulator because of Pyspark UDF?
When I run a UDF on pyspark I get this on the console all the time. It hasnt failed any unittest yet which prompts me to question if this is something I need to attend to? but this is my first time ...
1
vote
0
answers
36
views
Create a Sparse vector from Pyspark dataframe maintaing the index
I have pyspark df like this:
+--------------------+-------+----------+----------+----------+----------+--------+
| user_id|game_id|3mon_views|3mon_carts|3mon_trans|3mon_views| dt|
+---...
0
votes
2
answers
69
views
PySpark equivalent of Spark sliding() function
I have a multiline flat file which I wish to convert to an rdd/dataframe as a 4 column dataframe, or rdd array via PySpark. The Spark Scala code is,
#from pyspark.sql import SparkSession # Scala ...
2
votes
2
answers
107
views
Aggregate (sum) consecutive rows where the number of consecutive rows is defined in a dataframe column
Initial Dataframe:
Every "id" has the same "range" value, I have to execute the following aggregation:
grouping on column "id" a dynamic range of consecutive rows (col &...
0
votes
2
answers
42
views
How to change a value of a row in condition of a value in a previous row in an ordred dataframe by date of a unique id?
I need insights for how to do this in spark:
My dataframe is this
|ID | DATE | State
|X | 20-01-2023 | N
|X | 21-01-2023 | S
|X | 22-01-2023 | S
|X | 23-01-2023 | ...