Skip to main content

All Questions

0 votes
0 answers
30 views

java.io.EOFException PySpark Py4JJavaError always occuring when using user defined function

I'm doing data preprocessing for this csv file of 1 million rows and hoping to shrink it down to 600000 rows. However I'm having trouble always when doing an apply function on a column in the ...
Mig Rivera Cueva's user avatar
2 votes
1 answer
87 views

Update the column value if the Id in the dataframe is in the list

I am working on a very large dataset that has over 800,000 records and I'm trying to update the column value based on the ID column. If # Imports from pyspark.sql import SparkSession spark = ...
Sarah's user avatar
  • 65
0 votes
1 answer
79 views

Monotonically increasing id order

The spec of monotonically order id monotonically_increasing_id says The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. So I assume there is some ordering ...
BelowZero's user avatar
  • 1,393
-2 votes
1 answer
89 views

how to select/filter data in pyspark data frame?

I have worked with pandas dataframe and need to do some basic selecting/filtering of data , but in pyspark dataframe . I'm running the script as an aws glue job. do i need to convert pyspark dataframe ...
nohardfeelings's user avatar
0 votes
1 answer
33 views

PySpark: New column with uppercase name is dropped unexpectedly

I am trying to add a new column CHANNEL_ID in my PySpark DataFrame based on conditional logic using pyspark.sql.functions.when and after that, removing the old column channel_id, which is no longer ...
Dani's user avatar
  • 544
0 votes
1 answer
72 views

which is the best way to convert json into a dataframe? [closed]

I have a question about the best way to convert this JSON to a Dataframe: JSON data: { "myschema": { "accounts": { "load_type": "daily", ...
Julio's user avatar
  • 551
1 vote
1 answer
51 views

How convert json file into dataframe with spark?

One of my task today is read a simpe json file convert into dataframe and do a loop over the dataframe and do some validations, etc... This is part of my code: bucket_name = 'julio-s3' ...
Julio's user avatar
  • 551
0 votes
0 answers
62 views

how create new column or update column inside a dataframe?

Good morning eveyone I have a question today, that i don't know exactly how to do. Having a dataframe, i need create columns dynamically, and those column will contein a set of validations that i have ...
Julio's user avatar
  • 551
0 votes
1 answer
58 views

How convert a list into multiple columns and a dataframe?

i have a challenge today, is: Having a list of s3 paths, inside a list, split this and get a dataframe with one column with the path and a new column with just the name of the folder. my list have the ...
Julio's user avatar
  • 551
0 votes
0 answers
48 views

How to Dynamically Determine the Repartition Count for Loading Large CSV Files into PostgreSQL Using PySpark?

I need to load 5 million records from a CSV file into a PostgreSQL table as quickly as possible using PySpark. The performance and speed of the operation are critical for me. I often run my code from ...
Purushottam Nawale's user avatar
0 votes
0 answers
54 views

Pyspark job failing when trying to write, says worker versions are different but the versions match

The write option seems to be failing when it gets to the write command, before that it is able to pass the pandas dataframe. before that wit seems to be creating and applying methods fine. The error ...
Joseph W's user avatar
1 vote
1 answer
42 views

avg() over a whole dataframe causing different output

I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy. ...
anurag86's user avatar
  • 1,697
-1 votes
1 answer
47 views

Pyspark Data frame not returning rows having value more than 8 digits

I have created a sample data frame in Pyspark and the ID column contains a few values having more than 8 digits number. But it returns only those rows having less than 8 digits values in ID field. Can ...
Deveshwari Devi's user avatar
0 votes
0 answers
53 views

Error in converting pandas dataframe into spark dataframe

I'm encountering an issue in Jupyter Notebook when working with Pandas and Spark on Kubernetes (k8s). Here's the sequence of steps I follow: Create a Pandas DataFrame. Create a Spark session ...
harshwardhan Singh Dodiya's user avatar
0 votes
0 answers
89 views

spark sql query returns column has 0 length but non null

I have a spark dataframe for a parquet file. The column is string type. spark.sql("select col_a, length(col_a) from df where col_a is not null") +-------------------+------------------------...
Dozel's user avatar
  • 159

15 30 50 per page
1
2 3 4 5
66