All Questions
988 questions
0
votes
0
answers
30
views
java.io.EOFException PySpark Py4JJavaError always occuring when using user defined function
I'm doing data preprocessing for this csv file of 1 million rows and hoping to shrink it down to 600000 rows. However I'm having trouble always when doing an apply function on a column in the ...
2
votes
1
answer
87
views
Update the column value if the Id in the dataframe is in the list
I am working on a very large dataset that has over 800,000 records and I'm trying to update the column value based on the ID column. If
# Imports
from pyspark.sql import SparkSession
spark = ...
0
votes
1
answer
79
views
Monotonically increasing id order
The spec of monotonically order id monotonically_increasing_id says
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
So I assume there is some ordering ...
-2
votes
1
answer
89
views
how to select/filter data in pyspark data frame?
I have worked with pandas dataframe and need to do some basic selecting/filtering of data , but in pyspark dataframe . I'm running the script as an aws glue job. do i need to convert pyspark dataframe ...
0
votes
1
answer
33
views
PySpark: New column with uppercase name is dropped unexpectedly
I am trying to add a new column CHANNEL_ID in my PySpark DataFrame based on conditional logic using pyspark.sql.functions.when and after that, removing the old column channel_id, which is no longer ...
0
votes
1
answer
72
views
which is the best way to convert json into a dataframe? [closed]
I have a question about the best way to convert this JSON to a Dataframe:
JSON data:
{
"myschema": {
"accounts": {
"load_type": "daily",
...
1
vote
1
answer
51
views
How convert json file into dataframe with spark?
One of my task today is read a simpe json file convert into dataframe and do a loop over the dataframe and do some validations, etc...
This is part of my code:
bucket_name = 'julio-s3'
...
0
votes
0
answers
62
views
how create new column or update column inside a dataframe?
Good morning eveyone
I have a question today, that i don't know exactly how to do.
Having a dataframe, i need create columns dynamically, and those column will contein a set of validations that i have ...
0
votes
1
answer
58
views
How convert a list into multiple columns and a dataframe?
i have a challenge today, is:
Having a list of s3 paths, inside a list, split this and get a dataframe with one column with the path and a new column with just the name of the folder.
my list have the ...
0
votes
0
answers
48
views
How to Dynamically Determine the Repartition Count for Loading Large CSV Files into PostgreSQL Using PySpark?
I need to load 5 million records from a CSV file into a PostgreSQL table as quickly as possible using PySpark. The performance and speed of the operation are critical for me. I often run my code from ...
0
votes
0
answers
54
views
Pyspark job failing when trying to write, says worker versions are different but the versions match
The write option seems to be failing when it gets to the write command, before that it is able to pass the pandas dataframe. before that wit seems to be creating and applying methods fine. The error ...
1
vote
1
answer
42
views
avg() over a whole dataframe causing different output
I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy.
...
-1
votes
1
answer
47
views
Pyspark Data frame not returning rows having value more than 8 digits
I have created a sample data frame in Pyspark and the ID column contains a few values having more than 8 digits number. But it returns only those rows having less than 8 digits values in ID field. Can ...
0
votes
0
answers
53
views
Error in converting pandas dataframe into spark dataframe
I'm encountering an issue in Jupyter Notebook when working with Pandas and Spark on Kubernetes (k8s). Here's the sequence of steps I follow:
Create a Pandas DataFrame.
Create a Spark session ...
0
votes
0
answers
89
views
spark sql query returns column has 0 length but non null
I have a spark dataframe for a parquet file. The column is string type.
spark.sql("select col_a, length(col_a) from df where col_a is not null")
+-------------------+------------------------...