Newest 'python+dataframe+apache-spark' Questions

0 votes

0 answers

30 views

java.io.EOFException PySpark Py4JJavaError always occuring when using user defined function

I'm doing data preprocessing for this csv file of 1 million rows and hoping to shrink it down to 600000 rows. However I'm having trouble always when doing an apply function on a column in the ...

Mig Rivera Cueva

65

asked Apr 22 at 9:54

2 votes

1 answer

87 views

Update the column value if the Id in the dataframe is in the list

I am working on a very large dataset that has over 800,000 records and I'm trying to update the column value based on the ID column. If # Imports from pyspark.sql import SparkSession spark = ...

Sarah

65

asked Mar 29 at 22:04

0 votes

1 answer

79 views

Monotonically increasing id order

The spec of monotonically order id monotonically_increasing_id says The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. So I assume there is some ordering ...

BelowZero

1,393

asked Feb 20 at 14:49

-2 votes

1 answer

89 views

how to select/filter data in pyspark data frame?

I have worked with pandas dataframe and need to do some basic selecting/filtering of data , but in pyspark dataframe . I'm running the script as an aws glue job. do i need to convert pyspark dataframe ...

nohardfeelings

129

asked Oct 31, 2024 at 15:28

0 votes

1 answer

33 views

PySpark: New column with uppercase name is dropped unexpectedly

I am trying to add a new column CHANNEL_ID in my PySpark DataFrame based on conditional logic using pyspark.sql.functions.when and after that, removing the old column channel_id, which is no longer ...

Dani

544

asked Oct 9, 2024 at 9:38

0 votes

1 answer

72 views

which is the best way to convert json into a dataframe? [closed]

I have a question about the best way to convert this JSON to a Dataframe: JSON data: { "myschema": { "accounts": { "load_type": "daily", ...

Julio

551

asked Sep 26, 2024 at 0:00

1 vote

1 answer

51 views

How convert json file into dataframe with spark?

One of my task today is read a simpe json file convert into dataframe and do a loop over the dataframe and do some validations, etc... This is part of my code: bucket_name = 'julio-s3' ...

Julio

551

asked Sep 25, 2024 at 17:49

0 votes

0 answers

62 views

how create new column or update column inside a dataframe?

Good morning eveyone I have a question today, that i don't know exactly how to do. Having a dataframe, i need create columns dynamically, and those column will contein a set of validations that i have ...

Julio

551

asked Sep 19, 2024 at 8:07

0 votes

1 answer

58 views

How convert a list into multiple columns and a dataframe?

i have a challenge today, is: Having a list of s3 paths, inside a list, split this and get a dataframe with one column with the path and a new column with just the name of the folder. my list have the ...

Julio

551

asked Sep 19, 2024 at 0:59

0 votes

0 answers

48 views

How to Dynamically Determine the Repartition Count for Loading Large CSV Files into PostgreSQL Using PySpark?

I need to load 5 million records from a CSV file into a PostgreSQL table as quickly as possible using PySpark. The performance and speed of the operation are critical for me. I often run my code from ...

Purushottam Nawale

477

asked Aug 19, 2024 at 15:49

0 votes

0 answers

54 views

Pyspark job failing when trying to write, says worker versions are different but the versions match

The write option seems to be failing when it gets to the write command, before that it is able to pass the pandas dataframe. before that wit seems to be creating and applying methods fine. The error ...

Joseph W

1

asked Aug 18, 2024 at 16:11

1 vote

1 answer

42 views

avg() over a whole dataframe causing different output

I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy. ...

anurag86

1,697

asked Jul 26, 2024 at 7:39

-1 votes

1 answer

47 views

Pyspark Data frame not returning rows having value more than 8 digits

I have created a sample data frame in Pyspark and the ID column contains a few values having more than 8 digits number. But it returns only those rows having less than 8 digits values in ID field. Can ...

Deveshwari Devi

1

asked Jul 24, 2024 at 11:39

0 votes

0 answers

53 views

Error in converting pandas dataframe into spark dataframe

I'm encountering an issue in Jupyter Notebook when working with Pandas and Spark on Kubernetes (k8s). Here's the sequence of steps I follow: Create a Pandas DataFrame. Create a Spark session ...

harshwardhan Singh Dodiya

1

asked Jul 24, 2024 at 6:56

0 votes

0 answers

89 views

spark sql query returns column has 0 length but non null

I have a spark dataframe for a parquet file. The column is string type. spark.sql("select col_a, length(col_a) from df where col_a is not null") +-------------------+------------------------...

Dozel

159

asked Jul 19, 2024 at 22:59

Collectives™ on Stack Overflow

All Questions

java.io.EOFException PySpark Py4JJavaError always occuring when using user defined function

Update the column value if the Id in the dataframe is in the list

Monotonically increasing id order

how to select/filter data in pyspark data frame?

PySpark: New column with uppercase name is dropped unexpectedly

which is the best way to convert json into a dataframe? [closed]

How convert json file into dataframe with spark?

how create new column or update column inside a dataframe?

How convert a list into multiple columns and a dataframe?

How to Dynamically Determine the Repartition Count for Loading Large CSV Files into PostgreSQL Using PySpark?

Pyspark job failing when trying to write, says worker versions are different but the versions match

avg() over a whole dataframe causing different output

Pyspark Data frame not returning rows having value more than 8 digits

Error in converting pandas dataframe into spark dataframe

spark sql query returns column has 0 length but non null

Hot Network Questions

Collectives™ on Stack Overflow

All Questions

Related Tags