Newest 'rdd' Questions - Stack Overflow

1 vote

1 answer

111 views

How to properly recalculate Spark DataFrame statistics after checkpoint?

Here is minimal example using default data in DataBricks (Spark 3.4): import org.apache.spark.sql.functions.col import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types._ sc....

Igor Railean

11

asked May 15 at 20:43

0 votes

2 answers

140 views

Pyspark mapPartition evaluates the function more times than expected

I'm working with PySpark to process large amounts of data. However, I noticed that the function called by mapPartitions is executed one more time than expected. For instance, in the following code ...

sebenitezg

1

asked Dec 31, 2024 at 11:19

0 votes

1 answer

30 views

Find common data among two RDD in spark execution

I have RDD1 col1 col2 A x123 B y123 C z123 RDD2 col1 A C I want to run intersection of two RDDs and find common elements i.e. item that are in RDD2 what is the data of ...

Sachin Shrivastava

1

asked Oct 8, 2024 at 17:47

0 votes

1 answer

4k views

RDD is not implemented error on pyspark.sql.connect.dataframe.Dataframe

I have a dataframe on databricks on which I would like to use the RDD api on. The type of the dataframe is pyspark.sql.connect.dataframe.Dataframe after reading from the catalog. I found out that this ...

imawful

135

asked Sep 25, 2024 at 8:16

0 votes

1 answer

66 views

unpacking nested tuples after Spark RDD join

The resources for this are scarce and I'm not sure that there's a solution to this issue. Suppose you have 3 simple RDD's. Or more specifically 3 PairRDD's. val rdd1: RDD[(Int, Int)] = sc.parallelize(...

Nizar

763

asked Sep 12, 2024 at 12:19

0 votes

0 answers

156 views

While in Jupyter notebook, while using pyspark, get Py4JJavaError when using simple .count

While using the following code: import pyspark from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import SparkSession from pyspark.sql.types import Row from datetime ...

aemilius89

17

asked Aug 30, 2024 at 20:18

-1 votes

1 answer

350 views

pySpark RDD whitelisted Class issues

I was used below code before enabled unity catalog cluster in azure databricks notebook but after changed shared users enabled cluster. i could not able to use below logic, how should we achieve ...

Developer Rajinikanth

382

asked Aug 27, 2024 at 11:53

1 vote

1 answer

54 views

avg() over a whole dataframe causing different output

I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy. ...

anurag86

1,707

asked Jul 26, 2024 at 7:39

3 votes

1 answer

89 views

Casting RDD to a different type (from float64 to double)

I have a code like below, which uses pyspark. test_truth_value = RDD. test_predictor_rdd = RDD. valuesAndPred = test_truth_value.zip(lasso_model.predict(test_predictor_rdd)).map(lambda x: ((x[0]), (x[...

Inkyu Kim

175

asked Jul 3, 2024 at 12:09

1 vote

1 answer

61 views

Saving and Loading RDD (pyspark) to pickle file is changing order of SparseVectors

I trained tf-idf on a pre-tokenized (unigram tokenizer) dataset that I converted from list[list(token1, token2, token3, ...)] to an RDD using pyspark's HashingTF and IDF implementations. I tried to ...

Caden

65

asked Jun 25, 2024 at 19:15

1 vote

1 answer

622 views

Why is my PySpark row_number column messed up when applying a schema?

I want to apply a schema to specific non-technical columns of a Spark DataFrame. Beforehand, I add an artificial ID using Window and row_number so that I can later join some other technical columns to ...

stats_guy

717

asked Jun 24, 2024 at 12:35

0 votes

0 answers

44 views

PySpark with RDD - How to calculate and compare averages?

I need to solve a problem where a company wants to offer k different users free use (a kind of coupon) of their application for two months. The goal is to identify users who are likely to churn (leave ...

Yoel Ha

1

asked Jun 14, 2024 at 11:38

0 votes

1 answer

258 views

Order PySpark Dataframe by applying a function/lambda

I have a PySpark DataFrame which needs ordering on a column ("Reference"). The values in the column typically look like: ["AA.1234.56", "AA.1101.88", "AA.904.33"...

pymat

1,192

asked Jun 10, 2024 at 9:13

-1 votes

1 answer

62 views

Problem with pyspark mapping - Index out of range after split

When trying to map our 6 column pyspark RDD into a 4d-tuple we get a list out of range error for any list element besides 0 which return the normal result. The dataset is structured like this: X,Y,FID,...

Toxicone 7

95

asked Jun 7, 2024 at 13:41

0 votes

1 answer

97 views

Save text files as binary format using saveAsPickleFile with pyspark

I have around 613 text files stored in azure data lake gen 2 at this path for eg '/rawdata/no=/.txt'. I want to read all the text files and unbase 64 all text files as they are base64 encoded. But ...

Rushank Patil

3

asked Jun 6, 2024 at 8:11

Collectives™ on Stack Overflow

How to properly recalculate Spark DataFrame statistics after checkpoint?

Pyspark mapPartition evaluates the function more times than expected

Find common data among two RDD in spark execution

RDD is not implemented error on pyspark.sql.connect.dataframe.Dataframe

unpacking nested tuples after Spark RDD join

While in Jupyter notebook, while using pyspark, get Py4JJavaError when using simple .count

pySpark RDD whitelisted Class issues

avg() over a whole dataframe causing different output

Casting RDD to a different type (from float64 to double)

Saving and Loading RDD (pyspark) to pickle file is changing order of SparseVectors

Why is my PySpark row_number column messed up when applying a schema?

PySpark with RDD - How to calculate and compare averages?

Order PySpark Dataframe by applying a function/lambda

Problem with pyspark mapping - Index out of range after split

Save text files as binary format using saveAsPickleFile with pyspark

Hot Network Questions