4,063 questions
1
vote
1
answer
111
views
How to properly recalculate Spark DataFrame statistics after checkpoint?
Here is minimal example using default data in DataBricks (Spark 3.4):
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
sc....
0
votes
2
answers
140
views
Pyspark mapPartition evaluates the function more times than expected
I'm working with PySpark to process large amounts of data. However, I noticed that the function called by mapPartitions is executed one more time than expected. For instance, in the following code ...
0
votes
1
answer
30
views
Find common data among two RDD in spark execution
I have RDD1
col1 col2
A x123
B y123
C z123
RDD2
col1
A
C
I want to run intersection of two RDDs and find common elements i.e. item that are in RDD2 what is the data of ...
0
votes
1
answer
4k
views
RDD is not implemented error on pyspark.sql.connect.dataframe.Dataframe
I have a dataframe on databricks on which I would like to use the RDD api on. The type of the dataframe is pyspark.sql.connect.dataframe.Dataframe after reading from the catalog. I found out that this ...
0
votes
1
answer
66
views
unpacking nested tuples after Spark RDD join
The resources for this are scarce and I'm not sure that there's a solution to this issue.
Suppose you have 3 simple RDD's. Or more specifically 3 PairRDD's.
val rdd1: RDD[(Int, Int)] = sc.parallelize(...
0
votes
0
answers
156
views
While in Jupyter notebook, while using pyspark, get Py4JJavaError when using simple .count
While using the following code:
import pyspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.types import Row
from datetime ...
-1
votes
1
answer
350
views
pySpark RDD whitelisted Class issues
I was used below code before enabled unity catalog cluster in azure databricks notebook but after changed shared users enabled cluster. i could not able to use below logic, how should we achieve ...
1
vote
1
answer
54
views
avg() over a whole dataframe causing different output
I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy.
...
3
votes
1
answer
89
views
Casting RDD to a different type (from float64 to double)
I have a code like below, which uses pyspark.
test_truth_value = RDD.
test_predictor_rdd = RDD.
valuesAndPred = test_truth_value.zip(lasso_model.predict(test_predictor_rdd)).map(lambda x: ((x[0]), (x[...
1
vote
1
answer
61
views
Saving and Loading RDD (pyspark) to pickle file is changing order of SparseVectors
I trained tf-idf on a pre-tokenized (unigram tokenizer) dataset that I converted from list[list(token1, token2, token3, ...)] to an RDD using pyspark's HashingTF and IDF implementations. I tried to ...
1
vote
1
answer
622
views
Why is my PySpark row_number column messed up when applying a schema?
I want to apply a schema to specific non-technical columns of a Spark DataFrame. Beforehand, I add an artificial ID using Window and row_number so that I can later join some other technical columns to ...
0
votes
0
answers
44
views
PySpark with RDD - How to calculate and compare averages?
I need to solve a problem where a company wants to offer k different users free use (a kind of coupon) of their application for two months. The goal is to identify users who are likely to churn (leave ...
0
votes
1
answer
258
views
Order PySpark Dataframe by applying a function/lambda
I have a PySpark DataFrame which needs ordering on a column ("Reference").
The values in the column typically look like:
["AA.1234.56", "AA.1101.88", "AA.904.33"...
-1
votes
1
answer
62
views
Problem with pyspark mapping - Index out of range after split
When trying to map our 6 column pyspark RDD into a 4d-tuple we get a list out of range error for any list element besides 0 which return the normal result.
The dataset is structured like this:
X,Y,FID,...
0
votes
1
answer
97
views
Save text files as binary format using saveAsPickleFile with pyspark
I have around 613 text files stored in azure data lake gen 2 at this path for eg '/rawdata/no=/.txt'. I want to read all the text files and unbase 64 all text files as they are base64 encoded. But ...