Skip to main content
1 vote
1 answer
111 views

Here is minimal example using default data in DataBricks (Spark 3.4): import org.apache.spark.sql.functions.col import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types._ sc....
Igor Railean's user avatar
0 votes
2 answers
140 views

I'm working with PySpark to process large amounts of data. However, I noticed that the function called by mapPartitions is executed one more time than expected. For instance, in the following code ...
sebenitezg's user avatar
0 votes
1 answer
30 views

I have RDD1 col1 col2 A x123 B y123 C z123 RDD2 col1 A C I want to run intersection of two RDDs and find common elements i.e. item that are in RDD2 what is the data of ...
Sachin Shrivastava's user avatar
0 votes
1 answer
4k views

I have a dataframe on databricks on which I would like to use the RDD api on. The type of the dataframe is pyspark.sql.connect.dataframe.Dataframe after reading from the catalog. I found out that this ...
imawful's user avatar
  • 135
0 votes
1 answer
66 views

The resources for this are scarce and I'm not sure that there's a solution to this issue. Suppose you have 3 simple RDD's. Or more specifically 3 PairRDD's. val rdd1: RDD[(Int, Int)] = sc.parallelize(...
Nizar's user avatar
  • 763
0 votes
0 answers
156 views

While using the following code: import pyspark from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import SparkSession from pyspark.sql.types import Row from datetime ...
aemilius89's user avatar
-1 votes
1 answer
350 views

I was used below code before enabled unity catalog cluster in azure databricks notebook but after changed shared users enabled cluster. i could not able to use below logic, how should we achieve ...
Developer Rajinikanth's user avatar
1 vote
1 answer
54 views

I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy. ...
anurag86's user avatar
  • 1,707
3 votes
1 answer
89 views

I have a code like below, which uses pyspark. test_truth_value = RDD. test_predictor_rdd = RDD. valuesAndPred = test_truth_value.zip(lasso_model.predict(test_predictor_rdd)).map(lambda x: ((x[0]), (x[...
Inkyu Kim's user avatar
  • 175
1 vote
1 answer
61 views

I trained tf-idf on a pre-tokenized (unigram tokenizer) dataset that I converted from list[list(token1, token2, token3, ...)] to an RDD using pyspark's HashingTF and IDF implementations. I tried to ...
Caden's user avatar
  • 65
1 vote
1 answer
622 views

I want to apply a schema to specific non-technical columns of a Spark DataFrame. Beforehand, I add an artificial ID using Window and row_number so that I can later join some other technical columns to ...
stats_guy's user avatar
  • 717
0 votes
0 answers
44 views

I need to solve a problem where a company wants to offer k different users free use (a kind of coupon) of their application for two months. The goal is to identify users who are likely to churn (leave ...
Yoel Ha's user avatar
0 votes
1 answer
258 views

I have a PySpark DataFrame which needs ordering on a column ("Reference"). The values in the column typically look like: ["AA.1234.56", "AA.1101.88", "AA.904.33"...
pymat's user avatar
  • 1,192
-1 votes
1 answer
62 views

When trying to map our 6 column pyspark RDD into a 4d-tuple we get a list out of range error for any list element besides 0 which return the normal result. The dataset is structured like this: X,Y,FID,...
Toxicone 7's user avatar
0 votes
1 answer
97 views

I have around 613 text files stored in azure data lake gen 2 at this path for eg '/rawdata/no=/.txt'. I want to read all the text files and unbase 64 all text files as they are base64 encoded. But ...
Rushank Patil's user avatar

15 30 50 per page
1
2 3 4 5
271