All Questions
7,520 questions
1
vote
0
answers
47
views
Why does a subquery without matching column names still work in Spark SQL? [duplicate]
I have the following two datasets in Spark SQL:
person view:
person = spark.createDataFrame([
(0, "Bill Chambers", 0, [100]),
(1, "Matei Zaharia", 1, [500, 250, 100]),
(2, "...
0
votes
0
answers
37
views
Disable printing info when running spark-sql
I'm running sql commands with spark-sql. I have put rootLogger.level = off in log4j2.properties file, but I'm still getting some info messages:
Spark Web UI available at http://computer:4040
Spark ...
1
vote
0
answers
30
views
Pyspark writing dataframe to oracle database table using JDBC
I am new to Pyspark and having few clarifications on writing dataframe to oracle database table using JDBC.
As part of the requirement I need to read the data from Oracle table and perform ...
0
votes
0
answers
30
views
Outer join multiple tables with sort merge join in PySpark without intermediate resorting [closed]
I want to outer join multiple large dataframes in a memory constrained environment and store the result on an s3 bucket.
My plan was to use sort merge join. I nicely bucketed the dataframes based on ...
0
votes
0
answers
36
views
Spark with availableNow trigger doesn't archive sources
I use Spark to read JSON files that appear in a folder everyday with path pattern Yyyy/mm/dd to convert them into Iceberg format. Both folders JSON and Iceberg are in a s3 bucket on different paths.
...
0
votes
0
answers
12
views
How to sort time parser error when using EMR and pyspark script used as step
I am having this error when running a EMR with a notebook passing some dates:
An error occurred: An error occurred while calling o236.showString.
: org.apache.spark.SparkException: Job aborted due ...
1
vote
1
answer
97
views
How to Exclude Rows Based on a Dynamic Condition in a PySpark Window Function?
I am working with PySpark and need to create a window function that calculates the median of the previous 5 values in a column. However, I want to exclude rows where a specific column feature is True. ...
1
vote
0
answers
54
views
Spark groups broadcast hash join in a single task
We have a job in spark (databricks) that we are joining ~60 tables.
The job starts by joining the main table with some others tables with SortMergeJoin. This is working fine.
The last step of the ...
0
votes
1
answer
67
views
Pyspark find columns with mismatched data
I have a Pyspark dataframe df1.
It has columns like col1, col2, col3, col4, col5.
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|A |A |X |Y |Y |
|B |C |...
1
vote
1
answer
54
views
Whether repartition() will always shuffle even before an action is triggered
I read that repartition() will be lazily evaluated as it is a transformation, and transformations are only triggered on actions.
However, I imagine that all the data must be loaded by Spark first ...
1
vote
1
answer
68
views
How to efficiently join two directories that are already partitioned
Suppose I have two different data sets A and B, and they are both already partitioned by joinKey and laid out in the filesystem like A/joinKey/<files> and B/joinKey<files> in the ...
0
votes
0
answers
23
views
Apache Spark SQL Query Returns No Results for a Column in Azure Synapse Notebook
I'm running an Apache Spark SQL query in an Azure Synapse Notebook to retrieve data from an Azure Synapse table.
%%pyspark
df = spark.sql("""
SELECT scheduledend
FROM `...
0
votes
1
answer
79
views
Monotonically increasing id order
The spec of monotonically order id monotonically_increasing_id says
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
So I assume there is some ordering ...
2
votes
1
answer
263
views
Does Spark exceptAll() requires both dataframe columns to be in same order?
I have wasted a considerable amount of time trying to make exceptAll() pyspark function, and as far as I understood it was failing (not recognizing existing on target table) due to the fact that both ...
0
votes
0
answers
64
views
why two jobs are created for 1 action in pyspark?
Below is the data used in my csv file
empid,empname,empsal,empdept,empblock
1,abc,2000,cse,A
2,def,1000,ece,C
3,ghi,8000,eee,D
4,jkl,4000,ece,B
5,mno,3000,itd,F
6,pqr,6000,mec,C
1)Running below ...