All Questions
Tagged with pyspark apache-spark-sql
10,924 questions
0
votes
0
answers
10
views
Why does a subquery without matching column names still work in Spark SQL?
I have the following two datasets in Spark SQL:
person view:
person = spark.createDataFrame([
(0, "Bill Chambers", 0, [100]),
(1, "Matei Zaharia", 1, [500, 250, 100]),
(2, "...
0
votes
0
answers
37
views
Disable printing info when running spark-sql
I'm running sql commands with spark-sql. I have put rootLogger.level = off in log4j2.properties file, but I'm still getting some info messages:
Spark Web UI available at http://computer:4040
Spark ...
1
vote
0
answers
29
views
Pyspark writing dataframe to oracle database table using JDBC
I am new to Pyspark and having few clarifications on writing dataframe to oracle database table using JDBC.
As part of the requirement I need to read the data from Oracle table and perform ...
0
votes
0
answers
30
views
Outer join multiple tables with sort merge join in PySpark without intermediate resorting [closed]
I want to outer join multiple large dataframes in a memory constrained environment and store the result on an s3 bucket.
My plan was to use sort merge join. I nicely bucketed the dataframes based on ...
0
votes
0
answers
36
views
Spark with availableNow trigger doesn't archive sources
I use Spark to read JSON files that appear in a folder everyday with path pattern Yyyy/mm/dd to convert them into Iceberg format. Both folders JSON and Iceberg are in a s3 bucket on different paths.
...
0
votes
0
answers
42
views
Why does ydata-profiling not detect missing values in PySpark DataFrame when using None?
I'm using ydata-profiling to generate profiling reports from a large PySpark DataFrame without converting it to Pandas (to avoid memory issues on large datasets).
Some columns contain the string "...
0
votes
0
answers
12
views
How to sort time parser error when using EMR and pyspark script used as step
I am having this error when running a EMR with a notebook passing some dates:
An error occurred: An error occurred while calling o236.showString.
: org.apache.spark.SparkException: Job aborted due ...
0
votes
0
answers
50
views
Databricks: JVM Heap Leak When Iterating Over Large Number of Tables Using DESCRIBE DETAIL
Problem:
I'm trying to generate a consolidated metadata table for all tables within a Databricks database (I do not have admin privileges). The process works fine for the first few thousand tables, ...
0
votes
0
answers
58
views
Generating parquet file with bloom filter
I am looking to partition raw parquet data in AWS S3 by YYYYMMDD and to enable bloom filters on high cardinality column (let's say QID).
import boto3
from pyspark.sql import SparkSession
from pyspark....
1
vote
1
answer
97
views
How to Exclude Rows Based on a Dynamic Condition in a PySpark Window Function?
I am working with PySpark and need to create a window function that calculates the median of the previous 5 values in a column. However, I want to exclude rows where a specific column feature is True. ...
1
vote
0
answers
54
views
Spark groups broadcast hash join in a single task
We have a job in spark (databricks) that we are joining ~60 tables.
The job starts by joining the main table with some others tables with SortMergeJoin. This is working fine.
The last step of the ...
0
votes
1
answer
67
views
Pyspark find columns with mismatched data
I have a Pyspark dataframe df1.
It has columns like col1, col2, col3, col4, col5.
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|A |A |X |Y |Y |
|B |C |...
1
vote
1
answer
54
views
Whether repartition() will always shuffle even before an action is triggered
I read that repartition() will be lazily evaluated as it is a transformation, and transformations are only triggered on actions.
However, I imagine that all the data must be loaded by Spark first ...
1
vote
1
answer
68
views
How to efficiently join two directories that are already partitioned
Suppose I have two different data sets A and B, and they are both already partitioned by joinKey and laid out in the filesystem like A/joinKey/<files> and B/joinKey<files> in the ...
1
vote
1
answer
60
views
Unexpected PySpark filter behaviour
I want to filter out the rows where CID (string type) is '-' and trait_diff is null. The codes I have provided are filtering out the rows where both of them are null even if I do not put isNull for ...