Skip to main content

All Questions

0 votes
0 answers
10 views

Why does a subquery without matching column names still work in Spark SQL?

I have the following two datasets in Spark SQL: person view: person = spark.createDataFrame([ (0, "Bill Chambers", 0, [100]), (1, "Matei Zaharia", 1, [500, 250, 100]), (2, "...
DumbCoder's user avatar
  • 475
0 votes
0 answers
37 views

Disable printing info when running spark-sql

I'm running sql commands with spark-sql. I have put rootLogger.level = off in log4j2.properties file, but I'm still getting some info messages: Spark Web UI available at http://computer:4040 Spark ...
IGRACH's user avatar
  • 3,643
1 vote
0 answers
29 views

Pyspark writing dataframe to oracle database table using JDBC

I am new to Pyspark and having few clarifications on writing dataframe to oracle database table using JDBC. As part of the requirement I need to read the data from Oracle table and perform ...
Siva's user avatar
  • 11
0 votes
0 answers
30 views

Outer join multiple tables with sort merge join in PySpark without intermediate resorting [closed]

I want to outer join multiple large dataframes in a memory constrained environment and store the result on an s3 bucket. My plan was to use sort merge join. I nicely bucketed the dataframes based on ...
user824276's user avatar
0 votes
0 answers
36 views

Spark with availableNow trigger doesn't archive sources

I use Spark to read JSON files that appear in a folder everyday with path pattern Yyyy/mm/dd to convert them into Iceberg format. Both folders JSON and Iceberg are in a s3 bucket on different paths. ...
Alex's user avatar
  • 1,018
0 votes
0 answers
42 views

Why does ydata-profiling not detect missing values in PySpark DataFrame when using None?

I'm using ydata-profiling to generate profiling reports from a large PySpark DataFrame without converting it to Pandas (to avoid memory issues on large datasets). Some columns contain the string "...
hexxetexxeh's user avatar
0 votes
0 answers
12 views

How to sort time parser error when using EMR and pyspark script used as step

I am having this error when running a EMR with a notebook passing some dates: An error occurred: An error occurred while calling o236.showString. : org.apache.spark.SparkException: Job aborted due ...
gcj's user avatar
  • 298
0 votes
0 answers
50 views

Databricks: JVM Heap Leak When Iterating Over Large Number of Tables Using DESCRIBE DETAIL

Problem: I'm trying to generate a consolidated metadata table for all tables within a Databricks database (I do not have admin privileges). The process works fine for the first few thousand tables, ...
UnnamedChunk's user avatar
0 votes
0 answers
58 views

Generating parquet file with bloom filter

I am looking to partition raw parquet data in AWS S3 by YYYYMMDD and to enable bloom filters on high cardinality column (let's say QID). import boto3 from pyspark.sql import SparkSession from pyspark....
JollyRoger's user avatar
1 vote
1 answer
97 views

How to Exclude Rows Based on a Dynamic Condition in a PySpark Window Function?

I am working with PySpark and need to create a window function that calculates the median of the previous 5 values in a column. However, I want to exclude rows where a specific column feature is True. ...
user29963762's user avatar
1 vote
0 answers
54 views

Spark groups broadcast hash join in a single task

We have a job in spark (databricks) that we are joining ~60 tables. The job starts by joining the main table with some others tables with SortMergeJoin. This is working fine. The last step of the ...
kostas pats's user avatar
0 votes
1 answer
67 views

Pyspark find columns with mismatched data

I have a Pyspark dataframe df1. It has columns like col1, col2, col3, col4, col5. +----+----+----+----+----+ |col1|col2|col3|col4|col5| +----+----+----+----+----+ |A |A |X |Y |Y | |B |C |...
Suraj Pandey's user avatar
1 vote
1 answer
54 views

Whether repartition() will always shuffle even before an action is triggered

I read that repartition() will be lazily evaluated as it is a transformation, and transformations are only triggered on actions. However, I imagine that all the data must be loaded by Spark first ...
detcle's user avatar
  • 79
1 vote
1 answer
68 views

How to efficiently join two directories that are already partitioned

Suppose I have two different data sets A and B, and they are both already partitioned by joinKey and laid out in the filesystem like A/joinKey/<files> and B/joinKey<files> in the ...
detcle's user avatar
  • 79
1 vote
1 answer
60 views

Unexpected PySpark filter behaviour

I want to filter out the rows where CID (string type) is '-' and trait_diff is null. The codes I have provided are filtering out the rows where both of them are null even if I do not put isNull for ...
S. Nasir's user avatar

15 30 50 per page
1
2 3 4 5
729