Newest 'apache-spark-sql' Questions

0 votes

0 answers

43 views

Can't SELECT anything in a AWS Glue Data Catalog view due to invalid view text: <REDACTED VIEW TEXT>

i created a glue view through a glue job like this: CREATE OR REPLACE PROTECTED MULTI DIALECT VIEW risk_models_output.vw_behavior_special_limit_score SECURITY DEFINER AS [query ...

Paloma Raissa

1

asked Jan 14 at 20:37

0 votes

0 answers

41 views

Spark job fails with UnsafeExternalSorter OOM when using groupBy + collect_list + sort – how to optimize?

How to replace groupBy + collect_list + array_sort with a more memory-efficient approach in Spark SQL? I have a Spark (Java) batch job that processes large telecom event data The job is failing with `...

Thịnh Nguyễn

1

asked Jan 13 at 9:16

0 votes

1 answer

60 views

Why 2 tables bucketed by col1 and joined by (col1, col2) are shuffled?

// Enable all bucketing optimizations spark.conf.set("spark.sql.requireAllClusterKeysForDistribution", "false") spark.conf.set("spark.sql.sources.bucketing.enabled&...

user2417458

31

asked Dec 25, 2025 at 12:03

1 vote

0 answers

46 views

How to optimize special array_intersect in hive sql executed by spark engine?

buckets is a column of type array<string>. The logic is similar to array_intersect, except only the prefix of each string in buckets (before the first -) is compared. How can I optimize the ...

Dong Ye

11

asked Nov 22, 2025 at 17:27

Best practices

0 votes

5 replies

103 views

Pushing down filters in RDBMS with Java Spark

I have been working as a Data Engineer and got this issue. I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source. Now somewhere ...

Parth Sarthi Roy

1

asked Nov 14, 2025 at 6:13

Advice

0 votes

6 replies

163 views

Pyspark SQL: How to do GROUP BY with specific WHERE condition

So I am doing some SQL aggregation transformations of a dataset and there is a certain condition that I would like to do, but not sure how. Here is a basic code block: le_test = spark.sql(""&...

BeaverFever

21

asked Nov 2, 2025 at 6:39

0 votes

0 answers

91 views

How to Check if a Query Touches Data Files or just Uses Manifests and Metadata in Iceberg

I created a table as follows: CREATE TABLE IF NOT EXISTS raw_data.civ ( date timestamp, marketplace_id int, ... some more columns ) USING ICEBERG PARTITIONED BY ( marketplace_id, ...

shiva

2,781

asked Oct 25, 2025 at 15:11

3 votes

1 answer

155 views

How to collect multiple metrics with observe in PySpark without triggering multiple actions

I have a PySpark job that reads data from table a, performs some transformations and filters, and then writes the result to table b. Here’s a simplified version of the code: import pyspark.sql....

עומר אמזלג

39

asked Oct 22, 2025 at 15:17

0 votes

0 answers

131 views

Unexpected Write Behavior when using MERGE INTO/INSERT INTO Iceberg Spark Queries

I am observing different write behaviors when executing queries on EMR Notebook (correct behavior) vs when using spark-submit to submit a spark application to EMR Cluster (incorrect behavior). When I ...

shiva

2,781

asked Oct 21, 2025 at 20:58

0 votes

0 answers

48 views

Spark: VSAM File read issue with special character

We have a scenario to read a VSAM file directly along with a copy book to understand the column lengths, we were using COBRIX library as part of spark read. However, we could the same is not properly ...

Rocky1989

409

asked Oct 15, 2025 at 7:06

0 votes

0 answers

50 views

How to link Spark event log stages to PySpark code or query?

I'm analyzing Spark event logs and have already retrieved the SparkListenerStageSubmitted and SparkListenerTaskEnd events to collect metrics such as spill, skew ratio, memory, and CPU usage. However, ...

Carol C

1

asked Oct 9, 2025 at 19:40

0 votes

0 answers

64 views

Scala spark: Why does DataFrame.transform calling a transform hang?

I have a job on scala (v. 2.12.15) spark (v. 3.5.1) that works correctly and looks something like this: import org.apache.spark.sql.DataFrame ... val myDataFrame = myReadDataFunction(...) ....

jd_sa

1

asked Oct 7, 2025 at 18:07

0 votes

0 answers

60 views

Spatial join without Apache Sedona

currently I'm working in a specific version of Apache Spark (3.1.1) that cannot upgrade. Since that I can't use Apache Sedona and the version 1.3.1 is too slow. My problem is the following code that ...

matdlara

149

asked Oct 3, 2025 at 1:35

1 vote

4 answers

116 views

How to pass array of structure as parameter to udf in spark 4

does anybody know what am I doing wrong? Following is reduced code snippet working in spark-3.x but doesn't work in spark-4.x. In my usecase I need to pass complex data structure to udf (let's say ...

Jiri Humpolicek

31

asked Sep 22, 2025 at 12:51

0 votes

1 answer

166 views

Problem reading the _last_checkpoint file from the _delta_log directory of a delta lake table on s3

I am trying to read the _delta_log folder of a delta lake table via spark to export some custom metrics. I have configured how to get some metrics from history and description but I have problem ...

Melika Ghiasi

1

asked Sep 7, 2025 at 10:50

Collectives™ on Stack Overflow

Can't SELECT anything in a AWS Glue Data Catalog view due to invalid view text: <REDACTED VIEW TEXT>

Spark job fails with UnsafeExternalSorter OOM when using groupBy + collect_list + sort – how to optimize?

Why 2 tables bucketed by col1 and joined by (col1, col2) are shuffled?

How to optimize special array_intersect in hive sql executed by spark engine?

Pushing down filters in RDBMS with Java Spark

Pyspark SQL: How to do GROUP BY with specific WHERE condition

How to Check if a Query Touches Data Files or just Uses Manifests and Metadata in Iceberg

How to collect multiple metrics with observe in PySpark without triggering multiple actions

Unexpected Write Behavior when using MERGE INTO/INSERT INTO Iceberg Spark Queries

Spark: VSAM File read issue with special character

How to link Spark event log stages to PySpark code or query?

Scala spark: Why does DataFrame.transform calling a transform hang?

Spatial join without Apache Sedona

How to pass array of structure as parameter to udf in spark 4

Problem reading the _last_checkpoint file from the _delta_log directory of a delta lake table on s3

Hot Network Questions