Skip to main content
How are we doing? Please help us improve Stack Overflow. Take our short survey
0 votes
0 answers
43 views

i created a glue view through a glue job like this: CREATE OR REPLACE PROTECTED MULTI DIALECT VIEW risk_models_output.vw_behavior_special_limit_score SECURITY DEFINER AS [query ...
Paloma Raissa's user avatar
0 votes
0 answers
41 views

How to replace groupBy + collect_list + array_sort with a more memory-efficient approach in Spark SQL? I have a Spark (Java) batch job that processes large telecom event data The job is failing with `...
Thịnh Nguyễn's user avatar
0 votes
1 answer
60 views

// Enable all bucketing optimizations spark.conf.set("spark.sql.requireAllClusterKeysForDistribution", "false") spark.conf.set("spark.sql.sources.bucketing.enabled&...
user2417458's user avatar
1 vote
0 answers
46 views

buckets is a column of type array<string>. The logic is similar to array_intersect, except only the prefix of each string in buckets (before the first -) is compared. How can I optimize the ...
Dong Ye's user avatar
  • 11
Best practices
0 votes
5 replies
103 views

I have been working as a Data Engineer and got this issue. I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source. Now somewhere ...
Parth Sarthi Roy's user avatar
Advice
0 votes
6 replies
163 views

So I am doing some SQL aggregation transformations of a dataset and there is a certain condition that I would like to do, but not sure how. Here is a basic code block: le_test = spark.sql(""&...
BeaverFever's user avatar
0 votes
0 answers
91 views

I created a table as follows: CREATE TABLE IF NOT EXISTS raw_data.civ ( date timestamp, marketplace_id int, ... some more columns ) USING ICEBERG PARTITIONED BY ( marketplace_id, ...
shiva's user avatar
  • 2,781
3 votes
1 answer
155 views

I have a PySpark job that reads data from table a, performs some transformations and filters, and then writes the result to table b. Here’s a simplified version of the code: import pyspark.sql....
עומר אמזלג's user avatar
0 votes
0 answers
131 views

I am observing different write behaviors when executing queries on EMR Notebook (correct behavior) vs when using spark-submit to submit a spark application to EMR Cluster (incorrect behavior). When I ...
shiva's user avatar
  • 2,781
0 votes
0 answers
48 views

We have a scenario to read a VSAM file directly along with a copy book to understand the column lengths, we were using COBRIX library as part of spark read. However, we could the same is not properly ...
Rocky1989's user avatar
  • 409
0 votes
0 answers
50 views

I'm analyzing Spark event logs and have already retrieved the SparkListenerStageSubmitted and SparkListenerTaskEnd events to collect metrics such as spill, skew ratio, memory, and CPU usage. However, ...
Carol C's user avatar
0 votes
0 answers
64 views

I have a job on scala (v. 2.12.15) spark (v. 3.5.1) that works correctly and looks something like this: import org.apache.spark.sql.DataFrame ... val myDataFrame = myReadDataFunction(...) ....
jd_sa's user avatar
  • 1
0 votes
0 answers
60 views

currently I'm working in a specific version of Apache Spark (3.1.1) that cannot upgrade. Since that I can't use Apache Sedona and the version 1.3.1 is too slow. My problem is the following code that ...
matdlara's user avatar
  • 149
1 vote
4 answers
116 views

does anybody know what am I doing wrong? Following is reduced code snippet working in spark-3.x but doesn't work in spark-4.x. In my usecase I need to pass complex data structure to udf (let's say ...
Jiri Humpolicek's user avatar
0 votes
1 answer
166 views

I am trying to read the _delta_log folder of a delta lake table via spark to export some custom metrics. I have configured how to get some metrics from history and description but I have problem ...
Melika Ghiasi's user avatar

15 30 50 per page
1
2 3 4 5
1795