Skip to main content

All Questions

1 vote
0 answers
33 views

Why does a subquery without matching column names still work in Spark SQL?

I have the following two datasets in Spark SQL: person view: person = spark.createDataFrame([ (0, "Bill Chambers", 0, [100]), (1, "Matei Zaharia", 1, [500, 250, 100]), (2, "...
DumbCoder's user avatar
  • 485
0 votes
0 answers
37 views

Disable printing info when running spark-sql

I'm running sql commands with spark-sql. I have put rootLogger.level = off in log4j2.properties file, but I'm still getting some info messages: Spark Web UI available at http://computer:4040 Spark ...
IGRACH's user avatar
  • 3,643
1 vote
0 answers
30 views

Pyspark writing dataframe to oracle database table using JDBC

I am new to Pyspark and having few clarifications on writing dataframe to oracle database table using JDBC. As part of the requirement I need to read the data from Oracle table and perform ...
Siva's user avatar
  • 11
0 votes
0 answers
30 views

Outer join multiple tables with sort merge join in PySpark without intermediate resorting [closed]

I want to outer join multiple large dataframes in a memory constrained environment and store the result on an s3 bucket. My plan was to use sort merge join. I nicely bucketed the dataframes based on ...
user824276's user avatar
0 votes
0 answers
36 views

Spark with availableNow trigger doesn't archive sources

I use Spark to read JSON files that appear in a folder everyday with path pattern Yyyy/mm/dd to convert them into Iceberg format. Both folders JSON and Iceberg are in a s3 bucket on different paths. ...
Alex's user avatar
  • 1,018
0 votes
0 answers
12 views

How to sort time parser error when using EMR and pyspark script used as step

I am having this error when running a EMR with a notebook passing some dates: An error occurred: An error occurred while calling o236.showString. : org.apache.spark.SparkException: Job aborted due ...
gcj's user avatar
  • 298
1 vote
1 answer
97 views

How to Exclude Rows Based on a Dynamic Condition in a PySpark Window Function?

I am working with PySpark and need to create a window function that calculates the median of the previous 5 values in a column. However, I want to exclude rows where a specific column feature is True. ...
user29963762's user avatar
1 vote
0 answers
54 views

Spark groups broadcast hash join in a single task

We have a job in spark (databricks) that we are joining ~60 tables. The job starts by joining the main table with some others tables with SortMergeJoin. This is working fine. The last step of the ...
kostas pats's user avatar
0 votes
1 answer
67 views

Pyspark find columns with mismatched data

I have a Pyspark dataframe df1. It has columns like col1, col2, col3, col4, col5. +----+----+----+----+----+ |col1|col2|col3|col4|col5| +----+----+----+----+----+ |A |A |X |Y |Y | |B |C |...
Suraj Pandey's user avatar
1 vote
1 answer
54 views

Whether repartition() will always shuffle even before an action is triggered

I read that repartition() will be lazily evaluated as it is a transformation, and transformations are only triggered on actions. However, I imagine that all the data must be loaded by Spark first ...
detcle's user avatar
  • 79
1 vote
1 answer
68 views

How to efficiently join two directories that are already partitioned

Suppose I have two different data sets A and B, and they are both already partitioned by joinKey and laid out in the filesystem like A/joinKey/<files> and B/joinKey<files> in the ...
detcle's user avatar
  • 79
0 votes
0 answers
23 views

Apache Spark SQL Query Returns No Results for a Column in Azure Synapse Notebook

I'm running an Apache Spark SQL query in an Azure Synapse Notebook to retrieve data from an Azure Synapse table. %%pyspark df = spark.sql(""" SELECT scheduledend FROM `...
harinath reddy's user avatar
0 votes
1 answer
79 views

Monotonically increasing id order

The spec of monotonically order id monotonically_increasing_id says The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. So I assume there is some ordering ...
BelowZero's user avatar
  • 1,393
2 votes
1 answer
263 views

Does Spark exceptAll() requires both dataframe columns to be in same order?

I have wasted a considerable amount of time trying to make exceptAll() pyspark function, and as far as I understood it was failing (not recognizing existing on target table) due to the fact that both ...
David Sánchez's user avatar
0 votes
0 answers
64 views

why two jobs are created for 1 action in pyspark?

Below is the data used in my csv file empid,empname,empsal,empdept,empblock 1,abc,2000,cse,A 2,def,1000,ece,C 3,ghi,8000,eee,D 4,jkl,4000,ece,B 5,mno,3000,itd,F 6,pqr,6000,mec,C 1)Running below ...
Cassius Clay's user avatar

15 30 50 per page
1
2 3 4 5
502