Newest 'pyspark+apache-spark-sql+apache-spark' Questions

1 vote

0 answers

33 views

Why does a subquery without matching column names still work in Spark SQL?

I have the following two datasets in Spark SQL: person view: person = spark.createDataFrame([ (0, "Bill Chambers", 0, [100]), (1, "Matei Zaharia", 1, [500, 250, 100]), (2, "...

DumbCoder

485

asked 7 hours ago

0 votes

0 answers

37 views

Disable printing info when running spark-sql

I'm running sql commands with spark-sql. I have put rootLogger.level = off in log4j2.properties file, but I'm still getting some info messages: Spark Web UI available at http://computer:4040 Spark ...

IGRACH

3,643

asked Apr 22 at 20:06

1 vote

0 answers

30 views

Pyspark writing dataframe to oracle database table using JDBC

I am new to Pyspark and having few clarifications on writing dataframe to oracle database table using JDBC. As part of the requirement I need to read the data from Oracle table and perform ...

Siva

11

asked Apr 18 at 17:05

0 votes

0 answers

30 views

Outer join multiple tables with sort merge join in PySpark without intermediate resorting [closed]

I want to outer join multiple large dataframes in a memory constrained environment and store the result on an s3 bucket. My plan was to use sort merge join. I nicely bucketed the dataframes based on ...

user824276

637

asked Apr 17 at 13:50

0 votes

0 answers

36 views

Spark with availableNow trigger doesn't archive sources

I use Spark to read JSON files that appear in a folder everyday with path pattern Yyyy/mm/dd to convert them into Iceberg format. Both folders JSON and Iceberg are in a s3 bucket on different paths. ...

Alex

1,018

asked Apr 17 at 10:19

0 votes

0 answers

12 views

How to sort time parser error when using EMR and pyspark script used as step

I am having this error when running a EMR with a notebook passing some dates: An error occurred: An error occurred while calling o236.showString. : org.apache.spark.SparkException: Job aborted due ...

gcj

298

asked Apr 8 at 5:59

1 vote

1 answer

97 views

How to Exclude Rows Based on a Dynamic Condition in a PySpark Window Function?

I am working with PySpark and need to create a window function that calculates the median of the previous 5 values in a column. However, I want to exclude rows where a specific column feature is True. ...

user29963762

11

asked Mar 11 at 10:34

1 vote

0 answers

54 views

Spark groups broadcast hash join in a single task

We have a job in spark (databricks) that we are joining ~60 tables. The job starts by joining the main table with some others tables with SortMergeJoin. This is working fine. The last step of the ...

kostas pats

25

asked Mar 8 at 21:13

0 votes

1 answer

67 views

Pyspark find columns with mismatched data

I have a Pyspark dataframe df1. It has columns like col1, col2, col3, col4, col5. +----+----+----+----+----+ |col1|col2|col3|col4|col5| +----+----+----+----+----+ |A |A |X |Y |Y | |B |C |...

Suraj Pandey

107

asked Feb 25 at 16:06

1 vote

1 answer

54 views

Whether repartition() will always shuffle even before an action is triggered

I read that repartition() will be lazily evaluated as it is a transformation, and transformations are only triggered on actions. However, I imagine that all the data must be loaded by Spark first ...

detcle

79

asked Feb 25 at 12:44

1 vote

1 answer

68 views

How to efficiently join two directories that are already partitioned

Suppose I have two different data sets A and B, and they are both already partitioned by joinKey and laid out in the filesystem like A/joinKey/<files> and B/joinKey<files> in the ...

detcle

79

asked Feb 25 at 9:47

0 votes

0 answers

23 views

Apache Spark SQL Query Returns No Results for a Column in Azure Synapse Notebook

I'm running an Apache Spark SQL query in an Azure Synapse Notebook to retrieve data from an Azure Synapse table. %%pyspark df = spark.sql(""" SELECT scheduledend FROM `...

harinath reddy

1

asked Feb 21 at 19:04

0 votes

1 answer

79 views

Monotonically increasing id order

The spec of monotonically order id monotonically_increasing_id says The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. So I assume there is some ordering ...

BelowZero

1,393

asked Feb 20 at 14:49

2 votes

1 answer

263 views

Does Spark exceptAll() requires both dataframe columns to be in same order?

I have wasted a considerable amount of time trying to make exceptAll() pyspark function, and as far as I understood it was failing (not recognizing existing on target table) due to the fact that both ...

David Sánchez

636

asked Feb 20 at 11:04

0 votes

0 answers

64 views

why two jobs are created for 1 action in pyspark?

Below is the data used in my csv file empid,empname,empsal,empdept,empblock 1,abc,2000,cse,A 2,def,1000,ece,C 3,ghi,8000,eee,D 4,jkl,4000,ece,B 5,mno,3000,itd,F 6,pqr,6000,mec,C 1)Running below ...

Cassius Clay

332

asked Feb 19 at 9:01

Collectives™ on Stack Overflow

All Questions

Why does a subquery without matching column names still work in Spark SQL?

Disable printing info when running spark-sql

Pyspark writing dataframe to oracle database table using JDBC

Outer join multiple tables with sort merge join in PySpark without intermediate resorting [closed]

Spark with availableNow trigger doesn't archive sources

How to sort time parser error when using EMR and pyspark script used as step

How to Exclude Rows Based on a Dynamic Condition in a PySpark Window Function?

Spark groups broadcast hash join in a single task

Pyspark find columns with mismatched data

Whether repartition() will always shuffle even before an action is triggered

How to efficiently join two directories that are already partitioned

Apache Spark SQL Query Returns No Results for a Column in Azure Synapse Notebook

Monotonically increasing id order

Does Spark exceptAll() requires both dataframe columns to be in same order?

why two jobs are created for 1 action in pyspark?

Hot Network Questions

Collectives™ on Stack Overflow

All Questions

Related Tags