Newest 'pyspark+apache-spark-sql' Questions

0 votes

0 answers

10 views

Why does a subquery without matching column names still work in Spark SQL?

I have the following two datasets in Spark SQL: person view: person = spark.createDataFrame([ (0, "Bill Chambers", 0, [100]), (1, "Matei Zaharia", 1, [500, 250, 100]), (2, "...

DumbCoder

475

asked 44 mins ago

0 votes

0 answers

37 views

Disable printing info when running spark-sql

I'm running sql commands with spark-sql. I have put rootLogger.level = off in log4j2.properties file, but I'm still getting some info messages: Spark Web UI available at http://computer:4040 Spark ...

IGRACH

3,643

asked Apr 22 at 20:06

1 vote

0 answers

29 views

Pyspark writing dataframe to oracle database table using JDBC

I am new to Pyspark and having few clarifications on writing dataframe to oracle database table using JDBC. As part of the requirement I need to read the data from Oracle table and perform ...

Siva

11

asked Apr 18 at 17:05

0 votes

0 answers

30 views

Outer join multiple tables with sort merge join in PySpark without intermediate resorting [closed]

I want to outer join multiple large dataframes in a memory constrained environment and store the result on an s3 bucket. My plan was to use sort merge join. I nicely bucketed the dataframes based on ...

user824276

637

asked Apr 17 at 13:50

0 votes

0 answers

36 views

Spark with availableNow trigger doesn't archive sources

I use Spark to read JSON files that appear in a folder everyday with path pattern Yyyy/mm/dd to convert them into Iceberg format. Both folders JSON and Iceberg are in a s3 bucket on different paths. ...

Alex

1,018

asked Apr 17 at 10:19

0 votes

0 answers

42 views

Why does ydata-profiling not detect missing values in PySpark DataFrame when using None?

I'm using ydata-profiling to generate profiling reports from a large PySpark DataFrame without converting it to Pandas (to avoid memory issues on large datasets). Some columns contain the string "...

hexxetexxeh

3

asked Apr 9 at 10:57

0 votes

0 answers

12 views

How to sort time parser error when using EMR and pyspark script used as step

I am having this error when running a EMR with a notebook passing some dates: An error occurred: An error occurred while calling o236.showString. : org.apache.spark.SparkException: Job aborted due ...

gcj

298

asked Apr 8 at 5:59

0 votes

0 answers

50 views

Databricks: JVM Heap Leak When Iterating Over Large Number of Tables Using DESCRIBE DETAIL

Problem: I'm trying to generate a consolidated metadata table for all tables within a Databricks database (I do not have admin privileges). The process works fine for the first few thousand tables, ...

UnnamedChunk

1

asked Apr 7 at 10:47

0 votes

0 answers

58 views

Generating parquet file with bloom filter

I am looking to partition raw parquet data in AWS S3 by YYYYMMDD and to enable bloom filters on high cardinality column (let's say QID). import boto3 from pyspark.sql import SparkSession from pyspark....

JollyRoger

53

asked Mar 17 at 20:04

1 vote

1 answer

97 views

How to Exclude Rows Based on a Dynamic Condition in a PySpark Window Function?

I am working with PySpark and need to create a window function that calculates the median of the previous 5 values in a column. However, I want to exclude rows where a specific column feature is True. ...

user29963762

11

asked Mar 11 at 10:34

1 vote

0 answers

54 views

Spark groups broadcast hash join in a single task

We have a job in spark (databricks) that we are joining ~60 tables. The job starts by joining the main table with some others tables with SortMergeJoin. This is working fine. The last step of the ...

kostas pats

25

asked Mar 8 at 21:13

0 votes

1 answer

67 views

Pyspark find columns with mismatched data

I have a Pyspark dataframe df1. It has columns like col1, col2, col3, col4, col5. +----+----+----+----+----+ |col1|col2|col3|col4|col5| +----+----+----+----+----+ |A |A |X |Y |Y | |B |C |...

Suraj Pandey

107

asked Feb 25 at 16:06

1 vote

1 answer

54 views

Whether repartition() will always shuffle even before an action is triggered

I read that repartition() will be lazily evaluated as it is a transformation, and transformations are only triggered on actions. However, I imagine that all the data must be loaded by Spark first ...

detcle

79

asked Feb 25 at 12:44

1 vote

1 answer

68 views

How to efficiently join two directories that are already partitioned

Suppose I have two different data sets A and B, and they are both already partitioned by joinKey and laid out in the filesystem like A/joinKey/<files> and B/joinKey<files> in the ...

detcle

79

asked Feb 25 at 9:47

1 vote

1 answer

60 views

Unexpected PySpark filter behaviour

I want to filter out the rows where CID (string type) is '-' and trait_diff is null. The codes I have provided are filtering out the rows where both of them are null even if I do not put isNull for ...

S. Nasir

19

asked Feb 23 at 20:43

Collectives™ on Stack Overflow

All Questions

Why does a subquery without matching column names still work in Spark SQL?

Disable printing info when running spark-sql

Pyspark writing dataframe to oracle database table using JDBC

Outer join multiple tables with sort merge join in PySpark without intermediate resorting [closed]

Spark with availableNow trigger doesn't archive sources

Why does ydata-profiling not detect missing values in PySpark DataFrame when using None?

How to sort time parser error when using EMR and pyspark script used as step

Databricks: JVM Heap Leak When Iterating Over Large Number of Tables Using DESCRIBE DETAIL

Generating parquet file with bloom filter

How to Exclude Rows Based on a Dynamic Condition in a PySpark Window Function?

Spark groups broadcast hash join in a single task

Pyspark find columns with mismatched data

Whether repartition() will always shuffle even before an action is triggered

How to efficiently join two directories that are already partitioned

Unexpected PySpark filter behaviour

Hot Network Questions

Collectives™ on Stack Overflow

All Questions

Related Tags