Newest 'pyspark' Questions

0 votes

0 answers

19 views

Write partitioned col in s3 file too

I’m writing to glue table, where I’m having (country and state) as a partition column. But If I read directly from s3 bucket ( base of Athena table), I’m not seeing these partition columns ( country ...

Ashish Jangra

35

asked 9 hours ago

0 votes

0 answers

23 views

How to get the list of all urls that an AWS glue job calls while reading bigquery table?

I am facing a issue while reading data from bigquery table to S3 using an AWS glue Pyspark job. Under normal settings, the jobs is working fine. But when I am attaching a VPC connection and using a ...

Abhinav S J

1

asked 10 hours ago

0 votes

0 answers

11 views

java.lang.NoSuchMethodError/: org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream databricks notebook

import os import pyspark from pyspark.sql import SparkSession # file directory DATA_DIR = "dbfs:/FileStore/shared_uploads/[email protected]" path = os.path.join(DATA_DIR, "...

Mig Rivera Cueva

65

asked 12 hours ago

-3 votes

0 answers

25 views

How to ready files dynamically from different storage accounts [closed]

I am writing a Synapse spark code to dynamically read from files from different storage account, I dont want it to be hard coded as the Spark will be attached to pipiline - see the image below..NB the ...

bruce shavhani

5

asked yesterday

-1 votes

1 answer

75 views

Slow performance and timeout when writing 15GB of data from ADLS to Azure SQL DB using Databricks

We have a daily ETL process where we write Parquet data (~15GB) stored in Azure Data Lake Storage (ADLS) into a table in Azure SQL Database. The target table is truncated and reloaded each day. ...

Harish J

166

asked yesterday

0 votes

0 answers

40 views

Unable to use pyarrow optimization in AWS Glue

In my AWS Glue (4.0 which supports spark 3.3), I am trying to optimize by using this: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") but it gives me a warning /...

Karim Baig

1

asked yesterday

-2 votes

0 answers

28 views

Are there any benchmarking tools or reference guides available for evaluating the performance of Spark 3.x.x running on Kubernetes? [closed]

I'm running the Spark Connect framework on Kubernetes using Spark version 3.5.1, and I need to benchmark its performance.

Illusion_2001

1

asked 2 days ago

0 votes

0 answers

37 views

Disable printing info when running spark-sql

I'm running sql commands with spark-sql. I have put rootLogger.level = off in log4j2.properties file, but I'm still getting some info messages: Spark Web UI available at http://computer:4040 Spark ...

IGRACH

3,643

asked Apr 22 at 20:06

0 votes

0 answers

33 views

Error PySparkRuntimeError: [JAVA_GATEWAY_EXITED] in script to upload to redshift

I need help with an error I get when running a local pyspark notebook in VSC with miniforge. I have installed: VSC Java 8 + Java SDK11 Downloaded into c:/spark spark 3.4.4, and created folder c:/...

lecarusin

37

asked Apr 22 at 17:57

0 votes

0 answers

27 views

java.io.EOFException PySpark Py4JJavaError always occuring when using user defined function

I'm doing data preprocessing for this csv file of 1 million rows and hoping to shrink it down to 600000 rows. However I'm having trouble always when doing an apply function on a column in the ...

Mig Rivera Cueva

65

asked Apr 22 at 9:54

0 votes

1 answer

62 views

Does rdd.getNumPartitions() always have the right repartition number before an action?

spark is lazy evaluated, so how does rdd.getNumPartitions() return the correct partition value BEFORE the action is called? df1 = read_file('s3file1') df2 = read_file('file2') print('df1 ...

kyl

579

asked Apr 21 at 22:08

0 votes

1 answer

61 views

I checked online and found that Python 3.13 doesn't have "typing" in it, so how do I bypass this to start pyspark?

PS C:\spark-3.4.4-bin-hadoop3\bin> pyspark Python 3.13.3 (tags/v3.13.3:6280bb5, Apr 8 2025, 14:47:33) [MSC v.1943 64 bit (AMD64)] on win32 Type "help", "copyright", "...

digi store

11

asked Apr 20 at 8:27

0 votes

0 answers

34 views

Unable to Launch apache spark via command prompt

I am first time installing Apache Spark on my windows 11 machine and i am getting error while launching spark via cmd prompt. I am using: Java Version 17.0.0.1 Spark version - 3.5.5 Hadoop version - ...

Upendra Dwivedi

11

asked Apr 19 at 15:14

1 vote

0 answers

29 views

Pyspark writing dataframe to oracle database table using JDBC

I am new to Pyspark and having few clarifications on writing dataframe to oracle database table using JDBC. As part of the requirement I need to read the data from Oracle table and perform ...

Siva

11

asked Apr 18 at 17:05

-1 votes

1 answer

37 views

Does python activity in adf support abfss path?

We are migrating to new unity catalog workspace ,and we are trying to run adf pipelines with scripts in adls , previously ith old databricks workspace we used to call the scripts using dbfs path , but ...

shrinivas madras

1

asked Apr 18 at 5:46

Collectives™ on Stack Overflow

Write partitioned col in s3 file too

How to get the list of all urls that an AWS glue job calls while reading bigquery table?

java.lang.NoSuchMethodError/: org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream databricks notebook

How to ready files dynamically from different storage accounts [closed]

Slow performance and timeout when writing 15GB of data from ADLS to Azure SQL DB using Databricks

Unable to use pyarrow optimization in AWS Glue

Are there any benchmarking tools or reference guides available for evaluating the performance of Spark 3.x.x running on Kubernetes? [closed]

Disable printing info when running spark-sql

Error PySparkRuntimeError: [JAVA_GATEWAY_EXITED] in script to upload to redshift

java.io.EOFException PySpark Py4JJavaError always occuring when using user defined function

Does rdd.getNumPartitions() always have the right repartition number before an action?

I checked online and found that Python 3.13 doesn't have "typing" in it, so how do I bypass this to start pyspark?

Unable to Launch apache spark via command prompt

Pyspark writing dataframe to oracle database table using JDBC

Does python activity in adf support abfss path?

Hot Network Questions

Collectives™ on Stack Overflow

Related Tags