41,169 questions
0
votes
0
answers
19
views
Write partitioned col in s3 file too
I’m writing to glue table, where I’m having (country and state) as a partition column.
But If I read directly from s3 bucket ( base of Athena table), I’m not seeing these partition columns ( country ...
0
votes
0
answers
23
views
How to get the list of all urls that an AWS glue job calls while reading bigquery table?
I am facing a issue while reading data from bigquery table to S3 using an AWS glue Pyspark job. Under normal settings, the jobs is working fine.
But when I am attaching a VPC connection and using a ...
0
votes
0
answers
11
views
java.lang.NoSuchMethodError/: org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream databricks notebook
import os
import pyspark
from pyspark.sql import SparkSession
# file directory
DATA_DIR = "dbfs:/FileStore/shared_uploads/[email protected]"
path = os.path.join(DATA_DIR, "...
-3
votes
0
answers
25
views
How to ready files dynamically from different storage accounts [closed]
I am writing a Synapse spark code to dynamically read from files from different storage account, I dont want it to be hard coded as the Spark will be attached to pipiline - see the image below..NB the ...
-1
votes
1
answer
75
views
Slow performance and timeout when writing 15GB of data from ADLS to Azure SQL DB using Databricks
We have a daily ETL process where we write Parquet data (~15GB) stored in Azure Data Lake Storage (ADLS) into a table in Azure SQL Database. The target table is truncated and reloaded each day.
...
0
votes
0
answers
40
views
Unable to use pyarrow optimization in AWS Glue
In my AWS Glue (4.0 which supports spark 3.3), I am trying to optimize by using this:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
but it gives me a warning
/...
-2
votes
0
answers
28
views
Are there any benchmarking tools or reference guides available for evaluating the performance of Spark 3.x.x running on Kubernetes? [closed]
I'm running the Spark Connect framework on Kubernetes using Spark version 3.5.1, and I need to benchmark its performance.
0
votes
0
answers
37
views
Disable printing info when running spark-sql
I'm running sql commands with spark-sql. I have put rootLogger.level = off in log4j2.properties file, but I'm still getting some info messages:
Spark Web UI available at http://computer:4040
Spark ...
0
votes
0
answers
33
views
Error PySparkRuntimeError: [JAVA_GATEWAY_EXITED] in script to upload to redshift
I need help with an error I get when running a local pyspark notebook in VSC with miniforge. I have installed:
VSC
Java 8 + Java SDK11
Downloaded into c:/spark spark 3.4.4, and created folder c:/...
0
votes
0
answers
27
views
java.io.EOFException PySpark Py4JJavaError always occuring when using user defined function
I'm doing data preprocessing for this csv file of 1 million rows and hoping to shrink it down to 600000 rows. However I'm having trouble always when doing an apply function on a column in the ...
0
votes
1
answer
62
views
Does rdd.getNumPartitions() always have the right repartition number before an action?
spark is lazy evaluated, so how does rdd.getNumPartitions() return the correct partition value BEFORE the action is called?
df1 = read_file('s3file1')
df2 = read_file('file2')
print('df1 ...
0
votes
1
answer
61
views
I checked online and found that Python 3.13 doesn't have "typing" in it, so how do I bypass this to start pyspark?
PS C:\spark-3.4.4-bin-hadoop3\bin> pyspark
Python 3.13.3 (tags/v3.13.3:6280bb5, Apr 8 2025, 14:47:33) [MSC v.1943 64 bit (AMD64)] on win32
Type "help", "copyright", "...
0
votes
0
answers
34
views
Unable to Launch apache spark via command prompt
I am first time installing Apache Spark on my windows 11 machine and i am getting error while launching spark via cmd prompt.
I am using:
Java Version 17.0.0.1
Spark version - 3.5.5
Hadoop version - ...
1
vote
0
answers
29
views
Pyspark writing dataframe to oracle database table using JDBC
I am new to Pyspark and having few clarifications on writing dataframe to oracle database table using JDBC.
As part of the requirement I need to read the data from Oracle table and perform ...
-1
votes
1
answer
37
views
Does python activity in adf support abfss path?
We are migrating to new unity catalog workspace ,and we are trying to run adf pipelines with scripts in adls , previously ith old databricks workspace we used to call the scripts using dbfs path , but ...