Skip to main content
0 votes
0 answers
19 views

Write partitioned col in s3 file too

I’m writing to glue table, where I’m having (country and state) as a partition column. But If I read directly from s3 bucket ( base of Athena table), I’m not seeing these partition columns ( country ...
Ashish Jangra's user avatar
0 votes
0 answers
23 views

How to get the list of all urls that an AWS glue job calls while reading bigquery table?

I am facing a issue while reading data from bigquery table to S3 using an AWS glue Pyspark job. Under normal settings, the jobs is working fine. But when I am attaching a VPC connection and using a ...
Abhinav S J's user avatar
0 votes
0 answers
11 views

java.lang.NoSuchMethodError/: org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream databricks notebook

import os import pyspark from pyspark.sql import SparkSession # file directory DATA_DIR = "dbfs:/FileStore/shared_uploads/[email protected]" path = os.path.join(DATA_DIR, "...
Mig Rivera Cueva's user avatar
-3 votes
0 answers
25 views

How to ready files dynamically from different storage accounts [closed]

I am writing a Synapse spark code to dynamically read from files from different storage account, I dont want it to be hard coded as the Spark will be attached to pipiline - see the image below..NB the ...
bruce shavhani's user avatar
-1 votes
1 answer
75 views

Slow performance and timeout when writing 15GB of data from ADLS to Azure SQL DB using Databricks

We have a daily ETL process where we write Parquet data (~15GB) stored in Azure Data Lake Storage (ADLS) into a table in Azure SQL Database. The target table is truncated and reloaded each day. ...
Harish J's user avatar
  • 166
0 votes
0 answers
40 views

Unable to use pyarrow optimization in AWS Glue

In my AWS Glue (4.0 which supports spark 3.3), I am trying to optimize by using this: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") but it gives me a warning /...
Karim Baig's user avatar
-2 votes
0 answers
28 views

Are there any benchmarking tools or reference guides available for evaluating the performance of Spark 3.x.x running on Kubernetes? [closed]

I'm running the Spark Connect framework on Kubernetes using Spark version 3.5.1, and I need to benchmark its performance.
Illusion_2001's user avatar
0 votes
0 answers
37 views

Disable printing info when running spark-sql

I'm running sql commands with spark-sql. I have put rootLogger.level = off in log4j2.properties file, but I'm still getting some info messages: Spark Web UI available at http://computer:4040 Spark ...
IGRACH's user avatar
  • 3,643
0 votes
0 answers
33 views

Error PySparkRuntimeError: [JAVA_GATEWAY_EXITED] in script to upload to redshift

I need help with an error I get when running a local pyspark notebook in VSC with miniforge. I have installed: VSC Java 8 + Java SDK11 Downloaded into c:/spark spark 3.4.4, and created folder c:/...
lecarusin's user avatar
0 votes
0 answers
27 views

java.io.EOFException PySpark Py4JJavaError always occuring when using user defined function

I'm doing data preprocessing for this csv file of 1 million rows and hoping to shrink it down to 600000 rows. However I'm having trouble always when doing an apply function on a column in the ...
Mig Rivera Cueva's user avatar
0 votes
1 answer
62 views

Does rdd.getNumPartitions() always have the right repartition number before an action?

spark is lazy evaluated, so how does rdd.getNumPartitions() return the correct partition value BEFORE the action is called? df1 = read_file('s3file1') df2 = read_file('file2') print('df1 ...
kyl's user avatar
  • 579
0 votes
1 answer
61 views

I checked online and found that Python 3.13 doesn't have "typing" in it, so how do I bypass this to start pyspark?

PS C:\spark-3.4.4-bin-hadoop3\bin> pyspark Python 3.13.3 (tags/v3.13.3:6280bb5, Apr 8 2025, 14:47:33) [MSC v.1943 64 bit (AMD64)] on win32 Type "help", "copyright", "...
digi store's user avatar
0 votes
0 answers
34 views

Unable to Launch apache spark via command prompt

I am first time installing Apache Spark on my windows 11 machine and i am getting error while launching spark via cmd prompt. I am using: Java Version 17.0.0.1 Spark version - 3.5.5 Hadoop version - ...
Upendra Dwivedi's user avatar
1 vote
0 answers
29 views

Pyspark writing dataframe to oracle database table using JDBC

I am new to Pyspark and having few clarifications on writing dataframe to oracle database table using JDBC. As part of the requirement I need to read the data from Oracle table and perform ...
Siva's user avatar
  • 11
-1 votes
1 answer
37 views

Does python activity in adf support abfss path?

We are migrating to new unity catalog workspace ,and we are trying to run adf pipelines with scripts in adls , previously ith old databricks workspace we used to call the scripts using dbfs path , but ...
shrinivas madras's user avatar

15 30 50 per page
1
2 3 4 5
2745