Newest 'pyspark' Questions

0 votes

0 answers

36 views

Pyspark - Flatten nested structure

I have MongoDB collections forms and submissions where forms define dynamic UI components (textfield, checkbox, radio, selectboxes, columns, tables, datagrids) and submissions contain the user data in ...

Aniruth N

106

asked yesterday

2 votes

0 answers

28 views

How log model in mlflow using Spark Connect

I have the following setup: Kubernetes cluster with Spark Connect 4.0.1 and MLflow tracking server 3.5.0 MLFlow tracking server should serve all artifacts and is configured this way: --backend-store-...

hage

6,213

asked Nov 26 at 13:39

0 votes

1 answer

51 views

Handle corrupted files in spark load()

I have a spark job that runs daily to load data from S3. These data are composed of thousands of gzip files. However, in some cases, there is one or two corrupted files in S3, and it causes the whole ...

Nakeuh

1,933

asked Nov 26 at 7:17

0 votes

0 answers

25 views

Why do I get List index out of range error when writing a sharepoint list to azure delta lake using pyspark on Azure Databricks?

Writing a SharePoint list to delta file format and I get this error- list index out of range. I have included all the required columns to be fetched from sharepoint and check the datatype when writing ...

Sruthi Gopalakrishnan

99

asked Nov 25 at 9:38

-1 votes

2 answers

46 views

Connectivity issues in standalone Spark 4.0

In Azure VM, I have installed standalone Spark 4.0. On the same VM I have Python 3.11 with Jupyter deployed. In my notebook I submitted the following program: from pyspark.sql import SparkSession ...

Ziggy

43

asked Nov 24 at 16:16

1 vote

1 answer

91 views

PicklingError: Could not serialize object: RecursionError in pyspark code in Jupyter Notebook

I am very new in Spark (specifically, have just started with learning), and I have encountered a recursion error in a very simple code. Background: Spark Version 3.5.7 Java Version 11.0.29 (Eclipse ...

GINzzZ100

11

asked Nov 24 at 1:47

2 votes

1 answer

110 views

Spark with Delta Lake and S3A: NumberFormatException "60s" and request for working Docker image/config

I’m trying to create a Delta Lake table in MinIO using Spark 4.0.0 inside a Docker container. I’ve added the required JARs: delta-spark_2.13-4.0.0.jar delta-storage-4.0.0.jar hadoop-aws-3.3.6.jar aws-...

Tutu ツ

155

asked Nov 22 at 12:54

3 votes

1 answer

95 views

Spark-Redis write loses rows when writing large DataFrame to Redis

I’m experiencing data loss when writing a large DataFrame to Redis using the Spark-Redis connector. Details: I have a DataFrame with millions of rows. Writing to Redis works correctly for small ...

gianfranco de siena

23

asked Nov 19 at 5:06

0 votes

1 answer

39 views

py4j.protocol.Py4JJavaError on windows machine running pyspark code in executing df.show() or df.count() comman

i am new to pyspark. i have installed java 17 and made sure it works C:\Windows\System32>java -version java version "17.0.12" 2024-07-16 LTS installed python 3.9 and made sure it works C:\...

Blogger Anonymous

1

asked Nov 15 at 22:26

0 votes

0 answers

44 views

PySpark 3.5.5 CharType in read.csv schema definition

I'm using a PySpark notebook inside of Azure Synapse. This is my schema definition qcew_schema = StructType([ StructField( 'area_fips', dataType = CharType(5), ...

Vijay Tripathi

71

asked Nov 12 at 17:50

1 vote

1 answer

68 views

Spark JDBC reading wrong character encoding from PostgreSQL with server_encoding = SQL_ASCII

I'm reading data from a PostgreSQL 8.4 database into PySpark using the JDBC connector. The database's server_encoding is SQL_ASCII. When I query the table directly in pgAdmin, names like SÉRGIO or ...

Thiago Luan

51

asked Nov 12 at 13:15

2 votes

0 answers

56 views

PySpark/MongoDB Connector DataException: dataType 'struct' is invalid for 'BsonArray' during ETL

I am running a data ingestion ETL pipeline orchestrated by Airflow using PySpark to read data from MongoDB (using the MongoDB Spark Connector) and load it into a Delta Lake table. The pipeline is ...

Tavakoli

1,433

asked Nov 10 at 13:02

0 votes

0 answers

26 views

Cross-subscription Synapse Spark query to a dedicated SQL Pool how to?

I want to query a different subscription SQL Pool using SPark can I just use the same syntax or is additional configuration neccesary and if so how to? df = spark.read.option(Constants.SERVER, "&...

javadev

287

asked Nov 7 at 18:53

0 votes

0 answers

20 views

Can I update fs.s3a credentials in hadoop config on existing executors?

I have an application using EKS in AWS that runs a spark session that can run multiple workloads. In each workload, I need to access data from S3 in another AWS account, for which I have STS ...

md12345

1

asked Nov 7 at 14:56

0 votes

1 answer

102 views

Pyspark- Multithreading in Python

I have a user case like this. I have a list of many queries. I am running multi-threading with pyspark with each thread submitting some sql. There are some queries that report success but the final ...

user31827888

1

asked Nov 7 at 9:17

Collectives™ on Stack Overflow

Pyspark - Flatten nested structure

How log model in mlflow using Spark Connect

Handle corrupted files in spark load()

Why do I get List index out of range error when writing a sharepoint list to azure delta lake using pyspark on Azure Databricks?

Connectivity issues in standalone Spark 4.0

PicklingError: Could not serialize object: RecursionError in pyspark code in Jupyter Notebook

Spark with Delta Lake and S3A: NumberFormatException "60s" and request for working Docker image/config

Spark-Redis write loses rows when writing large DataFrame to Redis

py4j.protocol.Py4JJavaError on windows machine running pyspark code in executing df.show() or df.count() comman

PySpark 3.5.5 CharType in read.csv schema definition

Spark JDBC reading wrong character encoding from PostgreSQL with server_encoding = SQL_ASCII

PySpark/MongoDB Connector DataException: dataType 'struct' is invalid for 'BsonArray' during ETL

Cross-subscription Synapse Spark query to a dedicated SQL Pool how to?

Can I update fs.s3a credentials in hadoop config on existing executors?

Pyspark- Multithreading in Python

Hot Network Questions