Newest 'apache-spark' Questions

2 votes

0 answers

28 views

How log model in mlflow using Spark Connect

I have the following setup: Kubernetes cluster with Spark Connect 4.0.1 and MLflow tracking server 3.5.0 MLFlow tracking server should serve all artifacts and is configured this way: --backend-store-...

hage

6,213

asked Nov 26 at 13:39

0 votes

1 answer

51 views

Handle corrupted files in spark load()

I have a spark job that runs daily to load data from S3. These data are composed of thousands of gzip files. However, in some cases, there is one or two corrupted files in S3, and it causes the whole ...

Nakeuh

1,933

asked Nov 26 at 7:17

-1 votes

2 answers

46 views

Connectivity issues in standalone Spark 4.0

In Azure VM, I have installed standalone Spark 4.0. On the same VM I have Python 3.11 with Jupyter deployed. In my notebook I submitted the following program: from pyspark.sql import SparkSession ...

Ziggy

43

asked Nov 24 at 16:16

1 vote

1 answer

91 views

PicklingError: Could not serialize object: RecursionError in pyspark code in Jupyter Notebook

I am very new in Spark (specifically, have just started with learning), and I have encountered a recursion error in a very simple code. Background: Spark Version 3.5.7 Java Version 11.0.29 (Eclipse ...

GINzzZ100

11

asked Nov 24 at 1:47

-1 votes

0 answers

71 views

Implementing Incremental Data Quality Validation in Large-Scale ETL Pipelines with Schema Evolution

I'm working on a large-scale ETL pipeline processing ~500GB daily across multiple data sources. We're currently using Great Expectations for data quality validation, but facing performance bottlenecks ...

Vijay Savaliya

7

asked Nov 23 at 6:00

2 votes

1 answer

110 views

Spark with Delta Lake and S3A: NumberFormatException "60s" and request for working Docker image/config

I’m trying to create a Delta Lake table in MinIO using Spark 4.0.0 inside a Docker container. I’ve added the required JARs: delta-spark_2.13-4.0.0.jar delta-storage-4.0.0.jar hadoop-aws-3.3.6.jar aws-...

Tutu ツ

155

asked Nov 22 at 12:54

0 votes

0 answers

25 views

Large variation in spark runtimes

Long story short, my team was hired to take on some legacy code and it was running around 5ish hours. We began making some minor changes that shouldn't have affected the runtimes in any significant ...

Ben Fuqua

85

asked Nov 21 at 16:42

3 votes

1 answer

95 views

Spark-Redis write loses rows when writing large DataFrame to Redis

I’m experiencing data loss when writing a large DataFrame to Redis using the Spark-Redis connector. Details: I have a DataFrame with millions of rows. Writing to Redis works correctly for small ...

gianfranco de siena

23

asked Nov 19 at 5:06

Best practices

0 votes

5 replies

84 views

Pushing down filters in RDBMS with Java Spark

I have been working as a Data Engineer and got this issue. I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source. Now somewhere ...

Parth Sarthi Roy

1

asked Nov 14 at 6:13

1 vote

0 answers

67 views

Databricks always loads built-in BigQuery connector (0.22.2), can’t override with 0.43.x

I am using Databricks Runtime 15.4 (Spark 3.5 / Scala 2.12) on AWS. My goal is to use the latest Google BigQuery connector because I need the direct write method (BigQuery Storage Write API): option(&...

Thilina

157

asked Nov 13 at 12:24

0 votes

0 answers

44 views

PySpark 3.5.5 CharType in read.csv schema definition

I'm using a PySpark notebook inside of Azure Synapse. This is my schema definition qcew_schema = StructType([ StructField( 'area_fips', dataType = CharType(5), ...

Vijay Tripathi

71

asked Nov 12 at 17:50

2 votes

0 answers

56 views

PySpark/MongoDB Connector DataException: dataType 'struct' is invalid for 'BsonArray' during ETL

I am running a data ingestion ETL pipeline orchestrated by Airflow using PySpark to read data from MongoDB (using the MongoDB Spark Connector) and load it into a Delta Lake table. The pipeline is ...

Tavakoli

1,433

asked Nov 10 at 13:02

1 vote

0 answers

67 views

trying to read bigquery array colum and passing it as columns to fetch from spark dataframe

I have a bigquery table with array column named as "column_list " ALTER TABLE `test-project.TEST_DATASET.TEST_TABLE` ADD COLUMN column_list ARRAY<STRING>; update `test-project....

Vikrant Singh Rana

4,749

asked Nov 10 at 10:28

0 votes

1 answer

88 views

col function error type mismatch: found string required Int

I am attempting to programmatically remove specific columns/fields from a dataframe (anything that starts with _), whether the field is in the root or in a struct, using the dropFields method. For ...

Andrew

8,893

asked Nov 7 at 23:06

0 votes

1 answer

102 views

Pyspark- Multithreading in Python

I have a user case like this. I have a list of many queries. I am running multi-threading with pyspark with each thread submitting some sql. There are some queries that report success but the final ...

user31827888

1

asked Nov 7 at 9:17

Collectives™ on Stack Overflow

How log model in mlflow using Spark Connect

Handle corrupted files in spark load()

Connectivity issues in standalone Spark 4.0

PicklingError: Could not serialize object: RecursionError in pyspark code in Jupyter Notebook

Implementing Incremental Data Quality Validation in Large-Scale ETL Pipelines with Schema Evolution

Spark with Delta Lake and S3A: NumberFormatException "60s" and request for working Docker image/config

Large variation in spark runtimes

Spark-Redis write loses rows when writing large DataFrame to Redis

Pushing down filters in RDBMS with Java Spark

Databricks always loads built-in BigQuery connector (0.22.2), can’t override with 0.43.x

PySpark 3.5.5 CharType in read.csv schema definition

PySpark/MongoDB Connector DataException: dataType 'struct' is invalid for 'BsonArray' during ETL

trying to read bigquery array colum and passing it as columns to fetch from spark dataframe

col function error type mismatch: found string required Int

Pyspark- Multithreading in Python

Hot Network Questions