Skip to main content
2 votes
0 answers
28 views

I have the following setup: Kubernetes cluster with Spark Connect 4.0.1 and MLflow tracking server 3.5.0 MLFlow tracking server should serve all artifacts and is configured this way: --backend-store-...
hage's user avatar
  • 6,213
0 votes
1 answer
51 views

I have a spark job that runs daily to load data from S3. These data are composed of thousands of gzip files. However, in some cases, there is one or two corrupted files in S3, and it causes the whole ...
Nakeuh's user avatar
  • 1,933
-1 votes
2 answers
46 views

In Azure VM, I have installed standalone Spark 4.0. On the same VM I have Python 3.11 with Jupyter deployed. In my notebook I submitted the following program: from pyspark.sql import SparkSession ...
Ziggy's user avatar
  • 43
1 vote
1 answer
91 views

I am very new in Spark (specifically, have just started with learning), and I have encountered a recursion error in a very simple code. Background: Spark Version 3.5.7 Java Version 11.0.29 (Eclipse ...
GINzzZ100's user avatar
-1 votes
0 answers
71 views

I'm working on a large-scale ETL pipeline processing ~500GB daily across multiple data sources. We're currently using Great Expectations for data quality validation, but facing performance bottlenecks ...
Vijay Savaliya's user avatar
2 votes
1 answer
110 views

I’m trying to create a Delta Lake table in MinIO using Spark 4.0.0 inside a Docker container. I’ve added the required JARs: delta-spark_2.13-4.0.0.jar delta-storage-4.0.0.jar hadoop-aws-3.3.6.jar aws-...
Tutu ツ's user avatar
  • 155
0 votes
0 answers
25 views

Long story short, my team was hired to take on some legacy code and it was running around 5ish hours. We began making some minor changes that shouldn't have affected the runtimes in any significant ...
Ben Fuqua's user avatar
3 votes
1 answer
95 views

I’m experiencing data loss when writing a large DataFrame to Redis using the Spark-Redis connector. Details: I have a DataFrame with millions of rows. Writing to Redis works correctly for small ...
gianfranco de siena's user avatar
Best practices
0 votes
5 replies
84 views

I have been working as a Data Engineer and got this issue. I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source. Now somewhere ...
Parth Sarthi Roy's user avatar
1 vote
0 answers
67 views

I am using Databricks Runtime 15.4 (Spark 3.5 / Scala 2.12) on AWS. My goal is to use the latest Google BigQuery connector because I need the direct write method (BigQuery Storage Write API): option(&...
Thilina's user avatar
  • 157
0 votes
0 answers
44 views

I'm using a PySpark notebook inside of Azure Synapse. This is my schema definition qcew_schema = StructType([ StructField( 'area_fips', dataType = CharType(5), ...
Vijay Tripathi's user avatar
2 votes
0 answers
56 views

I am running a data ingestion ETL pipeline orchestrated by Airflow using PySpark to read data from MongoDB (using the MongoDB Spark Connector) and load it into a Delta Lake table. The pipeline is ...
Tavakoli's user avatar
  • 1,433
1 vote
0 answers
67 views

I have a bigquery table with array column named as "column_list " ALTER TABLE `test-project.TEST_DATASET.TEST_TABLE` ADD COLUMN column_list ARRAY<STRING>; update `test-project....
Vikrant Singh Rana's user avatar
0 votes
1 answer
88 views

I am attempting to programmatically remove specific columns/fields from a dataframe (anything that starts with _), whether the field is in the root or in a struct, using the dropFields method. For ...
Andrew's user avatar
  • 8,893
0 votes
1 answer
102 views

I have a user case like this. I have a list of many queries. I am running multi-threading with pyspark with each thread submitting some sql. There are some queries that report success but the final ...
user31827888's user avatar

15 30 50 per page
1
2 3 4 5
5510