44,206 questions
0
votes
0
answers
20
views
Can I update fs.s3a credentials in hadoop config on existing executors?
I have an application using EKS in AWS that runs a spark session that can run multiple workloads. In each workload, I need to access data from S3 in another AWS account, for which I have STS ...
0
votes
0
answers
105
views
Pyspark error py4j.protocol.Py4JJavaError
I keep running into this issue when running PySpark.
I was able to connect to my database and retrieve data, but whenever I try do operations like .show() or .count(), or when I try to save a Spark ...
1
vote
1
answer
80
views
Apache Hive Docker container: HiveServer2 fails to bind on port 10000 (Connection refused in Beeline
I am running Apache Hive 4.0.0 inside Docker on Ubuntu 22.04.
The container starts, but HiveServer2 never binds to the port.
When I try to connect with Beeline:
sudo docker exec -it hive4 beeline -u ...
0
votes
3
answers
145
views
How to connect to S3 without the large AWS SDK v2 bundle?
I'm trying to read some file from S3 with PySpark 4.0.1 and the S3AFileSystem.
The standard configuration using hadoop-aws 3.4.1 works, but it requires the AWS SDK Bundle. This single dependency is ...
0
votes
0
answers
68
views
Data Migration query
I'm having a Hive table emp1 with 100 partitions in Text format.
I want Spark to read emp table based on partitions bases and write to EMP2 in parquet format. How to achieve 1) 10 Partition Read from ...
0
votes
1
answer
71
views
distcp creating file in GCP bucket instead of file inside directory
Context:
using distcp, I am trying to copy HDFS directory including files to GCP bucket.
I am using
hadoop distcp -Dhadoop.security.credential.provider.path=jceks://$JCEKS_FILE hdfs://nameservice1/...
0
votes
0
answers
76
views
How to package a PySpark + Delta Lake script into an EXE with PyInstaller
I’m trying to convert my PySpark script into an executable(.exe) file using PyInstaller.
The script runs fine in Python, but after converting to an EXE and executing it, I get the following error:
'...
0
votes
1
answer
114
views
Cannot expire snapshot with retain last properies
I have 67 snapshot in a single table but when i use CALL
iceberg_catalog.system.expire_snapshots(
table => 'iceberg_catalog.default.test_7',
retain_last => 5
);
It doesn't delete any snapshot. ...
1
vote
1
answer
43
views
Failed to find datanode (scope="" excludedScope="/rack0")
When I build a hadoop cluster(version 3.3.6) by docker swarm. I have 3 machines, and 1 for namenode, all for datanode. After all starts, I checked everything, namenode is healthy, datanode is healthy, ...
0
votes
0
answers
116
views
How to prevent expired token on AWS S3 credentials provider?
I am building a Spring Boot 3.2.5 application that retrieves data from Parquet files on an AWS S3 bucket. This data is then converted into CSV and loaded into a Postgres database.
This operation works ...
0
votes
2
answers
105
views
Spark Unit test failing maven test but pass in IntelliJ
I'm working on a Scala project using Spark (with Hive support in some tests) and running unit and integration tests via both IntelliJ and Maven Surefire.
I have a shared test session setup like this:
...
0
votes
1
answer
133
views
Hive 4.0.1 doesn't work because of Jar files not found
Hive 4.0.1 doesn't work because of Jar files not found. I want to use hive integrated with hadoop 3.4.1 to query data on apache spark.
I tried to type in ./hive/bin/hive and expected it to return >...
0
votes
0
answers
18
views
Hadoop upload data using balancer to evenly distribute data across all nodes
I have a 3 node hadoop cluster (version 3.4.1) with java_home pointing to version 8 on each node.
I want to evenly distribute the uploaded data across all nodes when I type the following:
hdfs ...
1
vote
0
answers
44
views
Spark cluster fails with NoSuchFileException on temporary connection files
I have a Python celery application utilising Apache Spark for large-scale processing. Everything was going fine until today, when I received:
Exception in thread "main" java.nio.file....
0
votes
1
answer
104
views
Can not read from S3 with AssumedRoleCredentialProvider after upgrade from EMR serverless 6.9 to 7.5
I have a pyspark script that reads data from S3 in a different AWS account, using AssumedRoleCredentialProvider , it is working on emr serverless 6.9 but when I upgrade to EMR Serverless 7.5 it fails ...