44,152 questions
Tooling
0
votes
0
replies
27
views
Locating official Hadoop documentation that explains there is no sharing in credentials between backup and the production server being backuped
I am not given access to the backup of some data required for work by my boss who holds the wrong belief that the backup and the production being backed share credentials. I have got people setting up ...
Advice
0
votes
5
replies
121
views
Java 17 for Hadoop and Java 24
I currently have Java 24 installed on my system and I use it for my personal projects. However, for my college work with Hadoop, I need to run it on Java 17. How can I set up Hadoop to use Java 17 ...
0
votes
0
answers
79
views
Teradata ETL view Migration from Hadoop
We have been using tdch approach for data loading from hadoop to teradata but now looking to load into a teradata view from Hadoop csv tables, I've tried batch insert using tdch but that is failing as ...
1
vote
2
answers
118
views
Difference between org.apache.hadoop.io.compress.CompressionCodec and org.apache.spark.io.CompressionCodec
I want to use a compression in bigdata processing, but there are two compression codecs.
Anyone know the difference?
2
votes
1
answer
53
views
Can I update fs.s3a credentials in hadoop config on existing executors?
I have an application using EKS in AWS that runs a spark session that can run multiple workloads. In each workload, I need to access data from S3 in another AWS account, for which I have STS ...
0
votes
0
answers
198
views
Pyspark error py4j.protocol.Py4JJavaError
I keep running into this issue when running PySpark.
I was able to connect to my database and retrieve data, but whenever I try do operations like .show() or .count(), or when I try to save a Spark ...
0
votes
1
answer
171
views
Apache Hive Docker container: HiveServer2 fails to bind on port 10000 (Connection refused in Beeline
I am running Apache Hive 4.0.0 inside Docker on Ubuntu 22.04.
The container starts, but HiveServer2 never binds to the port.
When I try to connect with Beeline:
sudo docker exec -it hive4 beeline -u ...
0
votes
3
answers
350
views
How to connect to S3 without the large AWS SDK v2 bundle?
I'm trying to read some file from S3 with PySpark 4.0.1 and the S3AFileSystem.
The standard configuration using hadoop-aws 3.4.1 works, but it requires the AWS SDK Bundle. This single dependency is ...
0
votes
0
answers
70
views
Data Migration query
I'm having a Hive table emp1 with 100 partitions in Text format.
I want Spark to read emp table based on partitions bases and write to EMP2 in parquet format. How to achieve 1) 10 Partition Read from ...
0
votes
1
answer
82
views
distcp creating file in GCP bucket instead of file inside directory
Context:
using distcp, I am trying to copy HDFS directory including files to GCP bucket.
I am using
hadoop distcp -Dhadoop.security.credential.provider.path=jceks://$JCEKS_FILE hdfs://nameservice1/...
0
votes
0
answers
80
views
How to package a PySpark + Delta Lake script into an EXE with PyInstaller
I’m trying to convert my PySpark script into an executable(.exe) file using PyInstaller.
The script runs fine in Python, but after converting to an EXE and executing it, I get the following error:
'...
-1
votes
1
answer
186
views
Cannot expire snapshot with retain last properies
I have 67 snapshot in a single table but when i use CALL
iceberg_catalog.system.expire_snapshots(
table => 'iceberg_catalog.default.test_7',
retain_last => 5
);
It doesn't delete any snapshot. ...
1
vote
1
answer
48
views
Failed to find datanode (scope="" excludedScope="/rack0")
When I build a hadoop cluster(version 3.3.6) by docker swarm. I have 3 machines, and 1 for namenode, all for datanode. After all starts, I checked everything, namenode is healthy, datanode is healthy, ...
0
votes
2
answers
113
views
Spark Unit test failing maven test but pass in IntelliJ
I'm working on a Scala project using Spark (with Hive support in some tests) and running unit and integration tests via both IntelliJ and Maven Surefire.
I have a shared test session setup like this:
...
0
votes
1
answer
162
views
Hive 4.0.1 doesn't work because of Jar files not found
Hive 4.0.1 doesn't work because of Jar files not found. I want to use hive integrated with hadoop 3.4.1 to query data on apache spark.
I tried to type in ./hive/bin/hive and expected it to return >...