Newest 'hadoop' Questions - Stack Overflow

0 votes

0 answers

20 views

Can I update fs.s3a credentials in hadoop config on existing executors?

I have an application using EKS in AWS that runs a spark session that can run multiple workloads. In each workload, I need to access data from S3 in another AWS account, for which I have STS ...

md12345

1

asked Nov 7 at 14:56

0 votes

0 answers

105 views

Pyspark error py4j.protocol.Py4JJavaError

I keep running into this issue when running PySpark. I was able to connect to my database and retrieve data, but whenever I try do operations like .show() or .count(), or when I try to save a Spark ...

Siva Indukuri

1

asked Sep 29 at 14:11

1 vote

1 answer

80 views

Apache Hive Docker container: HiveServer2 fails to bind on port 10000 (Connection refused in Beeline

I am running Apache Hive 4.0.0 inside Docker on Ubuntu 22.04. The container starts, but HiveServer2 never binds to the port. When I try to connect with Beeline: sudo docker exec -it hive4 beeline -u ...

user31562336

11

asked Sep 24 at 13:39

0 votes

3 answers

145 views

How to connect to S3 without the large AWS SDK v2 bundle?

I'm trying to read some file from S3 with PySpark 4.0.1 and the S3AFileSystem. The standard configuration using hadoop-aws 3.4.1 works, but it requires the AWS SDK Bundle. This single dependency is ...

RobinFrcd

5,714

asked Sep 19 at 14:29

0 votes

0 answers

68 views

Data Migration query

I'm having a Hive table emp1 with 100 partitions in Text format. I want Spark to read emp table based on partitions bases and write to EMP2 in parquet format. How to achieve 1) 10 Partition Read from ...

Rishabh Joshi

1

asked Sep 11 at 10:18

0 votes

1 answer

71 views

distcp creating file in GCP bucket instead of file inside directory

Context: using distcp, I am trying to copy HDFS directory including files to GCP bucket. I am using hadoop distcp -Dhadoop.security.credential.provider.path=jceks://$JCEKS_FILE hdfs://nameservice1/...

Jhon

49

asked Sep 8 at 19:31

0 votes

0 answers

76 views

How to package a PySpark + Delta Lake script into an EXE with PyInstaller

I’m trying to convert my PySpark script into an executable(.exe) file using PyInstaller. The script runs fine in Python, but after converting to an EXE and executing it, I get the following error: '...

userr

11

asked Aug 25 at 23:56

0 votes

1 answer

114 views

Cannot expire snapshot with retain last properies

I have 67 snapshot in a single table but when i use CALL iceberg_catalog.system.expire_snapshots( table => 'iceberg_catalog.default.test_7', retain_last => 5 ); It doesn't delete any snapshot. ...

Sơn Bùi

3

asked Aug 5 at 10:06

1 vote

1 answer

43 views

Failed to find datanode (scope="" excludedScope="/rack0")

When I build a hadoop cluster(version 3.3.6) by docker swarm. I have 3 machines, and 1 for namenode, all for datanode. After all starts, I checked everything, namenode is healthy, datanode is healthy, ...

jcyan

71

asked Jul 27 at 15:17

0 votes

0 answers

116 views

How to prevent expired token on AWS S3 credentials provider?

I am building a Spring Boot 3.2.5 application that retrieves data from Parquet files on an AWS S3 bucket. This data is then converted into CSV and loaded into a Postgres database. This operation works ...

Timbuck

423

asked Jul 16 at 13:26

0 votes

2 answers

105 views

Spark Unit test failing maven test but pass in IntelliJ

I'm working on a Scala project using Spark (with Hive support in some tests) and running unit and integration tests via both IntelliJ and Maven Surefire. I have a shared test session setup like this: ...

M06H

1,813

asked Jun 26 at 17:20

0 votes

1 answer

133 views

Hive 4.0.1 doesn't work because of Jar files not found

Hive 4.0.1 doesn't work because of Jar files not found. I want to use hive integrated with hadoop 3.4.1 to query data on apache spark. I tried to type in ./hive/bin/hive and expected it to return >...

vinhdiesal

1

asked Jun 24 at 9:17

0 votes

0 answers

18 views

Hadoop upload data using balancer to evenly distribute data across all nodes

I have a 3 node hadoop cluster (version 3.4.1) with java_home pointing to version 8 on each node. I want to evenly distribute the uploaded data across all nodes when I type the following: hdfs ...

vinhdiesal

1

asked Jun 23 at 17:39

1 vote

0 answers

44 views

Spark cluster fails with NoSuchFileException on temporary connection files

I have a Python celery application utilising Apache Spark for large-scale processing. Everything was going fine until today, when I received: Exception in thread "main" java.nio.file....

digital_monk

87

asked Jun 14 at 23:54

0 votes

1 answer

104 views

Can not read from S3 with AssumedRoleCredentialProvider after upgrade from EMR serverless 6.9 to 7.5

I have a pyspark script that reads data from S3 in a different AWS account, using AssumedRoleCredentialProvider , it is working on emr serverless 6.9 but when I upgrade to EMR Serverless 7.5 it fails ...

Sayed

11

asked Jun 14 at 16:00

Collectives™ on Stack Overflow

Can I update fs.s3a credentials in hadoop config on existing executors?

Pyspark error py4j.protocol.Py4JJavaError

Apache Hive Docker container: HiveServer2 fails to bind on port 10000 (Connection refused in Beeline

How to connect to S3 without the large AWS SDK v2 bundle?

Data Migration query

distcp creating file in GCP bucket instead of file inside directory

How to package a PySpark + Delta Lake script into an EXE with PyInstaller

Cannot expire snapshot with retain last properies

Failed to find datanode (scope="" excludedScope="/rack0")

How to prevent expired token on AWS S3 credentials provider?

Spark Unit test failing maven test but pass in IntelliJ

Hive 4.0.1 doesn't work because of Jar files not found

Hadoop upload data using balancer to evenly distribute data across all nodes

Spark cluster fails with NoSuchFileException on temporary connection files

Can not read from S3 with AssumedRoleCredentialProvider after upgrade from EMR serverless 6.9 to 7.5

Hot Network Questions