Skip to main content
0 votes
0 answers
20 views

I have an application using EKS in AWS that runs a spark session that can run multiple workloads. In each workload, I need to access data from S3 in another AWS account, for which I have STS ...
md12345's user avatar
0 votes
0 answers
105 views

I keep running into this issue when running PySpark. I was able to connect to my database and retrieve data, but whenever I try do operations like .show() or .count(), or when I try to save a Spark ...
Siva Indukuri's user avatar
1 vote
1 answer
80 views

I am running Apache Hive 4.0.0 inside Docker on Ubuntu 22.04. The container starts, but HiveServer2 never binds to the port. When I try to connect with Beeline: sudo docker exec -it hive4 beeline -u ...
user31562336's user avatar
0 votes
3 answers
145 views

I'm trying to read some file from S3 with PySpark 4.0.1 and the S3AFileSystem. The standard configuration using hadoop-aws 3.4.1 works, but it requires the AWS SDK Bundle. This single dependency is ...
RobinFrcd's user avatar
  • 5,714
0 votes
0 answers
68 views

I'm having a Hive table emp1 with 100 partitions in Text format. I want Spark to read emp table based on partitions bases and write to EMP2 in parquet format. How to achieve 1) 10 Partition Read from ...
Rishabh Joshi's user avatar
0 votes
1 answer
71 views

Context: using distcp, I am trying to copy HDFS directory including files to GCP bucket. I am using hadoop distcp -Dhadoop.security.credential.provider.path=jceks://$JCEKS_FILE hdfs://nameservice1/...
Jhon's user avatar
  • 49
0 votes
0 answers
76 views

I’m trying to convert my PySpark script into an executable(.exe) file using PyInstaller. The script runs fine in Python, but after converting to an EXE and executing it, I get the following error: '...
userr's user avatar
  • 11
0 votes
1 answer
114 views

I have 67 snapshot in a single table but when i use CALL iceberg_catalog.system.expire_snapshots( table => 'iceberg_catalog.default.test_7', retain_last => 5 ); It doesn't delete any snapshot. ...
Sơn Bùi's user avatar
1 vote
1 answer
43 views

When I build a hadoop cluster(version 3.3.6) by docker swarm. I have 3 machines, and 1 for namenode, all for datanode. After all starts, I checked everything, namenode is healthy, datanode is healthy, ...
jcyan's user avatar
  • 71
0 votes
0 answers
116 views

I am building a Spring Boot 3.2.5 application that retrieves data from Parquet files on an AWS S3 bucket. This data is then converted into CSV and loaded into a Postgres database. This operation works ...
Timbuck's user avatar
  • 423
0 votes
2 answers
105 views

I'm working on a Scala project using Spark (with Hive support in some tests) and running unit and integration tests via both IntelliJ and Maven Surefire. I have a shared test session setup like this: ...
M06H's user avatar
  • 1,813
0 votes
1 answer
133 views

Hive 4.0.1 doesn't work because of Jar files not found. I want to use hive integrated with hadoop 3.4.1 to query data on apache spark. I tried to type in ./hive/bin/hive and expected it to return >...
vinhdiesal's user avatar
0 votes
0 answers
18 views

I have a 3 node hadoop cluster (version 3.4.1) with java_home pointing to version 8 on each node. I want to evenly distribute the uploaded data across all nodes when I type the following: hdfs ...
vinhdiesal's user avatar
1 vote
0 answers
44 views

I have a Python celery application utilising Apache Spark for large-scale processing. Everything was going fine until today, when I received: Exception in thread "main" java.nio.file....
digital_monk's user avatar
0 votes
1 answer
104 views

I have a pyspark script that reads data from S3 in a different AWS account, using AssumedRoleCredentialProvider , it is working on emr serverless 6.9 but when I upgrade to EMR Serverless 7.5 it fails ...
Sayed's user avatar
  • 11

15 30 50 per page
1
2 3 4 5
2948