Skip to main content
0 votes
0 answers
20 views

Write partitioned col in s3 file too

I’m writing to glue table, where I’m having (country and state) as a partition column. But If I read directly from s3 bucket ( base of Athena table), I’m not seeing these partition columns ( country ...
Ashish Jangra's user avatar
0 votes
0 answers
29 views

AWS MSK and Glue integration for processing batches of messages

I need to do a batch processing for the messages of a topic with a low throughput, so ideally, instead of a service running 24/7, the job is executed a few times a day. The topic is in an MSK cluster ...
Francisco Chacón Rubio's user avatar
0 votes
0 answers
42 views

Unable to use pyarrow optimization in AWS Glue

In my AWS Glue (4.0 which supports spark 3.3), I am trying to optimize by using this: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") but it gives me a warning /...
Karim Baig's user avatar
-1 votes
1 answer
39 views

Data Warhousing with Spark and Redshift [closed]

A question to those who have done some data warehousing where Spark/Glue is the transformation engine, bonus if the data warehouse is Redshift. This is my first time putting a data warehouse in place, ...
Eya's user avatar
  • 11
1 vote
1 answer
38 views

Duplicate Records in Parquet (Processed) Table after AWS Glue Job execution

We have an AWS Glue pipeline where: A crawler populates a raw database table from partitioned JSON files in S3. S3 structure: raw/ ├── org=21/ │ └── 221.json └── org=23/ └── 654.json ...
Max Manitskov's user avatar
1 vote
1 answer
81 views

Read incremental data from iceberg tables using Spark SQL

I am trying to read incremental data between two snapshots I have last processed snapshot (my day0 load) and below is my code snippet to read incremental data incremental_df = spark.read.format("...
Abhi5421's user avatar
0 votes
0 answers
44 views

Unable to configure the exact number of DPUs for the Glue Pyspark job

I have 20 million records, which comprise around 1.5 to 10 GB, as per the information I received. I can't access the source system to get the exact size of this table. I am just reading it from the ...
RushHour's user avatar
  • 613
1 vote
1 answer
27 views

Is Data catalog and Crawler mandatory for Glue

I am reading about the use of AWS Glue for ETL. https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html In Data Discovery and cataloging, AWS talks about creating a Crawler for Data cataloging. ...
Kul's user avatar
  • 121
-1 votes
0 answers
19 views

Unable to create JDBC source connection with AWS Glue

There seems to be little to no documentation(or atleast I can't find any meaningful guides), that can help me establish a successful connection with a MySQL source. Either getting this VPC endpoint or ...
Hassaan Ali's user avatar
0 votes
0 answers
50 views

Does Glue connect to SQL Server?

I haven't found a single tutorial that shows how to connect Glue to a SQL Server or Azure DB instance, so that's why I'm here. I'm having issues connecting AWS Glue to a SQL Server instance in a ...
fdkgfosfskjdlsjdlkfsf's user avatar
0 votes
0 answers
21 views

aws list_findings parameters changed in request

I am currently using boto3 list findings to return all findings for various aws accounts. I am getting the following error sporadically (Service: MandoFindings, Status Code: 400,) Pagination token ...
em456's user avatar
  • 443
0 votes
1 answer
38 views

Adding jar files to AWS Glue script versus notebook

I've noticed something odd in AWS Glue - when you create a spark notebook and pass some magic commands to set up the notebook it has no problems For example: %idle_timeout 2880 %worker_type G.1X %...
geo_coder's user avatar
  • 753
0 votes
1 answer
20 views

AWS Athena is not processing any data from glue table if partition projection is enabled

I have a glue table that is fed by partitioned data in s3. The issue at hand is in Athena that if the partition projection is turned off, and I run MSCK REPAIR TABLE <my table>; and SELECT * ...
Raisin's user avatar
  • 21
1 vote
0 answers
21 views

PyIceberg with AWS Glue Creates Unwanted Nested Directories in S3 Tables

I'm using PyIceberg with AWS Glue REST catalog to insert data into an Iceberg table stored in S3. The data insertion works fine, but I noticed that PyIceberg creates unwanted nested directories in S3 ...
Tharanesh Balaji's user avatar
0 votes
0 answers
52 views

AWS Glue 5.0 "Installation of Python modules timed out after 10 minutes"

I have an AWS Glue 5.0 job where I am specifying --additional-python-modules s3://my-dev/other-dependencies /MyPackage-0.1.1-py3-none-any.whl in my job options. My glue job itself is just a print(&...
Martin's user avatar
  • 1,598

15 30 50 per page
1
2 3 4 5
284