Newest 'aws-glue' Questions

0 votes

0 answers

20 views

Write partitioned col in s3 file too

I’m writing to glue table, where I’m having (country and state) as a partition column. But If I read directly from s3 bucket ( base of Athena table), I’m not seeing these partition columns ( country ...

Ashish Jangra

35

asked 12 hours ago

0 votes

0 answers

29 views

AWS MSK and Glue integration for processing batches of messages

I need to do a batch processing for the messages of a topic with a low throughput, so ideally, instead of a service running 24/7, the job is executed a few times a day. The topic is in an MSK cluster ...

Francisco Chacón Rubio

11

asked yesterday

0 votes

0 answers

42 views

Unable to use pyarrow optimization in AWS Glue

In my AWS Glue (4.0 which supports spark 3.3), I am trying to optimize by using this: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") but it gives me a warning /...

Karim Baig

1

asked 2 days ago

-1 votes

1 answer

39 views

Data Warhousing with Spark and Redshift [closed]

A question to those who have done some data warehousing where Spark/Glue is the transformation engine, bonus if the data warehouse is Redshift. This is my first time putting a data warehouse in place, ...

Eya

11

asked Apr 18 at 8:29

1 vote

1 answer

38 views

Duplicate Records in Parquet (Processed) Table after AWS Glue Job execution

We have an AWS Glue pipeline where: A crawler populates a raw database table from partitioned JSON files in S3. S3 structure: raw/ ├── org=21/ │ └── 221.json └── org=23/ └── 654.json ...

Max Manitskov

11

asked Apr 16 at 15:53

1 vote

1 answer

81 views

Read incremental data from iceberg tables using Spark SQL

I am trying to read incremental data between two snapshots I have last processed snapshot (my day0 load) and below is my code snippet to read incremental data incremental_df = spark.read.format("...

Abhi5421

23

asked Apr 16 at 8:26

0 votes

0 answers

44 views

Unable to configure the exact number of DPUs for the Glue Pyspark job

I have 20 million records, which comprise around 1.5 to 10 GB, as per the information I received. I can't access the source system to get the exact size of this table. I am just reading it from the ...

RushHour

613

asked Apr 15 at 7:42

1 vote

1 answer

27 views

Is Data catalog and Crawler mandatory for Glue

I am reading about the use of AWS Glue for ETL. https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html In Data Discovery and cataloging, AWS talks about creating a Crawler for Data cataloging. ...

Kul

121

asked Apr 14 at 3:42

-1 votes

0 answers

19 views

Unable to create JDBC source connection with AWS Glue

There seems to be little to no documentation(or atleast I can't find any meaningful guides), that can help me establish a successful connection with a MySQL source. Either getting this VPC endpoint or ...

Hassaan Ali

9

asked Apr 13 at 20:59

0 votes

0 answers

50 views

Does Glue connect to SQL Server?

I haven't found a single tutorial that shows how to connect Glue to a SQL Server or Azure DB instance, so that's why I'm here. I'm having issues connecting AWS Glue to a SQL Server instance in a ...

fdkgfosfskjdlsjdlkfsf

3,311

asked Apr 10 at 21:06

0 votes

0 answers

21 views

aws list_findings parameters changed in request

I am currently using boto3 list findings to return all findings for various aws accounts. I am getting the following error sporadically (Service: MandoFindings, Status Code: 400,) Pagination token ...

em456

443

asked Apr 10 at 9:12

0 votes

1 answer

38 views

Adding jar files to AWS Glue script versus notebook

I've noticed something odd in AWS Glue - when you create a spark notebook and pass some magic commands to set up the notebook it has no problems For example: %idle_timeout 2880 %worker_type G.1X %...

geo_coder

753

asked Apr 4 at 19:13

0 votes

1 answer

20 views

AWS Athena is not processing any data from glue table if partition projection is enabled

I have a glue table that is fed by partitioned data in s3. The issue at hand is in Athena that if the partition projection is turned off, and I run MSCK REPAIR TABLE <my table>; and SELECT * ...

Raisin

21

asked Apr 3 at 12:43

1 vote

0 answers

21 views

PyIceberg with AWS Glue Creates Unwanted Nested Directories in S3 Tables

I'm using PyIceberg with AWS Glue REST catalog to insert data into an Iceberg table stored in S3. The data insertion works fine, but I noticed that PyIceberg creates unwanted nested directories in S3 ...

Tharanesh Balaji

11

asked Apr 2 at 5:32

0 votes

0 answers

52 views

AWS Glue 5.0 "Installation of Python modules timed out after 10 minutes"

I have an AWS Glue 5.0 job where I am specifying --additional-python-modules s3://my-dev/other-dependencies /MyPackage-0.1.1-py3-none-any.whl in my job options. My glue job itself is just a print(&...

Martin

1,598

asked Mar 27 at 19:27

Collectives™ on Stack Overflow

Write partitioned col in s3 file too

AWS MSK and Glue integration for processing batches of messages

Unable to use pyarrow optimization in AWS Glue

Data Warhousing with Spark and Redshift [closed]

Duplicate Records in Parquet (Processed) Table after AWS Glue Job execution

Read incremental data from iceberg tables using Spark SQL

Unable to configure the exact number of DPUs for the Glue Pyspark job

Is Data catalog and Crawler mandatory for Glue

Unable to create JDBC source connection with AWS Glue

Does Glue connect to SQL Server?

aws list_findings parameters changed in request

Adding jar files to AWS Glue script versus notebook

AWS Athena is not processing any data from glue table if partition projection is enabled

PyIceberg with AWS Glue Creates Unwanted Nested Directories in S3 Tables

AWS Glue 5.0 "Installation of Python modules timed out after 10 minutes"

Hot Network Questions

Collectives™ on Stack Overflow

Related Tags