4,249 questions
0
votes
0
answers
20
views
Write partitioned col in s3 file too
I’m writing to glue table, where I’m having (country and state) as a partition column.
But If I read directly from s3 bucket ( base of Athena table), I’m not seeing these partition columns ( country ...
0
votes
0
answers
29
views
AWS MSK and Glue integration for processing batches of messages
I need to do a batch processing for the messages of a topic with a low throughput, so ideally, instead of a service running 24/7, the job is executed a few times a day. The topic is in an MSK cluster ...
0
votes
0
answers
42
views
Unable to use pyarrow optimization in AWS Glue
In my AWS Glue (4.0 which supports spark 3.3), I am trying to optimize by using this:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
but it gives me a warning
/...
-1
votes
1
answer
39
views
Data Warhousing with Spark and Redshift [closed]
A question to those who have done some data warehousing where Spark/Glue is the transformation engine, bonus if the data warehouse is Redshift.
This is my first time putting a data warehouse in place, ...
1
vote
1
answer
38
views
Duplicate Records in Parquet (Processed) Table after AWS Glue Job execution
We have an AWS Glue pipeline where:
A crawler populates a raw database table from partitioned JSON files in S3.
S3 structure:
raw/
├── org=21/
│ └── 221.json
└── org=23/
└── 654.json
...
1
vote
1
answer
81
views
Read incremental data from iceberg tables using Spark SQL
I am trying to read incremental data between two snapshots
I have last processed snapshot (my day0 load) and below is my code snippet to read incremental data
incremental_df = spark.read.format("...
0
votes
0
answers
44
views
Unable to configure the exact number of DPUs for the Glue Pyspark job
I have 20 million records, which comprise around 1.5 to 10 GB, as per the information I received. I can't access the source system to get the exact size of this table. I am just reading it from the ...
1
vote
1
answer
27
views
Is Data catalog and Crawler mandatory for Glue
I am reading about the use of AWS Glue for ETL.
https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html
In Data Discovery and cataloging, AWS talks about creating a Crawler for Data cataloging.
...
-1
votes
0
answers
19
views
Unable to create JDBC source connection with AWS Glue
There seems to be little to no documentation(or atleast I can't find any meaningful guides), that can help me establish a successful connection with a MySQL source.
Either getting this VPC endpoint or ...
0
votes
0
answers
50
views
Does Glue connect to SQL Server?
I haven't found a single tutorial that shows how to connect Glue to a SQL Server or Azure DB instance, so that's why I'm here.
I'm having issues connecting AWS Glue to a SQL Server instance in a ...
0
votes
0
answers
21
views
aws list_findings parameters changed in request
I am currently using boto3 list findings to return all findings for various aws accounts.
I am getting the following error sporadically
(Service: MandoFindings, Status Code: 400,) Pagination token ...
0
votes
1
answer
38
views
Adding jar files to AWS Glue script versus notebook
I've noticed something odd in AWS Glue - when you create a spark notebook and pass some magic commands to set up the notebook it has no problems
For example:
%idle_timeout 2880
%worker_type G.1X
%...
0
votes
1
answer
20
views
AWS Athena is not processing any data from glue table if partition projection is enabled
I have a glue table that is fed by partitioned data in s3. The issue at hand is in Athena that if the partition projection is turned off, and I run MSCK REPAIR TABLE <my table>; and SELECT * ...
1
vote
0
answers
21
views
PyIceberg with AWS Glue Creates Unwanted Nested Directories in S3 Tables
I'm using PyIceberg with AWS Glue REST catalog to insert data into an Iceberg table stored in S3. The data insertion works fine, but I noticed that PyIceberg creates unwanted nested directories in S3 ...
0
votes
0
answers
52
views
AWS Glue 5.0 "Installation of Python modules timed out after 10 minutes"
I have an AWS Glue 5.0 job where I am specifying --additional-python-modules s3://my-dev/other-dependencies /MyPackage-0.1.1-py3-none-any.whl in my job options.
My glue job itself is just a print(&...