4,870 questions
0
votes
0
answers
11
views
Can I output Salesforce object data as csv to S3 bucket using AWS Glue zero ETL?
I've been looking at better ways to extract Salesforce data for our organization and found the announcement on AWS Glue zero ETL now using the Salesforce bulk api and the performance results sound ...
Advice
0
votes
0
replies
19
views
Applying a Single AWS Glue Data Quality Ruleset to Multiple Glue Jobs with Dynamic Column Input
Team,
We are implementing a new requirement to integrate Data Quality (DQ) rules within AWS Glue Studio. We have successfully created DQ rules using the DQDL builder, leveraging built-in rulesets, and ...
3
votes
0
answers
87
views
How to convert epoch to datetime in Datadog dashboard?
I have a Datadog dashboard displaying the metrics we get for our AWS Glue Zero-ETL integrations. One of those is lastSyncTimestamp, the epoch timestamp until which source has been synced to target.
I ...
0
votes
0
answers
50
views
Is it possible to update script section for AWS Glue ETL or Glue streaming Jobs using AWS CLI?
Version my python script for each change and push to S3 with new version
aws s3 cp aws_glue_script_v1.0.3_1.py s3://mytestcicdglue/glue-scripts/aws_glue_script_v1.0.3_1.py
I have skeleton json of ...
1
vote
0
answers
32
views
Version control Athena queries
I'm running a data pipeline through glue notebooks that references Athena saved queries and runs them sequentially.The pipeline is working well but there is no version control for the Athena queries. ...
0
votes
0
answers
43
views
AWS Glue and IAM conditional access
How to write an AWS IAM Policy document such that it does the following:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "ec2:...
0
votes
0
answers
24
views
Unable to connect my S3 bucket to my data source for my AWS glue ETL job
I am trying to start an ETL job on AWS GLUE Visual editor which is fairly intuitive however with my very first step, I wanted to connect to my S3 bucket as my data source. So my first step was to ...
0
votes
0
answers
115
views
How to configure AWS Glue to trust custom SSL certificate for SAP OData connection?
Code I’m running:
connection_type="sapodata",
connection_options={
"ENABLE_CDC": "false",
"connectionName": "sapodata-connection&...
1
vote
1
answer
73
views
How to avoid full table scan in Glue's create_dynamic_frame.from_options for dynamodb
My Dynamodb table has both PK and SK. it has huge data set(500 GB).
I'm using below syntax for querying data based on PK in Glue, But it does a full table scan leading to the glue timeout. Have ...
0
votes
0
answers
147
views
Athena is appending UTC to an iceberg timestamp results, why? how to fix it?
I am storing a simple datetime value (e.g 2025-01-24 13:58:14.000) from SQL to an iceberg table using glue catelog. I don't want anything with timezones. We only work in EST so all our datetimes don't ...
0
votes
1
answer
49
views
Unable to register database/table in aws glue when hudi job is submitted from emrserverless
I am using emr 6.15 and hudi 0.14
I submitted following hudi job which should create a database and a table in aws glue. IAM Role assigned to EMR serverless has all neccessary permissions of s3 and ...
0
votes
0
answers
48
views
Using AWS Glue, how can I process different file types in a folder to their own Glue table
I am new to AWS Glue, Apache Spark and all things big data.
I have files being delivered to S3 with the following structure.
s3://raw-data/dd-mm-yyyy/<source>/<product>/<reportType>/[...
1
vote
0
answers
43
views
Manually add Oracle Procedures as a 'data job' nodes in DataHub lineage models
We're trialling Datahub for the first time, and have used AWS Glue Data Catalog to connect to our Oracle database, and then connected Datahub to our Glue Data Catalog to pull the table/column metadata ...
0
votes
0
answers
56
views
Return value from Glue job to Xcom
Is there any way to return any value from Glue ETL job to airflow’s task (Xcom) which triggers that glue job ?
Thanks
0
votes
0
answers
22
views
Glue Job Fails When Exporting from DocumentDB to Azure Blob Storage Due to Mongo Spark Connector Schema Inference
I'm using AWS Glue 4.0 to export data from AWS DocumentDB to Azure Blob Storage. The job is written in PySpark and uses the MongoDB Spark Connector. Below are the jars added to the Glue job:
mongo-...
-1
votes
1
answer
214
views
AnalysisException: This Delta operation requires the SparkSession to be configured [closed]
I have a PySpark script using Glue 4.0 which reads parquet and write Delta Lake. It works well.
Here is my PySpark script:
import logging
import os
import sys
from awsglue.context import GlueContext
...
0
votes
1
answer
44
views
duplicate removal from grouped and merged data frame fails generating duplicates in final JSON
I have two dataframes as below:
DataFrame 1: df1
UniqueId
VendorId
Fname
LName
VendorAccNo
001
12
ABC
XYZ
8787888
002
13
XYZ
FFF
8787888
003
14
PQR
ZZZ
8787888
005
16
MMM
TTT
5432100
006
17
BBB
XXX
...
0
votes
1
answer
44
views
Data transformation in AWS
I have two years of IOT telemetry data in a S3 bucket (json format). I want to transform with Glue in the as mentioned below to another S3 in the data lake.
Structure is : year, month, day, hour, ...
0
votes
1
answer
62
views
Write partitioned col in s3 file too
I’m writing to glue table, where I’m having (country and state) as a partition column.
But If I read directly from s3 bucket ( base of Athena table), I’m not seeing these partition columns ( country ...
0
votes
0
answers
81
views
Unable to use pyarrow optimization in AWS Glue
In my AWS Glue (4.0 which supports spark 3.3), I am trying to optimize by using this:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
but it gives me a warning
/...
2
votes
1
answer
653
views
Read incremental data from iceberg tables using Spark SQL
I am trying to read incremental data between two snapshots
I have last processed snapshot (my day0 load) and below is my code snippet to read incremental data
incremental_df = spark.read.format("...
1
vote
1
answer
91
views
Is Data catalog and Crawler mandatory for Glue
I am reading about the use of AWS Glue for ETL.
https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html
In Data Discovery and cataloging, AWS talks about creating a Crawler for Data cataloging.
...
0
votes
0
answers
29
views
aws list_findings parameters changed in request
I am currently using boto3 list findings to return all findings for various aws accounts.
I am getting the following error sporadically
(Service: MandoFindings, Status Code: 400,) Pagination token ...
0
votes
1
answer
124
views
AWS Athena is not processing any data from glue table if partition projection is enabled
I have a glue table that is fed by partitioned data in s3. The issue at hand is in Athena that if the partition projection is turned off, and I run MSCK REPAIR TABLE <my table>; and SELECT * ...
0
votes
0
answers
309
views
AWS Glue 5.0 "Installation of Python modules timed out after 10 minutes"
I have an AWS Glue 5.0 job where I am specifying --additional-python-modules s3://my-dev/other-dependencies /MyPackage-0.1.1-py3-none-any.whl in my job options.
My glue job itself is just a print(&...