All Questions
327 questions
0
votes
0
answers
17
views
Disable auto scaling for templated jobs
In Dataflow, you can run jobs without autoscaling. This is typically achieved by setting a pipeline_option called autoscaling_algorithm to NONE. Attempting the equivalent on Templated Dataflow Jobs ...
0
votes
0
answers
33
views
Apache beam dataflow logging formatter
I have some pipeline, it use some of my modules there I am using a logging etc. The problem is with logging, i use log level overrides to setup log level but what with formatter? I am using loggers ...
0
votes
0
answers
31
views
NameError: name 'MyschemaName' is not defined [while running 'Attaching the schema-ptransform-67'] while deploying apache beam pipeline in Dataflow
How does one effectively define a schema for all the workers in dataflow to have access to the schema defined .Below is section of my code failing since the schema name cannot be found.
I have ...
0
votes
1
answer
126
views
Beam Dataflow Reshuffle Failing
We have a batch GCP Dataflow job that is failing on a Reshuffle() step with the following error:
ValueError: Error decoding input stream with coder WindowedValueCoder[TupleCoder[LengthPrefixCoder[...
0
votes
1
answer
187
views
"TypeError: isinstance() arg 2 must be a type, a tuple of types, or a union" while Writing to BigQuery table on a Dataflow pipeline using Apache Beam
I am working on a Dataflow pipeline in Pthon using apache-beam==2.57.0 and google-cloud-bigquery==3.26.0 to read data from a Cloud SQL database and write it to a BigQuery table. The script runs into ...
0
votes
0
answers
164
views
Error Syncing Pod Python Dataflow Workers
I have built my custom Python image for streaming Dataflow job and tested it locally that it works. When deployed to GCP it pulls the images, JOB logs initialize the minimum 1 worker, job is in state ...
0
votes
0
answers
116
views
How to design a Dataflow pipeline requiring several external api calls and database operations per flow?
The pipeline I am building is specifically using Dataflow with Apache Beam Python's sdk, however, I just need opinions on Beam's concepts, so examples in Java for instance I am very fine with.
In ...
0
votes
0
answers
38
views
GC Dataflow workers can't find my python modules
I've written a python script and built a folder structure for a small ETL application that is meant to run as a Google Cloud Dataflow job.
It works locally and I'm able to load, transform, and offload ...
0
votes
0
answers
25
views
Why windowing in apache beam not waiting for window to finish?
I am writing a apache beam streaming pipeline, where I want to read msg from pubsusb, and want to iterate over all the msg present in window , after my window time, in this case 1 min.
But, although ...
0
votes
0
answers
119
views
Pip install failed for package: -r while Running dataflow Flex template with providing Requirements.txt NewConnectionError
Getting following errors after launching dataflow template - code execution starts and once pipeline graph generation starts, getting this 👇👇
"subprocess.CalledProcessError: Command '['/usr/...
1
vote
0
answers
57
views
How to specify the python dependency versions for Dataflow where Cloud NAT is disabled and custom container can't be used
I am running Python Datalfow jobs and I deploy the dataflow template to gcs from Gitlab. I am using --requirements_file=requirement.txt when I deploy my python template to gcs. Cloud NAT is diabled in ...
0
votes
0
answers
57
views
How to add delay to Apache Beam WriteToJdbc for MySQL Database
I am trying to stream data from a Google Cloud Pub/Sub to a MySQL Cloud Database for a uni project. The data contains a series of values for various attributes of which some need to go into one table ...
0
votes
0
answers
73
views
Configure message retention duration of PubSub subscription created by Dataflow
I have a Dataflow pipeline that ingests messages from PubSub. This automatically creates a subscription however the retention duration is 7 days. I like that it creates the subscription so I don't ...
0
votes
0
answers
199
views
Dataflow streaming pipeline from pubsub to bigquery stalls (python beam sdk, streaming inserts)
I am trying to set up a streaming pipeline using the python beam sdk (2.54.0).
The pipeline works fine but is not scaling well when there is a important increase of input pubsub messages ( when pubsub ...
2
votes
1
answer
394
views
How to use logging with custom jsonPayload in GCP Dataflow
by default, GCP Dataflow 'just works' with the standard Python logging library, so:
import logging
logging.info('hello')
yields a log message in Google Logging while the Dataflow job runs in GCP. ...