I have built my custom Python image for streaming Dataflow job and tested it locally that it works. When deployed to GCP it pulls the images, JOB logs initialize the minimum 1 worker, job is in state RUNNING. I see logs that I printed when pipeline is initiated. However, when I publish pubsub messages to my stream nothing happens, just data freshness metric instreases, workers have strange errors (that are markes info severity) about flags that I do not control:
flag provided but not defined: -logging_endpoint
Usage of /opt/google/dataflow/python_template_launcher:
...
And when looking into dataflow_step logs (not shown in Dataflow Worker console) in GCP I see constant (not sure if those two are related):
"Error syncing pod, skipping" err="failed to \"StartContainer\" for \"sdk-0-0\" with CrashLoopBackOff: \"back-off 10s restarting failed container=sdk-0-0 pod=df-myjob-she-09170618-0m8p-harness-vmzx_default(ce159539391472095ac26571c8a6ceb2)\"" pod="default/df-myjob-she-09170618-0m8p-harness-vmzx" podUID=ce159539391472095ac26571c8a6ceb2
Here is the dockerfile (anything commented was there before from google-docs and tried it):
FROM gcr.io/dataflow-templates-base/python311-template-launcher-base:latest
WORKDIR /template
COPY . /template/
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="src/dataflow.py"
RUN apt-get update \
&& apt-get install -y libffi-dev git \
&& rm -rf /var/lib/apt/lists/* \
# Upgrade pip and install the requirements.
# && pip install --upgrade pip \
# && pip install -e .
# && pip install --no-cache-dir -r "/template/requirements-pypi.txt" \
# Download the requirements to speed up launching the Dataflow job.
# && pip download --no-cache-dir --dest /tmp/dataflow-requirements-cache -r "/template/requirements-pypi.txt"
ENTRYPOINT ["/opt/google/dataflow/python_template_launcher"]
Here is how I launch my job :
gcloud dataflow flex-template run "myjob" \
--project="xxx" \
--region="europe-north1" \
--template-file-gcs-location="gs://xxx/templates/myjob-template.json" \
--parameters=^::^pubsub_topic=projects/xxx/topics/senior-cash-transactions::staging_location=gs://xxx/temp::sdk_container_image=europe-north1-docker.pkg.dev/xxx/xxx/xxx:latest \
--temp-location="gs://xxx/temp" \
--additional-experiments="enable_secure_boot" \
--disable-public-ips \
--network="xxx" \
--subnetwork="https://www.googleapis.com/compute/v1/projects/xxx/regions/europe-north1/subnetworks/xxx-europe-north1" \
--worker-machine-type="e2-standard-8" \
--service-account-email="[email protected]"
Service account has pubsub admin, dataflow worker, dataflow admin, gcs admin, network user, log writer, anything I could think of while debugging.
Also, here are some logs I see (filter is dataflow_step and job_id):
I have reviewed several related topics, such as Why did I encounter an "Error syncing pod" with Dataflow pipeline?, also the one in Google Docs per related error message (https://cloud.google.com/dataflow/docs/guides/common-errors#error-syncing-pod), so far I tried:
- Changing apache beam versions
- As per other thread removing requirements.txt and setting up dependencies with setup.py file and rebuilding the image
- Removing requirements and setup at all from Dockerfile as looks like those dependencies already in the image to avoid conflicts and rebuilding the image
- Checked firewall settings in my VPC and Firewall GCP, nothing there.
- Checked all IAM permissions for service account