I am running Python Datalfow jobs and I deploy the dataflow template to gcs from Gitlab. I am using --requirements_file=requirement.txt when I deploy my python template to gcs. Cloud NAT is diabled in my project and it will restrict the workers to download the packages from PyPi.
Initial requirement.txt that was used:
- gcloud
- google-cloud-logging==1.15.0
- google-cloud-core==1.4.1
- google-cloud-datastore==1.8.0
- httplib2
- google-resumable-media==2.1.0
- google-cloud-storage
- google-cloud-bigquery
- google-cloud
- apache-beam[gcp]==2.39.0
- google-api-python-client
My dataflow job got failed because it was trying to download some packages from Internet.
Then the requirement.txt was modified like this.
- gcloud
- google-cloud-logging==3.1.2
- google-cloud-core==1.7.2
- google-cloud-datastore==1.8.0
- httplib2==0.19.1
- google-resumable-media==2.3.3
- google-cloud-storage==1.44.0
- google-cloud-bigquery==2.34.4
- google-cloud==0.34.0
- apache-beam[gcp]==2.39.0
- google-api-python-client==2.51.0
- google-cloud-appengine-logging==0.1.0
- google-cloud-audit-log==0.1.0
- pyyaml
Then there was no more download during dataflow runtime. What was the reason for my initial error? How I can ensure that the correct dependency versions to be provided so that it will not download any packages during runtime? I will not be able to use custom container options due to some restrictions.