I have a GCP DataFlow pipeline configured with a select SQL query that selects specific rows from a Postgres table and then inserts these rows automatically into the BigQuery dataset. This pipeline is configured to run daily at 12am UTC.
When the pipeline initiates a job, it runs successfully and copies the desired rows. However, when the next job runs, it copies the same set of rows again into the BigQuery table, hence resulting in data duplication.
I wanted to know if there is a way to truncate the BigQuery dataset table before the pipeline runs. It seems like a common problem so looking if there's an easy solution without going into a custom DataFlow template.