Truncate existing BigQuery table before DataFlow job runs

Question

I have a GCP DataFlow pipeline configured with a select SQL query that selects specific rows from a Postgres table and then inserts these rows automatically into the BigQuery dataset. This pipeline is configured to run daily at 12am UTC.

When the pipeline initiates a job, it runs successfully and copies the desired rows. However, when the next job runs, it copies the same set of rows again into the BigQuery table, hence resulting in data duplication.

I wanted to know if there is a way to truncate the BigQuery dataset table before the pipeline runs. It seems like a common problem so looking if there's an easy solution without going into a custom DataFlow template.

Do you mean delete all records from the table? Can you use a SQL query for that - cloud.google.com/bigquery/docs/reference/standard-sql/… — al-dann, Commented Feb 8, 2023 at 14:12
Do you mean - delete and create a table? Can you do that using API, command line, SQL? before loading data? — al-dann, Commented Feb 8, 2023 at 14:13
Yes, that's what I meant to truncate the whole table so that when the DataFlow job runs again, I don't get duplicate rows. I've now used a scheduled query in the BQ that truncates the table 5min before the Dataflow job runs. — Vaibhav Rathore, Commented Feb 10, 2023 at 7:21

Bruno Volpato · Accepted Answer · 2023-02-08 14:13:28Z

BigQueryIO has an option called WriteDisposition, where you can use WRITE_TRUNCATE.

From the link above, WRITE_TRUNCATE means:

Specifies that write should replace a table.

The replacement may occur in multiple steps - for instance by first removing the existing table, then creating a replacement, then filling it in. This is not an atomic operation, and external programs may see the table in any of these intermediate steps.

If your use case can not afford the table being unavailable during the operation, a common pattern is moving the data to a secondary / staging table, and then using atomic operations on BigQuery to replace the original table (e.g., using CREATE OR REPLACE TABLE).

Collectives™ on Stack Overflow

Truncate existing BigQuery table before DataFlow job runs

1 Answer 1

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Related