0

Description:

We are currently using Google Cloud Datastream to replicate data from a CloudSQL (MySQL) instance into BigQuery in near real-time. The replication works perfectly for insert and update operations, and all changes are reliably reflected in BigQuery.

However, we are facing a major issue:
Whenever a DELETE or TRUNCATE operation is performed on the source MySQL database, the corresponding rows are also deleted from BigQuery. This results in the loss of historical data, which we want to retain.


What we've tried:

  1. APPEND-ONLY Mode in Datastream
    We explored the APPEND-ONLY mode provided by Datastream, which retains all versions of records (along with metadata such as operation type). However, this approach introduces a lot of data redundancy — each update operation creates a new row, leading to a rapid increase in table size and complexity in querying current vs. historical states.

  2. Disabling Binary Logging for Deletes (SET SQL_LOG_BIN=0)
    We also tried disabling binary logging for delete operations. However, this is not feasible in our case as deletions are initiated programmatically, vary across scenarios, and cannot be reliably controlled at the session or query level.


Requirement:

We are looking for a no-code or low-code solution to prevent delete and truncate operations on the source database from being propagated to BigQuery. Ideally, we would like to retain only the latest state of data for updates but completely ignore deletes during replication.


Question:

  • Is there any configuration in Datastream or BigQuery that can help achieve this behavior?
  • Can we build a Dataflow pipeline or transformation layer that listens to Datastream's change records and filters out DELETE/TRUNCATE operations before writing to BigQuery?
  • Are there any other Google Cloud-native solutions that support selective replication or change filtering?

We are trying to avoid managing this manually or using complex ETL transformations, so a low-maintenance, scalable solution would be preferred.

Any guidance or recommendations would be highly appreciated!


1 Answer 1

0

Let me give some insights to each one of your questions:

  • Currently, there is no built-in configuration in Datastream or BigQuery to selectively prevent DELETE or TRUNCATE operations from being replicated.

  • Yes, you have more control on your data transformation when you use the Dataflow pipeline into BigQuery. Feel free to browse this document for more information.

  • Besides Dataflow, another Google Cloud-native solution could involve using Cloud Functions triggered by Pub/Sub messages from Datastream. The Cloud Function would filter out DELETE/TRUNCATE operations and then write the remaining data to BigQuery. However, for high-volume data, Dataflow is generally more scalable and recommended.

1
  • Let me know if this has resolved your issue. If it did, an upvote or marking it as accepted would be appreciated as this helps the community.
    – shiro
    Commented 11 hours ago

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.