1

I am looking to create an application which will have following architecture:

  1. Site Database (Which will run on-prem or cloud). If multiple sites then multiple database instances for each site. In other words site is tenant.
  2. Cloud central database -> which will retrieve and push all data from sites used for company reporting etc.

Now the concern and issues I am facing are below:

  1. How to avoid duplication of data in cloud when merging from sites ?
  2. How to avoid conflicts that may occur ? Should I have site id as guid in all tables ?
  3. How to avoid transferring shared data to site where it shouldn't be ?

For example: Product A and B are assigned to Site X but not assigned to Site Y. How can I prevent in replication subscription to not push Product A and B to site Y. And then when assigned send.

Please advise on the above scenario, I will be using SQL Server or Postgres. Please advise for both databases. The application will be blazor platform.

3
  • It seems to me that your question boils down to how one synchronizes data between two databases. Whether it's tenanted or not seems irrelevant to answer that core question. Commented Jan 19, 2024 at 4:37
  • Yes please can you advise ? Commented Jan 19, 2024 at 5:02
  • Just elaborating on the closure reason, my previous comment was indicating that you should focus on the boiled down question, which includes first doing your own research and then, if needed, posting a direct and concrete question on that topic, instead of a much broader and vaguer question like you did here. Commented Jan 21, 2024 at 22:51

2 Answers 2

1

You were vague on details, so I will make some up.

Suppose each site has an inventory and a sales table. And centrally we maintain a sales_trends table which does global reporting on recent sales across all sites.

site setup

  1. Site Database (Which will run on-prem or cloud)

If there was no on-prem requirement, then I wouldn't even use multiple databases -- just make the site_id GUID the first part of a compound PK for both tables.

You might still choose to do that for your cloud customers.

on-prem customer

First, produce the string "secret" + roll_new_guid(), and store that somewhere secure in the cloud, perhaps in an AWS S3 bucket. The prefix is simply a reminder to staff that this global secret should never be revealed to any tenant.

Next, to enroll a new customer, roll a (timestamped) UUIDv1 guid for them which will serve as their site_id. Also compute validator = sha3(global_secret + site_id). Have the customer's on-prem DB store the (site_id, validator) tuple. (Or use HMAC function to compute the validator, for similar result.) Customer should store the tuple in a one-row table.

During routine updates, customer shall present the (site_id, validator) tuple, and you will proceed with the request only if it's valid. The idea is to prevent an attacker from rolling random GUIDs, and to let your various system components validate requests without always needing online access to some central registration table that has UNIQUE index on site_id.

indexes

Central cloud-based tables will use customer's site_id as part of their primary key.

Tables maintained on-prem for a given customer can follow the same convention, if you find that convenient. Equivalently, a very simple CREATE VIEW can prepend site_id to each row when you're sending updates to a cloud server.

The inventory and sales tables will have an updated timestamp column, and it will be indexed, to support efficient queries on recent changes.

A central table, with N rows for your N customers, will maintain summary timestamps that describe "synchronized up through timestamp T1" for each customer.

Feel free to add created timestamp columns to your tables, if you find that useful.

synchronizing

An on-prem customer will connect to a cloud server and send recently changed rows.

Naïvely we might SELECT ... WHERE updated > T1. But we're racing with other customer transactions, so a "consistent reads" sql isolation level is going to matter. Postgres makes this easy, since it's the default. I don't know offhand what the appropriate SQL Server isolation setting would be.

A better query would pick a T2 in the recent past, say ten seconds behind current wallclock time, and then do SELECT ... WHERE T1 < updated <= T2. This assumes that most INSERT / UPDATE transactions happen "quickly", within a few seconds. Such a query also accounts for the fact that timestamps have limited granularity, so we want to wait for a few ticks of the clock to give events a chance to show up as "historic" events, rather than in-progress "current" events.

If some rows are very hot, they will always have an updated value that matches wallclock time, preventing us from incorporating a small delay in the query. If that is the case, you will need to rely on isolation and Postgresql's MVCC. Avoid this if feasible, as it introduces porting issues if you plan to use more than one backend database vendor, and it complicates the system testing you must do to verify there's no unfortunate races.

It's easy to synchronize the log of sales transactions; since it's append-only, every row is cold, and has a GUID PK. Depending on your use case, you may find it convenient to break out an append-only inventory_history table. Add another row to it, snapshotting an updated inventory row, each time a SKU participates in a sales transaction.

Every now and again, perhaps daily, you will want to audit synchronization to verify it is working as intended. Pick a pair of timestamps (T1, T2), perhaps separated by 24 hours. On both customer and cloud database, minimally compute COUNT(*) for that interval. Ideally you would also compute hash of those ordered rows. We expect to routinely find a perfect match. Log an error if mismatch is found.


  1. How to avoid duplication .. when merging ?

Tag each row with site_id, as you proposed.

If you find that customers sometimes set their NTP-synchronized clock backwards, then you will need to enforce monotonic timestamps at the app level. Let's hope it doesn't come to that. After you locally record that we're synced through timestamp T2, the customer is only allowed to commit timestamps > T2.

  1. How to avoid conflicts that may occur ? Should I have site id as guid in all tables ?

See above. And "yes".

  1. How to avoid transferring shared data to site where it shouldn't be ?

Different customers will have different site IDs. Only send a customer rows bearing their own site ID.

0

This should be fairly simple as long as you follow some rules

  1. You won't be able to use out of the box replication. You will have to write your own sync code

  2. Use GUIDs as PKs, NEVER auto inc ints.

Using your own sync code allows you to avoid all the I want this set of data but not that issues.

Using GUIDs avoids any key collisions for new data created on the tenant databases while allowing you to keep the schema the same for both tenant and central db

2
  • Given that the second store seems to be readonly, the "never use autoincrementing ints" advice somewhat misses its intended mark. As long as there's only one writer, that writer can be consistent about its usage of ints. Commented Jan 21, 2024 at 22:54
  • "Cloud central database -> which will retrieve and push" Commented Jan 21, 2024 at 22:59

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.