You were vague on details, so I will make some up.
Suppose each site has an inventory and a sales table.
And centrally we maintain a sales_trends table
which does global reporting on recent sales across all sites.
site setup
- Site Database (Which will run on-prem or cloud)
If there was no on-prem requirement,
then I wouldn't even use multiple databases -- just
make the site_id GUID the first part of a compound
PK
for both tables.
You might still choose to do that for your cloud customers.
on-prem customer
First, produce the string "secret" + roll_new_guid(),
and store that somewhere secure in the cloud, perhaps in an AWS S3 bucket.
The prefix is simply a reminder to staff that this global secret
should never be revealed to any tenant.
Next, to enroll a new customer, roll a (timestamped) UUIDv1 guid for them
which will serve as their site_id.
Also compute validator = sha3(global_secret + site_id).
Have the customer's on-prem DB store the (site_id, validator) tuple.
(Or use HMAC function to compute the validator, for similar result.)
Customer should store the tuple in a one-row table.
During routine updates, customer shall present the (site_id, validator) tuple,
and you will proceed with the request only if it's valid.
The idea is to prevent an attacker from rolling random GUIDs,
and to let your various system components validate requests
without always needing online access to some central registration table
that has UNIQUE index on site_id.
indexes
Central cloud-based tables will use customer's site_id as
part of their primary key.
Tables maintained on-prem for a given customer can follow the same convention,
if you find that convenient.
Equivalently, a very simple CREATE VIEW can prepend site_id to each row
when you're sending updates to a cloud server.
The inventory and sales tables will have an updated timestamp column,
and it will be indexed, to support efficient queries on recent changes.
A central table, with N rows for your N customers, will maintain summary
timestamps that describe "synchronized up through timestamp T1" for each customer.
Feel free to add created timestamp columns to your tables,
if you find that useful.
synchronizing
An on-prem customer will connect to a cloud server and send
recently changed rows.
Naïvely we might SELECT ... WHERE updated > T1.
But we're racing with other customer transactions,
so a "consistent reads" sql isolation level is going to matter.
Postgres makes this easy, since it's the default.
I don't know offhand what the appropriate SQL Server isolation setting would be.
A better query would pick a T2 in the recent past,
say ten seconds behind current wallclock time, and
then do SELECT ... WHERE T1 < updated <= T2.
This assumes that most INSERT / UPDATE transactions
happen "quickly", within a few seconds.
Such a query also accounts for the fact that
timestamps have limited granularity, so we
want to wait for a few ticks of the clock
to give events a chance to show up as "historic"
events, rather than in-progress "current" events.
If some rows are very hot,
they will always have an updated value that matches wallclock time,
preventing us from incorporating a small delay in the query.
If that is the case, you will need to rely on isolation and Postgresql's
MVCC.
Avoid this if feasible, as it introduces porting issues if you
plan to use more than one backend database vendor, and it
complicates the system testing you must do to verify
there's no unfortunate races.
It's easy to synchronize the log of sales transactions;
since it's append-only, every row is cold, and has a GUID PK.
Depending on your use case, you may find it convenient to
break out an append-only inventory_history table.
Add another row to it, snapshotting an updated inventory row,
each time a SKU participates in a sales transaction.
Every now and again, perhaps daily, you will want
to audit synchronization to verify it is working as intended.
Pick a pair of timestamps (T1, T2), perhaps separated by 24 hours.
On both customer and cloud database, minimally compute COUNT(*)
for that interval. Ideally you would also compute hash of those ordered rows.
We expect to routinely find a perfect match.
Log an error if mismatch is found.
- How to avoid duplication .. when merging ?
Tag each row with site_id, as you proposed.
If you find that customers sometimes set their NTP-synchronized clock backwards,
then you will need to enforce monotonic timestamps at the app level.
Let's hope it doesn't come to that.
After you locally record that we're synced through timestamp T2,
the customer is only allowed to commit timestamps > T2.
- How to avoid conflicts that may occur ? Should I have site id as guid in all tables ?
See above. And "yes".
- How to avoid transferring shared data to site where it shouldn't be ?
Different customers will have different site IDs.
Only send a customer rows bearing their own site ID.