We are designing a coupon system with 300 request per second on redemptions, where updating global and per-user coupon counters directly in database within a transaction causes some contention and latency,
e.g. - Updating counters such as offer_budget seems to be bottleneck as all users might be claiming the same offer causing serialisation/locking while using transactions for safe updates, however user_id_coupoun_id counters are not really the problem as it's unlikely that same user with same offer will redeem at same time. Max latency expectation on client is 200-300ms on one redeem call here.
Because of this, our design is -
- Enforce coupon budgets (global and per-user limits) using Redis counters for low-latency checks and updates.
- Write each coupon redemption as an immutable log record in the database.
- On step 2 success we considered redemption as success. // return success to client here.
- Periodically we reconcile/sync counters back to the database using background jobs by scanning our log table since last checkpoint (checkpoint - we keep checkpoint for the record we have scanned the table up untill now, so that we don't have to scan entire table to build counters everytime).
This works for redemptions, but refunds introduce complexity.
On order cancellation or refund, what is the correct and safest approach to handle coupon refunds ? Specifically, how should systems increase the global offer budget and per-user (user_id, offer_id) counters and log these changes?
Should refunds: Increment Redis counters back and write a refund log, OR Move both redemption & refund logic entirely to database transactions (updating counters and logs atomically), accepting higher latency but relying on strong consistency—assuming ~300 request per second ?
In practice, what approach do large-scale systems use to balance safety, latency, and correctness in this case?