1

I’m designing a coupon redemption system where each coupon is active for a time range and provides a discount on the cart/product value. So when user places an order with the coupon user redeems that discount.

Each coupon has a few limit counters (global redemption limit, per-user_id limit, per-card limit, etc.). A redemption request must check these counters and update them atomically so the limits cannot be exceeded.

On each redemption every counter attached with that coupon is validated if they are within the limits we update it and allow the successful redemption when order is placed.

We must validate and atomically update several counters, as below -

total_redemptions_for_offer -= 1
user_redemptions_for_offer -= 1
card_redemptions_for_offer -= 1

Live counters tracking -

coupon_id_count : 32323  // total num of times coupon is redeemed
user_id * coupon_id : 2 // user_id claimed coupon_id by 2 times
card_num * coupon_id : 1 // card_num claimed coupon_id by 1 times

Request per second for coupon redemption ~ 500, and multiple redemptions may happen on the same coupon concurrently.

I’m looking for guidance on designing such counter-based limit enforcement at scale, to maintain these counters with good availability and correctness.

3
  • I removed some words from the question which seem to trigger certain community members to downvote and close-vote it. Asking for a standard or recommended industry approach is is not fundamentally different from asking for a "best practice" or "the most popular approach". See softwareengineering.meta.stackexchange.com/a/6968 for more explanations why this kind of wording should be avoided. Commented Jan 3 at 7:06
  • Thanks @DocBrown. Commented Jan 3 at 8:05
  • Question is subject on meta Commented Jan 4 at 6:44

1 Answer 1

3

inputs

Ok, you're handing out inexpensive coupon cards, which might have a guid QR code printed on them. A card might be displayed in a store window, where many users could access it.

Devices, like Android phones, have GUIDs. It's not quite clear if you have a user_device_id, or a user_id with some many-to-many relationship with devices. For simplicity I will assume the former.

So we have several entity IDs having per-ID quota limits:

  • offer
  • card
  • user_device

I assume there is a distribution over entity activity, perhaps Zipfian, where some entities are far more popular or active than other entities. In addition to quota values, each entity should also store a recently seen rate of redemption events, to help us assess risk of imminent quota violation. The fact that some offers are more popular than others, and can be handled differently, is essential to keeping the "at risk" query rate down.

fast path

Independent of the backend technology you choose for this (kafka, redis, rdbms), I recommend you apply a boolean label to each entity: "fast" vs "slow" path for processing. The fast-path entities are "boring"; they are far from their limit so there is little danger of soon exceeding their quotas.

audit

Each redemption transaction event shall be appended to a common event log. This is easy to do, cheaply and reliably, even during a network partition. We append to a local log, which later will show up in an eventually consistent log. Thanks to transaction guids, logging an event is idempotent. Such logging is highly available.

At "slow" intervals (every 100ms, every 10s, whatever) those events will be rolled up into counters. And we will evaluate the related entity to see if it is in danger of "soon" violating any quota. If so, we update that boolean attribute, turning it into a slow-path entity.

rate of change

So to recap, our sharded backend infrastructure is publishing a highly available mapping from entity ID to a boolean attribute. For most entities it reports "fast-path". The update rate, promoting from fast- to slow-path, is very low.

chasing 9's

Depending on your backend stability, and where you set the "transition to slow path" thresholds, it is possible for a partition to happen, a flash mob to arrive, and then some entity gets hammered on both sides of the partition, violating its quota. The CAP theorem tells us: that's life. Pick your tradeoffs. Fortunately the quality of the logged records is very good, so you can monitor quota violations and intelligently match thresholds and business goals against what your infrastructure is actually observed to do.

slow path

All nodes have access to (a possibly stale) entity --> path mapping. If the entity is marked "slow path", then we forward to a smallish number of specialized nodes which do the transaction "carefully", meaning they acquire locks. Personally, I would choose a postgres RDBMS to do that, but there's lots of other ways to acquire distributed locks.

time passes

The OP mentioned a "time range" associated with each entity. So a fast-path entity that heated up may be promoted to slow-path, then time passes and we see the entity has cooled down, so we COMMIT an rdbms transaction and then make a separate update to mark the entity as fast-path. There's hysteresis on the entity's thresholds, to ensure the write rate for such path transitions is low.


leases

Think of fast-path as an optimistic lock for an entity, which grants all worker nodes permission to log redemption events against that entity.

If a proposed redemption involves any slow-path entity, then the request is forwarded to a smaller set of nodes where we make different CAP tradeoffs, in the interest of safety and consistency.

8
  • updated the question to clarify on user_ids and counters example. Commented Dec 12, 2025 at 20:36
  • Updating high contention live redemption counters (such as offer count / budget) in rdbms, wouldn't cause too much serialization/locking of rows, such that no next query will be able to acquire previous transactions ends ? And here we are going to update the rows = redemption counters available, And also with every redemption we might to update the log a log table Commented Dec 12, 2025 at 21:08
  • I heard "QPS == 500". I offered a "low write rate" approach which is read mostly plus local logging, reducing the "difficult" (locking) case by perhaps two orders of magnitude for a given distribution, so maybe 5 QPS on a limited set of your compute nodes. // "Updating ... too much" -- quantify what "too much" means. Does it exceed the capacity of a specific AWS compute node? "might update the log" no, it is essential that every customer interaction shall append a log record, which becomes eventually consistent, it cannot be might. This is in accordance with your CAP requirements. Commented Dec 12, 2025 at 23:35
  • My question mainly around whether a transactional DB postgresMysql would realistically become a bottleneck? If each redemption runs a transaction that updates ~4–5 rows (offer, user, card counters and a log record table) and peak QPS is ~500, would row-level locking on all hot rows(sameCoupon) introduce noticeable latency or contention in redeem call ? Trying to understand if transactional postgres starts to struggle for this pattern, assuming proper indexing/InnoDB row locks? Commented Dec 13, 2025 at 8:04
  • Such that we have to move to some cache counter approach - softwareengineering.stackexchange.com/questions/384237/… With some write back stratgey to DB and serve both read of coupon availablility by checking counters state in cache and updating the state of counter in cache on redemption, and later sync it to DB, in case cache goes down. Commented Dec 13, 2025 at 8:10

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.