Is a single worker thread sufficient to keep up with ~500 RPS in real-world systems?
There is no way we would know it, as there are absolutely no details to test it. Maybe you're running the thing on a Raspberry Pi 3, with the database file on an old USB flash drive. It should still probably work fine, but who knows.
There is, however, a definitive way for you to answer your own question.
Do the test.
how can this be safely scaled to multiple workers without double-counting (i.e., ensuring workers don’t process same log records)?
Sharding. Or communication between workers. Or a message queue system.
But before considering those approaches, test if the most basic system works. What happens when you process your 500 requests by changing the counters in database?
That's literally a few dozens of lines of code to write.
I believe we should still run multiple (3-4) workers for availability while guaranteeing that only one worker updates counters and checkpoint at a time? How do we ensure this safely?
Define safely.
Sooner or later, you will face the situation where you'll have to chose between:
- Not receiving a message.
- Receiving the same message multiple times.
Depending on the business case, it may make sense to lose messages. The counter doesn't get changed; that's it.
Or maybe it is critical to have the counter changed no matter what. In which case, you'll have to implement an additional feature where the requester would continue sending the same message until the server acknowledges it received it. This means that you'll probably need to de-duplicate messages, for instance by storing their identifiers in the database to know whether you already processed them or not.
Actual test
Following the advice from above, here's what you get.
The server has two routes:
@app.route('/<int:counter>/increment', methods=['POST'])
def increment(counter: int) -> flask.Response:
...
@app.route('/<int:counter>/decrement', methods=['POST'])
def decrement(counter: int) -> flask.Response:
...
Each route executes two SQL queries. The first one adds an entry in the log table:
insert into operations
(counter_id, original_value, increment)
values (%(id)s, (select value from counters where id = %(id)s), %(increment)s)
The second one adjusts the actual counter:
update counters set value = value + %(delta)s where id = %(id)s
As they run in a transaction, there would be no inconsistencies between the two tables if something goes wrong.
On client side, a simple script creates a given number of changes to run (generate_changes). A change affects one of the five counters, given that the probability is that it would be the first counter in most of the cases, to test the “high contention” scenario you mentioned. Here's the probability I picked:
indexes = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 3, 4, 4]
Each change either increments or decrements a counter, given that I arbitrarily decided that counters will never go below zero.
The client application starts by clearing all the data, then runs the changes, in parallel. The duration of this second action is measured and reported:
def process_change(c: Change):
action = 'increment' if c.increment else 'decrement'
uri = f'http://localhost:7492/{c.counter}/{action}'
requests.post(uri).raise_for_status()
start_time = time.time()
changes = generate_changes(count_changes)
arguments = ((c,) for c in changes)
with multiprocessing.Pool() as pool:
pool.starmap(process_change, arguments)
end_time = time.time()
print(f'Ran {count_changes} changes in {(end_time - start_time):.3f} s.')
The overall thing takes about thirty minutes to draft (or less: I was being particularly slow on this one).
With an Intel i5-10600 3.30 GHz CPU with a Force MP510 SSD, PostgreSQL 15 and Python 3.13.5, running 2,000 changes takes 2.27 s. That's about 880 changes per second. When I change the test to perform 10,000 changes or 50,000 changes, I find the very same 880 changes/s. speed.
If I don't reset the changes, i.e. accumulate the data, the speed seems to remain the same. I haven't tested what happens if the database contains millions of rows.
During the runs, all CPU cores are used around 65%. Here's a view of two consecutive runs:

Although I haven't checked it, my impression is that the actual CPU load comes from the client script, not the server or the database. As a matter of fact, removing parallelism on client side leads to the process taking ten times as long as its parallel variant.
Conclusion:
A naive implementation, with no optimizations, can easily reach a rate of 880 changes/second to the counters on “ordinary” hardware, given that it is not impossible that the actual rate is much higher, if the slowness ends up to be on client side in my test.
The implementation being so basic, it is easy to implement as a proof of concept, check the actual performance on production hardware, and, if needed, act accordingly by using one of the most common approaches:
- Distributing the work to multiple servers (horizontal scalability).
- Moving to better hardware (vertical scalability).
- Optimizing.