implement something similar to classic rate limiting
The shared state reflects the sum of the states of all individual replicas
Here is the problem statement I heard.
You have web client Request Types A, B, ..., Z,
each having a weight reflecting the typical elapsed CPU + I/O time
to service the request.
For example A might take 400 msec and B might take 600 msec.
We see (client_id, server_id, request_type, "begin")
events arriving, and similar "end"
events arriving,
via Kafka or a similar pub/sub system.
So we increment a server metric or a client metric for "begin",
and decrement it for "end".
Each event is labeled with a unique GUID, which makes a good
PK
for an RDBMS that logs events.
And presumably we use that for making cloud scale-up decisions,
and for rejecting or queueing certain client requests.
failure behavior
During ugly events like network partition or server reboot,
we choose to "fail open", so we permissively accept all client requests.
terminology
I feel you're getting hung up on the term "replica",
which suggests they are all similar to one another.
It sounds like what you want is to take hash of client_id
, modulo N servers,
to decide which "shard" will exclusively serve that client.
(This occasionally changes when servers come and go.)
Now each server can make purely local throttling decisions,
without worrying about what their peers are up to.
eventually consistent
Servers log events to an RDBMS which survives reboot events.
Log records get replicated so all servers eventually learn about all history.
The duration of each event is from its "begin" to its "end" entry.
If the "end" entry is missing then we impose some arbitrary timeout,
so it ends after e.g. 60 seconds.
The realtime metrics maintained by redis are simply a cache
of recent log records.
The redis TTL is a convenient way of making ancient records gracefully age out.
recovery
When routers and servers come and go, the system propagates "recent"
log entries so they are visible to most servers.
A simple SELECT query can be used to restore metrics maintained on redis.
EDIT
- Currently the scheduling system schedules request to a few types of services, according to user request, each type of service being independent and having it's own "scheduling stats", which make it a bit harder to implement your proposed solution.
I have no idea why you'd say that.
Combine requests for Type A .. Z into a single unified log stream
(single RDBMS table, with an enum request_type
column).
Maintain Stat A through Stat Z in realtime,
and rebuild them from the log when a reboot / recovery operation happens.
Sum up Stats A + B + C in realtime when making admission decisions
if e.g. the aggregate load is meaningful in the Business Domain,
perhaps because server devoting CPU cycles to any of those
predicts slower response times for any other requests that arrive.
Would it make sense in your opinion to deploy a separate scheduler for each type of service?
As explained, "no."
- Could you please clarify more the role of redis in your solution?
Events happen.
There are two event sinks: the log and the redis metrics,
which are essentially a cache that reflects recently logged activity.
During normal operation the two are in sync.
That is, each "begin" redis increment is paired with an "end"
redis decrement, and the currently reported redis metric matches up
with SELECT ... FROM log WHERE ... AND timestamp > sixty_seconds_ago.
To make scheduling decisions, normally you only consult redis.
However, sometimes Bad Things happen, such as network partition
or server failure / reboot.
In which case we're left wondering: What is the Source of Truth?
Simple, it's in the RDBMS.
Quickly run a report with that SELECT query,
re-populate the redis metrics from that,
and continue on your merry way as though nothing had happened.
- Would you recommend implementing the hash mechanism using nginx?
Sure, that is a possibility.
Typically in a High Availability situation people will use
dedicated F5 load balancer equipment, roll their own HA proxies,
contract with AWS for load balancing services, or do something similar.
The key is that local health checks across the LAN keep the load balancer
aware of server status, and upon noticing an unresponsive server
the balancer will immediately stop sending it traffic.
So e.g. if you had five servers and one goes belly up, the balancer
might switch from hash(ip) % 5
to hash(ip) % 4
when assigning
requests to a shard.
On a slower timescale, reports of "server died!" or "queues are big!"
will trigger horizontal scale-out.
Plan on it taking half a minute or a minute to spin up a new VM server
if you didn't have any on standby.
Or should I use built-in Kubernetes components to do so?
Why are you asking random internet strangers these things?
You know your own use case, cost-of-downtime, budget, and capabilities
far better than I do, so you decide.
K8s is a wonderful technology for scaling resilient services up and down.
But it is complex, with many config knobs, and it offers less than 100% uptime.
Sometimes hosts run out of RAM or suffer from config errors or versionitis,
and simple restarts are not sufficient to restore user-visible service.
Diagnosing the root cause can be complex.
Only go the k8s route if you have sufficient in-house expertise
to deal with the hiccups.
Based on the current set of questions,
I'm skeptical that k8s would be an excellent fit for your business needs.
Also, this means that each client can have only 1 replica to handle it, couldn't this become a bottleneck later on?
Again I have no idea why you'd say that.
I hope you read the "modulo num_servers" part?
A tacit assumption here is that a single server can support
the requests of at least K clients.
If it's the other way around, then your question was far too vague.
It did not occur to me that we'd expect to see a single client
saturate the capacity of multiple servers, nor that we would
wish to support such behavior.
Suppose you have a hundred concurrent clients visiting the site,
and in steady state there's four servers.
Now the load balancer is using hash(ip) % 4
when routing request to server.
Then load increases a little, queue depth increases,
and we cross a hysteresis threshold which triggers
spinning up a fifth server.
Everyone sees a "server five became healthy!" heartbeat
and invalidates their stats, starting them anew from zero
(or perhaps in your use case you prefer to let them
age out in the usual way after one minute).
Load balancer starts using hash(ip) % 5
.
Life is good, and settles down into a new steady state.
There can be a transition interval, so immediately after
new server becomes available we roll a random number
and 90% of the time we use "mod 4" and just 10% of the
time we send the new guy traffic with "mod 5".
Over the course of a minute or so we ramp up the
new guy's traffic to 20% of offered load, 50%, 90%,
then we stop using random, declare arrival at steady state,
and just always use "mod 5".