Skip to main content
2 of 2
added 413 characters in body

Fault tolerance in aggregated distributed state

I have a scheduling system that is horizontally scaled, and stores shared state in a redis key.

The purpose of the system is to implement something similar to classic rate limiting, but a bit different in that I want to track the amount of currently active requests, and the sum of the "weights" of each request (as some requests are heavier than others in my case). The reason to use redis is to ensure minimal footprint of the system on requests latency.

The shared state reflects the sum of the states of all individual replicas (both active requests and sum of the weights), meaning that each replica individually increases and decreases the shared state.

Is there a known solution for fault tolerance in this scenario? If a replica suddenly crashes, it will never "clear" it's value from the shared state.

One possible solution is to develop another component which watches for replica failures, and decreases from the shared state when a replica crashes.

The problem with this solution is that the system relies upon external components to be hermetic. Is there any other way I can model my shared state to be more resilient to such problems?