How Alibaba Cloud deploys eBPF for large-scale load balancing
The eBPF Foundation released a case study analysing how Alibaba Cloud is implementing an eBPF-powered solution to address growing performance and scalability challenges in its Layer 7 load-balancing systems.
The company's internal framework, Hermes, reportedly processes more than 10 million requests per second while aiming to reduce operational overhead and improve reliability for millions of cloud tenants.
Load balancing pressure
The research found that as Alibaba Cloud's global infrastructure expanded, traditional Linux I/O event mechanisms such as epoll and SO_REUSEPORT became limiting factors. These mechanisms resulted in fairness issues, worker imbalances, and difficulties in obtaining visibility into worker states, ultimately affecting tenant performance and system stability.
The company identified the need for an approach that would both extend kernel scheduling behaviour and incorporate real-time data from userspace without introducing the operational risks of modifying the Linux kernel at scale.
In response, Alibaba Cloud developed Hermes, a userspace-directed I/O event notification framework underpinned by eBPF. Hermes allows the Linux kernel to make scheduling decisions informed by real-time feedback from application workers.
Worker availability, number of pending events, and active connections metrics are published continuously to shared memory by userspace processes. An eBPF program reads and processes this data, allowing adaptive and granular connection management.
The Hermes framework's architecture includes closed-loop scheduling between userspace and kernel, two-stage load distribution, and lock-free synchronisation. Coordination between userspace and kernel leverages atomic operations on eBPF array maps, designed to avoid typical lock contention issues and reduce system overhead.
Production deployment of Hermes ensures stable operation, making use of bitwise operations to minimise CPU costs and implementing safety checks to prevent programming errors from disrupting service.
Operational outcomes
Implementation of Hermes has delivered a number of measurable results for Alibaba Cloud. The company reports a 19 per cent reduction in infrastructure unit costs, driven by an increase in safe CPU utilisation levels from 30 to 40 per cent. Daily worker hangs, measured by health probes that exceed latency thresholds, have fallen by 99.8 per cent. The load-balancing platform maintains consistent throughput and latency across varied workload types, and has proven stable at high scale, operating on 100,000 CPU cores and handling over 10 million requests per second for more than two years.
Hermes' ability to coordinate kernel scheduling with userspace worker status has also contributed to enhanced isolation and fairness in multi-tenant environments, supporting diverse application requirements and reducing the risk of tenant performance imbalances.
eBPF justification
Alibaba Cloud chose eBPF for its capability to extend kernel behaviour safely and efficiently without kernel code modification. eBPF's sandboxing and API stability through CO-RE (Compile Once - Run Everywhere) provide compatibility across kernel versions and protect production systems from potential faults caused by custom kernel patches.
Lock-free synchronisation between userspace and the kernel is achieved using eBPF array maps, while atomic operations support rapid updates to scheduling data without contention.
Expansion plans
The company plans to generalise the Hermes framework for broader epoll-based workloads, potentially making it accessible for platforms such as Redis, Envoy, and Nginx via a shared SDK. There are also intentions to introduce group-based scheduling models that could support cache-aware balancing and further enhance system efficiency.