Optimized Implementation of ML-KEM on ARMv9-A with SVE2 and SME

Paper 2026/093

Optimized Implementation of ML-KEM on ARMv9-A with SVE2 and SME

Hanyu Wei

, Fudan University, Shanghai, China

Wenqian Li

, Fudan University, Shanghai, China

Shiyu Shen

, City University of Hong Kong, Hong Kong, China

Hao Yang

, City University of Hong Kong, Hong Kong, China

Yunlei Zhao

, Fudan University, Shanghai, China

Abstract

As quantum computing continues to advance, traditional public-key cryptosystems face increasing vulnerability, necessitating a global transition toward post-quantum cryptography (PQC). A primary challenge for both cryptographers and system architects is the efficient integration of PQC into high-performance computing platforms. ARM, a dominant processor architecture, has recently introduced ARMv9-A to accelerate modern workloads such as artificial intelligence and cloud computing. Leveraging its Scalable Vector Extension 2 (SVE2) and Scalable Matrix Extension (SME), ARMv9-A provides sophisticated hardware support for high-performance computing. This architectural evolution motivates the need for efficient implementations of PQC schemes on the new architecture. In this work, we present a highly optimized implementation of ML-KEM, the post-quantum key encapsulation mechanism (KEM) standardized by NIST as FIPS 203, on the ARMv9-A architecture. We redesign the polynomial computation pipeline to achieve deep alignment with the vector and matrix execution units. Our optimizations encompass refined modular arithmetic and highly vectorized polynomial operations. Specifically, we propose two NTT variants tailored to the architectural features of SVE2 and SME: the vector-based NTT (VecNTT) and the matrix-based NTT (MatNTT), which effectively utilize layer fusion and optimized data access patterns. Experimental results on the Apple M4 Pro processor demonstrate that VecNTT and MatNTT achieve performance improvements of up to $7.18\times$ and $7.77\times$, respectively, compared to the reference implementation. Furthermore, the matrix-vector polynomial multiplication, which is the primary computational bottleneck of ML-KEM, is accelerated by up to $5.27\times$. Our full ML-KEM implementation achieves a 52.47% to 60.09% speedup in key encapsulation across all security levels. To the best of our knowledge, this is the first work to implement and evaluate ML-KEM leveraging SVE2 and SME on real ARMv9-A hardware, providing a practical foundation for future PQC deployments on next-generation ARM platforms.

Metadata

Available format(s): PDF
Category: Implementation
Publication info: Preprint.
Keywords: Post-Quantum Cryptography ML-KEM NTT SVE2 SME ARMv9-A
Contact author(s): hywei24 @ m fudan edu cn
liwq24 @ m fudan edu cn
crypto @ sher1e dev
crypto @ d4rk dev
ylzhao @ fudan edu cn
History: 2026-01-23: approved; 2026-01-20: received; See all versions
Short URL: https://ia.cr/2026/093
License: CC BY

BibTeX

@misc{cryptoeprint:2026/093,
      author = {Hanyu Wei and Wenqian Li and Shiyu Shen and Hao Yang and Yunlei Zhao},
      title = {Optimized Implementation of {ML}-{KEM} on {ARMv9}-A with {SVE2} and {SME}},
      howpublished = {Cryptology {ePrint} Archive, Paper 2026/093},
      year = {2026},
      url = {https://eprint.iacr.org/2026/093}
}