Environment
- MySQL 8.0.36, Community Edition
- 256 GB RAM,
innodb_buffer_pool_size = 192G - Data volume ~1.2 TB (SSD, NVMe)
- Master-slave replication (GTID + ROW format), used for planned failover
Background & Research
We conduct a planned master-slave failover drill every week. After switching, the new master's Buffer Pool is "cold", causing a noticeable rise in query latency for the first 10–15 minutes, with some complex reporting queries even timing out. I reviewed the MySQL official documentation on innodb_buffer_pool_dump_at_shutdown and innodb_buffer_pool_load_at_startup, and performed the following tests:
- Enabled
innodb_buffer_pool_dump_at_shutdown = ONandinnodb_buffer_pool_dump_pct = 80on the old master - After failover, the new master's
ib_buffer_poolfile was about 2.1 GB - The startup log showed
InnoDB: Buffer pool(s) load completed at 240303 03:12:45, but the high-latency phenomenon persisted, only reduced from ~15 minutes to ~8 minutes
I also consulted Percona blog posts about --innodb-buffer-pool-load-abort and manually triggering LOAD, but could not find a systematic comparison between dump/restore and natural warm-up (no dump, letting traffic gradually fill the pool) specifically for failover scenarios.
Specific Questions
- In MySQL 8.0, does
innodb_buffer_pool_dump/restoreonly recover the "address mapping" of pages rather than the actual page content? If so, does the first access to these pages after failover still trigger physical IO, meaning the latency spike is merely "dispersed" rather than eliminated? - For a failover scenario, are there better warm-up strategies (e.g., running
SELECT /*+ JOIN_ORDER(...) */against hot tables on the slave before switching; or using Percona'sinnodb_buffer_pool_load_nowcombined with specific SQL warm-up)? If yes, are there any best practices or benchmark data? - When the Buffer Pool size (192G) is much larger than the number of pages the dump file can describe, does MySQL 8.0's LRU algorithm rapidly evict pages that were just loaded during restore, thereby weakening the restore effect?
Expectations
I would like to understand how to minimize cold-cache latency after failover in a large Buffer Pool scenario. Source-level explanations (e.g., logic in buf0dump.cc or buf0lru.cc) are also very welcome.