- The 64-bit integer is read from a fetch buffer - think of it as L0 D-cache, not RAM.
- If the item is not in the fetch buffer, then it’s read from L1 D-cache. If not there, L2 and L3 caches will be checked.
- If none of the caches contains the cache line with the integer in it, then one or more cache lines will be fetched from RAM in parallel to fill the caches.
Access to RAM is not really done by the computing part of the CPU. It is done by the cache system, independently from what the CPU functional units are doing.
RAM has not been accessed in 8-bit widths in PCs for decades now. A DDR5 module has a 64-bit-wide data path. But the data into the cache is not fetched 64-bits at a time. It is fetched 64 bytes at a time. So, with just one module, there will be 8 successive 64-bit reads to fill one cache line.
The CPU sees memory in units of cache lines. It doesn’t see individual bytes until the data gets into fetch buffers. The “word size” of the memory hierarchy is 64 bytes or 512 bits.
You will want to read the rest of the WSEPKAM to get a better big picture. All this is covered there IIRC.
That document has to be read multiple times. At first don’t stress over details. Just make sure you read the whole thing. Then you can read it again, focusing more on details.
One “big idea” is that accessing one byte of memory costs the same as accessing (naturally aligned) 64 bytes. Reading randomly distributed single bytes consumes 8x as much memory bandwidth since for every byte a whole cache line is fetched most of the time.
Another big idea is that the memory is hierarchical, and each higher layer is roughly an order of magnitude slower than the layer below it.
Finally, modern CPUs have extensive performance measurement systems built-in. When you are running your code in a profiler, you can see exactly how many cache misses there were at various levels of the cache, how many mispredicted branches there were, how many speculated results were discarded (a waste of energy!), how much cache latency there was, what the instruction throughput was, etc. A modern Intel CPU has way more transistors dedicated to performance monitoring than there were transistors in an entire 80386.