Linux::VM.Internals virtual memory · mmu · tlb · cache hierarchy · ram · swap · disk · page faults
Session 0xA17F
kernel 6.8.0-generic · x86_64 · 4 KiB pages
live trace
speed 1.0×
PID 2741 VA 0x00007ffe·a4c·018 OP READ · awaiting trace…
// USER SPACE // KERNEL SPACE CPU CORE · ring 3 PROCESS PID 2741 mov eax, [rbx+0x18] issues virtual address cores · L1$ split VA L1 TLB · per-core L1 dTLB 64 entries · ~1 cycle L2 STLB · unified L2 STLB 1.5K entries · ~7 cyc MMU · cr3 MMU virt → phys enforces R/W/X · NX raises #PF on miss VA PA HW PAGE WALKER 4-LEVEL WALK PML4→PDPT→PD→PT ~100–250 cycles PAGE TABLE · mm_struct PROCESS PT PTE flags: P|RW|US|D|A B210·0480 → frame 0x12E4 P|RW|U B210·0488 → frame 0x12E5 P|RW|U|D B210·04A0 → not present (anon) B210·04B8 → swap_entry 0x008·1A2 L1 CACHE L1d / L1i 32–48 KiB ~4 cycles L2 CACHE L2 unified 256 KiB – 1 MiB ~12 cycles L3 CACHE / LLC L3 shared 8–64 MiB ~40 cycles DRAM · ~80–120 ns PHYSICAL RAM (frames) page cache · anon · slab kswapd · pageout · SWAP CACHE swap subsystem LRU eviction → /dev/dm-1 BLOCK DEVICE · NVMe DISK ~50–150 µs · 4K block /swapfile · ext4 VA hit → PA miss hit → PA miss → walk read PTE PA cache-miss → DRAM cold/inactive → reclaim page-out page-in (DMA) #PF trap

// real-world walkthroughs

▸ example 1 · heap write — *a = 10
int *a = malloc(sizeof(int));   // glibc asks OS for a VMA via mmap()
*a = 10;                        // first ever write to this address

malloc(4) returns a virtual address like 0x55a3·c4f0·2a80 almost instantly. glibc called mmap(MAP_ANONYMOUS|MAP_PRIVATE) earlier to grab a chunk of address space. The kernel recorded a VMA in mm_struct — but allocated zero physical RAM. This is "lazy" allocation: pages only exist when you touch them.

*a = 10 — CPU issues a store to that VA. The MMU checks the L1 TLB: ⚡ TLB miss — never seen this address. Page walker reads the PTE: P=0 (not present). ⚠ #PF minor fault — trap to do_page_fault().

Kernel verifies a valid VMA exists (it does). Calls __alloc_pages_node() — the buddy allocator hands over a free 4 KiB frame (PFN 0x12F1). Calls clear_page() to zero it (so your program never sees stale data from a previous process). Writes PTE: 0x55a3…a000 → PFN 0x12F1 | P|RW|US|A. Fills the TLB. Returns.

CPU retries the store. Now: ✓ L1 TLB hit → PA = 0x12F1·000 + 0xa80✓ L1d cache write10 is stored. DRAM not written yet (write-back cache). ⏱ ~10 µs total 💾 zero disk I/O

▸ example 2 · file read — fread(buf, 1, 4096, f) (cold cache)
FILE *f = fopen("data.bin", "r");
char buf[4096];
fread(buf, 1, 4096, f);   // very first read of this file

fread calls read(fd, buf, 4096) — a syscall. The kernel's VFS asks the page cache: "is offset 0 of data.bin already in RAM?" Answer: no, page cache miss.

⚠ #PF major faultfilemap_fault() allocates a page-cache frame, calls submit_bio(), and queues a 4 KiB read request to the NVMe block driver. The calling process is put to sleep (TASK_UNINTERRUPTIBLE) — it truly cannot run until the disk responds.

~50–150 µs later: NVMe controller finishes DMA, raises an IRQ. Block layer marks the frame PG_uptodate, installs the PTE, and wakes the process. The scheduler re-queues it.

CPU retries: ⚡ TLB miss (PTE brand new) → page walk → PA → ⚡ L1/L2/L3 all miss (cold frame, never cached) → DRAM serves the 64-byte cache line → fread copies bytes into buf. ⏱ ~100 µs first read

Now call fread again on the same offset: ✓ page cache hit✓ L1 TLB hit✓ L1d cache hit → served in ~4 cycles. This is why repeatedly reading the same file is dramatically faster — the OS page cache acts as a read-through DRAM cache for disk. ⚡ 10,000× faster on repeat read

▸ example 3 · swap-in — array access after eviction
// System has 8 GiB RAM. Your program allocated 8 GiB.
// kswapd evicted large_array[i..i+1023] to /swapfile.
printf("%d\n", large_array[i]); // touches the evicted page

System RAM hit the low-watermark. kswapd (a kernel thread) woke up, walked the inactive LRU list, picked cold anonymous pages, wrote them to /swapfile via the swap subsystem, and cleared each PTE — encoding a swap entry (type=0, offset=0x1A2) in bits 1–63 of the PTE instead of a frame number.

large_array[i] — CPU emits VA → ⚡ L1/L2 TLB miss → page walk reads PTE: it encodes a swap entry, P=0. ⚡ #PF swap faultdo_swap_page() is called.

Kernel looks up the swap cache (in-memory index of recently swapped pages) — cache miss. Calls swap_readpage()submit_bio() → reads 4 KiB from /swapfile at offset 0x1A2 × 4096. Process sleeps. NVMe completes → frame restored → PTE rewritten with real frame number + P=1.

CPU retries → cold cache path → DRAM → printf finally runs. ⏱ 100–500 µs 🔥 up to 100,000× slower than L1 hit

This is why swap usage is a serious performance warning sign on servers. Monitor with: vmstat 1 → watch the si (swap-in) and so (swap-out) columns. Any non-zero si means your workload is hitting evicted pages — add RAM or reduce memory pressure.

// the journey

A user-space process issues a virtual address (VA). The CPU never speaks physical RAM directly — every load and store is filtered by the MMU, which translates VA → PA using the process's page tables rooted at %cr3.

// l1 tlb · per-core, ~1 cycle

To avoid walking page tables on every access, the MMU consults a small L1 TLB sitting next to each core (separate iTLB and dTLB, ~64 entries each). On hit, the physical address is produced almost for free.

// l2 stlb · unified, ~7 cycles

If L1 misses, the larger L2 STLB (~1.5K entries, shared instructions+data) is checked. Modern CPUs walk this in parallel with speculative execution to hide cost.

// hardware page walk · ~100–250 cycles

Both TLBs miss → the HW page walker traverses the four levels (PML4 → PDPT → PD → PT) by reading PTEs from RAM. Those reads go through the data caches, so a "warm" walk is much cheaper than a cold one.

// minor fault · #pf, no I/O

PTE has P=0 but valid VMA exists — e.g. malloc()'d memory on first touch, or a copy-on-write page after fork(). Kernel allocates a zeroed frame, installs the PTE, retries. No disk touched.

// major fault · #pf, disk

Page is file-backed but not yet in the page cache — e.g. first fread() of a cold file. Kernel issues block I/O, waits, fills a frame, updates PTE. The slow path: ~50–150 µs.

// swap-in · #pf, swap_entry

PTE encodes a swap entry — page was evicted by kswapd when RAM was full. Kernel reads it back from /swapfile via the swap cache. Rewriting the PTE restores normal access.

// caches, in parallel

Once a PA is in hand, the load hits L1d → L2 → L3. A miss in all three goes to DRAM. L1d lookup (VIPT) often runs in parallel with TLB lookup — only tag comparison waits on the PA.

// reclaim · keeping ram free

When free memory falls below watermarks, kswapd walks active/inactive LRU lists, writes dirty pages back, and frees clean ones. Under extreme pressure the OOM killer terminates a process.