*a = 10int *a = malloc(sizeof(int)); // glibc asks OS for a VMA via mmap() *a = 10; // first ever write to this address
malloc(4) returns a virtual address like 0x55a3·c4f0·2a80 almost instantly. glibc called mmap(MAP_ANONYMOUS|MAP_PRIVATE) earlier to grab a chunk of address space. The kernel recorded a VMA in mm_struct — but allocated zero physical RAM. This is "lazy" allocation: pages only exist when you touch them.
*a = 10 — CPU issues a store to that VA. The MMU checks the L1 TLB: ⚡ TLB miss — never seen this address. Page walker reads the PTE: P=0 (not present). ⚠ #PF minor fault — trap to do_page_fault().
Kernel verifies a valid VMA exists (it does). Calls __alloc_pages_node() — the buddy allocator hands over a free 4 KiB frame (PFN 0x12F1). Calls clear_page() to zero it (so your program never sees stale data from a previous process). Writes PTE: 0x55a3…a000 → PFN 0x12F1 | P|RW|US|A. Fills the TLB. Returns.
CPU retries the store. Now: ✓ L1 TLB hit → PA = 0x12F1·000 + 0xa80 → ✓ L1d cache write → 10 is stored. DRAM not written yet (write-back cache). ⏱ ~10 µs total 💾 zero disk I/O
fread(buf, 1, 4096, f) (cold cache)FILE *f = fopen("data.bin", "r"); char buf[4096]; fread(buf, 1, 4096, f); // very first read of this file
fread calls read(fd, buf, 4096) — a syscall. The kernel's VFS asks the page cache: "is offset 0 of data.bin already in RAM?" Answer: no, page cache miss.
⚠ #PF major fault — filemap_fault() allocates a page-cache frame, calls submit_bio(), and queues a 4 KiB read request to the NVMe block driver. The calling process is put to sleep (TASK_UNINTERRUPTIBLE) — it truly cannot run until the disk responds.
~50–150 µs later: NVMe controller finishes DMA, raises an IRQ. Block layer marks the frame PG_uptodate, installs the PTE, and wakes the process. The scheduler re-queues it.
CPU retries: ⚡ TLB miss (PTE brand new) → page walk → PA → ⚡ L1/L2/L3 all miss (cold frame, never cached) → DRAM serves the 64-byte cache line → fread copies bytes into buf. ⏱ ~100 µs first read
Now call fread again on the same offset: ✓ page cache hit → ✓ L1 TLB hit → ✓ L1d cache hit → served in ~4 cycles. This is why repeatedly reading the same file is dramatically faster — the OS page cache acts as a read-through DRAM cache for disk. ⚡ 10,000× faster on repeat read
// System has 8 GiB RAM. Your program allocated 8 GiB. // kswapd evicted large_array[i..i+1023] to /swapfile. printf("%d\n", large_array[i]); // touches the evicted page
System RAM hit the low-watermark. kswapd (a kernel thread) woke up, walked the inactive LRU list, picked cold anonymous pages, wrote them to /swapfile via the swap subsystem, and cleared each PTE — encoding a swap entry (type=0, offset=0x1A2) in bits 1–63 of the PTE instead of a frame number.
large_array[i] — CPU emits VA → ⚡ L1/L2 TLB miss → page walk reads PTE: it encodes a swap entry, P=0. ⚡ #PF swap fault — do_swap_page() is called.
Kernel looks up the swap cache (in-memory index of recently swapped pages) — cache miss. Calls swap_readpage() → submit_bio() → reads 4 KiB from /swapfile at offset 0x1A2 × 4096. Process sleeps. NVMe completes → frame restored → PTE rewritten with real frame number + P=1.
CPU retries → cold cache path → DRAM → printf finally runs. ⏱ 100–500 µs 🔥 up to 100,000× slower than L1 hit
This is why swap usage is a serious performance warning sign on servers. Monitor with: vmstat 1 → watch the si (swap-in) and so (swap-out) columns. Any non-zero si means your workload is hitting evicted pages — add RAM or reduce memory pressure.
A user-space process issues a virtual address (VA). The CPU never speaks physical RAM directly — every load and store is filtered by the MMU, which translates VA → PA using the process's page tables rooted at %cr3.
To avoid walking page tables on every access, the MMU consults a small L1 TLB sitting next to each core (separate iTLB and dTLB, ~64 entries each). On hit, the physical address is produced almost for free.
If L1 misses, the larger L2 STLB (~1.5K entries, shared instructions+data) is checked. Modern CPUs walk this in parallel with speculative execution to hide cost.
Both TLBs miss → the HW page walker traverses the four levels (PML4 → PDPT → PD → PT) by reading PTEs from RAM. Those reads go through the data caches, so a "warm" walk is much cheaper than a cold one.
PTE has P=0 but valid VMA exists — e.g. malloc()'d memory on first touch, or a copy-on-write page after fork(). Kernel allocates a zeroed frame, installs the PTE, retries. No disk touched.
Page is file-backed but not yet in the page cache — e.g. first fread() of a cold file. Kernel issues block I/O, waits, fills a frame, updates PTE. The slow path: ~50–150 µs.
PTE encodes a swap entry — page was evicted by kswapd when RAM was full. Kernel reads it back from /swapfile via the swap cache. Rewriting the PTE restores normal access.
Once a PA is in hand, the load hits L1d → L2 → L3. A miss in all three goes to DRAM. L1d lookup (VIPT) often runs in parallel with TLB lookup — only tag comparison waits on the PA.
When free memory falls below watermarks, kswapd walks active/inactive LRU lists, writes dirty pages back, and frees clean ones. Under extreme pressure the OOM killer terminates a process.