Understanding Virtual Memory and NX Bit Simulation: A Journey into x86 32-bit
Virtual memory is at the heart of modern operating systems, providing each program with an isolated and flexible memory space. Orchestrated by the MMU and page tables, it ensures both efficiency and security. This article explores paged memory on x86 32-bit, the role of TLBs, and a key technique: software simulation of the NX bit via W^X, a clever solution to hardware limitations for countering malware.
1. Virtual Memory, MMU, and Paging
Virtual memory is a hardware and software mechanism that provides each process with an independent logical (virtual) memory space, often much larger than the available physical memory. This system relies on a complex interaction between the hardware, via the MMU (Memory Management Unit), and the software, via page management structures.
1.1. Fundamentals and the Role of the MMU
Virtual memory allows each process to operate as if it had a contiguous and exclusive address space, isolated from other processes. This virtual space is mapped to physical memory (DRAM) or, if necessary, to disk storage (swap), by a key component: the MMU.
The MMU is a hardware component located between the processor and physical memory. It intervenes at every memory access to translate a virtual address into a physical address on the fly. The MMU also manages:
- Access rights: Read (R), Write (W), or Execute (X) permissions.
- Error signaling: When accessing an absent or forbidden page, the MMU generates a page fault.
- Status bits: The Accessed (A) and Dirty (D) bits allow the operating system to track page usage.
On x86 32-bit architectures, the translation relies on a two-level paging scheme: a Page Directory and Page Tables.
1.2. Advantages of Virtual Memory
Process isolation: Each program has its own virtual address space. Isolation is guaranteed at the hardware level by the MMU, which uses separate page tables for each process (pointed to by CR3 on x86).
Fine-grained memory protection: The MMU allows defining precise permissions for each page (readable, writable, executable, or forbidden).
Flexible allocation: A process can request a large block of virtual memory without it being physically contiguous in RAM.
Optimized memory usage: Rarely used pages can be swapped out to disk storage (swap).
1.3. Address Translation and Breakdown
In 32-bit systems, the standard memory page size is 4 KiB (4096 bytes). A 32-bit virtual address is divided as follows:
[31........22][21........12][11........0] 10 bits 10 bits 12 bits PDE index PTE index Offset
- PDE index (bits 31-22): Index into the Page Directory, selecting one of 1024 entries.
- PTE index (bits 21-12): Index into the Page Table, selecting one of 1024 entries.
- Offset (bits 11-0): Position within the physical page (0 to 4095 bytes).
The translation process (page walk):
- Extract the fields from the virtual address.
- Read the PDE from the address pointed to by CR3.
- Read the PTE from the address indicated by the PDE.
- Compute the physical address: Page Frame + Offset.
The page walk requires at minimum two additional memory accesses (PDE + PTE), which can represent 100-200 cycles if the structures are not cached.
2. Translation Lookaside Buffers (TLBs)
TLBs are specialized hardware caches integrated into the MMU. Their role is to speed up the translation of virtual addresses to physical addresses, by avoiding the high cost of repeated accesses to page tables.
2.1. How They Work
When a virtual address is generated, the MMU checks the TLB:
- TLB hit: The translation is immediate, with no access to the page tables.
- TLB miss: The MMU performs a page walk and stores the result in the TLB for future accesses.
2.2. Structure
TLBs are small caches (from a few dozen to a few hundred entries). Each entry contains: virtual address, physical address, permissions, and status bits.
On x86 32-bit architectures, TLBs are often separated into:
- i-TLB (instruction TLB): For accesses related to instruction execution.
- d-TLB (data TLB): For data read/write accesses.
This separation plays a key role in NX bit simulation.
2.3. Performance Impact
- TLB Hit: A few cycles.
- TLB Miss: Can cost up to three DRAM accesses (PDE, PTE, data), amounting to several hundred cycles.
To minimize TLB misses, modern systems use multi-level TLBs, large pages (2 MiB or 4 GiB), and optimize reference locality.
3. NX Bit Simulation: the W^X Policy
In x86 32-bit processors from the 1990s and early 2000s, the lack of a hardware mechanism to prevent code execution in data pages exposed systems to code injection attacks. To address this limitation, a software solution based on the W^X (Write XOR Execute) policy was developed, leveraging TLB separation.
3.1. The Problem
On x86 32-bit, all pages present in memory were implicitly executable. This limitation allowed attackers to inject malicious code into data pages (via a buffer overflow, for example) and execute it.
The NX (No-eXecute) bit was introduced later by AMD in the x86-64 architecture. On RISC-V Sv32, an explicit X (Executable) bit exists in each page table entry.
3.2. The W^X Principle
The W^X policy mandates that a page cannot be both writable and executable at the same time:
- A data page is marked writable in the d-TLB but non-executable in the i-TLB.
- A code page is marked executable in the i-TLB but non-writable in the d-TLB.
- Any violation triggers a page fault, captured via the CR2 register.
3.3. TLB Manipulation
- For a write: The system loads the page into the d-TLB with
{ R=1, W=1 }and ensures that the i-TLB has no entry for this page (selective flush). - For an execution: The system loads the page into the i-TLB and flushes the corresponding entry in the d-TLB.
Simplified pseudo-code:
page_fault_handler(virtual_address, operation_type) { pte = get_PTE(virtual_address); if (operation_type == WRITE) { dTLB_add(virtual_address, pte.physical_address, R=1, W=1); flush_iTLB(virtual_address); } else if (operation_type == EXECUTE) { iTLB_add(virtual_address, pte.physical_address); flush_dTLB(virtual_address); } }
3.4. My Experience at LANDesk
I worked on this technique at LANDesk (now Ivanti), where the intrusion prevention system used W^X to protect millions of Windows machines. A major challenge: on the early Pentiums, the TLB had only 8 entries. An instruction like push [esi] could require up to eleven memory accesses if the address was not aligned or straddled two pages.
The solution I designed: a disassembler to identify these critical cases, then an emulator to work around the fact that the early Pentium processors did not have enough TLB entries. The whole thing was deployed on 10 million machines without a single issue.
3.5. Limitations
- Performance: Each page fault triggers software handling. Frequent TLB flushes degrade performance.
- Compatibility: Some legitimate applications (JIT compilers) require pages that are both writable and executable.
- Complexity: Managing separate TLBs requires fine-grained synchronization between hardware and software.
Conclusion
Although W^X was an ingenious solution to compensate for the lack of the NX bit on x86 32-bit, it remained costly in terms of performance and complex to implement. The introduction of the hardware NX bit in x86-64, followed by its widespread adoption in modern architectures (ARM and RISC-V), rendered this simulation obsolete, providing more efficient native protection against code injection attacks.