Understanding Virtual Memory and NX Bit Simulation: A Journey into x86 32-bit

Virtual memory is at the heart of modern operating systems, providing each program with an isolated and flexible memory space. Orchestrated by the MMU and page tables, it ensures both efficiency and security. This article explores paged memory on x86 32-bit, the role of TLBs, and a key technique: software simulation of the NX bit via W^X, a clever solution to hardware limitations for countering malware.

1. Virtual Memory, MMU, and Paging

Virtual memory is a hardware and software mechanism that provides each process with an independent logical (virtual) memory space, often much larger than the available physical memory. This system relies on a complex interaction between the hardware, via the MMU (Memory Management Unit), and the software, via page management structures.

1.1. Fundamentals and the Role of the MMU

Virtual memory allows each process to operate as if it had a contiguous and exclusive address space, isolated from other processes. This virtual space is mapped to physical memory (DRAM) or, if necessary, to disk storage (swap), by a key component: the MMU.

The MMU is a hardware component located between the processor and physical memory. It intervenes at every memory access to translate a virtual address into a physical address on the fly. The MMU also manages:

On x86 32-bit architectures, the translation relies on a two-level paging scheme: a Page Directory and Page Tables.

1.2. Advantages of Virtual Memory

  1. Process isolation: Each program has its own virtual address space. Isolation is guaranteed at the hardware level by the MMU, which uses separate page tables for each process (pointed to by CR3 on x86).

  2. Fine-grained memory protection: The MMU allows defining precise permissions for each page (readable, writable, executable, or forbidden).

  3. Flexible allocation: A process can request a large block of virtual memory without it being physically contiguous in RAM.

  4. Optimized memory usage: Rarely used pages can be swapped out to disk storage (swap).

1.3. Address Translation and Breakdown

In 32-bit systems, the standard memory page size is 4 KiB (4096 bytes). A 32-bit virtual address is divided as follows:

[31........22][21........12][11........0] 10 bits 10 bits 12 bits PDE index PTE index Offset

The translation process (page walk):

  1. Extract the fields from the virtual address.
  2. Read the PDE from the address pointed to by CR3.
  3. Read the PTE from the address indicated by the PDE.
  4. Compute the physical address: Page Frame + Offset.

The page walk requires at minimum two additional memory accesses (PDE + PTE), which can represent 100-200 cycles if the structures are not cached.

2. Translation Lookaside Buffers (TLBs)

TLBs are specialized hardware caches integrated into the MMU. Their role is to speed up the translation of virtual addresses to physical addresses, by avoiding the high cost of repeated accesses to page tables.

2.1. How They Work

When a virtual address is generated, the MMU checks the TLB:

2.2. Structure

TLBs are small caches (from a few dozen to a few hundred entries). Each entry contains: virtual address, physical address, permissions, and status bits.

On x86 32-bit architectures, TLBs are often separated into:

This separation plays a key role in NX bit simulation.

2.3. Performance Impact

To minimize TLB misses, modern systems use multi-level TLBs, large pages (2 MiB or 4 GiB), and optimize reference locality.

3. NX Bit Simulation: the W^X Policy

In x86 32-bit processors from the 1990s and early 2000s, the lack of a hardware mechanism to prevent code execution in data pages exposed systems to code injection attacks. To address this limitation, a software solution based on the W^X (Write XOR Execute) policy was developed, leveraging TLB separation.

3.1. The Problem

On x86 32-bit, all pages present in memory were implicitly executable. This limitation allowed attackers to inject malicious code into data pages (via a buffer overflow, for example) and execute it.

The NX (No-eXecute) bit was introduced later by AMD in the x86-64 architecture. On RISC-V Sv32, an explicit X (Executable) bit exists in each page table entry.

3.2. The W^X Principle

The W^X policy mandates that a page cannot be both writable and executable at the same time:

3.3. TLB Manipulation

  1. For a write: The system loads the page into the d-TLB with { R=1, W=1 } and ensures that the i-TLB has no entry for this page (selective flush).
  2. For an execution: The system loads the page into the i-TLB and flushes the corresponding entry in the d-TLB.

Simplified pseudo-code:

page_fault_handler(virtual_address, operation_type) { pte = get_PTE(virtual_address); if (operation_type == WRITE) { dTLB_add(virtual_address, pte.physical_address, R=1, W=1); flush_iTLB(virtual_address); } else if (operation_type == EXECUTE) { iTLB_add(virtual_address, pte.physical_address); flush_dTLB(virtual_address); } }

3.4. My Experience at LANDesk

I worked on this technique at LANDesk (now Ivanti), where the intrusion prevention system used W^X to protect millions of Windows machines. A major challenge: on the early Pentiums, the TLB had only 8 entries. An instruction like push [esi] could require up to eleven memory accesses if the address was not aligned or straddled two pages.

The solution I designed: a disassembler to identify these critical cases, then an emulator to work around the fact that the early Pentium processors did not have enough TLB entries. The whole thing was deployed on 10 million machines without a single issue.

3.5. Limitations

Conclusion

Although W^X was an ingenious solution to compensate for the lack of the NX bit on x86 32-bit, it remained costly in terms of performance and complex to implement. The introduction of the hardware NX bit in x86-64, followed by its widespread adoption in modern architectures (ARM and RISC-V), rendered this simulation obsolete, providing more efficient native protection against code injection attacks.