There have been times when a computer could only run one application at a time. Nowadays even smartphones try to run multiple programs in parallel. The next logical step is to run multiple operating systems in parallel on the same machine. A modern PC has enough resource and speed to do it efficiently. Welcome to the era of virtual machines.

The VM technology has entered the mainstream in the recent years with Intel and AMD building support for it into their popular chips. If you have a computer that is less than five years old, there is a good chance that it supports virtualization in hardware. Even if you don’t feel the urge to run Linux and Windows at the same time on your laptop, virtualization has many other interesting uses that are only now being explored. Since Corensic’s own product, Jinx, uses virtualization for debugging concurrent programs, in this series of blog posts we will share our knowledge of VM technology with you.

The idea of a virtual machine is not hard to grasp. There is a perfect analogy for it: the relationship between processes and the operating system.

Imagine that you are a process: You live in virtual reality in which you have exclusive access to an almost arbitrary number of processors on which to run your threads. You also have access to a large swath of contiguous memory, limited only by your addressing capabilities (32- or 64-bit). And for all you know you are the only process in the universe.

The same way the OS can fool processes, a hypervisor can fool the OS. A hypervisor creates the illusion of a machine on which the OS operates–the virtual machine. And it does it in a very similar way.

In this series I will discuss different areas of virtualization, first by describing how they are implemented for processes by the OS, and then for the OS by the hypervisor. In this installment, I’ll talk about how the operating system virtualizes memory.

Virtual Memory

There are three major aspects of virtual memory:

  1. Providing a process with the illusion of large contiguous memory space
  2. Allowing multiple processes to run in parallel
  3. Hiding the operating system code and data from processes and hiding processes from each other

The illusion is created by the operating system with the help of hardware.

It’s the OS that creates a process that executes a program. It provides the process with means to access its 32-bit or 64-bit address space. It does it by mapping each address used by the program (a.k.a., virtual address) into an address in computer memory (a.k.a., physical address). So when the program dereferences a pointer, this pointer is translated into an actual address in memory.

Address Translation

Address translation is done using a table. Conceptually it’s a table that maps every virtual address to its corresponding physical address. In practice, the address space is divided into larger chunks called pages and the page table maps virtual pages into physical pages. It would still be a lot of data to map the whole virtual address space, so page tables are organized in a tree structure and only some branches of it are present in memory at any given time.

Let’s see how this works on a 32-bit x86. There is a top-level page directory table with entries that point to lower-level page tables. When it comes to translating a virtual address, the x86 takes the top 10 bits of the address and uses them as an offset into the directory table. How does it find the directory table? It stores the pointer to it in one of the special control registers, called CR3. An entry in the directory table contains a pointer to the corresponding page table. The x86 uses the next 10 bits of the virtual address to offset into that table. Finally, the entry in the page table contains the address of the physical page. The lowest 12 bits of the virtual address are then used as the offset into that page.

Virtual address decoding

This kind of table walking is relatively expensive, and it involves multiple memory accesses, so a caching mechanism is used to speed it up. The x86 has a special Translation Lookaside Buffer, TLB, in which most recently used page translations are stored. If the upper 20 bits of a given address are present in the TLB, address translation is very fast. In the case of a TLB miss, the processor does the table walk and stores the new translation in the TLB. On the x86, table walking is implemented in hardware which, in retrospect, turned out to be not such a great idea. We’ll talk about it more in the context of virtualization of virtual memory.

Page Fault

At this point I would like to say that virtual address space is much larger than the amount of physical memory normally available on a PC, but that’s no longer true. I have 4GB of memory in my laptop, which is the maximum amount addressable by a 32-bit processor. But then again, my laptop runs on a 64-bit processor, which can address exabytes of memory. In any case, virtual memory was designed to allow the addressing of more memory than is physically available to a particular process.

One way to look at it is that physical memory is nothing but a large cache for the main storage medium: the hard disk. Indeed, when a process asks the OS for memory, the OS reserves a chunk of disk space inside the swap file and creates some dummy entries for it in the page table. When the program first tries to access this memory, the processor goes through the usual hoops: it looks it up in the TLB–nothing there–then it starts walking the tables and finds the dummy entry. At this point it gives up and faults. Now it is up to the operating system to process the page fault. It has previously registered its handler with the processor and now this handler is executed.

This is what the page fault handler does:

  1. It allocates a page of physical memory. If there aren’t any free pages, it swaps an existing one to disk and invalidates its page entry.
  2. If the page had been previously swapped out, the handler reads the data from the page file to the newly allocated physical page
  3. It updates the page table and the TLB
  4. It restarts the faulting instruction

Notice that all this is totally transparent to the running process. This kind of trap and emulate mechanism forms the basis of all virtualization.

Isolation

Virtual memory is also used to run multiple processes on the same machine. Processes are defined in terms of their address spaces: unlike threads, processes can’t access each other’s memory. Every process has its own virtual address space with its own private page table. One process cannot peek at another process’s memory simply because there is no way it can address it.

At the hardware level, when the operating system assigns an x86 processor (or core) to a given process, it sets its CR3 register to point to that process’s page directory. In a multitasking environment this assignment changes with every context switch.

Setting one control register is not an expensive operation, but on an x86 it must be accompanied by a TLB flush, which results in new table walks to fill it again; and that might be quite expensive. This is why both Intel and AMD have recently added Address Space Identifiers (ASIDs) to TLB entries in their newest processors. TLBs with ASIDs don’t have to be flushed, the processor will just ignore the entries with the wrong ASIDs.

One of the important aspects of virtual memory is that it provides isolation between processes: one process can’t modify, or even see, another process’s memory. This not only prevents one buggy programs from trashing the whole system (a common occurrence in DOS or the first versions of Windows), but also improves overall security of the system.

Levels of Privilege

Full isolation would be impossible if a user program could directly manipulate page tables. This is something that only the trusted operating system should be able to do. For instance, a user program should not be able to set the value of CR3. This prohibition has to be enforced at the processor level, which designates certain instructions as privileged. Such instructions must be accessible from the kernel of the operating system, but cause a fault when a user process tries to execute them. The x86 must therefore be able to distinguished these two modes of operations: kernel (a.k.a., supervisor) and user. For historical reasons, the x86 offers four levels of protection called rings: ring 0 corresponding to kernel mode and rings 1, 2, and 3 to user mode. In practice, only ring 0 and 3 are used, except for some implementations of virtualization.

When the operating system boots, it starts in kernel mode, so it can set up the page tables and do other kernel-only things. When it starts an application, it creates a ring 3 process for it, complete with separate page tables. The application may drop back into the kernel only through a trap, for instance when a page fault occurs. A trap saves the state of the processor, switches the processor to kernel mode, and executes the handler that was registered by the OS.

In the next installment I’ll explain how the hypervisor fools the operating system by virtualizing virtual memory.

Advertisements