| An overview of memory management in QEMU: |
| |
| I. RAM Management: |
| ================== |
| |
| I.1. RAM Address space: |
| ----------------------- |
| |
| All pages of virtual RAM used by QEMU at runtime are allocated from |
| contiguous blocks in a specific abstract "RAM address space". |
| |ram_addr_t| is the type of block addresses in this space. |
| |
| A single block of contiguous RAM is allocated with 'qemu_ram_alloc()', which |
| takes a size in bytes, and allocates the pages through mmap() in the QEMU |
| host process. It also sets up the corresponding KVM / Xen / HAX mappings, |
| depending on each accelerator's specific needs. |
| |
| Each block has a name, which is used for snapshot support. |
| |
| 'qemu_ram_alloc_from_ptr()' can also be used to allocated a new RAM |
| block, by passing its content explicitly (can be useful for pages of |
| ROM). |
| |
| 'qemu_get_ram_ptr()' will translate a 'ram_addr_t' into the corresponding |
| address in the QEMU host process. 'qemu_ram_addr_from_host()' does the |
| opposite (i.e. translates a host address into a ram_addr_t if possible, |
| or return an error). |
| |
| Note that ram_addr_t addresses are an internal implementation detail of |
| QEMU, i.e. the virtual CPU never sees their values directly; it relies |
| instead of addresses in its virtual physical address space, described |
| in section II. below. |
| |
| As an example, when emulating an Android/x86 virtual device, the following |
| RAM space is being used: |
| |
| 0x0000_0000 ... 0x1000_0000 "pc.ram" |
| 0x1000_0000 ... 0x1002_0000 "bios.bin" |
| 0x1002_0000 ... 0x1004_0000 "pc.rom" |
| |
| |
| I.2. RAM Dirty tracking: |
| ------------------------ |
| |
| QEMU also associates with each RAM page an 8-bit 'dirty' bitmap. The |
| main idea is that whenever a page is written to, the value 0xff is |
| written to the page's 'dirty' bitmap. Various clients can later inspect |
| some of the flags and clear them. I.e.: |
| |
| VGA_DIRTY_FLAG (0x1) is typically used by framebuffer drivers to detect |
| which pages of video RAM were touched since the latest VSYNC. The driver |
| typically copies the pixel values to the real QEMU output, then clears |
| the bits. This is very useful to avoid needless copies if nothing |
| changed in the framebuffer. |
| |
| MIGRATION_DIRTY_FLAG (0x8) is used to tracked modified RAM pages during |
| live migration (i.e. moving a QEMU virtual machine from one host to |
| another) |
| |
| CODE_DIRTY_FLAG (0x2) is a bit more special, and is used to support |
| self-modifying code properly. More on this later. |
| |
| |
| II. The physical address space: |
| =============================== |
| |
| Represents the address space that the virtual CPU can read from / write to. |
| |hwaddr| is the type of addresses in this space, which is decomposed |
| into 'pages'. Each page in the address space is either unassigned, or |
| mapped to a specific kind of memory region. |
| |
| See |phys_page_find()| and |phys_page_find_alloc()| in translate-all.c for |
| the implementation details. |
| |
| |
| II.1. Memory region types: |
| -------------------------- |
| |
| There are several memory region types: |
| |
| - Regions of RAM pages. |
| - Regions of ROM pages (similar to RAM, but cannot be written to). |
| - Regions of I/O pages, used to communicate with virtual hardware. |
| |
| Virtual devices can register a new I/O region type by calling |
| |cpu_register_io_memory()|. This function allows them to provide |
| callbacks that will be invoked every time the virtual CPU reads from |
| or writes to any page of the corresponding type. |
| |
| The memory region type of a given page is encoded using PAGE_BITS bits |
| in the following format: |
| |
| +-------------------------------+ |
| | mem_type_index | flags | |
| +-------------------------------+ |
| |
| Where |mem_type_index| is a unique value identifying a given memory |
| region type, and |flags| is a 3-bit bitmap used to store flags that are |
| only relevant for I/O pages. |
| |
| The following memory region type values are important: |
| |
| IO_MEM_RAM (mem_type_index=0, flags=0): |
| Used for regular RAM pages, always all zero on purpose. |
| |
| IO_MEM_ROM (mem_type_index=1, flags=0): |
| Used for ROM pages. |
| |
| IO_MEM_UNASSIGNED (mem_type_index=2, flags=0): |
| Used to identify unassigned pages of the physical address space. |
| |
| IO_MEM_NOTDIRTY (mem_type_index=3, flags=0): |
| Used to implement tracking of dirty RAM pages. This is essentially |
| used for RAM pages that have not been written to yet. |
| |
| Any mem_type_index value of 4 or higher corresponds to a device-specific |
| I/O memory region type (i.e. with custom read/write callbaks, a |
| corresponding 'opaque' value), and can also use the following bits |
| in |flags|: |
| |
| IO_MEM_ROMD (0x1): |
| Used for ROM-like I/O pages, i.e. they are backed by a page from |
| the RAM address space, but writing to them triggers a device-specific |
| write callback (instead of being ignored or faulting the CPU). |
| |
| IO_MEM_SUBPAGE (0x02) |
| Used to indicate that not all addresses in this page map to the same |
| I/O region type / callbacks. |
| |
| IO_MEM_SUBWIDTH (0x04) |
| Probably obsolete. Set to indicate that the corresponding I/O region |
| type doesn't support reading/writing values of all possible sizes |
| (1, 2 and 4 bytes). This seems to be never used by the current code. |
| |
| Note that cpu_register_io_memory() returns a new memory region type value. |
| |
| II.2. Physical address map: |
| --------------------------- |
| |
| QEMU maintains for each assigned page in the physical address space |
| two values: |
| |
| |phys_offset|, a combination of ram address and memory region type. |
| |
| |region_offset|, an optional offset into the region backing the |
| page. This is only useful for I/O pages. |
| |
| The |phys_offset| value has many interesting encoding which require |
| further clarification: |
| |
| - Generally speaking, a phys_offset value is decomposed into |
| the following bit fields: |
| |
| +-----------------------------------------------------+ |
| | high_addr | mem_type | |
| +-----------------------------------------------------+ |
| |
| where |mem_type| is a PAGE_BITS memory region type as described |
| previously, and |high_addr| may contain the high bits of a |
| ram_addr_t address for RAM-backed pages. |
| |
| More specifically: |
| |
| - Unassigned pages always have the special value IO_MEM_UNASSIGNED |
| (high_addr=0, mem_type=IO_MEM_UNASSIGNED) |
| |
| - RAM pages have mem_type=0 (i.e. IO_MEM_RAM) while high_addr are |
| the high bits of the corresponding ram_addr_t. Hence, a simple call to |
| qemu_get_ram_ptr(phys_offset) will return the corresponding |
| address in host QEMU memory. |
| |
| This is the reson why IO_MEM_RAM is always 0: |
| |
| RAM page phys_offset value: |
| +-----------------------------------------------------+ |
| | high_addr | 0 | |
| +-----------------------------------------------------+ |
| |
| |
| - ROM pages are like RAM pages, but have mem_type=IO_MEM_ROM. |
| QEMU ensures that writing to such a page is a no-op, except on |
| some target architectures, like Sparc, this may cause a CPU fault. |
| |
| ROM page phys_offset value: |
| +-----------------------------------------------------+ |
| | high_addr | IO_MEM_ROM | |
| +-----------------------------------------------------+ |
| |
| - Dirty RAM page tracking is implemented by using special |
| phys_offset values with mem_type=IO_MEM_NOTDIRTY. Note that these |
| values do not appear directly in the physical page map, but in |
| the CPU TLB cache (explained later). |
| |
| non-dirty RAM page phys_offset value (CPU TLB cache only): |
| +-----------------------------------------------------+ |
| | high_addr | IO_MEM_NOTDIRTY | |
| +-----------------------------------------------------+ |
| |
| - Other pages are I/O pages, and their high_addr value will |
| be 0 / ignored: |
| |
| I/O page phys_offset value: |
| +----------------------------------------------------------+ |
| | 0 | mem_type_index | flags | |
| +----------------------------------------------------------+ |
| |
| Note that when reading from or writing to I/O pages, the lowest |
| PAGE_BITS bits of the corresponding hwaddr value will be added |
| to the page's |region_offset| value. This new address is passed |
| to the read/write callback as the 'i/o address' for the operation. |
| |
| - As a special exception, if the I/O page's IO_MEM_ROMD flag is |
| set, then high_addr is not 0, but the high bits of the corresponding |
| ram_addr_t backing the page's contents on reads. On write operations |
| though, the I/O region type's write callback will be called instead. |
| |
| ROMD I/O page phys_offset value: |
| +----------------------------------------------------------+ |
| | high_addr | mem_type_index | flags | |
| +----------------------------------------------------------+ |
| |
| Note that |region_offset| is ignored when reading from such pages, |
| it's only used when writing to the I/O page. |