|  |  | 
|  | Device Specification for Inter-VM shared memory device | 
|  | ------------------------------------------------------ | 
|  |  | 
|  | The Inter-VM shared memory device is designed to share a memory region (created | 
|  | on the host via the POSIX shared memory API) between multiple QEMU processes | 
|  | running different guests. In order for all guests to be able to pick up the | 
|  | shared memory area, it is modeled by QEMU as a PCI device exposing said memory | 
|  | to the guest as a PCI BAR. | 
|  | The memory region does not belong to any guest, but is a POSIX memory object on | 
|  | the host. The host can access this shared memory if needed. | 
|  |  | 
|  | The device also provides an optional communication mechanism between guests | 
|  | sharing the same memory object. More details about that in the section 'Guest to | 
|  | guest communication' section. | 
|  |  | 
|  |  | 
|  | The Inter-VM PCI device | 
|  | ----------------------- | 
|  |  | 
|  | From the VM point of view, the ivshmem PCI device supports three BARs. | 
|  |  | 
|  | - BAR0 is a 1 Kbyte MMIO region to support registers and interrupts when MSI is | 
|  | not used. | 
|  | - BAR1 is used for MSI-X when it is enabled in the device. | 
|  | - BAR2 is used to access the shared memory object. | 
|  |  | 
|  | It is your choice how to use the device but you must choose between two | 
|  | behaviors : | 
|  |  | 
|  | - basically, if you only need the shared memory part, you will map BAR2. | 
|  | This way, you have access to the shared memory in guest and can use it as you | 
|  | see fit (memnic, for example, uses it in userland | 
|  | http://dpdk.org/browse/memnic). | 
|  |  | 
|  | - BAR0 and BAR1 are used to implement an optional communication mechanism | 
|  | through interrupts in the guests. If you need an event mechanism between the | 
|  | guests accessing the shared memory, you will most likely want to write a | 
|  | kernel driver that will handle interrupts. See details in the section 'Guest | 
|  | to guest communication' section. | 
|  |  | 
|  | The behavior is chosen when starting your QEMU processes: | 
|  | - no communication mechanism needed, the first QEMU to start creates the shared | 
|  | memory on the host, subsequent QEMU processes will use it. | 
|  |  | 
|  | - communication mechanism needed, an ivshmem server must be started before any | 
|  | QEMU processes, then each QEMU process connects to the server unix socket. | 
|  |  | 
|  | For more details on the QEMU ivshmem parameters, see qemu-doc documentation. | 
|  |  | 
|  |  | 
|  | Guest to guest communication | 
|  | ---------------------------- | 
|  |  | 
|  | This section details the communication mechanism between the guests accessing | 
|  | the ivhsmem shared memory. | 
|  |  | 
|  | *ivshmem server* | 
|  |  | 
|  | This server code is available in qemu.git/contrib/ivshmem-server. | 
|  |  | 
|  | The server must be started on the host before any guest. | 
|  | It creates a shared memory object then waits for clients to connect on a unix | 
|  | socket. All the messages are little-endian int64_t integer. | 
|  |  | 
|  | For each client (QEMU process) that connects to the server: | 
|  | - the server sends a protocol version, if client does not support it, the client | 
|  | closes the communication, | 
|  | - the server assigns an ID for this client and sends this ID to him as the first | 
|  | message, | 
|  | - the server sends a fd to the shared memory object to this client, | 
|  | - the server creates a new set of host eventfds associated to the new client and | 
|  | sends this set to all already connected clients, | 
|  | - finally, the server sends all the eventfds sets for all clients to the new | 
|  | client. | 
|  |  | 
|  | The server signals all clients when one of them disconnects. | 
|  |  | 
|  | The client IDs are limited to 16 bits because of the current implementation (see | 
|  | Doorbell register in 'PCI device registers' subsection). Hence only 65536 | 
|  | clients are supported. | 
|  |  | 
|  | All the file descriptors (fd to the shared memory, eventfds for each client) | 
|  | are passed to clients using SCM_RIGHTS over the server unix socket. | 
|  |  | 
|  | Apart from the current ivshmem implementation in QEMU, an ivshmem client has | 
|  | been provided in qemu.git/contrib/ivshmem-client for debug. | 
|  |  | 
|  | *QEMU as an ivshmem client* | 
|  |  | 
|  | At initialisation, when creating the ivshmem device, QEMU first receives a | 
|  | protocol version and closes communication with server if it does not match. | 
|  | Then, QEMU gets its ID from the server then makes it available through BAR0 | 
|  | IVPosition register for the VM to use (see 'PCI device registers' subsection). | 
|  | QEMU then uses the fd to the shared memory to map it to BAR2. | 
|  | eventfds for all other clients received from the server are stored to implement | 
|  | BAR0 Doorbell register (see 'PCI device registers' subsection). | 
|  | Finally, eventfds assigned to this QEMU process are used to send interrupts in | 
|  | this VM. | 
|  |  | 
|  | *PCI device registers* | 
|  |  | 
|  | From the VM point of view, the ivshmem PCI device supports 4 registers of | 
|  | 32-bits each. | 
|  |  | 
|  | enum ivshmem_registers { | 
|  | IntrMask = 0, | 
|  | IntrStatus = 4, | 
|  | IVPosition = 8, | 
|  | Doorbell = 12 | 
|  | }; | 
|  |  | 
|  | The first two registers are the interrupt mask and status registers.  Mask and | 
|  | status are only used with pin-based interrupts.  They are unused with MSI | 
|  | interrupts. | 
|  |  | 
|  | Status Register: The status register is set to 1 when an interrupt occurs. | 
|  |  | 
|  | Mask Register: The mask register is bitwise ANDed with the interrupt status | 
|  | and the result will raise an interrupt if it is non-zero.  However, since 1 is | 
|  | the only value the status will be set to, it is only the first bit of the mask | 
|  | that has any effect.  Therefore interrupts can be masked by setting the first | 
|  | bit to 0 and unmasked by setting the first bit to 1. | 
|  |  | 
|  | IVPosition Register: The IVPosition register is read-only and reports the | 
|  | guest's ID number.  The guest IDs are non-negative integers.  When using the | 
|  | server, since the server is a separate process, the VM ID will only be set when | 
|  | the device is ready (shared memory is received from the server and accessible | 
|  | via the device).  If the device is not ready, the IVPosition will return -1. | 
|  | Applications should ensure that they have a valid VM ID before accessing the | 
|  | shared memory. | 
|  |  | 
|  | Doorbell Register:  To interrupt another guest, a guest must write to the | 
|  | Doorbell register.  The doorbell register is 32-bits, logically divided into | 
|  | two 16-bit fields.  The high 16-bits are the guest ID to interrupt and the low | 
|  | 16-bits are the interrupt vector to trigger.  The semantics of the value | 
|  | written to the doorbell depends on whether the device is using MSI or a regular | 
|  | pin-based interrupt.  In short, MSI uses vectors while regular interrupts set | 
|  | the status register. | 
|  |  | 
|  | Regular Interrupts | 
|  |  | 
|  | If regular interrupts are used (due to either a guest not supporting MSI or the | 
|  | user specifying not to use them on startup) then the value written to the lower | 
|  | 16-bits of the Doorbell register results is arbitrary and will trigger an | 
|  | interrupt in the destination guest. | 
|  |  | 
|  | Message Signalled Interrupts | 
|  |  | 
|  | An ivshmem device may support multiple MSI vectors.  If so, the lower 16-bits | 
|  | written to the Doorbell register must be between 0 and the maximum number of | 
|  | vectors the guest supports.  The lower 16 bits written to the doorbell is the | 
|  | MSI vector that will be raised in the destination guest.  The number of MSI | 
|  | vectors is configurable but it is set when the VM is started. | 
|  |  | 
|  | The important thing to remember with MSI is that it is only a signal, no status | 
|  | is set (since MSI interrupts are not shared).  All information other than the | 
|  | interrupt itself should be communicated via the shared memory region.  Devices | 
|  | supporting multiple MSI vectors can use different vectors to indicate different | 
|  | events have occurred.  The semantics of interrupt vectors are left to the | 
|  | user's discretion. |