Escolar Documentos
Profissional Documentos
Cultura Documentos
Moshe Malka
Rethinking the I/O Memory
Management Unit (IOMMU)
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Research Thesis
Moshe Malka
Some results in this thesis have been published as articles by the author and research
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
collaborators in conferences and journals during the course of the authors research
period, the most up-to-date versions of which being:
1. Moshe Malka, Nadav Amit, Muly Ben-Yehuda and Dan Tsafrir. rIOMMU:
Efficient IOMMU for I/O Devices that Employ Ring. In proceeding of the 20th
International Conference on Architectural Support for Programming Languages
and Operating Systems (ASPLOS 2015).
2. Moshe Malka, Nadav Amit, and Dan Tsafrir. Efficient IOMMU Intra-Operating
System Protection. In proceeding of the 13th USENIX Conference on File and
Storage Technologies (FAST 2015) .
Acknowledgements
I would like to thank my advisor Dan Tsafrir for his devoted guidance and help, my
research team Nadav Amit and Muli Ben-Yehuda, my parents and my friends.
Contents
List of Figures
Abstract 1
1 Introduction 5
2 Background 7
2.1 Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Physical and Virtual Addressing . . . . . . . . . . . . . . . . . . 8
2.1.2 Address Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Page Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Virtual Memory as a Tool for Memory Protection . . . . . . . . 11
2.1.5 Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Direct Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Transferring Data from the Memory to the Device . . . . . . . . 23
2.2.2 Transferring Data from the Device to the Memory . . . . . . . . 23
2.3 Adding Virtual Memory to I/O Transactions . . . . . . . . . . . . . . . 24
6 Conclusions 93
6.1 rIOMMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 eIOVAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3 Reducing the IOTLB Miss Overhead . . . . . . . . . . . . . . . . . . . . 93
Hebrew Abstract i
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
List of Figures
Abstract
Processes are encapsulated with virtual memory spaces and access the memory via
virtual addresses (VAs) to ensure, among other things, that they access only those
memory parts they have been explicitly granted access to. These memory spaces are
created and maintained by the OS, and the translation from virtual to physical addresses
(PAs) is done by the MMU. Analogously, I/O devices can be encapsulated with I/O
virtual memory spaces and access the memory using I/OVAs, which are translated by
the input/output memory management units (IOMMU) to physical addresses.
This encapsulation increases system availability and reliability, since it prevents
devices from overwriting any part of the memory, including memory that might be used
by other entities. It also prevents rogue devices from performing errant or malicious
access to the memory and ensures that buggy devices will not lose important data.
Chip makers understood the importance of this and added IOMMUs to the chipsets of
all servers and some PCs. However, this protection comes at the cost of performance
degradation, which depends on the IOMMU design, the way it is programmed, and the
workload. We found that Intels IOMMU degrades the throughput of I/O-intensive
workloads by up to an order of magnitude.
We investigate all the possible causes of IOMMU overhead and that of its driver
and suggest a solution for each. First we identify that the complexity of the kernel
subsystem in charge of IOVA allocation is linear in the number of allocated IOVAs and
thus a major source of overhead. We optimize the allocation in a manner that ensures
that the complexity is typically constant and never worse than logarithmic, and we
improve the performance of the Netperf, Apache, and Memcached benchmarks by up
to 4.6x. Observing that the IOTLB miss rate can be as high as 50%, we then suggest
hiding the IOTLB misses with a prefetcher. We extend some of the state-of-the-art
prefetchers to IOTLB and compare them. In our experiments we achieve a hit rate
of up to 99% on some configurations and workloads. Finally, we observe that many
devices such as network and disk controllers typically interact with the OS via circular
ring buffers that induce a sequential, completely predictable workload. We design a ring
IOMMU (rIOMMU) that leverages this characteristic by replacing the virtual memory
page table hierarchy with a circular, flat table. Using standard networking benchmarks,
we show that rIOMMU provides up to 7.56x higher throughput relative to the baseline
IOMMU, and that it is within 0.77-1.00x the throughput of a system without IOMMU.
1
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
2
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
3
NIC : Network Interface Controller
OS : Operating System
PA : Physical Address
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
4
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Chapter 1
Introduction
I/O transactions, such as reading data from a hard disk or a network card, constitute a
major part of the workloads in todays computer systems, especially in servers. They are,
therefore, an important part of systems performance. To reduce the CPU utilization,
most of todays peripheral devices can bypass the processor and directly read from and
write to the main memory using a direct memory access (DMA) hardware unit [23, 17].
This unit is described in section 2.2.
Although accessing the memory directly gives performance advantages, it can also
lead to stability and security problems. A peripheral device with direct memory access
can be programmed to overwrite any part of the systems memory or to cause bugs in
the system; it can also be made vulnerable to infection with malicious software [15].
These disadvantages of direct access have received a lot of attention recently. How-
ever, similar problems also existed for CPU processes back before the virtual memory
mechanism was added to CPUs. Virtual memory encapsulates the processes with an
abstraction layer that prevents direct access to the memory. Many of the problems solved
by the virtual memory closely resemble problems that emerged due to direct memory
access. Hardware designers have noticed this similarity and duplicated the virtual
memory mechanism for I/O devices, calling it I/O virtual memory. The hardware unit
in charge of I/O virtual memory is called the I/O memory management unit (IOMMU).
Duplicating the mechanism was natural because the virtual memory is a major part of
almost every computer machine.
In chapter 2 we expand on the required background regarding virtual memory, the
direct memory access mechanism, how the I/O virtual memory mechanism fits into the
picture, and the differences between processes and I/O devices in the context of virtual
memory.
Although I/O virtual memory solves the problems mentioned above, hardware
designers did not take into account the difference between the workload of I/O devices
and processes. As a result, systems that perform I/O transactions using I/O virtual
memory experience a significant reduction in performance. The goal of this work
is to study all the causes for the performance reduction and suggest new designs
5
and algorithms to reduce the performance overhead. We carefully investigated the
workload experienced by the virtual memory and found that hardware overheads are
not exclusively to blame for its high cost. Rather, the cost is amplified by software due
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
6
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Chapter 2
Background
The concept of virtual memory was first developed by German physicist Fritz-Rudolf
Guntsch at the Technische Universitt Berlin in 1956 in his doctoral thesis, Logical Design
of a Digital Computer with Multiple Asynchronous Rotating Drums and Automatic High
Speed Memory Operation. In 1961, the Burroughs Corporation independently released
the first commercial computer with virtual memory, the B5000, with segmentation rather
than paging. The first minicomputer to introduce virtual memory was the Norwegian
NORD-1; during the 1970s, other minicomputers implemented virtual memory, notably
VAX models running VMS. Before 1982 all Intel CPUs were designed to work in Real
Mode. Real Mode, also called real address mode, is characterized by unlimited direct
software access to all memory, I/O addresses, and peripheral hardware, but provides
no support for memory protection. This situation continued until Protected Mode was
added to the x86 architecture in 1982. Introduced with the release of Intels 80286
(286)processor, Protected Mode was later extended, with the release of the 80386 in
1985, to include features such as virtual memory paging and safe multi-tasking. The
first
The virtual memory is an abstract layer of the main memory that separates the
processes from the physical memory. It is a combination of operating system, disk
files, hardware virtual address translation, hardware exceptions, and main memory that
provide processes with a large private address space without any intervention from the
application programmer. Virtual memory provides three main capabilities:
1. It provides programmers with a large uniform address space and lets them focus
on designing the program rather than dealing with managing the memory used by
the processes.
2. It uses main memory efficiently by treating it as a cache for process data stored
on disk, keeping only the active areas in main memory, and swapping data disk
and memory as needed.
3. It protects the address space of each process from corruption by other processes.
7
2.1 Virtual Memory
The goal of this subsection is to explain how virtual memory works, with a focus on:
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
The functionality relevant to the I/O: address space, address translation and so
forth;
Protecting the system from devices that try to gain unauthorized access to the
memory.
The reader is referred to the references on which this subsection is based for more
in-depth information about virtual memory [18].
Main memory
0
1
Physical
Address 2
(PA) 3
CPU
4
4
5
6
7
8
M-1:
Data word
8
The next byte has an address of 1;
As the organization is simple, physical addressing is the most natural way for
a CPU to access memory. Figure 2.1 shows an example of physical addressing for
a load instruction that reads the word starting at physical address 4. An effective
physical address is created when the CPU executes the load instruction. It then
passes it to main memory over the memory bus. The main memory fetches the 4-byte
word that starts at physical address 4 and returns it to the CPU, which stores it
in a register. Physical addressing, and systems such as digital signal processors and
embedded microcontrollers, were used in early personal computers. The method is still
employed in Cray supercomputers.
Modern processors no longer use physical addressing. Instead, they use virtual
addressing, as shown in Figure 2.2. In virtual addressing, the CPU accesses the main
memory by generating a virtual address (or VA, for short), which is converted to the
appropriate physical address before it is sent to the memory. Converting a virtual
address into a physical address is called address translation. Address translation, like
exception handling, requires the CPU hardware and the operating system to work
together closely.
Main memory
0
1
Physical
Address 2
Address
Virtual translation
Address(VA) (PA) 3
CPU MMU
4100 4 4
5
6
7
8
M-1:
Virtual addresses are translated on the fly by the memory management unit (MMU),
9
which is dedicated hardware on the CPU chip that uses a look-up table stored in main
memory. The look-up table contains the translations for the mapped virtual addresses
and is managed by the operating system.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
This distinction allows us to generalize, and also allows each data object to have multiple
independent addresses, each chosen from a different address space. The basic idea of
virtual memory is that each byte of main memory has a virtual address chosen from the
virtual address space, and a physical address chosen from the physical address space.
A data structure known as a page table. It is stored in physical memory and maps
virtual pages to physical pages. The address translation hardware reads the page
table each time it converts a virtual address to a physical address.
The operating system maintains the contents of the page table and transfers pages back
and forth between disk and DRAM.
The basic organization of a page table is shown in Figure 2.3
10
Physical memory
(DRAM)
Physical page
Number or
VP 1 PP 0
valid disk address
VP 2
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
PTE 0 0 null
VP 7
1
VP 4 PP 3
1
Virtual memory
1
(disk)
0 VP 1
0 null VP 2
VP 3
0
VP 4
PTE 7 1
Memory resident VP 6
Page table
VP 7
(DRAM)
A page table is an array of page table entries (PTEs). Each page in the virtual
address space has a page table entry at a fixed offset in the page table. For our purposes,
we will assume that each page table entry consists of a valid bit and an n-bit address
field. The valid bit tells us whether the virtual page is currently cached in DRAM. If the
valid bit is set, the address field indicates the start of the corresponding physical page
in DRAM where the virtual page is cached. If the valid bit is not set, a null address
tells us that the virtual page has not yet been allocated. Otherwise, the address points
to the start of the virtual page on disk.
Figure 2.3 shows a page table for a system with eight virtual pages and four physical
pages. The four virtual pages, VP 1, VP 2, VP 4, and VP 7, are currently cached in
DRAM. Two pages, VP 0 and VP 5, are yet to be allocated, while pages VP 3 and VP
6 have been allocated - but are not currently cached. Because the DRAM cache is fully
associative, any physical page can contain any virtual page.
When the operating system allocates a new page of virtual memory, such as the result
of calling malloc, this will affect the page table, as shown in the example in Figure 2.4.
In the example, VP 5 is allocated by creating room on disk and updating page table
entry 5 to point to the newly created page on disk.
11
Physical memory
(DRAM)
Physical page
VP 1 PP 0
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Number or
valid disk address
VP 2
PTE 0 0 null
VP 7
1
VP 3 PP 3
1
Virtual memory
1
(disk)
0 VP 1
0 VP 2
VP 3
0
VP 4
PTE 7 1
VP 5
Memory resident VP 6
Page table VP 7
(DRAM)
all parties expressly allow it, such as by calling to specific inter-process communication
system calls, modification of any virtual pages shared with other processes should be
forbidden.
Having separate virtual address spaces makes it easy to isolate the private memories
of different processes. However, the address translation mechanism can be extended by
adding some additional permission bits to the page table entries to provide even greater
access control. This is possible because the address translation hardware reads a page
table entry each time the CPU generates an address. Figure 2.5 shows how this can be
done.
In this example we have added three permission bits to each page table entry:
The SUP bit indicates whether processes must be running in kernel (supervisor)
mode to access the page. Processes running in kernel mode can access any page,
while processes running in user mode can access pages for which SUP is 0.
The read and write bits control read and write access to the page. For
example, if process i is running in user mode, then it has permission to read VP 0
and to read or write VP 1, but it cannot access VP 2.
12
Page tables with permission bits
Physical memory
Process i: SUP READ WRITE ADDRESS
Vp 0 PP 0
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
No Yes No PP 6
Vp 1 No Yes Yes PP 4
Vp 2 Yes Yes Yes PP 2 PP 2
PP 4
PP 6
The memory management unit uses the virtual page number to select the appropriate
page table entry. For example, VPN 0 selects PTE 0, VPN 1 selects PTE 1, and so on.
The corresponding physical address is the concatenation of the physical page number
(PPN) from the page table entry and the virtual page offset from the virtual address.
As both the physical and virtual pages are P bytes, the physical page offset (PPO) is
13
N-1 p p-1 o
Virtual page number (VPN) Virtual page offset (VPO)
Page table
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Base register
(PTBR)
Valid Physical page number (PPN)
If valid=0
then page
not in memory M-1 p p-1 o
(page fault)
Physical page number (PPN) Physical page OFETFS (PPO)
Physical address
2
CPU chip
PTEA
1 PTE
MMU 3 Cache/
Processor memory
VA
PA
4
Data
5
1. The processor generates a virtual address and sends it to the memory management
unit.
14
2. The memory management unit generates the page table entry address, and requests
it from the main memory.
3. The cache and/or main memory return the page table entry to the memory
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
management unit. The memory management unit creates the physical address
and sends it to main memory.
4. The main memory returns the requested data word to the processor.
Page hits are handled entirely by hardware. Handling a page fault requires coopera-
tion between hardware and the operating system kernel. As it is not relevant to our
work, we do not explain the process here.
Each time the CPU generates a virtual address, the memory management unit must
refer to a page table entry to translate the virtual address into a physical address. This
requires an additional fetch from memory that take tens or even hundreds of cycles.
If the page table entry is cached in L1, then the overhead is reduced to only one or
two cycles. However, even this low cost can be further reduced or even eliminated by
including a small cache of page table entries in the memory management unit. This is
called a translation lookaside buffer (or TLB for short).
A TLB is a small, virtually addressed cache where each line holds a block consisting
of a single page table entry. A translation lookaside buffer usually has a high degree
of associativity. This is shown in Figure 2.8, in which the index and tag fields used
for set selection and line matching are extracted from the virtual page number in the
virtual address. If the translation lookaside buffer has T = 2 t sets, then the translation
lookaside buffer index (TLBI) consists of the t least significant bits of the virtual page
number, and the translation lookaside buffer tag (TLBT) consists of the remaining bits
in the virtual page number.
VPN
Figure 2.8: Components of a virtual address that are used to access the TLB.
The steps involved when there is a translation lookaside buffer hit (the usual case)
are shown in Figure 2.9. The important point is that all of the address translation steps
are performed inside the on-chip memory management unit. Because of this, they are
performed very quickly.
15
CPU chip
TLB
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
2 VPN PTE 3
4
1 Trans- Cache-
Processor lation memory
VA PA
5 Data
2. The memory management unit checks if the translation exists on the translation
lookaside buffer.
3. The memory management unit fetches the appropriate page table entry from the
translation lookaside buffer.
4. The memory management unit translates the virtual address into a physical
address, and then sends it to the main memory.
5. The main memory returns the requested data word to the CPU.
The memory management unit must fetch the page table entry from the L1 cache if
there is a translation lookaside buffer miss. This is shown in Figure 2.10. The newly
fetched page table entry is stored in the translation lookaside buffer miss and may
possibly overwrite an existing entry.
Until now we have assumed that the system uses a single page table for address
translation. If, however, we had a 32-bit address space, 4 KB pages, and a 4-byte page
table entry, we would require a 4 MB page table resident in memory at all times, even if
the application referenced only a small part of the virtual address space. This problem
is compounded for systems with 64-bit address spaces.
16
CPU chip
TLB
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
2 VPN 4
PTE
3
1 Trans- PTEA Cache-
Processor lation memory
VA PA
5
6 Data
A commonly used method of compacting the page table is to use a hierarchy of page
tables. The idea is best explained using an example in which a 32-bit virtual address
space is partitioned into 4 KB pages, with page table entries that are 4 bytes each, and
for which the virtual address space has the following form:
The first 2K pages of memory are allocated for code and data;
Figure 2.11 shows how a two-level page table hierarchy for this virtual address space
might be constructed.
Each page table entry in the level-1 table is responsible for mapping a 4 MB segment
of the virtual address space, in which each segment consists of 1024 contiguous pages.
For example, page table entry 0 maps the first segment, page table entry 1 the next
segment, and so on. As the address space is 4 GB, 1024 page table entries are sufficient
to cover the entire space.
If every page in segment i is unallocated, level 1 page table entry i will be null. For
example, in Figure 2.11, segments 2 to7 are unallocated. However, if at least one page
in segment I is allocated, level 1 page table entry i will point to the base of a level 2
page table. This is shown in Figure 1.11, where all or portions of segments 0, 1, and 8
are allocated, so their level 1 page table entries point to level 2 page tables.
17
Virtual
memory
Level 2 VP 0
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Page table
Level 1
Page table PTE 0
2K allocated VM
VP 1023
PTE 0 .. pages for code and
data
VP 1024
PTE 1 PTE 1023
PTE 2 (NULL)
VP 2047
PTE 3 (NULL) PTE 0
PTE 4 (NULL) ..
PTE 5 (NULL) PTE 1023
6 k allocated VM
PTE 6 (NULL) Gap
pages
PTE 7 (NULL)
Figure 2.11: A two-level page table hierarchy. Notice that addresses increase from top
to bottom.
Each page table entry in a level 2 page table maps a 4 KB page of virtual memory,
just as in a single-level page table. In 4-byte page table entries, each level 1 and level 2
page table is 4K bytes, which, conveniently, is the same size as a page.
Moreover, only the level 1 table needs to be in main memory all the time. The level
2 page tables can be created and paged in and out by the virtual memory system as
they are needed. This further reduces demand on the main memory. Only the most
frequently used level 2 page tables need be cached in the main memory.
Figure 2.12 summarizes address translation with a k-level page table hierarchy.
The virtual address is partitioned into k virtual page numbers and a virtual page
offset.
Each page table entry in a level-j table, 1 j k 1, points to the base of some
page table at level j + 1.
18
Each page table entry in a level-k table contains either the physical page number
of some physical page, or the address of a disk block.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
To create the physical address, the memory management unit must access k page table
entries before it can allocate the physical page number. Again, as with a single-level
hierarchy, the physical page number is identical to the virtual page offset.
n -1 Virtual address P -1 0
PPN
M-1 P -1 0
PPN PPO
Physical address
At first glance, accessing k page table entries appears to be expensive and impractical.
However, the translation lookaside buffer compensates for this by caching page table
entries from the page tables at the different levels, the effect of which is that address
translation with multi-level page tables is not significantly slower than with single-level
page tables.
We put it all together in this subsection with an example of end-to-end address translation
on a small system with a translation lookaside buffer and L1 d-cache. For simplicity, we
make the following assumptions:
19
13 12 11 10 9 8 7 6 5 4 3 2 1 0
Virtual
address
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
VPN VPO
(Virtual page number) (Virtual page number)
11 10 9 8 7 6 5 4 3 2 1 0
Physical
address
PPN PPO
(Physical page number) (Physical page number)
Figure 2.13: Addressing for small memory system. Assume 14-bit virtual addresses (n
= 14), 12-bit physical addresses (m = 12), and 64-byte pages (P = 64).
The L1 d-cache is physically addressed and direct mapped, with a 4-byte line size
and a total of 16 sets.
The formats of the virtual and physical addresses are shown in Figure 2.13. Since
each page is composed of 26 = 64 bytes, the low-order 6 bits of the virtual and physical
addresses serve as the virtual page offset and physical page offset respectively. The
high-order 8 bits of the virtual address serve as the virtual page number. The high-order
6 bits of the physical address serve as the physical page number.
Figure 2.14(a) shows a snapshot of this memory system, including the translation
lookaside buffer; Figure 2.14(b) shows a portion of the page table; and Figure 2.14(c)
shows the L1 cache.
We have also shown how the bits of the virtual and physical addresses are partitioned
by the hardware as it accesses these devices. This can be seen above the figures for the
translation lookaside buffer and cache.
Page table: The page table is a single-level design with a total of 28 = 256 page
table entries. However, we are only interested in the first sixteen of these. For
20
TLBT TLBT
13 12 11 10 9 8 7 6 5 4 3 2 1 0
Virtual
address
VPN VPO
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Set Tag PPN Valid Tag PPN Valid Tag PPN Valid Tag PPN Valid
0 03 _ 0 09 0D 1 00 - 0 07 02 1
1 03 2D 1 02 - 0 04 - 0 0A - 0
2 02 - 0 08 - 0 06 - 0 03 - 0
3 07 - 0 03 0D 1 0A 34 1 02 - 0
CT CI CO
11 10 9 8 7 6 5 4 3 2 1 0
Physical
address
PPN PPO
Figure 2.14: TLB, page table, and cache for small memory system. All values in the
TLB, page table, and cache are in hexadecimal notation.
convenience, we have labeled each page table entry with the virtual page number
that indexes it. Keep in mind, however, that these virtual page numbers are not
part of the page table and not stored in memory. Also keep in mind that the
physical page number of each invalid page table entry is marked with a dash or
minus sign to emphasize that the bit values stored there are not meaningful.
Cache. The direct-mapped cache is addressed by the fields in the physical address.
As each block is 4 bytes, the low-order 2 bits of the physical address serve as the
block offset (CO) and, since there are 16 sets, the next 4 bits serve as the set
index (CI). The remaining 6 bits serve as the tag (CT).
What happens when the CPU executes a load instruction that reads the byte at
address 0x03d4? Recall that the hypothetical CPU reads one-byte words, not four-byte
words. In starting a manual simulation such as this, it is helpful to:
21
Identify the various fields we will need; and
Because the tag in Set 0x5 matches the cache tag, the cache detects a hit, reads
out the data byte (0x36) at offset CO, and returns it to the memory management unit,
which then passes it back to the CPU.
It is possible to have other routes through the translation, an example being that if
the translation lookaside buffer misses, the memory management unit has to fetch the
physical page number from a page table entry in the page table. If the resulting page
table entry is invalid, this indicates a page fault and the kernel must reload the missing
page and rerun the load instruction. Another possibility is that the page table entry is
valid, but the necessary memory block misses in the cache.
22
it can also significantly improve the throughput of the memory transactions from the
device to the memory and vice versa.
There are two types of DMAs: a software transfer of the data (by calling functions
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
such as read) or a hardware transfer of the data, known as Rx when the data is
transferred from the memory, and Tx when it is transferred to the device. We will focus
on the hardware DMA transaction to the memory, which is done asynchronously in this
case. That is, the device transfers the data at its own rate without the involvement of
the CPU, and the CPU continues its execution simultaneously. Those transactions are
done using DMA descriptors, which include the location on the memory to access.
A DMA descriptor is a data structure that contains all the information the hardware
needs to execute its operations such as read or write. The descriptor is prepared by the
OS in advance. Its location on the memory is known to the device. Once the hardware
becomes available to execute the next DMA command, it reads the descriptor, executes
the relevant command, advances to read the next descriptor, and so on until it reaches
an empty descriptor.
Consider an example of a case in which the device reads data from the memory, as
happens when a packet is sent to the NIC. On driver registration, the driver allocates a
set of descriptors, and, once there is data to transfer to the device, chooses a descriptor
belonging to the OS, updates it to point to the data buffer, writes in it the size of the
buffer, marks the descriptor as belonging to the device, and interrupts the device to
wake it up. The device reads the first descriptor, which belongs to it, and from that
descriptor it reads the pointer and the size of the buffer to be read. The device now
knows the number of bytes to read and where to read from, and it start the transaction.
After finishing the transaction, the device marks the descriptor as belonging to the OS,
advances to the next descriptor, and interrupts the OS. The OS detaches the buffer
from the descriptor, leaving the descriptor free for the next DMA command.
When the device writes data to the memory, it interrupts the OS to announce that new
data has arrived. Then, the OS uses the interrupt handler to allocate a buffer and tells
the hardware where to transfer its data. After the device writes the data to the buffer
and raises another interrupt, the OS wakes up a relevant process and passes the packet
to it. A network card is a typical example of a device that transfers data asynchronously
to the memory. The OS prepares the descriptor in advance, allocates a buffer, links it
to the descriptor, and marks the descriptor belonging to the device. If no packet arrives,
the descriptor contains a link to an allocated buffer waiting for data to be written to it.
At the arrival of a packet from the network, the network card reads the descriptor in
order to identify the address of the buffer to write the data to, writes the data to the
23
buffer, and raises an interrupt to the OS. Using the interrupt handler, the OS detaches
the buffer from the descriptor, allocates a new buffer, links it to the descriptor, and
passes the packet written on the buffer to the network stack. Now the descriptor, with
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
a new allocated buffer, again belongs to the device and waits until a new packet arrives.
What follows is a brief description of what the I/O virtual memory adds to the flow
of I/O transactions. Before updating the DMA descriptor with the buffer address, the
driver asks the IOMMU driver to map the physical address of the buffer and receives a
virtual address. This address is inserted to the DMA descriptor instead of the physical
one, which was inserted if the IOMMU wasnt enabled. The flow is described in more
detail in Figure 2.15
I/O device transactions work as follows (each number refers to the corresponding
number in Figure 2.15):
1. When a device needs to perform an I/O transaction to/from the memory, its
driver (the piece of kernel software that control the device) issues a request for an
I/O buffer.
2. The OS updates the lookup table (page table) and returns an IOVA to the device.
24
3. The device driver initiates the device to transfer the data to and/or from a virtual
address via a corresponding DMA unit.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
5. The IOMMU translates the IOVA to a physical address, and starts the transaction
to and/or from a physical address.
6. When the transaction ends, the device raises an interrupt to the driver.
8. The OS updates the radix tree that the mapping is not available.
There are several strategies for deciding when to map and unmap the I/O buffer.
Strict is the common strategy and the one that offers maximum protection. Before a
DMA transaction takes place, all the memory accessed by the I/O device is mapped
and then unmapped once the transaction is complete (right after steps 3-6). Other
strategies postpone the unmap operation and add it to a list, the items on which will
be unmapped together. The longer the unmap operation is delayed, the less the system
is protected from misbehaving device drivers.
25
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
26
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Chapter 3
3.1 Introduction
I/O device drivers initiate direct memory accesses (DMAs) to asynchronously move data
from their devices into memory and vice versa. In the past, DMAs used physical memory
addresses. But such unmediated access made systems vulnerable to (1) rogue devices
that might perform errant or malicious DMAs [14, 19, 40, 44, 65], and to (2) buggy
drivers that account for most operating system (OS) failures and might wrongfully trigger
DMAs to arbitrary memory locations [13, 21, 31, 50, 59, 63]. Subsequently, all major
chip vendors introduced I/O memory management units (IOMMUs) [3, 11, 36, 40],
which allow DMAs to execute with I/O virtual addresses (IOVAs). The IOMMU
translates the IOVAs into physical addresses according to I/O page tables that are setup
by the OS. The OS thus protects itself by adding a suitable translation just before
the corresponding DMA, and by removing the translation right after [17, 23, 64]. We
explain in detail how the IOMMU is implemented and used in 3.2.
DMA protection comes at a cost that can be substantial in terms of performance
[4, 15, 64], notably for newer, high-throughput I/O devices like 10/40 Gbps network
controllers (NICs), which can deliver millions of packets per second. Our measurements
indicate that using DMA protection with such devices can reduce the throughput by
up to 10x. This penalty has motivated OS developers to trade off some protection
for performance. For example, when employing the deferred IOMMU mode, the
Linux kernel defers IOTLB invalidations for a short while instead of performing them
immediately when necessary, because they are slow. The kernel then processes the
accumulated invalidations en masse by flushing the entire IOTLB, thus amortizing
the overhead at the risk of allowing devices to erroneously utilize stale IOTLB entries.
27
I/O deviceIOTLB
memory
I/O device TLB
device
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
I/O device
I/O device
I/Odevice
I/O
CPU
physical
IOMMU
address
I/O
MMU
Figure 3.1: IOMMU is for devices what the MMU is for processes.
While this tradeoff can double the performance relative to the stricter IOMMU mode,
the throughput is still 5x lower than when the IOMMU is disabled. We analyze and
model the overheads associated with using the IOMMU in 3.3.
We argue that the degraded performance is largely due to the IOMMU needlessly
replicating the design of the regular MMU, which is based on hierarchical page tables.
Our claim pertains to the most widely used I/O devices, such as NICs and disk drives,
which utilize circular ring buffers in order to interact with the OS. A ring is an array
of descriptors that the OS driver sets when initiating DMAs. Descriptors encapsulate
the DMA details, including the associated IOVAs. Importantly, ring semantics dictate
that (1) the driver work through the ring in order, one descriptor after the other, and
that (2) the I/O device process these descriptors in the same order. Thus, IOVAs are
short-lived and the sequence in which they are used is linearly predictable: each IOVA
is allocated, placed in the ring, used in turn, and deallocated.
We propose a ring IOMMU (rIOMMU) that supports this pervasive sequential model
using flat (1D) page tables that directly correspond to the nature of rings. rIOMMU
has three advantages over the baseline IOMMU that significantly reduce the overhead
of DMA protection. First, building/destroying an IOVA translation in a flat table is
quicker than in a hierarchical structure. Second, the act of (de)allocating IOVAsthe
actual integers serving as virtual addressesis faster, as IOVAs are indices of flat
tables in our design. Finally, the frequency of IOTLB invalidations is substantially
reduced, because the rIOMMU designates only one IOTLB entry per ring. One is
enough because IOVAs are used sequentially, one after the other. Consequently, every
translation inserted to the IOTLB removes the previous translation, eliminating the
need to explicitly invalidate the latter. And since the OS handles high-throughput I/O
in bursts, explicit invalidations become rare. We describe rIOMMU in 3.4.
We evaluate the performance of rIOMMU using standard network benchmarks.
We find that rIOMMU improves the throughput by 1.007.56x, shortens latency by
0.800.99x, and reduces CPU consumption by 0.361.00x relative to the baseline DMA
28
protection. Our fastest rIOMMU variant is within 0.771.00x the throughput, 1.001.04x
the latency, and 1.001.22x the CPU consumption of a system that disables the IOMMU
entirely. We describe our experimental evaluation in 4.6.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
3.2 Background
The role the IOMMU plays for I/O devices is similar to the role the regular MMU plays
for processes, as illustrated in Figure 3.1. Processes typically access the memory using
virtual addresses, which are translated to physical addresses by the MMU. Analogously,
I/O devices commonly access the memory via DMAs associated with IOVAs. The
IOVAs are translated to physical addresses by the IOMMU.
The IOMMU provides inter - and intra-OS protection [4, 62, 64, 66]. Inter-OS
protection is applicable in virtual setups. It allows for direct I/O, where the host
assigns a device directly to a guest virtual machine (VM) for its exclusive use, largely
removing itself from the guests I/O path and thus improving its performance [30, 50].
In this mode of operation, the VM directly programs device DMAs using its notion of
(guest) physical addresses. The host uses the IOMMU to redirect these accesses to
where the VM memory truly resides, thus protecting its own memory and the memory
of the other VMs. With inter-OS protection, IOVAs are mapped to physical memory
locations infrequently, typically only upon such events as VM creation and migration.
Such mappings are therefore denoted static or persistent [64]; they are not the focus of
this paper.
Intra-OS protection allows the OS to defend against the DMAs of errant/malicious
devices [14, 19, 24, 40, 44, 65] and of buggy drivers, which account for most OS failures
[21, 13, 31, 50, 59, 63]. Drivers and their I/O devices can perform DMAs to arbitrary
memory addresses, and IOMMUs allow OSes to protect themselves (and their processes)
against such accesses, by restricting them to specific physical locations. In this mode
of work, map operations (of IOVAs to physical addresses) and unmap operations
(invalidations of previous maps) are frequent and occur within the I/O critical path,
such that each DMA is preceded and followed by the mapping and unmapping of the
corresponding IOVA [44, 52]. Due to their short lifespan, these mappings are denoted
dynamic [17], streaming [23] or single-use [64]. This strategy of IOMMU-based intra-OS
protection is the focus of this paper. It is recommended by hardware vendors [40, 32, 44]
and employed by operating systems [9, 17, 23, 37, 51, 64].1 It is applicable in non-virtual
setups where the OS has direct control over the IOMMU. It is likewise applicable in
1
For example, the DMA API of Linux notes that DMA addresses should be mapped only for the
time they are actually used and unmapped after the DMA transfer [52]. In particular, once a buffer
has been mapped, it belongs to the device, not the processor. Until the buffer has been unmapped, the
[OS] driver should not touch its contents in any way. Only after [the unmap of the buffer] has been
called is it safe for the driver to access the contents of the buffer [23].
29
requester identifier IOVA (DMA address)
bus dev func 00 idx idx idx idx offset
15 8 3 0 63 48 39 30 21 12 0
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Given a target memory buffer of a DMA, the OS associates the physical address (PA) of
the buffer with an IOVA. The OS maps the IOVA to the PA by inserting the IOVAPA
translation to the IOMMU data structures. Figure 4.1 depicts these structures as
implemented by Intel x86-64 [40]. The PCI protocol dictates that each DMA operation
is associated with a 16-bit request identifier comprised of a bus-device-function triplet
that uniquely identifies the corresponding I/O device. The IOMMU uses the 8-bit bus
number to index the root table in order to retrieve the physical address of the context
table. It then indexes the context table using the 8-bit concatenation of the device
and function numbers. The result is the physical location of the root of the page table
hierarchy that houses all of the IOVAPA translations of that I/O device.
The purpose of the IOMMU page table hierarchy is similar to that of the MMU
hierarchy: recording the mapping from virtual to physical addresses by utilizing a 4-level
radix tree. Each 48-bit (I/O) virtual address is divided into two: the 36 high-order
bits, which constitute the virtual page number, and the 12 low-order bits, which are the
offset within the page. The translation procedure applies to the virtual page number
only, converting it into a physical frame number (PFN) that corresponds to the physical
30
memory location being addressed. The offset is the same for both physical and virtual
pages.
Let Tj denote a page table in the j-th radix tree level for j = 1, 2, 3, 4, such that
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
T1 is the root of the tree. Each Tj is a 4KB page containing up to 29 = 512 pointers
to physical locations of next-level Tj+1 tables. Last-levelT4 tables contain PFNs of
target buffer locations. Correspondingly, the 36-bit virtual page number is split into a
sequence of four 9-bit indices i1 , i2 , i3 and i4 , such that ij is used to index Tj in order
to find the physical address of the next Tj+1 along the radix tree path. Logically, in C
pointer notation, T1 [i1 ][i2 ][i3 ][i4 ] is the PFN of the target memory location.
Similarly to the MMU translation lookaside buffer (TLB), the IOMMU caches
translations using an IOTLB, which it fills on-the-fly as follows. Upon an IOTLB
miss, the IOMMU hardware hierarchically walks the page table as described above,
and it inserts the IOVAPA translation to the IOTLB. IOTLB entries are invalidated
explicitly by the OS as part of the corresponding unmap operation.
An IOMMU table walk fails if a matching translation was not previously established
by the OS, a situation which is logically similar to encountering a null pointer value
during the walk. A walk additionally fails if the DMA being processed conflicts with
the read/write permission bits found within the page table entries along the traversed
radix tree path. We note in passing that, at present, in contrast to MMU memory
accesses, DMAs are typically not restartable. Namely, existing systems usually do not
support I/O page faults, and hence the OS cannot populate the IOMMU page table
hierarchy on demand. Instead, IOVA translations of valid DMAs are expected to be
successful, and the corresponding pages must be pinned to memory. (Albeit I/O page
fault standardization does exist [54].)
Many I/O devicesnotably NICs and disk drivesdeliver their I/O through one or
more producer/consumer ring buffers. A ring is an array shared between the OS device
driver and the associated device, as illustrated in Figure 3.3. The ring is circular in that
the device and driver wrap around to the beginning of the array when they reach its end.
The entries in the ring are called DMA descriptors. Their exact format and content
vary between devices, but they specify at least the address and size of the corresponding
target buffers. Additionally, the descriptors commonly contain status bits that help the
driver and the device to synchronize.
Devices must also know the direction of each requested DMA, namely, whether the
data should be transmitted from memory (into the device) or received (from the device)
into memory. The direction can be specified in the descriptor, as is typical for disk
controllers. Alternatively, the device can employ different rings for receive and transmit
activity, in which case the direction is implied by the ring. The receive and transmit
rings are denoted Rx and Tx, respectively. NICs employ at least one Rx and one Tx
31
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Figure 3.3: A driver drives its device through a ring. With an IOMMU, pointers are
IOVAs (both registers and target buffers).
per port. They may employ multiple Rx/Tx rings per port to promote scalability, as
different rings can be handled concurrently by different cores.
Upon initialization, the OS device driver is responsible for allocating the rings and
configuring the I/O device with their size and base location. For each ring, the device
and driver utilize a head and a tail pointers to delimit the content of the ring that can be
used by the device: [head, tail). The device iteratively consumes (removes) descriptors
from the head, and it increments the head to point to the descriptor that it will use
next. Similarly, the driver adds descriptors to the tail, and it increments the tail to
point to the entry it will use subsequently.
A device asynchronously informs its OS driver that data was transmitted or received
by triggering an interrupt. The device coalesces interrupts when their rate is high.
Upon receiving an interrupt, the driver of a high-throughput device handles the entire
I/O burst. Namely, it sequentially iterates through and processes all the descriptors
whose corresponding DMAs have completed,
This section enumerates the overhead components involved in using the IOMMU in the
Linux/Intel kernel (3.3.1). It experimentally quantifies the overhead of each component
(3.3.2). And it provides and validates a simple performance model that allows us to
understand how the overhead affects performance and to assess the benefits of reducing
it (3.3.3).
32
3.3.1 Overhead Components
Suppose that a device driver that employs a ring wants to transmit or receive data
from/to a target buffer. Figure 3.4 lists the actions it carries out. First, it allocates the
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
target buffer, whose physical address is denoted p (1). (For simplicity, let us assume
that p is page aligned.) It pins p to memory and then asks the IOMMU driver to map
the buffer to some IOVA, such that the I/O device would be able to access p (2). The
IOMMU driver invokes its IOVA allocator, which returns a new IOVA van integer
that is not associated with any other page currently accessible to the I/O device (3).
The IOMMU driver then inserts the v p translation to the page table hierarchy of
the I/O device (4), and it returns v to the device driver (5). Finally, when updating
the corresponding ring descriptor, the device driver uses v as the address for the target
buffer of the associated DMA operation (6).
Assume that the latter is a receive DMA. Figure 3.5 details the activity that takes
place when the I/O device gets the data. The device reads the DMA descriptor within
the ring through its head register. As the address held by the head is an IOVA, it is
intercepted by the IOMMU (1). The IOMMU consults its IOTLB to find a translation
for the head IOVA. If the translation is missing, the IOMMU walks the page table
hierarchy of the device to resolve the miss (2). Equipped with the heads physical
address, the IOMMU translates the head descriptor for of the I/O device (3). The head
descriptor specifies that v (IOVA defined above) is the address of the target buffer (4),
so the I/O device writes the received data to v (5). The IOMMU intercepts v, walks
the page table if the v p translation is missing (6), and redirects the received data to
p (7).
Figure 3.6 shows the actions the device driver carries out after the DMA operation
is completed. The device driver asks the IOMMU driver to unmap the IOVA v (1).
In response, the IOMMU driver removes the v p mapping from the page table
hierarchy (2), purges the mapping from the IOTLB (3), and deallocates v (4). (The
order of these actions is important.) Once the I/O device can no longer access p, it is
safe for the device driver to hand the buffer to higher levels in the software stack for
further processing (5).
We experimentally quantify the overhead components of the map and unmap functions
of the IOMMU driver as outlined in Figures 3.4 and 3.6. To this end, we execute
the standard Netperf TCP stream benchmark, which attempts to maximize network
throughput between two machines over a TCP connection. (The experimental setup is
detailed in 4.6.)
Strict Protection We begin by profiling the Linux kernel in its safer IOMMU mode,
denoted strict, which strictly follows the map/unmap procedures described in 3.3.1.
33
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Figure 3.4: The I/O device driver maps an IOVA v to a physical target buffer p. It
then assigns v to the DMA descriptor.
Figure 3.5: The I/O device writes the packet it receives to the target buffer through v,
which the IOMMU translates to p.
Figure 3.6: After the DMA completes, the I/O device driver unmaps v and passes p to
a higher-level software layer.
34
function component strict strict+ defer defer+
map iova alloc 3986 92 1674 108
page table 588 590 533 577
other 44 45 44 42
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Table 3.1: Average cycles breakdown of the map and unmap functions of the IOMMU
driver for different protection modes.
Table 3.1 shows the average duration of the components of these procedures in cycles.
When examining the breakdown of strict/map, we see that its most costly component
is, surprisingly, IOVA allocation (Step 3 in Figure 3.4). Upon further investigation, we
found that the reason for this high cost is a nontrivial pathology in the Linux IOVA
allocator that regularly causes some allocations to be linear in the number of currently
allocated IOVAs. We were able to come up with a more efficient IOVA allocator, which
consistently allocates/frees in constant time [7]. We denote this optimized IOMMU
modewhich is quicker than strict but equivalent to it in terms of safetyas strict+.
Table 3.1 shows that strict+ indeed reduces the allocation time from nearly 4,000 cycles
to less than 100.
The remaining dominant strict(+)/map overhead is the insertion of the IOVA to
the IOMMU page table (Step 4 in Figure 3.4). The 500+ cycles of the insertion are
attributed to explicit memory barriers and cacheline flushes that the driver performs
when updating the hierarchy. Flushes are required, as the I/O page walk is incoherent
with the CPU caches on our system. (This is common nowadays; Intel started shipping
servers with coherent I/O page walks only recently.)
Focusing on the unmap components of strict/strict+, we see that finding the
unmapped IOVA in the allocators data structure is costlier in strict+ mode. The
reason: like the baseline strict, strict+ utilizes a red-black tree to hold the IOVAs. But
the strict+ tree is fuller, so the logarithmic search is longer. Conversely strict+/free
(Step 4 in Figure 3.6) is done in constant time, rather than logarithmic, so it is quicker.
The other unmap components are: removing the IOVA from the page tables (Step
2 in Figure 3.6) and the IOTLB (Step 3). The removal takes 400+ cycles, which is
comparable to the duration of insertion. IOTLB invalidation is by far the slowest unmap
component at around 2,000 cycles; this result is consistent with previous work [4, 66].
Deferred Protection In order to reduce the high cost of invalidating IOTLB entries,
the Linux deferred protection mode relaxes strictness somewhat, trading off some safety
35
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
25
none model
throughput [Gbps]
20 measured w/ delays
riommu measured w/ iommu
15
riommu-
10
defer+
defer
5 strict+ strict
0
2k 4k 6k 8k 10k 12k 14k 16k 18k
36
for performance. Instead of invalidating entries right away, the IOMMU driver queues
the invalidations until 250 freed IOVAs accumulate. It then processes all of them in
bulk by invalidating the entire IOTLB. This approach affects the cost of (un)mapping
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
in two ways, as shown in Table 3.1 in the defer and defer+ columns. (Defer+ is to
defer what strict+ is to strict.) First, as intended, it eliminates the cost of invalidating
individual IOTLB entries. And second, it reduces the cost of IOVA allocation in the
baseline deferred mode as compared to strict (1,674 vs. 3,986), because deallocating
IOVAs in bulk reduces somewhat the aforementioned linear pathology.
The drawback of deferred protection is that the I/O device might erroneously access
target buffers through stale IOTLB entries after the buffers have already been handed
back to higher software stack levels (Step 5 in Figure 3.6). Notably, at this point, the
buffers could be (re)used for other purposes.
Let C denote the average number of CPU cycles required to process one packet. Figure
3.7 shows C for each of the aforementioned IOMMU modes in our experimental setup.
The bottommost horizontal grid line shows Cnone , which is C when the IOMMU is
turned off. We can see, for example, that Cstrict is nearly 10x higher than Cnone .
Our experimental setup employs a NIC that uses two target buffers per packet:
one for the header and one for the data. Each packet thus requires two map and two
unmap invocations. So the processing of the packet includes: two IOVA (de)allocations;
two page table insertions and deletions; and two invalidations of IOTLB entries. The
corresponding aggregated cycles are respectively depicted as the three top stacked
sub-bars in the figure. The bottom, other sub-bar embodies all the rest of the packet
processing activity, notably TCP/IP and interrupt processing. As noted, the deferred
modes eliminate the IOTLB invalidation overhead, and the + modes reduce the
overhead of IOVA (de)allocation. But even Cdef er+ (the most performant mode, which
introduces a vulnerability window) is still over 3.3x higher than Cnone .
We find that the way the specific value of C affects the overall throughput of Netperf
is simple and intuitive. Specifically, if S denotes the cycles-per-second clock speed of
the core, then S/C is the number of packets the core can handle per second. And
since every Ethernet packet carries 1,500 bytes, the throughput of the system in Gbps
S
should be Gbps(C) = 1500 byte 8 bit C, assuming S is given in GHz. Figure 3.8
shows that this simple model (thick line) is accurate. It coincides with the throughput
obtained when systematically lengthening Cnone using a carefully controlled busy-wait
loop (thin line). It also coincides with the throughput measured under the different
IOMMU modes (cross points).
Consequences As our model accurately reflects reality, we conclude that the trans-
lation activity carried out by the IOMMU (as depicted in Figure 3.5) does not affect
37
the performance of the system, even when servicing demanding benchmarks like Net-
perf. Instead, the cost of IOMMU protection is entirely determined by the number
of cycles the core spends establishing and destroying IOVA mappings. Consequently,
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
we can later simulate and accurately assess the expected performance of our proposed
IOMMU by likewise spending cycles; there is no need to simulate the actual IOMMU
hardware circuitry external to the core. A second conclusion rests on the understanding
that throughput is proportional to 1/C. If C is high (right-hand side of Figure 3.8),
incrementally improving it would make little difference. The required change must be
significant enough to bring C to the proximity of Cnone .
3.4 Design
a) b) c) d)
struct rDEVICE { struct rRING { struct rPTE { struct rIOVA {
u16 size; u18 size; u64 phys_addr; u30 offset;
rRING rings[size]; rPTE ring[size]; u30 size; u18 rentry;
}; // SW only u02 dir; u16 rid;
u18 tail; u01 valid; }; // = 64bit
// SW only u31 unused;
u18 nmapped; };// = 128bit
};
e)
struct rIOTLB_entry {
u16 bdf;
u16 rid;
u18 rentry;
rPTE rpte;
rPTE next;
};
Figure 3.9: The rIOMMU data structures. e) is used only by hardware. The last two
fields of rRING are used only by software.
Our goal is to design rIOMMU, an efficient IOMMU for devices that employ rings. We
aim to substantially reduce all IOVA-related overheads: (de)allocation, insertion/deletion
to/from the page table hierarchy, and IOTLB invalidation (see Figure 3.7). We base our
design on the observation that ring semantics (3.2.3) dictate a well-defined memory
access order. The OS sequentially produces ring entries, one after the other. And the
I/O device sequentially consumes these entries, one after the other, making its DMA
pattern predictable.
We contend that the x86 hierarchical structure of page tables is poorly suited for
the ring model. For each DMA, the OS has to walk the hierarchical page table in order
to map the associated IOVA. Then, the device faults on the IOVA and so the IOMMU
must walk the table too. Shortly after, the OS has to walk the table yet again in order
to unmap the IOVA. Contributing to this overhead are the aforementioned memory
barriers and cacheline flushes required for propagating page table changes.
In a nutshell, we propose to replace the table hierarchy with a per-ring flat page
table (1D array) as shown in Figure 3.10. IOVAs would constitute indices of the array,
38
requester identifier IOVA (DMA address)
bus dev func ring id ring entry offset
15 8 3 0 63 49 48 30 29 0
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Data Structures Figure 3.9 defines the rIOMMU data structures. The rDEVICE
(Figure 3.9a) is to the rIOMMU what the root page table is to the baseline IOMMU.
It is uniquely associated with a bus-device-function (bdf) triplet and is pointed to by
the context table (Figure 4.1). As noted, each DMA carries with it a bdf, allowing the
rIOMMU to find the corresponding rDEVICE when needed. The rDEVICE consists of
a physical pointer to an array of rRING structures (Figure 3.9b) and a matching size.
Each rRING entry represents a flat page table. It likewise contains the tables physical
address and size. The OS associates with each rRING: (1) a tail pointing to the next
entry to be allocated in the flat table, and (2) the current number of valid mappings
in the table. The latter two are not architected and are unknown to the rIOMMU
hardware. We include them in rRING to simplify the description.
39
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
rIOTLB_entry e;
rRING r = d.rings[iova.rid];
e.bdf = bus_dev_func;
e.rid = iova.rid;
e.rentry = iova.rentry;
e.rpte = r.ring[e.rentry]; // copy
rprefetch( d, e );
return e;
}
// async
void rprefetch(rDEVICE d, rIOTLB_entry e) {
rRING r = d.rings[e.rid];
u18 next = (e.rentry + 1) % r.size;
if( r.size > 1 && r.ring[next].valid )
e.next = r.ring[next]; // copy
}
Figure 3.11: Outline of the rIOMMU logic. All DMAs are carried out with IOVAs
that are translated by the rtranslate routine.
40
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
r.ring[t].phys_addr = pa;
r.ring[t].size = size;
r.ring[t].dir = direction;
r.ring[t].valid = 1;
sync_mem( & r.ring[t] );
return pack_iova( 0/*offset*/, t/*rentry*/, rid );
}
Figure 3.12: Outline of the rIOMMU OS driver, implementing map and unmap, which
respectively correspond to Figures 3.4 and 3.6.
41
Each ring buffer of the I/O device is associated with two rRINGs in the rDEVICE
array. The first corresponds to IOVAs pointing to the device ring buffer (Step 1 of
Figure 3.5 for translating the head register). The second corresponds to IOVAs that
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
the device finds within its ring descriptors (Step 5 in Figure 3.5 for translating target
buffers). The IOVAs that reside in the first flat table are mapped as part of the I/O
device initialization. They will be unmapped only when the device is brought down, as
the device rings are always accessible to the device. IOVAs residing in the second flat
table are associated with DMA target buffers; they are mapped/unmapped repeatedly
and are valid only while their DMA is in flight.
The flat table pointed to by rRING.ring is an array of rPTE structures (Figure 3.9c).
An rPTE consists of the physical address and size associated with the corresponding
IOVA; two bits that specify the DMA direction, which can be from the device, to it, or
both; and a bit that indicates whether the rPTE (and thus the corresponding IOVA)
are valid. The physical address need not be page aligned and the associated size can
have any value, allowing for fine-grained protection.
The rIOVA structure (Figure 3.9d) defines the format of IOVAs. As noted, every
DMA has a bdf that uniquely identifies its rDEVICE. The rIOVA.rid (ring ID) serves as
an index to the corresponding rDEVICE.rings array, and thus it uniquely identifies the
rRING of the rIOVA. Likewise, rIOVA.rentry serves as an index to the rRING.ring array,
and thus it uniquely identifies the rPTE of the rIOVA. The target address of the rIOVA
is computed by adding rIOVA.offset to rPTE.phys addr.
The data structures discussed so far are used by both software and hardware. They
are setup by the OS and utilized by the rIOMMU to translate rIOVAs. The last
one (Figure 3.9e) is a hardware-only structure, representing one rIOTLB entry. The
combination of its first two fields (bdf+rid) uniquely identifies a rRING flat page table,
which we denote as T . The rIOTLB utilizes at most one rIOTLB entry per T . The
combination of the first three fields (bdf+rid+rentry) uniquely identifies T s current
rPTEthe PTE associated with the most recently translated rIOVA that belongs to T .
The current rPTE is cached by rIOTLB entry.rpte (holds a copy). The rIOTLB entry.next
field may or may not contain a prefetched copy of T s subsequent rPTE. (Our design
does not depend on the latter field and works just as well without it.)
Hardware The rtranslate routine (Figure 3.11 top/left) outlines how rIOMMU trans-
lates a rIOVA to a physical address. First, it searches for e, the rIOTLB entry of the
rRING that is associated with the rIOVA. (Recall that there is only one such entry per
rRING.) If e is missing from the rIOTLB, rIOMMU walks the table using the data
structures defined above, finds the rPTE, and inserts to the rIOTLB a matching entry.
Doing the table walk ensures that e.rpte is the rPTE of the given rIOVA. However, if e
was initially found in the rIOTLB, then e and the rIOVA might mismatch. rIOMMU
therefore compares the rentry numbers of e and the IOVA, and it updates e if they are
different. Now that e is up-to-date, rIOMMU checks that the direction of the DMA is
42
permitted according to the rPTE. It also checks that the offset of the IOVA is in range,
namely, smaller than the associated rPTE.size. Violating these conditions constitutes an
error, causing rIOMMU to trigger an I/O page fault (IOPF). IOPFs are not expected
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
to occur (drivers pin target buffers to memory), and OSes typically reinitialize the I/O
device if they do. If no violation is detected, rIOMMU finally performs the translation
by adding the offset of the IOVA to rPTE.phys addr.
The rtable walk routine (Figure 3.11 top/right) ensures that the rIOVA complies with
the rIOMMU data structure limits as well as points to a valid rPTE. Noncompliance might
be the result of, e.g., an errant DMA or a buggy driver. After validation, rtable walk
initializes the rIOTLB entry in a straightforward manner based on the rIOVA and its
rPTE. It additionally attempts to prefetch the subsequent rPTE by invoking rprefetch
(Figure 3.11 bottom/right), which succeeds if the next rPTE is valid. Prefetching can
be asynchronous.
The riotlb entry sync routine (Figure 3.11 bottom/left) is used by rtranslate to syn-
chronize e (the rIOTLB entry) with the current IOVA. The two become unsynchronized,
e.g., whenever the device handles a new DMA descriptor. The required rPTE can then
be found in e.next if prefetching was previously successful, in which case the routine
assigns e.next to e.rpte. Otherwise, it uses rtable walk to fetch the needed rPTE. Finally,
it attempts to prefetch the subsequent rPTE.
Software The (un)map functions comprising the rIOMMU OS driver are shown in
Figure 3.12. Their prototypes are logically similar to the associated Linux functions
from the baseline IOMMU OS driver (Figures 3.4 and 3.6), with minor adjustments.
The map flow corresponds to Figure 3.4. It gets a device, a ring ID, a physical address
to be mapped, and the associated size and direction of the DMA. The first part of the
code allocates a ring entry rPTE at the rings tail and then updates the tail/nmapped
fields accordingly. This allocationwhich consists of incrementing two integersis
analogous to the costly IOVA allocation of baseline Linux.
The second part of map initializes the newly allocated rPTE. When the rPTE is ready,
the map function invokes sync mem, which ensures that the rPTE memory updates are
visible to the rIOMMU. This part of the code is analogous to walking and updating
the page table hierarchy of the baseline IOMMU, but it is simpler since the page table
is flat. The return statement of the map function packs the rentry index and its ring
ID into an IOVA as dictated by the rIOVA data structure (Figure 3.9d). The offset is
always set to be 0 by the rIOMMU driver. Callers of map can later manipulate the
offset as they please, provided they conform to the size constraint encoded into the
corresponding rPTE.
The flow of unmap (Figure 3.12/right) corresponds to Figure 3.6. Unmap gets an
rIOVA, marks the associated rPTE as invalid (analogous to walking the table hierarchy),
decrements the rings nmapped counter (analogous to IOVA deallocation), and synchro-
nizes the memory to make the rPTE update visible to the rIOMMU. Recall that when
43
the device driver is notified that its device has finished some DMAs, it loops through
the relevant descriptors and sequentially unmaps their IOVAs (3.2.3). The driver sets
the end of burst parameter of unmap to true at the end of this loop, upon the last
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
IOVA. One invalidation is sufficient because, by design, each rRING has at most one
rIOTLB entry allocated in the rIOTLB.
In our experimental testbed, our measurements indicate that the average loop length
of a throughput-sensitive workload such as Netperf is 200 iterations. This is long
enough to make the amortized cost of IOTLB invalidations negligible, as in the deferred
mode, but without sacrificing safety. Amortization, however, does not apply to latency-
sensitive workloads. Nonetheless, the invalidation cost is small in comparison to the
overall latency as will shortly be demonstrated.
Finally, we consider the problem of synchronizing the memory between the IOMMU
and its driver. In sync mem (Figure 3.12 bottom/right), we see support for two hardware
modes, corresponding to whether the IOMMU table walk is coherent with the CPU
caches. The baseline Linux kernel queries the relevant IOMMU capability bit. If it
finds that the two are not in the same coherency domain, it introduces an additional
memory barrier followed by a cacheline flush. In the following section, we experimentally
evaluate two simulated rIOMMU versions corresponding to these two modes.
3.5 Evaluation
3.5.1 Methodology
44
enhances strict with our faster IOVA allocator; (3) defer, which is the Linux variant that
trades off some protection for performance by batching IOTLB invalidations; (4) defer+,
which is defer with our IOVA allocator; (5) riommu- (in lowercase), which is the newly
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
proposed rIOMMU when assuming no I/O page table coherency; (6) riommu, which
does assume coherent I/O page tables; and (7) none, which turns off the IOMMU.
The five non-rIOMMU modes are executed as is. They constitute full implementa-
tions of working systems and do not require a simulation component. To simulate the
two rIOMMU modes, we start with the none mode as the baseline. We then supplement
the baseline with calls to the (un)map functions, similarly to the way they are called in
the non-simulated IOMMU-enabled modes. But instead of invoking the native functions
of the Linux IOMMU driver (Figures 3.4 and 3.6), we invoke the (un)map functions
that we implement in the simulated rIOMMU driver (Figure 3.12). All the code of
the rIOMMU driver can beand isexecuted, with one exception. Since there is no
real rIOTLB, we must simulate the invalidation of rIOTLB entries. We do so by busy
waiting for 2,150 cycles upon each entry invalidation, in accordance to the measurements
specified in Table 3.1.
Notice that our methodology does not account for differences between the existing
and proposed IOMMU translation mechanism. Namely, we only account for actions
shown in Figures 3.4 and 3.6 but not those in Figure 3.5. Notably, we ignore the fact
that the IOMMU works harder than the rIOMMU due to IOTLB misses that rIOMMU
avoids via prefetching. We likewise ignore the fact that rIOMMU works harder than the
no-IOMMU mode, since it translates addresses whereas the no-IOMMU mode does not.
We ignore these differences, as the model validated in 3.3.3 shows that throughput
is entirely determined by the number of cycles it takes the corenot the device or
the IOMMUto process a DMA request, even for the most demanding I/O-intensive
workloads. The system behaves this way probably because the device and IOMMU
operate in parallel to the CPU and are apparently fast enough so as not to constitute a
bottleneck.
We revalidate our methodology and show that it is also applicable for latency-sensitive
workloads by using the standard Netperf UDP request-response (RR) benchmark, which
repeatedly sends one byte to its peer and waits for an identical response. We run RR
under two IOMMU modes: hardware path-through (HWpt) and software pass-through
(SWpt). With HWpt, the IOMMU is enabled but never experiences IOTLB misses;
instead, it translates each IOVA to an identical physical address without consulting
any page table. SWpt provides an equivalent functionality by using a page table that
maps the entire physical memory and associates each physical page address with an
identical IOVA. Under SWpt, Netperf RR experiences an IOTLB miss on every packet
it sends and receives. Nonetheless, we find that the performance of HWpt and SWpt
is identical, because the network stack and interrupt processing introduce far greater
latencies that hide the IOTLB miss penalty. Moreover, we find that the RR performance
of HWpt/SWpt is identical to that of no-IOMMU.
45
Throughput performance of Netperf stream with HWpt and SWpt is smaller by
10% relative to no-IOMMU. But here too the difference is entirely caused by the core:
about 200 CPU cycles spent on unrelated kernel abstraction code that executes under
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Experimental Setup In an effort to get more general results, we conduct the evalu-
ation using two setups involving two different NICs, as follows.
The Mellanox setup (mlx for short) is comprised of two identical Dell PowerEdge
R210 II Rack Server machines that communicate through Mellanox ConnectX3 40Gbps
NICs. The NICs are connected back to back via a 40Gbps optical fiber and are configured
to use Ethernet. We use one machine as the server and the other as a workload generator
client. Each machine has a 8GB 1333MHz memory and a single-socket 4-core Intel Xeon
E3-1220 CPU running at 3.10GHz. The chipset is Intel C202, which supports VT-d,
Intels Virtualization Technology that provides IOMMU functionality. We configure
the server to utilize one core only and turn off all power optimizationssleep states
(C-states) and dynamic voltage and frequency scaling (DVFS)to avoid reporting
artifacts caused by nondeterministic events. The machines run Ubuntu 12.04 with the
Linux 3.4.64 kernel. All experimental findings described thus far were obtained with
the mlx setup.
The Broadcom setup (brcm for short) is similar, likewise utilizing two R210 machines.
The differences are that the two machines communicate through Broadcom NetXtreme
II BCM57810 10GbE NICs (connected via a CAT7 10GBASE-T cable for fast Ethernet);
that they are equipped with 16GB memory; and that they run the Linux 3.11.0 kernel.
The mlx and brcm device drivers differ substantially. Notably, mlx utilizes more
ring buffers and allocates more IOVAs (we observed a total of 12K addresses for mlx
and 3K for brcm). The mlx driver uses two target buffers per packet (header and
body) and thus two IOVAs, whereas the brcm driver allocates only one buffer/IOVA
per packet.
46
netperf stream netperf rr apache 1MB apache 1KB memcached
[Gbps] [req/sec] [req/sec] [req/sec] [req/sec]
20 80k 2k 16k 160k 100
15 60k 1.5k 12k 120k 75
throughput (higher is better)
s
s+
d
d+
r-
r
n
s
s+
d
d+
r-
r
n
s
s+
d
d+
r-
r
n
s
s+
d
d+
r-
r
n
10 40k 1.2k 16k 160k 100
7.5 30k 0.9k 12k 120k 75
5 20k 0.6k 8k 80k 50
2.5 10k 0.3k 4k 40k 25
0 0k 0.0k 0k 0k 0
strict
strict+
defer
defer+
riommu-
riommu
none
strict
strict+
defer
defer+
riommu-
riommu
none
strict
strict+
defer
defer+
riommu-
riommu
none
strict
strict+
defer
defer+
riommu-
riommu
none
strict
strict+
defer
defer+
riommu-
riommu
none
throughput
cpu [%]
Figure 3.13: Absolute performance numbers of the IOMMU modes when using the
Mellanox (top) and Broadcom (bottom) NICs.
3. Apache [26, 27] is a popular HTTP web server. We drive it using ApacheBench
[8], the workload generator distributed with Apache. It measures the number of
requests per second that the server is capable of handling by requesting a static
page of a given size. We run it on the client machine configured to generate
32 concurrent requests. We use two instances of the benchmark, respectively
requesting a smaller (1KB) and a bigger (1MB) file.
3.5.2 Results
brcm stream 2.17 1.00 1.00 1.00 1.00 2.17 1.00 1.00 1.00 1.00
rr 1.19 1.05 1.04 1.02 0.99 1.21 1.06 1.05 1.03 1.00
apache 1M 1.20 1.01 1.00 1.00 1.00 1.20 1.01 1.00 1.00 1.00
apache 1K 1.24 1.13 1.08 1.02 0.89 1.29 1.18 1.13 1.07 0.93
memcached 1.76 1.35 1.18 1.10 0.78 1.88 1.45 1.27 1.18 0.84
47
NIC benchmark cpu
riommu- divided by riommu divided by
strict strict+ defer defer+ none strict strict+ defer defer+ none
mlx stream 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
rr 0.94 0.99 0.98 0.99 1.01 0.93 0.98 0.96 0.98 1.00
apache 1M 0.99 0.99 1.00 1.00 1.00 0.99 0.99 0.99 1.00 1.00
apache 1K 0.99 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00
memcached 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
brcm stream 0.40 0.50 0.64 0.81 1.21 0.36 0.45 0.58 0.73 1.09
rr 0.86 0.96 0.96 1.00 1.11 0.84 0.93 0.93 0.98 1.08
apache 1M 0.48 0.49 0.60 0.75 1.41 0.41 0.42 0.52 0.65 1.22
apache 1K 0.99 0.99 0.99 1.00 1.00 0.99 1.00 1.00 1.00 1.00
memcached 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
We run each benchmark 100 times. Each individual run is configured to take 10
seconds. We treat the first 10 runs as warmup and report the average of the remaining
90 runs. Figure 4.10 shows the resulting throughput and CPU consumption for the
mlx (top) and brcm (bottom) setups. The corresponding normalized performance is
shown in Table 3.2, specifying the relative improvement of the two rIOMMU variants
over the other modes. The top/left plot in Figure 4.10 corresponds to the analysis and
data shown in Figures 3.73.8.
Let us discuss the results in Figure 4.10, left to right. The greatest observed
improvement by rIOMMU is attained with mlx / Netperf stream (Figure 4.10/top/left).
This result is to be expected considering the model from 3.3.3 showing that every
cycle shaved off the IOVA (un)mappings translates into increased throughput. CPU
cycles constitute the bottleneck resource, as is evident from the mlx/stream/CPU curve,
which is at 100% for all IOMMU modes. The notable difference between riommu- and
riommu is due to 1.1K cycles that the former adds to the latter, which is the cost
of four additional memory barriers and four additional cacheline flushes, per packet.
(Specifically, a barrier and a cacheline flush in both map and unmap for two IOVAs
corresponding to the packets header and data.) Riommu- and riommu provide 2.90
7.56x higher throughput relative to the completely safe IOMMU modes strict and
strict+, and 1.743.79x higher throughput relative to the deferred modes. The latter,
however, does not constitute an apple-to-apples comparison, since the deferred modes
are vulnerable whereas the rIOMMU modes are safe. Riommu- and riommu deliver
0.52x and 0.77x lower throughout relative to the unprotected, no-IOMMU optimum.
The brcm/stream results (Figure 4.10/bottom/left) are quantitatively and qualita-
tively different. In particular, all IOMMU modes except strict have enough cycles to
saturate the Broadcom NIC and achieve its line-rate, which is 10 Gbps. The brcm setup
requires fewer cycles per packet because its device driver is more efficient, e.g., due to
utilizing only one IOVA per packet instead of two. In setups of this typewhere the
network is saturatedthe performance metric of interest becomes the CPU consumption.
48
NIC strict strict+ defer defer+ riommu- riommu none
mlx 17.3 15.1 14.9 14.4 14.1 13.9 13.4
brcm 41.9 36.7 36.6 35.8 35.1 34.7 34.6
By Table 3.2, we can see that riommu and riommu- consume 0.360.50x fewer CPU
cycles than the two strict modes; 0.580.81x fewer cycles than the deferred modes; and
1.09x and 1.21x more cycles than the no-IOMMU optimum, respectively.
The improvement by rIOMMU is less pronounced when running RR, in both mlx
and brcm, with 1.021.25x higher throughput and 0.841.00x lower CPU consumption
relative to the strict and deferred variants. It is less pronounced due to RRs ping-pong
nature, which implies that CPU cycles are in low demand, as indicated by the CPU
curves at 2830% for mlx and at 1215% for brcm. For this reason, in comparison
to mlx/RR/none, rIOMMU has 45% lower throughput and nearly identical CPU
consumption. In comparison to brcm/RR/none, rIOMMU has 811% higher CPU
consumption and nearly identical throughput. Although the per-packet processing
time at the core is smaller in brcm, overall, the mlx hardware transmits packets faster,
as indicated by its higher RR throughput. The corresponding round-trip time of the
different modes (which, as noted, is the inverse of the throughput in RRs case) is shown
in Table 3.3.
The results of Apache 1MB are qualitatively identical to those of Netperf stream,
because the benchmark transmits a lot of data per request and is thus throughput
sensitive. Conversely, Apache 1KB is not throughput sensitive. Its smaller 1KB requests
make the performance of mlx and brcm look remarkably similar despite their networking
infrastructure difference. In both cases, the bottleneck is the CPU, while the volume of
the transmitted data is only a small fraction of the NICs capacity. (Both deliver 12K
requests per second of 1KB files, yielding a transfer rate of 0.1 Gbps.) This is because
Apache requires heavy processing for each http request. This overhead is amortized
over hundreds of packets in the case of Apache 1MB, but over only one packet in the
case of 1KB. Consequently, the computational processing dominates the throughput of
Apache 1KB, and so the role of the networking infrastructure is marginalized. Even so,
rIOMMU demonstrates a 1.24x and 2.32x throughput improvement over brcm/strict
and mlx/strict, respectively. It is up to 1.18x higher relative to the other IOMMU-
enabled modes. And 0.9x lower relative to the unprotected optimum.2
The network activity of Apache 1KB is somewhat similar to that of the Memcached
benchmark, because both are configured with 32 concurrent requests, both receive
queries comprised of a few dozens of bytes (file name or key item), and both transmit
1KB responses (file content or data item). The difference is that the Memcached internal
2
We note in passing that our Apache 1KB throughput results coincide with that of Soares et al. [58],
who reported a latency of 22ms for 256 concurrent requests, which translate to 1000/22 256 12K
requests/second.
49
logic is simpler, as its purpose is merely to serve as an in-memory LRU cache. For
this reason, it achieves an order of magnitude higher throughput relative to Apache
1KB.3 The shorter per-request processing time makes the differences between the
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Our experiments thus far indicated that using the IOMMU affects performance because
it mandates the OS to spend CPU cycles on creating and destroying IOVA mappings.
We were unable to measure the overhead caused by the actual IOMMU translation
activity of walking the page tables upon an IOTLB miss (Figures 4.1 and 3.5). In 4.6.1,
we attributed this inability to the substantially longer latencies induced by interrupt
processing and the TCP/IP stack. In Table 3.3, we specified the round-trip latencies,
whose magnitude (1342 s) seems to suggest that the occasional cost of 4 memory
references per table walk is negligible in comparison.
There are, however, high performance environments that enjoy lower latencies in the
order of a s [22, 29, 55, 61], which is required, e.g., where a fraction of a microsecond
can make a difference in the value of a transaction [1]. User-level I/O, for example,
might permit applications to (1) utilize raw Ethernet packets to eliminate TCP/IP
overheads, and to (2) poll the I/O device to eliminate interrupt delays.
With the help of the ibverbs library [38, 47], we established such a configuration
on top of the mlx setup. We ran two experiments. The first iteratively and randomly
selects a buffer from a large pool of previously mapped buffers and transmits it, thus
ensuring that the probability for the corresponding IOVA to reside in the IOTLB is
low. The second experiment does the same but with only one buffer, thus ensuring
that the IOTLB always hits. The latency differencewhich is the cost of an IOTLB
misswas 0.3 s (1013 cycles on average); we believe it is reasonable to assume
that it approximates the benefit of using rIOMMU over the existing IOMMU in high
performance environments of this type.
rIOMMU is not a prefetcher. Rather, it is a new IOMMU design that allows for
efficient IOVA (un)mappings while minimizing costly IOTLB invalidations. (Unrelated
to prefetching.) But rIOMMU does have a prefetching component, since it loads to the
rIOTLB the next IOVA to be used ahead of time. While this component turned out to
be useful only in specialized setups (3.5.3), it is still interesting to compare this aspect
of our work to previously proposed TLB prefetchers.
3
Our Memcached results are comparable to that of Gordon et al. [30].
50
For lack of space, we only briefly describe the bottom line. We modified the IOMMU
layer of KVM/QEMU to log the DMAs that its emulated I/O devices perform. We
ran our benchmarks in a VM and generated DMA traces. We fed the traces to three
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
simulated TLB prefetchers: Markov [43], Recency [56], and Distance [46], as surveyed by
Kandiraju and Sivasubramaniam [45]. We found their baseline versions to be ineffective,
as IOVAs are invalidated immediately after being used. We modified them and allowed
them to store invalidated addresses, but mandated them to walk the page table and check
that their predictions are mapped before making them. Distance was still ineffective.
Recency and Markov, however, were able to predict most accesses, but only if the
number of entries comprising their history data structure grew larger than the ring. In
contrast, rIOTLB requires only two entries per ring and its predictions are always
correct.
51
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
52
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Chapter 4
Efficient IOMMU
Intra-Operating System
Protection
4.1 Introduction
The role that the I/O memory management unit (IOMMU) plays for I/O devices is
similar to the role that the regular memory management unit (MMU) plays for processes.
Processes typically access the memory using virtual addresses translated to physical
addresses by the MMU. Likewise, I/O devices commonly access the memory via direct
memory access operations (DMAs) associated with I/O virtual addresses (IOVAs),
which are translated to physical addresses by the IOMMU. Both hardware units are
implemented similarly with a page table hierarchy that the operating system (OS)
maintains and the hardware walks upon an (IO)TLB miss.
The IOMMU can provide inter - and intra-OS protection [4, 62, 64, 66]. Inter
protection is applicable in virtual setups. It allows for direct I/O, where the host
assigns a device directly to a guest virtual machine (VM) for its exclusive use, largely
removing itself from the guests I/O path and thus improving its performance [30, 50].
In this mode, the VM directly programs device DMAs using its notion of (guest)
physical addresses. The host uses the IOMMU to redirect these accesses to where
the VM memory truly resides, thus protecting its own memory and the memory of the
other VMs. With inter protection, IOVAs are mapped to physical memory locations
infrequently, typically only upon such events as VM creation and migration. Such
mappings are therefore denoted persistent or static [64].
Intra-OS protection allows the OS to defend against errant/malicious devices and
buggy drivers, which account for most OS failures [21, 59]. Drivers and their I/O
devices are able perform DMAs to arbitrary memory locations, and IOMMUs allow
OSes to protect themselves by restricting these DMAs to specific physical memory
53
locations. Intra-OS protection is applicable in non-virtual setups where the OS has
direct control over the IOMMU. It is likewise applicable in virtual setups where IOMMU
functionality is exposed to VMs via paravirtualization [15, 50, 57, 64], full emulation
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
[4], or, recently, hardware support for nested IOMMU translation [3, 40]. In this mode,
IOVA (un)mappings are frequent and occur within the I/O critical path. The OS
programs DMAs using IOVAs rather than physical addresses, such that each DMA
is preceded and followed by the mapping and unmapping of the associated IOVA to
the physical address it represents [44, 52]. For this reason, such mappings are denoted
single-use or dynamic [17]. The context of this paper is intra-OS protection. We
discussed it in more detail in 4.2.
To do its job, the intra-OS protection mapping layer must allocate IOVA values:
ranges of integer numbers that would serve as page identifiers. IOVA allocation is similar
to regular memory allocation. But it is different enough to merit its own allocator.
One key difference is that regular allocators dedicate much effort to preserving locality
and to combating fragmentation, whereas the IOVA allocator disallows locality and
enjoys a naturally unfragmented workload. This difference makes the IOVA allocator
12 orders of magnitude smaller in terms of lines of code. Another key difference
is that, by default, the IOVA subsystem trades off some safety for performance. It
systematically delays the completion of IOVA deallocations while letting the OS believe
that the deallocations have already been processed. Specifically, part of freeing an
IOVA is purging it from the IOTLB such that the associated physical buffer is no longer
accessible to the I/O device. But invalidating IOTLB entries is a costly, slow operation.
So the IOVA subsystem opts for batching the invalidations until enough accumulate.
Then, it invalidates the entire IOTLB, en masse, thus reducing the amortized price.
This default mode is called deferred protection. Users can turn it off at boot time by
instructing the kernel to use strict protection. We discuss the IOVA allocator and its
protection modes in detail in 4.3.
Single-use mappings that stress the IOVA mapping layer are usually associated with
I/O devices that employ ring buffers in order to communicate with their OS drivers in
a producer-consumer manner. The ring buffer is a cyclic memory array whose entries
correspond to DMA requests that the OS initiates and the I/O device must fulfill. The
ring entries contain IOVAs that the mapping layer allocates and frees before and after
the associated DMAs are processed by the device. We carefully analyze the performance
of the IOVA mapping layer and find that its allocation scheme is efficient despite its
simplicity, but only if the device is associated with a single ring. Devices, however,
often employ more rings, in which case our analysis indicates that the IOVA allocator
seriously degrades the performance. We carefully study this deficiency and find that its
root cause is a pathology we call long-lasting ring interference. The pathology occurs
when I/O asynchrony prompts an event that confuses the allocator into migrating an
IOVA from one ring to another, henceforth repetitively destroying the contiguity of the
rings I/O space upon which the allocator relies for efficiency. We conjecture that this
54
harmful effect remained hidden thus far because of the well-known slowness associated
with manipulating the IOMMU. The hardware took most of the blame for the high
price of intra-OS protection even though software is equally guilty, as it turns out. We
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
55
requester identifier IOVA (DMA address)
bus dev func 0s idx idx idx offset
15 8 3 0 63 39 29 21 12 0
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
In the past, I/O devices had to use physical addresses in order to access the main
memory, namely, each DMA descriptor contained a physical address of its target buffer.
Such unmediated DMA activity directed at the memory makes the system vulnerable
to rogue devices performing errant or malicious DMAs [14, 19, 44, 65], or to buggy
drivers that might wrongfully program their devices to overwrite any part of the system
memory [13, 31, 50, 59, 63]. Subsequently, all major chip vendors introduced IOMMUs
[3, 11, 36, 40], alleviating the problem as follows.
The OS associates each DMA target buffer with some IOVA, which it uses instead
of the buffers physical address when filling out the associated ring descriptor. The
I/O device is oblivious to this change; it processes the DMA using the IOVA as if it
was physical memory. The IOMMU circuitry then translates the IOVA to the physical
address of the target buffer, routing the operation to the appropriate memory location.
Figure 4.1 illustrates the translation process as performed by the Intel IOMMU, which
we use in this paper. The PCI protocol dictates that each DMA operation is associated
with a 16 bit request identifier comprised of a bus-device-function triplet, which is
uniquely associated with the corresponding I/O device. The IOMMU uses the 8 bit bus
number to index the root table and thus retrieve the physical address of the context table.
It then indexes the context table using the 8 bit concatenation of the device and function
numbers, yielding the physical location of the root of the page table hierarchy that
houses the devices IOVA translations. Similarly to the MMU, the IOMMU accelerates
translations using an IOTLB.
The functionality of the IOMMU hierarchy is similar to that of the regular MMU: it
will permit an IOVA memory access to go through only if the OS previously inserted
56
a matching mapping between the IOVA and some physical memory address. The OS
can thus protect itself by allowing a device to access a target buffer just before the
corresponding DMA occurs (inserting a mapping), and by revoking access just after
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
(removing the mapping), exerting fine-grained control over what portions of memory
may be used in I/O transactions at any given time. This state-of-the-art strategy of
IOMMU-based protection was termed intra-OS protection by Willmann et al. [64].It
is recommended by hardware vendors [32, 44], and it is used by operating systems
[10, 17, 37, 51]. For example, the DMA API of Linuxwhich we use in this studynotes
that DMA addresses should be mapped only for the time they are actually used and
unmapped after the DMA transfer [52].
Locality Memory allocators spend much effort in trying to (re)allocate memory chunks
in a way that maximizes reuse of TLB entries and cached content. The IOVA
mapping layer of the OS does the opposite. The numbers it allocates correspond
to whole pages, and they are not allowed to stay warm in hardware caches in
between allocations. Rather, they must be purged from the IOTLB and from
the page table hierarchy immediately after the DMA completes. Moreover, while
purging an IOVA, the mapping layer must flush each cache line that it modifies
in the hierarchy, as the IOMMU and CPU do not reside in the same coherence
domain.1
Complexity Simplicity and compactness matter and are valued within the kernel.
Not having to worry about locality and fragmentation while enjoying a simple
workload, the mapping layer allocation scheme is significantly simpler than regular
memory allocators. In Linux, it is comprised of only a few hundred lines of codes
instead of thousands [48, 49] or tens of thousands [16, 33].
1
Intel IOMMU specification documents a capability bit that indicates whether the IOMMU and
CPU coherence could be turned on [40], but we do not own such hardware and believe it is not yet
common.
57
Safety & Performance Assume a thread T0 frees a memory chunk M , and then
another thread T1 allocates memory. A memory allocator may give M to T1 , but
only after it processes the free of T0 . Namely, it would never allow T0 and T1 to use
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
M together. Conversely, the IOVA mapping layer purposely allows T0 (the device)
and T1 (the OS) to access M simultaneously for a short period of time. The reason:
invalidation of IOTLB entries is costly [4, 64]. Therefore, by default, the mapping
layer trades off safety for performance by (1) accumulating up to W unprocessed
free operations and only then (2) freeing those W IOVAs and (3) invalidating
the entire IOTLB en masse. Consequently, target buffers are actively being used
by the OS while the device might still access them through stale IOTLB entries.
This weakened safety mode is called deferred protection. Users can instead employ
strict protectionwhich processes invalidations immediatelyby setting a kernel
command line parameter.
Technicalities Memory allocators typically use the memory that their clients free to
store their internal data structures. (For example, a linked list of freed pages
where each next pointer is stored at the beginning of the corresponding page.)
The IOVA mapping layer cannot do that, because the IOVAs that it invents are
pointers to memory that is used by some other entity (the device or the OS).
An IOVA is just an additional identifier for a page, which the mapping layer
does not own. Another difference between the two types of allocators is that
memory allocators running on 64-bit machines use native 64-bit pointers. The
IOVA mapping layer prefers to use 32-bit IOVAs, as utilizing 64-bit addresses for
DMA would force a slower, dual address cycle on the PCI bus [17].
58
struct range_t {int lo, hi;};
range_t alloc_iova(rbtree_t t, int rngsiz) {
range_t new_range;
rbnode_t right = t.cache;
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Rs successor. And (2) when a new range Q is allocated, then C is updated to point to
Q; if Q was the highest free gap prior to its allocation, then C still points higher than
the highest free gap after this allocation.
59
associated target buffer (assuming strict protection). The unmap frees IOVA=n, thus
updating C to point to ns successor in the red-black tree (free iova in Figure 4.2). The
OS then immediately re-arms the ring descriptor for future packets, allocating a new
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
target buffer and associating it with a newly allocated IOVA. The latter will be n, and
it will be allocated in constant time, as C points to ns immediate successor (alloc iova
in Figure 4.2). The same scenario will cyclically repeat itself for n 1, n 2, ..., 1 and
then again n, ..., 1 and so on as long as the NIC is operational.
Our soon to be described experiments across multiple devices and workloads indicate
that the above description is fairly accurate. IOVA allocations requests are overwhelm-
ingly for one page ranges (H = L), and the freed IOVAs are indeed re-allocated shortly
after being freed, enabling, in principle, the allocator in Figure 4.2 to operate in constant
time as described. But the algorithm succeeds to operate in this ideal manner only for
some bounded time. We find that, inevitably, an event occurs end ruins this ideality
thereafter.
The above description assumes there exists only one ring in the I/O virtual address
space. In reality, however, there are often two or more, for example, the Rx and Tx
receive and transmit rings. Nonetheless, even when servicing multiple rings, the IOVA
allocator provides constant time allocation in many cases, so long as each rings free iova
is immediately followed by a matching alloc iova for the same ring (the common case).
Allocating for one ring and then another indeed causes linear IOVA searches due to
how the cache node C is maintained. But large bursts of I/O activity flowing in one
direction still enjoy constant allocation time.
The aforementioned event that forever eliminates the allocators ability to accom-
modate large I/O bursts with constant time occurs when a free-allocate pair of one ring
is interleaved with that of another. Then, an IOVA from one ring is mapped to another,
ruining the contiguity of the rings I/O virtual address. Henceforth, every cycle of n
allocations would involve one linear search prompted whenever the noncontiguous IOVA
is freed and reallocated. We call this pathology long-lasting ring interference and note
that its harmful effect increases as additional inter-ring free-allocate interleavings occur.
Table 4.1 illustrates the pathology. Assume that a server mostly receives data and
occasionally transmits. Suppose that Rx activity triggers a Rx.free iova(L) of address
L (1). Typically, this action would be followed by Rx.alloc iova, which would then
return L (2). But sometimes a Tx operation sneaks in between. If this Tx operation is
Tx.free iova(H) such that H > L (3), then the allocator would update the cache node C
to point to Hs successor (4). The next Rx.alloc iova would be satisfied by H (5), but then
the subsequent Rx.alloc iova would have to iterate through the tree from H (6) to L (7),
inducing a linear overhead. Notably, once H is mapped for Rx, the pathology is repeated
every time H is (de)allocated. This repetitiveness is experimentally demonstrated in
Figure 4.3, showing the per-allocation number of rb prev invocations. The calls are
invoked in the loop in alloc iova while searching for a free IOVA.
We show below that the implications of long-lasting ring interference can be dreadful
60
operation without Tx with Tx
return C C return C C
value before after value before after
Rx.free(L=151) (1) 152 152 152 152
Tx.free(H=300) (3) 152 (4) 301
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Table 4.1: Illustrating why Rx-Tx interferences cause linearity, following the baseline
allocation algorithm detailed in Figure 4.2. (Assume that all addresses are initially allocated.)
[num of rb_prev calls]
length of search loop
3k
2k
1k
0k
n+ n+ n+ n+ n+
0k 1 0k 20 30 40
k k k
allocation [serial number]
Figure 4.3: The length of each alloc iova search loop in a 40K (sub)sequence of alloc iova calls
performed by one Netperf run. One Rx-Tx interference leads to regular linearity.
Suffering from frequent linear allocations, the baseline IOVA allocator is ill-suited for
high-throughput I/O devices that are capable of performing millions of I/O transactions
per second. It is too slow. One could proclaim that this is just another case of a special-
purpose allocator proved inferior to a general-purpose allocator and argue that the
61
latter should be favored over the former despite the notable differences between the two
as listed in 4.3. We, however, contend that the simple and repetitive, inherently ring-
induced nature of the workload justifies a special-purpose allocator. We further contend
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
that we are able to modify the existing allocator to consistently support extremely fast
(de)allocations while introducing only a minimal change to the existing allocator.
We propose the EiovaR optimization (Efficient IOVA allocatoR), which rests of the
following observation. I/O devices that stress the intra-OS protection mapping layer
are not like processes, in that the size of their virtual address spaces is relatively small,
inherently bounded by the size of their rings. A typical ring size n is a few hundreds
or a few thousands of entries. The number of per-device virtual page addresses that
the IOVA allocator must simultaneously support is proportional to the ring size, which
means it is likewise bounded and relatively small. Moreover, unlike regular memory
allocators, the IOVA mapping layer does not allocate real memory pages. Rather, it
allocates integer identifiers for those pages. Thus, it is reasonable to keep O(n) of these
identifiers alive under the hood for quick (de)allocation, without really (de)allocating
them (in the traditional, malloc sense of (de)allocation).
In numerous experiments with multiple devices and workloads, the maximal number
of per-device different IOVAs we have seen is 12K. More relevant is that, across all
experiments, the maximal number of previously-allocated-but-now-free IOVAs has never
exceeded 668 and was 155 on average. EiovaR leverages this workload characteristic
to cache freed IOVAs so as to satisfy future allocations quickly. It further leverages
the fact that, as noted earlier, IOVA allocations requests are overwhelmingly for one
6 L, the size of the
page ranges (H = L), and that, in the handful of cases where H =
requested range has always been a power of two (H L + 1 = 2j ). We have never
witnessed a non-power of two allocation.
EiovaR is thin layer that masks the red-black tree, resorting to using it only when
EiovaR cannot fulfill IOVA allocation on its own using previously freed elements. When
configured to have enough capacity, all tree allocations that EiovaR is unable to mask
are assured to occur in constant time.
EiovaRs main data structure is called the freelist. It has two components. The
first is an array, farr , which has M entries, such that 2M +12 is the upper bound on
the size of the consecutive memory areas that EiovaR supports. (M = 28 is enough,
implying a terabyte of memory.) Entries in farr are linked lists of IOVA ranges. They
are empty upon initialization. When an IOVA range [L, H] whose size is a power
of two is freed, instead of actually freeing it, EiovaR adds it to the liked list of the
corresponding exponent. That is, if H L + 1 = 2j , then EiovaR adds the range to the
its j-th linked list farr [j]. Ranges comprised of one page (H = L) end up in farr [0].
For completeness, when the size of a freed range is not a power of two, EiovaR
stores it in its second freelist component, frbt , which is a red-black tree. Unlike the
baseline red-black tree, which sorts [L, H] ranges by the L and H values, frbt sorts
ranges by their size (H L + 1), allowing EiovaR to locate a range of a desirable size
62
throughput freelist allocation avg freelist size avg search length
[Gbit/sec] [percent] [nodes] [rb_prev ops]
4 100 2.5 60
80 2 50 eiovar
3.5
60 1.5 40 eiovar512
3 30 eiovar64
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
40 1 20 eiovar8
2.5 20 0.5 eiovar1
10
baseline
2 0 0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
(a) (b) (c) (d)
Figure 4.4: Netperf TCP stream iteratively executed under strict protection. The x axis
shows the iteration number.
in logarithmic time.
EiovaR allocation performs the reverse operation of freeing. If the newly allocated
range has a power-of-2 size (i.e., 2j ), then EiovaR tries to satisfy the allocation using
farr [j]. Otherwise, it tries to satisfy the allocation using frbt . EiovaR resorts to utilizing
the baseline red-black tree only if a suitable range is not found in the freelist.
When configured with enough capacity, our experiments indicate that, after a
while, all (de)allocations are satisfied by farr in constant time, taking 50-150 cycles per
(de)allocation on average, depending on the configuration. When not imposing a limit
on the freelist capacity, the allocations that EiovaR satisfies by resorting to the baseline
tree are likewise done in constant time, seeing the tree never observes deallocations,
which means its cache node C always points to its smallest, leftmost node (Figure 4.2).
To bound the size of the freelist, EiovaR has a configurable parameter k that serves
as its maximal capacity. We use the EiovaRk notation to express this limit, with k =
indicating a limitless freelist.
To understand the behavior and effect of EiovaR, we begin by analyzing five Eio-
vaRk variants as compared to the baseline under strict protection, where IOVAs are
(de)allocated immediately before and after the associated DMAs. We use the standard
Netperf stream benchmark that maximizes throughput on one TCP connection. We
initially restart the NIC interface for each allocation variant (thus clearing IOVA struc-
tures), and then we execute the benchmark iteratively. The exact experimental setup is
described in 4.6. The results are shown in Figure 4.4.
Figure 4.4a shows that the throughput of all EiovaR variants is similar and is
20%60% better than the baseline. The baseline gradually decreases except from in
the last iteration. Figure 4.4b highlights why even EiovaR1 is sufficient to provide the
observed benefit. It plots the rate of IOVA allocations that are satisfied by the freelist,
showing that k = 1 is enough to satisfy nearly all allocations. This result indicates that
each call to free iova is followed by alloc iova, such that the IOVA freed by the former
is returned by the latter, coinciding with the ideal scenario outlined in 4.3. Figure
63
baseline eiovar
4k 4k
map [cycles] 3k 3k
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
2k 2k
1k 1k
0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
alloc_iova iteration
all the rest
Figure 4.5: Average cycles breakdown of map with Netperf/strict.
baseline eiovar
4k 4k
unmap [cycles]
3k 3k
2k 2k
1k 1k
0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
free_iova iteration
all the rest
Figure 4.6: Average cycles breakdown of unmap with Netperf/strict.
4.4c supports this observation by depicting the average size of the freelist. The average
of EiovaR1 is inevitably 0.5, as every allocation and deallocation contributes to the
average 1 and 0 respectively. Larger k values are similar, with an average of 2.5 because
of two additional (de)allocation that are performed when Netperf starts running and
that remain in the freelist thereafter. Figure 4.4d shows the average length of the while
loop from Figure 4.2, which searches for the next free IOVA. It depicts a rough mirror
image of Figure 4.4a, indicating throughput is tightly negatively correlated with the
traversal length.
Figure 4.5 (left) shows the time it takes the baseline to map an IOVA, separating
allocation from the other activities. Whereas the latter remains constant, the former
exhibits a trend identical to Figure 4.4d. Conversely, the alloc iova time of EiovaR
(Figure 4.5, right) is negligible across the board. EiovaR is immune to long-lasting ring
interface, as interfering transactions are absorbed by the freelist and reused in constant
64
throughput freelist allocation avg freelist size avg search length
[Gbit/sec] [percent; log-scaled] [nodes; log-scaled] [rb_prev ops]
7 100 128 35
50 64 30 eiovar
6 25 32 25 eiovar512
10 16 20 eiovar64
5 8
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
15 eiovar8
3 4 10
4 2 eiovar1
1 1 5 baseline
3 0 0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
(a) (b) (c) (d)
Figure 4.7: Netperf TCP stream iteratively executed under deferred protection. The x axis
shows the iteration number.
time.
Figure 4.6 is similar to Figure 4.5, but it pertains to the unmap operation rather than
to map. It shows that the duration of free iova remains stable across iterations with
both EiovaR and the baseline. EiovaR deallocation is still faster as it is performed in
constant time whereas the baseline is logarithmic. But most of the overhead is not due
to free iova. Rather, it is due to the costly invalidation that purges the IOVA from the
IOTLB to protect the corresponding target buffer. This is the aforementioned costly
hardware overhead that motivated deferred protection, which amortizes the price by
delaying invalidations until enough IOVAs are accumulated and then processing all of
them together. As noted, deferring the invalidations trades off safety for performance,
because the relevant memory is accessible by the device even though it is already used
by the kernel for other purposes.
Figure 4.7 compares between the baseline and the EiovaR variants under deferred
protection. Interestingly, the resulting throughput divides the variants into two, with
EiovaR512 and EiovaR above 6Gbps and all the rest at around 4Gbps (Figure 4.7a).
We again observe a strong negative correlation between the throughput and the length
of the search to find the next free IOVA (Figure 4.7a vs. 4.7d).
In contrast to the strict setup (Figure 4.4), here we see that EiovaR variants with
smaller k values roughly perform as bad as the baseline. This finding is somewhat
surprising, because, e.g., 25% of the allocations of EiovaR64 are satisfied by the freelist
(Figure 4.7b), which should presumably improve its performance over the baseline. A
finding that helps explain this result is noticing that the average size of the EiovaR64
freelist is 32 (Figure 4.7c), even though it is allowed to hold up to k = 64 elements.
Notice that EiovaR holds around 128 elements on average, so we know there are
enough deallocations to fully populate the EiovaR64 freelist. One might therefore expect
that the latter would be fully utilized, but it is not.
The average size of the EiovaR64 freelist is 50% of its capacity due to the following
reason. Deferred invalidations are aggregated until a high-water mark W (kernel
65
parameter) is reached, and then all the W addresses are deallocated in bulk.2 When
k < W , the freelist fills up to hold k elements, which become k 1 after the subsequent
allocation, and then k 2 after the next allocation, and so on until zero is reached,
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
1 k
yielding an average size of k+1 j=0 j k/2 as our measurements show.
Importantly, the EiovaRk freelist does not have enough capacity to absorb all
the W consecutive deallocations when k < W . The remaining W k deallocations
would therefore be freed by the baseline free iova. Likewise, out of the subsequent W
allocations, only k would be satisfied by the freelist, and the remaining W k would be
serviced by the baseline alloc iova. It follows that the baseline free iova and alloc iova
are regularly invoked in an uncoordinated way despite the freelist. As described in 4.4,
the interplay between these two routines in the baseline version makes them susceptible
to long-lasting ring interference that inevitably induces repeated linear searches. In
contrast, when k is big enough ( W ), the freelist has sufficient capacity to absorb all
W deallocations, which are then used to satisfy the subsequent W allocations and thus
secure the conditions for preventing the harmful effect.
Figure 4.8 experimentally demonstrates this threshold behavior, depicting the
throughput as a function of the maximal allowed freelist size k. As k gets bigger,
the performance slowly improves, because morebut not yet all(de)allocations are
served by the freelist. When k reaches W = 250, the freelist is finally big enough, and
the throughput suddenly increases by 26%. Figure 4.9 provides further insight into
this result. It shows the per-allocation length of the loop within alloc iova that iterates
through the red-black tree in search for the next free IOVA (similarly to Figure 4.3).
The three sub-graphs correspond to three points from Figure 4.8 that are associated
with k values 64, 240, and 250. We see that the smaller k (left) yields more longer
searches relative to the bigger k (middle), and that the length of the search becomes
zero when k = W (right).
4.6 Evaluation
4.6.1 Methodology
Experimental Setup We implement EiovaR in the Linux kernel, and we experimen-
tally evaluate its performance and contrast it against the baseline IOVA allocation.
In an effort to attain more general results, we conducted the evaluation using two
setups involving two different NICs with two corresponding different device drivers that
generate different workloads for the IOVA allocation layer.
The Mellanox setup is comprised of two identical Dell PowerEdge R210 II Rack
Server machines that communicate through Mellanox ConnectX3 40Gbps NICs. The
NICs are connected back to back via a 40Gbps optical fiber and are configured to use
2
They cannot be freed before they are purged from the IOTLB, or else they could be re-allocated,
which would be a bug since their stale mappings might reside in the IOTLB and point to somewhere
else.
66
7
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
6
Gbit/sec
4
0 50 100 150 200 250
eiovars freelist size
Figure 4.8: Under deferred protection, EiovaRk eliminates costly linear searches when k
exceeds the high-water mark W .
8k 8k 8k
6k 6k 6k
4k 4k 4k
2k 2k 2k
0k 0k 0k
n+ n+ n+ n+ n+ n+ n+ n+ n+
0k 20 40 0k 20 40 0k 20 40
k k k k k k
allocation [serial number]
Figure 4.9: Length of the alloc iova search loop under the EiovaRk deferred protection
regime for three k values when running Netperf TCP Stream. Bigger capacity implies that the
searches become shorter on average. Big enough capacity (k W = 250) eliminates the
searches altogether.
Ethernet. We use one machine as the server and the other as a workload generator
client. Each machine is furnished with a 8GB 1333MHz memory and a single-socket
4-core Intel Xeon E3-1220 CPU running at 3.10GHz. The chipset is Intel C202, which
supports VT-d, Intels Virtualization Technology that provides IOMMU functionality.
We configure the server to utilize one core only, and we turn off all power optimizations
67
Netperf stream Netperf RR Apache 1MB Apache 1KB Memcached
transactions/sec
9 40 1.2k 12k 120k
latency [sec]
requests/sec
requests/sec
throughput 30
[Gbp/s]
6 0.8k 8k 80k
20
3 0.4k 4k 40k
10
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
0 0 0.0k 0k 0k
baseline
st
de
st
de
st
de
st
de
st
de
ric
ric
ric
ric
ric
fe
fe
fe
fe
fe
eiovar
t
t
r
r
Netperf stream Netperf RR Apache 1MB Apache 1KB Memcached
transactions/sec
9 40 1.2k 12k 120k
latency [sec]
requests/sec
requests/sec
throughput
30
[Gbp/s]
6 0.8k 8k 80k
20
3 10 0.4k 4k 40k
0 0 0.0k 0k 0k
baseline
st
de
st
de
st
de
st
de
st
de
ric
ric
ric
ric
ric
fe
fe
fe
fe
fe
eiovar
t
t
r
r
Figure 4.10: The performance of baseline vs. EiovaR allocation, under strict and deferred
protection regimes for the Mellanox (top) and Broadcom (bottom) setups. Except for in the
case of Netperf RR, higher values indicated better performance.
sleep states (C-states) and dynamic voltage and frequency scaling (DVFS)to avoid
reporting artifacts caused by nondeterministic events. The two machines run Ubuntu
12.04 and utilize the Linux 3.4.64 kernel.
The Broadcom setup is similar, likewise utilizing two Dell PowerEdge R210 machines.
The difference is: that the two machines communicate through Broadcom NetXtreme II
BCM57810 10GbE NICs (connected via a CAT7 10GBASE-T cable for fast Ethernet);
that they are equipped with 16GB memory; and that they run the Linux 3.11.0 kernel.
The device drivers of the Mellanox and Broadcom NICs differ in many respects.
Notably, the Mellanox NIC utilizes more ring buffers and it allocates more IOVAs (we
observed around 12K addresses for Mellanox and 3K for Broadcom). For example,
the Mellanox driver uses two buffers per packetone for the header and one for the
bodyand hence two IOVAs, whereas the Broadcom driver allocates only one buffer
and thus only one IOVA.
68
3. Apache [26, 27] is a popular HTTP web server. We drive it using the ApacheBench
[8] (also called ab), which is a workload generator that is distributed with Apache.
The goal of ApacheBench is to assess the number of concurrent requests per second
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
that the server is capable of handling by requesting a static page of a given size
from within several concurrent threads. We run ApacheBench on the client
machine configured to generate 100 concurrent requests. We use two instances of
the benchmark, respectively requesting a smaller (1KB) and a bigger (1MB) file.
The logging is disabled to avoid the overhead of writing to disk.
Before running each benchmark, we shut down and bring up the interface of the NIC
using the ifconfig utility, such that the IOVA allocation is redone from scratch using a
clean tree, clearing the impact of previous harmful long-lasting ring interferences. We
then iteratively run the benchmark 150 times, such that individual runs are configured
to take about 20 seconds. We present the the corresponding results, on average.
4.6.2 Results
The resulting average performance is shown in Figure 4.10 for the Mellanox (top) and
Broadcom (bottom) setups. The corresponding normalized performance, which specifies
relative improvement, is shown in the first part of Tables 4.2 and 4.3. In Figure 4.10,
higher numbers indicate better throughput in all cases but Netperf RR, which depicts
latency (the inverse of the throughput). No such exception is required in (the first part
of) Tables 4.2 and 4.3, which, for consistency, displays the normalized throughput for
all applications including Netperf RR.
69
protect benchmark baseline EiovaR diff
throughput strict Netperf stream 1.00 2.37 +137%
(normalized) Netperf RR 1.00 1.27 +27%
Apache 1MB 1.00 3.65 +265%
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Table 4.2: Summary of the results obtained with the Mellanox setup
70
protect benchmark baseline EiovaR diff
throughput strict Netperf stream 1.00 2.35 +135%
(normalized) Netperf RR 1.00 1.07 +7%
Apache 1MB 1.00 1.22 +22%
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Table 4.3: Summary of the results obtained with the Broadcom setup
71
Mellanox Setup Results Let us first examine the results of the Mellanox setup
(Table 4.2). In the topmost part of the table, we see that EiovaR yields throughput that
is 1.074.58x better than the baseline, and that improvements are more pronounced
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
under strict protection. In the second part of the table, we see that the reason underlying
the improved performance of EiovaR is that it reduces the average IOVA allocation
time by 12 orders of magnitude, from up to 50K cycles to around 100200. It also
reduces the average IOVA deallocation time by about 75%85%, from around 250550
cycles to around 6585 (fourth part of the table).
When comparing the overhead of the IOVA allocation routine to the length of
the search loop within this routine (third part of Tables 4.2), we can see that, as
expected, the two quantities are tightly correlated. Roughly, a longer loop implies
a higher allocation overhead. Conversely, notice that there is not necessarily such a
direct correspondence between the throughput improvement of EiovaR (first part of
the table) and the associated IOVA allocation overhead (second part). The reason
is largely due to the fact that latency sensitive applications are less affected by the
allocation overhead, because other components in their I/O paths have higher relative
weights. For example, under strict protection, the latency sensitive Netperf RR has
higher allocation overhead as compared to the throughput sensitive Netperf Stream
(10,269 cycles vs. 7,656, respectively), yet the throughput improvement of RR is smaller
(1.27x vs. 2.37x). Similarly, the IOVA allocation overhead of Apache/1KB is higher
than that of Apache/1MB (49,981 cycles vs. 17,776), yet its throughput improvement is
lower (2.35x vs. 3.65x).
While there is not necessarily a direct connection between throughput and allocation
overhead when examining only strict safety, the connection becomes apparent when
comparing the strict and deferred protection regimes. Clearly, the benefit of EiovaR in
terms of throughput is greater under strict protection because the associated baseline
allocation overheads are higher than that of deferred protection (7K50K cycles for
strict vs. 2K3K for deferred).
Broadcom Setup Results Let us now examine the results of the Broadcom setup
(Table 4.3). Strict EiovaR yields throughput that is 1.072.35x better than the baseline.
Deferred EiovaR, on the other hand, only improves the throughput by up to 10%, and,
in the case of Netperf Stream and Apache/1MB, it offers no improvement. Thus, while
still significant, throughput improvements in this setup are less pronounced. The reason
for this difference is twofold. First, as noted above, the driver of the Mellanox NIC
utilizes more rings and more IOVAs, increasing the load on the IOVA allocation layer
relative to the Broadcom driver and generating more opportunities for ring interference.
This difference is evident when comparing the duration of alloc iova in the two setups,
which is significantly lower in the Broadcom case. In particular, the average allocation
time in the Mellanox setup across all benchmarks and protection regimes is about 15K
cycles, whereas it is only about 3K cycles in the Broadcom setup.
72
The second reason for the less pronounced improvements in the Broadcom setup is
that the Broadcom NIC imposes a 10 Gbps upper bound on the bandwidth, which is
reached in some of the benchmarks. Specifically, the aforementioned Netperf Stream
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Deferred Baseline vs. Strict EiovaR We explained above that deferred protection
trades off safety to get better performance. We now note that, by Figure 4.10, the
performance attained by EiovaR when strict protection is employed is similar to the
performance of the baseline configuration that uses deferred protection (the default in
Linux). Specifically, in the Mellanox setup, on average, strict EiovaR achieves 5% higher
throughput than the deferred baseline, and in the Broadcom setup EiovaR achieves 3%
lower throughput. Namely, if strict EiovaR is made the default, it will simultaneously
deliver similar performance and better protection as compared to the current default
configuration.
73
strict defer
throughput [Gbit/s]
10 10
8 8
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
6 6
4 4
2 2
0 0
64 25 1K 4K 16 64 25 1K 4K 16
B 6B B B K B B 6B B B KB
strict defer
100 100
80 80
CPU [%]
60 60
40 40
20 20
0 0
64 25 1K 4K 16 64 25 1K 4K 16
B 6B B B K B B 6B B B KB
a gradually increasing throughput until the line rate is reached. However, to achieve
the same level of throughput, strict/EiovaR requires more CPU than deferred/baseline,
which in turn requires more CPU than deferred/EiovaR.
74
throughput alloc_iova
120k 60k
transactions/sec
100k 50k
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
80k 40k
cycles
60k 30k
40k 20k
20k 10k
0k 0k
1
2
4
8
16
32
1
2
4
8
16
32
eiovar clients (log scale)
baseline
Figure 4.12: Impact of increased concurrency on Memcached in the Mellanox setup. EiovaR
allows the performance to scale.
75
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
76
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Chapter 5
5.1 Introduction
Virtual to physical IO address translation is on the critical path of DMA operations
in computer systems that use IOMMUs since it is invoked on every memory access by
the device that performs the DMA. To decrease the translation latency, IOMMU uses
a cache of recent translations called the I/O Translation Lookaside Buffer (IOTLB).
The IO virtual to physical address translation algorithm that the IOMMU executes is
similar to the algorithm used by the CPUs MMU, and thus studies about TLBs should
be relevant for learning the behavior of the IOTLB [41, 6, 35, 46]. Although many
proposed optimizations were supposed to improve the TLBs performance by reducing
the miss rate, none of the studies above succeeded in devising a TLB with a miss rate
close to 0%. This is because at least the first access to an address causes a miss on the
TLB. Prefetching can hide some or all of the cost of TLB misses, as has been shown in
several recent studies [12, 56, 53, 46].
Amit et al. were, as far as we know, the first to explore prefetching in the context
of IOTLB and used 2 simple prefetching approaches. The first, proposed by Intel, uses
the Address Locality Hints (ALH) mechanism[39]. With this approach, the IOMMU
prefetches adjacent virtual address translations. The second approach is Mapping
Prefetch (MPRE), whereby the operating system provides the IOMMU with explicit
hints to prefetch the first mapping of each group of streaming DMA mappings, where a
group is defined as a scatter-gather list of several contiguous pages that are mapped
consecutively. The evaluation of these prefetchers has been covered elsewhere; we refer
the interested reader to [5].
In general, we can classify prefetchers into 2 groups: those that capture strided
reference patterns and those that decide whether to prefetch on the basis of history.
(Strided reference patterns are a sequence of memory reads and writes to addresses, each
of which is separated from the last by a constant interval.) Anand et al. reviewed 5 main
77
prefetching mechanisms for the data TLB [46, 45], two of which belong to the first class:
sequential prefetching and arbitrary stride prefetching. These prefetching mechanisms
cannot be adapted to the context of IOMMU because one of them depends on the
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Page Table structure and the second uses the Program Counter register. Consequently,
we used only the other three for comparison against our prefetching mechanism and
adjusted them to the IO context.
The first of these is based on limited history and called Markov prefetching [43]. The
second, called recency-based prefetching, is based on a much larger history (until the
previous miss to a page) and is described in [56]. The third is a stride-based technique
called distance prefetching that was proposed for TLBs in [46].
Prediction Table
Prefetch Buffer
78
called the prefetch buffer, and a logic unit. These are depicted in Figure 5.1.
The prediction table summarizes information about the recent IOTLB misses and
provides the logic unit relevant data for deciding what translation to prefetch and
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
whether to prefetch. Each row in this table has a tag and s slots containing data to
calculate the next s address translations to prefetch.The size of the table is therefore the
number of rows multiplied by s. In our work we used s=2. In Markov prefetching, the
current missing page is used as the tag, and the table gives two other pages with high
probability of being missed immediately after: these two pages need to be prefetched.
Distance prefetching uses as the tag the current distance (difference in page number
between the current miss and the previous miss), and the table gives two other distances
that occurred immediately after this distance was encountered. Recency prefetching
keeps an LRU stack of TLB misses by maintaining two additional pointers in each page
table entry (kept in memory).
On each IOTLB miss, the logic unit updates the prediction table and decides which
translation to prefetch into the prefetch buffer. The prefetch buffer is as close as the
IOTLB to the IOMMU, and it can be searched concurrently with the IOTLB. Hence,
in the case of a prefetch buffer hit, the IOMMU gets the translation as quickly as if it
were in the IOTLB.
Note that the prefetch mechanism observes the miss reference stream to make deci-
sions and never directly loads the fetched entries into the IOTLB. As a result, prefetching
does not influence the miss rate of the IOTLB; it can only hide the performance cost of
some of the IOTLB misses. Hence the only drawback of the prefetch mechanism is the
additional memory traffic of prefetched and unused translations.
The Markov prefetcher is based on the Markov Chain Theorem and the assumption that
there is a good chance that history repeats itself. It uses a matrix (X Y ) to represent an
approximation of a Markov state transition diagram and build it dynamically. Markov
state transition diagrams answer the question: given a visit to a specific state, what are
the X 1 states with the highest probability to be visited right after it?
The diagram contains states denoting the referenced unit (virtual addresses in this
context) and transition arcs denoting the probability to move from one state to the state
pointed to by the arc. The probabilities are tracked from prior references to that state.
To know which states have the highest probability to be visited, one should search in
the states arcs. Joseph et al. were the first to propose this mechanism for caching [43].
Kandiraju et al. extended it to work with TLBs and called it the Markov Prefetcher
[46], and we extended the latter to work with IOTLBs, as will be discussed shortly. We
will briefly describe the Markov Chain theorem for a better understanding of how the
prefetcher works.
79
5.3.1 Markov Chain Theorem
The Markov chain can thus be used for describing systems that follow a chain of
linked events represented as states, S = {s1 , s2 , ...sn }. The process starts in one of
these states and moves successively from one state to another. These changes of state
of the system are also called transitions. If the system is currently in state ssi , the
probability that the next state will be sj is denoted by pij , where pij depends only on
the current state (si ) of the system and not on the sequence of events that preceded it
nor on time i. This can be described mathematically as: P r[Xn+1 = b | Xt = a] = p{ ab}
(Stationary Assumption). The states and transitions are denoted by a Markov state
transition diagram, which is represented as a directed graph or a matrix and usually
built using the history of the process.
Example
This can be illustrated by a simple example. Let us assume that the entire World Wide
Web consists of only 3 websites, e1, e2, e3, visited in the sequence e1, e1, e2, e2, e1, e2,
e1, e3, e3, e2, e3. Then the Markov state transition diagram, built according to the
history of these visits, is illustrated in 5.2.
2/4
2/4 1/4
1/4 e1 e2
1/4
0
1/4 1/2
e3
1/2
Figure 5.2: Markov state transition diagram, which is represented as a directed graph
(right) or a matrix (left).
80
The left is the matrix representation and the right is the directed graph representation.
If we are in state e1 there is a chance of 1/4 to stay in e1, 2/4 to move to e2 and 1/4 to
move to e3. If we are in state e2 there is a chance of 1/4 to stay in e2, 2/4 to move to
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
The prediction table for the Markov prefetcher includes a tag column containing the
virtual addresses that missed on the TLB. Each row of the table has s slots, with each
slot containing a virtual page address that has the highest (approximated) probability
to be accessed after the virtual address denoted by the tag of the row. These slots
therefore correspond to translations to be prefetched the next time a miss occurs. On a
TLB miss, the prefetcher looks up the missed address in the table. If not found, then
this entry is added, and the s slots for this entry are kept empty. If the missed address
is found, then a prefetch is initiated for the corresponding s slots of this address.
In the TLB extension of the Markov prefetcher [46], the slots contain an entire page
table entry (including the virtual address and the translation). This is in contrast to
our IOMMU extension, where the slots contain only virtual addresses; the translations
corresponding to these addresses are brought by invoking a page walk for each address.
This difference is due to the fact that in the IO context the translations are not available
after a DMA transaction is finished, unlike the CPU context where the virtual addresses
exist until the process is killed or the memory is free.
In addition to prefetching, the prefetchers logic unit goes to the entry of the previous
virtual address that missed, and adds the current missed address into one of its s slots
(whichever is free). If all the slots are occupied, then it evicts one in accordance with
LRU policy. As a result, the s slots for each entry correspond to different virtual pages
that also missed immediately after this page. Since the table has limited entries, an
entry (row) could itself be replaced because of conflicts. A simplified hardware block
diagram implementation of the Markov prefetcher with s = 2 is given in Figure 5.3.
While Markov prefetching schemes were initially proposed for caches, recency prefetching
has been proposed solely for TLBs [56, 46]. This scheme uses the memory accesses (and
misses on the TLB) as points on the time-line and works on the principle that after a
TLB miss of a virtual address with a given recency, the next TLB miss will likely be
of a virtual address with the same recency. The underlying assumption is that if an
application has sequential accesses to a set of data structures, then there is a temporal
81
ordering between the accesses to these structures, despite the fact that there may be no
ordering in either the virtual or physical address space.
To keep an updated data structure that contains all the page table entries ordered
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
by their recency, the recency prefetcher builds an LRU stack of page table entries. The
hardware TLB maintains temporal ordering and naturally contains virtual addresses
with contemporary recency: if the TLB has N entries, then it implements the upper
N entries of the recency stack and the rest of the entries are located in the prefetcher
database (the LRU maintained in the main memory).
When accessing a virtual address that does not exist in the TLB (a TLB miss), the
required virtual address translation is removed from the in-memory LRU stack, inserted
into the top of the TLB, and the address has the most contemporary recency at that
point in time. All the following addresses are pushed down. As a result, the last entry
Page # (Virtual)
Predicted Addresses Prefetch
Tags
Buffer
2) P1 is inserted to the P0
predicted addresses
82
1 A C C
A
2 B A
TLB
B
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
3 C B
4 D D 2) A and B move to D
the 2nd and 3rd
1) C is accessed position respectively
when it is the 3rd
TLB entry and then
moved to the 1st
5 U position U U
In memory recency stack
6 V V V
7 W W W
8 X X X
9 Y Y Y
10 Z Z Z
LRU stack order Recency stack before Updating recency Recency stack after
(Contemporary recency) TLB hit stack on TLB hit TLB hit
is evacuated from the TLB to make room for the inserted entry and inserted to the
in-memory stack. Thus, the only in-memory recency stack manipulations required are
those related to TLB misses. In addition to all these updates, the prefetch mechanism
prefetches 2 LRU entries with similar contemporary recency to the accessed virtual
address, namely the predecessor and the successor of the accessed virtual address before
it was removed from the in memory LRU stack.
Example
Assume we have a TLB with 4 entries, A, B, C, and D, with an LRU stack order of 1,
2, 3, and 4, respectively, and a memory recency stack with 6 entries, U, V, W, X, Y ,
and Z, with an LRU stack order of 5, 6, 7, 8, 9, and 10, respectively; see Figure 5.5.
If virtual address Y, which is 9th in the LRU stack order, is accessed, the LRU stack
is updated as follows:
1. Y is removed from the in-memory recency stack and inserted to the first entry in
the TLB.
2. D is evacuated from the TLB and inserted to the first place in the in-memory
stack; namely, it becomes the 5th in the LRU stack order.
83
1 A Y 2) A,B and C move Y
A to the 2nd , 3rd and
2 B A
TLB
4th position
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
B respectively
3 C B
3) D is moved from C
4 D the TLB to the in- C
memory stack.
From the 4th to the
5th position.
D
5 U D
In memory recency stack
U
6 V U
V
7 W V
W
8 X W
X
9 Y X
1) Y is accessed
10 Z when it is the 9th Z Z
entry in the LRU 4) X and Y are
stack and then prefetched to the
moved to be the first prefetch buffer
position in the TLB
LRU stack order Recency stack before Updating recency Recency stack after
(Contemporary recency) TLB miss stack on TLB miss TLB miss
3. A,B,C stay in the TLB but become 2nd , 3rd and 4th respectively, in the LRU stack
order. Since Y was missed in the TLB, prefteching is executed, that is to say X
and Z are inserted into the prefetch buffer.
As explained before, virtual addresses translations are available only during the DMA
transactions and unpadded right after the transaction is finished, so keeping the trans-
lations is useless. Since the virtual address allocator reuses the addresses after they are
freed, it is probable that the recency prefetcher can predict missing patterns. We thus
propose holding the virtual addresses in the stack even if they are unmapped and, when
a prefetch is executed, looking for the translation in the page table and prefetching it
only if it exists.
84
20, 22, 30, 32. Then the distances of this sequence are: 2, 8, 2, 8, 2, 8, 2. Knowing that
a distance of 2 is followed by a distance of 8 and vice versa allows us to predict, after
only the third access (address 10), the rest of the sequence.This is the main idea behind
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Prefetch
Page # (Virtual)
Previous IOVA Predicted Addresses Buffer
Tags
Missed
IOVA -
d1 d2 d3
Current distance d0 d1
4) Translate and Prefetch
prefetch the IOVAs translations
2) Insert the current
Previous distance
distance to the previous
distance row
Page Table
+ +
1) IOVA is missed and
current distance is 3) Calculate the addresses Prefetch
calculated to prefetch by adding Virtual
distances to current IOVA Addresses
85
for their translations in the page table and, if found, fetch them from the page table
and store them in the TLB.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
5.6 Evaluation
5.6.1 Methodology
5.6.2 Results
In order to understand the results, one should understand what memory accesses are
included in a DMA transaction. First the device reads the DMA descriptor which resides
in the Rx or Tx descriptor ring, gets the buffer address, and then accesses (write or read)
that memory address. When IOMMU is enabled, every access requires a translation and
the IOMMU checks whether the translation is cached in the IOTLB or in the prefetch
buffer.
Each E1000 device in our virtual machines includes only 2 rings with 256 descriptors.
One ring takes 4k bytes of memory, and 1 page is sufficient to contain it. Hence, each
virtual machine uses only 2 addresses for the NIC rings and the maximum possible
virtual addresses is 514: 2 addresses of the rings and 256 buffers in each ring (Rx and
Tx). When the NIC only receives data, there are only 258 virtual addresses: 2 addresses
of the rings and 256 Rx buffers. Note that the Rx ring contains the maximum number
of buffers and the Tx ring is empty.
86
Minimum accesses per transaction Maximum accesses per transaction
100
80
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Apache 1k
hit rate [%]
60
40
20
0
100
80
Apache 1M
hit rate [%]
60
40
20
0
8
32
64
12
258
516
8
32
64
12
258
516
8
32
64
12
258
516
8
32
64
12
258
516
8
32
64
12
258
516
8
32
64
12
258
516
2
2
recency distance markov recency distance markov
size of prediction table size of prediction table
IOTLB rings
IOTLB buffers
PB buffers
Figure 5.7: Hit rate simulation of Apache benchmarks with message sizes of 1k (top)
and 1M (bottom).
We observed that the 2 addresses of the rings are inserted to the IOTLB and are
never invalidated. This is because 32 entries are sufficient for never reaching the point
where those 2 addresses are the least recently used in the IOTLB and other entries are
evicted before it. Hence there are 3 type of possible hits in our results:
We present 2 versions of results for each benchmark, because the granularity of the
transferred buffer chunks varies from system to system and affects the miss rate of the
IOMMU. The right-hand columns in the figures show the results for systems with a
granularity of 64 bytes, which cause the maximum number of accesses per transaction,
while the left-hand columns include results for systems that have a chunk with the size
of the transferred buffer.
87
Minimum accesses per transaction Maximum accesses per transaction
100
80
Netperf Stream 1k
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
60
40
20
0
100
80
Netperf Stream 4k
hit rate [%]
60
40
20
0
8
32
64
12
258
516
8
32
64
12
258
516
8
32
64
12
258
516
8
32
64
12
258
516
8
32
64
12
258
516
8
32
64
12
258
516
2
2
recency distance markov recency distance markov
size of prediction table size of prediction table
IOTLB rings
IOTLB buffers
PB buffers
Figure 5.8: Hit rate simulation of Netperf stream with message sizes of 1k (top) and
4k (bottom).
Markov prefetcher
The Markov prefetcher begins to be effective only when the prediction table has 256 or
more entries. Not coincidentally, 256 is the size of a ring in E1000 and constitutes most
of the virtual addresses that are in use by the NIC.
When sending a 1k buffer, only one Ethernet packet is required, but when sending a
1M buffer the buffer is split to 700 Ethernet packets. Those 700 packets are entered one
after the other to the Tx descriptor ring, which has 256 entries. The order of sending
256 packets repeats itself about 3 times, so the predictor can learn this pattern after
the first 1M buffer is sent. The results are better than Apache 1k (Figure 5.7). For the
same reason, when sending 1M buffers, 256 entries in the prediction table are sufficient
to obtain results as good as those obtained when 512 entries are used. Any number of
entries smaller than 256 results in a zero prefetch buffer hit rate because the periodicity
of the rings is 256.
In all the other benchmarks, 256 prediction table entries are sufficient to obtain a
prefetch buffer of about 95% (Figures 5.8, 5.9).
88
Minimum accesses per transaction Maximum accesses per transaction
100
80
Netperf RR 1k
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
40
20
0
100
80
Memcache
hit rate [%]
60
40
20
0
8
32
64
12
258
516
8
32
64
12
258
516
8
32
64
12
258
516
8
32
64
12
258
516
8
32
64
12
258
516
8
32
64
12
258
516
2
2
recency distance markov recency distance markov
size of prediction table size of prediction table
IOTLB rings
IOTLB buffers
PB buffers
Figure 5.9: Hit rate simulation of Netperf RR (top) and Memcached (bottom).
As expected, all benchmarks obtain a hit rate of almost 100% when the prediction table
contains 512 entries and a lower hit rate when the prediction table contains at most 256
entries.
Because of the IOVA allocation scheme explained in chapter 4, Rx and Tx swaps
IOVAs. The IOVA recency, which is accessed both by sending and receiving, is always
above 256. Hence, when running Netperf RR, 256 prediction table entries are not
enough for the preftcher to predict anything.
As for the Markov prefetcher, here, too, any prediction table with fewer than 256
entries will result in a 0% prefetch buffer hit rate because the periodicity of the rings is
256.
The distance prefetcher is not influenced by the addresses or by the periodicity of the
NIC descriptor rings. The distances between accessed address are such that only 8
prediction table entries are enough to predict something and increasing that number
does not really improve the prediction. Moreover, in some benchmarks (for example the
89
Netperf stream shown in Figure 5.8), increasing the number of prediction table entries
reduces the prefetch buffer hit rate, because the relevance of the access pattern changes
over time. Hence, the bigger the table is, the more irrelevant data is kept by it.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Experimental Setup
We used two identical Dell PowerEdge R210 II Rack Server machines that communicate
through Mellanox ConnectX3 40Gbps NICs. The NICs are connected back to back
and are configured to use Ethernet. We use one machine as the server and the other
as a workload generator client. Each machine has a 8GB 1333MHz memory and a
single socket 4-core Intel Xeon E3-1220 CPU running at 3.10GHz. The chipset is Intel
C202, which supports VT-d, Intels Virtualization Technology that provides IOMMU
functionality. We configure the server to utilize one core only and turn off all power
optimization sleep states (C-states), dynamic voltage, and frequency scaling (DVFS) to
avoid reporting artifacts caused by nondeterministic events. The machines run Ubuntu
12.04 with the Linux 3.11 kernel.
With the help of the ibverbs library [47, 38], we bypassed the entire network stack
and communicated directly with the NIC. This way we could measure the latency more
accurately.
Experimental results
We first measured the IOMMU latency sending an Ethernet packet and then measured
the packets round trip time. Each experiment was performed in two versions. In the
first, a buffer was iteratively and randomly selected from a large pool of previously
mapped buffers and transmitted, thus ensuring that the probability for the corresponding
IOVA to reside in the IOTLB is low. We call this version the miss-version. The second
version does the same but with a pool containing only one buffer, thus ensuring that the
IOTLB always hits. This version is called the hit-version. The two types of experiments
are detailed below:
The hit-version results are given in the first line of Table 5.1. The system takes
about 3300 cycles to send 1 packet. As expected, both with or without the IOMMU
the results are the same for the hit-version. The miss-version shows a difference
of about 1000 cycles between running the experiment with IOMMU enabled and
90
running it with IOMMU disabled. Thus, we can learn that the IOTLB miss
latency for sending a packet is 0.3 microseconds in our machine.
Intel IOTLB hit rate Intel IOMMU disabled Intel IOMMU enabled
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
1.2
Round Trip Time Delta [micro seconds]
0.8
0.6
0.4
0.2
0
2 3 8 16 32 64 100 128 256 10k
number of different buffers sent
Figure 5.10: Subtraction between the RTT when the IOMMU is enabled and the RTT
when the IOMMU is disabled.
Figure 5.10 shows the difference between the round-trip time (RTT) as a function
of buffer pool size with the IOMMU enabled and disabled. The first thing we notice
is that with more than 32 buffers in the pool, the RTT delta jumps from 0.6 to 1.2
microseconds. From this we learn that a buffer pool size larger than 32 causes IOTLB
91
misses and the size of the IOTLB is between 32 and 64. We also learn that the IOTLB
miss penalty for a round trip is 0.6 microseconds. Finally, we learn that there is an
additional cost of 0.6 microseconds due to the private data structure of the NIC.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
92
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Chapter 6
Conclusions
6.1 rIOMMU
The IOMMU design is similar to that of the regular MMU, despite the inherent differences
in their workloads. This design worked reasonably well when I/O devices were relatively
slow as compared to the CPU. But it hampers the performance of contemporary devices
like 10/40 Gbps NICs. We foresee that this problem will get worse due to the ever
increasing speed of such devices. We thus contend that it makes sense to rearchitect
the IOMMU such that it directly supports the unique characteristics of its workload.
We propose rIOMMU as an example for such a redesign and show that the benefits of
using it are substantial.
6.2 eIOVAR
The IOMMU has been falsely perceived as the main responsible party for the significant
overheads of intra-OS protection. We find that the OS is equally to blame, suffering
from the long-term ring interference pathology that makes its IOVA allocations slow.
We exploit the ring-induced nature of IOVA (de)allocation requests to design a simple
optimization called eIOVAR. The latter eliminates the harmful effect and makes the
baseline allocator orders of magnitude faster. It improves the performance of common
benchmarks by up to 4.6x.
93
the hit rate by up to 12%. But, as shown in chapter 3, by changing the page table data
structure to a ring and using only two cache entries per ring, it is possible to predict
100% of the accessed addresses, and this is the best solution for devices that use rings.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
94
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Bibliography
[1] Dennis Abts and Bob Felderman. A guided tour through data-center net-
working. ACM Queue, 10(5):10:1010:23, May 2012.
[2] Brian Aker. Memslap - load testing and benchmarking a server. http://docs.
libmemcached.org/bin/memslap.html. libmemcached 1.1.0 documentation.
[3] AMD Inc. AMD IOMMU architectural specification, rev 2.00. http://
support.amd.com/TechDocs/48882.pdf, Mar 2011.
[4] Nadav Amit, Muli Ben-Yehuda, Dan Tsafrir, and Assaf Schuster. vIOMMU:
efficient IOMMU emulation. In USENIX Ann. Technical Conf. (ATC), pages
7386, 2011.
[5] Nadav Amit, Muli Ben-Yehuda, and Ben-Ami Yassour. Iommu: Strategies
for mitigating the iotlb bottleneck. In Proceedings of the 2010 Interna-
tional Conference on Computer Architecture, ISCA10, pages 256274, Berlin,
Heidelberg, 2012. Springer-Verlag.
[9] Apple Inc. Thunderbolt device driver programming guide: Debugging VT-d
I/O MMU virtualization. https://developer.apple.com/library/mac/
documentation/HardwareDrivers/Conceptual/ThunderboltDevGuide/
DebuggingThunderboltDrivers/DebuggingThunderboltDrivers.html,
2013. (Accessed: May 2014).
[10] Apple Inc. Thunderbolt device driver programming guide: Debugging VT-d
I/O MMU virtualization. https://developer.apple.com/library/mac/
95
documentation/HardwareDrivers/Conceptual/ThunderboltDevGuide/
DebuggingThunderboltDrivers/DebuggingThunderboltDrivers.html,
2013. (Accessed: May 2014).
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
[12] Kavita Bala, M. Frans Kaashoek, and William E. Weihl. Software prefetching
and caching for translation lookaside buffers. In USENIX Ann. Technical
Conf. (ATC), pages 243253, 1994.
[13] Thomas Ball, Ella Bounimova, Byron Cook, Vladimir Levin, Jakob Lichten-
berg, Con McGarvey, Bohus Ondrusek, Sriram K. Rajamani, and Abdullah
Ustuner. Thorough static analysis of device drivers. In ACM EuroSys, pages
7385, 2006.
[14] Michael Becher, Maximillian Dornseif, and Christian N. Klein. FireWire: all
your memory are belong to us. In CanSecWest applied security conference,
2005.
[15] Muli Ben-Yehuda, Jimi Xenidis, Michal Ostrowski, Karl Rister, Alexis Bruem-
mer, and Leendert van Doorn. The price of safety: Evaluating IOMMU
performance. In Ottawa Linux Symp. (OLS), pages 920, 2007.
[17] James E.J. Bottomley. Dynamic DMA mapping using the generic device.
https://www.kernel.org/doc/Documentation/DMA-API.txt. Linux ker-
nel documentation.
96
[21] Andy Chou, Junfeng Yang, Benjamin Chelf, Seth Hallem, and Dawson Engler.
An empirical study of operating systems errors. In ACM Symp. on Operating
Systems Principles (SOSP), pages 7388, 2001.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
[23] Jonathan Corbet. Linux Device Drivers, chapter 15: Memory Mapping and
DMA. OReilly, 3rd edition, 2005.
[24] John Criswell, Nicolas Geoffray, and Vikram Adve. Memory safety for low-
level software/hardware interactions. In USENIX Security Symp., pages
83100, 2009.
[27] Roy T. Fielding and Gail Kaiser. The Apache HTTP server project. IEEE
Internet Computing, 1(4):8890, Jul 1997.
[28] Brad Fitzpatrick. Distributed caching with memcached. Linux J., 2004(124),
Aug 2004.
[30] Abel Gordon, Nadav Amit, Nadav HarEl, Muli Ben-Yehuda, Alex Landau,
Assaf Schuster, and Dan Tsafrir. ELI: Bare-metal performance for I/O
virtualization. In ACM Intl Conf. on Architecture Support for Programming
Languages & Operating systems (ASPLOS), pages 411422, 2012.
[31] Jorrit N. Herder, Herbert Bos, Ben Gras, Philip Homburg, and Andrew S.
Tanenbaum. Failure resilience for device drivers. In IEEE/IFIP Ann. Intl
Conf. on Dependable Syst. & Networks (DSN), pages 4150, 2007.
[32] Brian Hill. Integrating an EDK custom peripheral with a LocalLink interface
into Linux. Technical Report XAPP1129 (v1.0), XILINX, May 2009.
97
[34] HP Development Company. Family data sheet: Broadcom NetXtreme network
adapters for HP ProLiant Gen8 servers. http://www.broadcom.com/docs/
features/netxtreme_ethernet_hp_datasheet.pdf, Aug 2013. Rev. 2. (Ac-
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
[35] Jerry Huck and Jim Hays. Architectural support for translation table man-
agement in large address space machines. In Proceedings of the 20th Annual
International Symposium on Computer Architecture, ISCA, pages 3950, New
York, NY, USA, 1993. ACM.
[37] IBM Corporation. AIX kernel extensions and device support programming con-
cepts. https://publib.boulder.ibm.com/infocenter/aix/v7r1/topic/
com.ibm.aix.kernelext/doc/kernextc/kernextc_pdf.pdf, 2013. (Acc-
ssed: May 2014).
[41] Bruce L. Jacob and Trevor N. Mudge. A look at several memory management
units, tlb-refill mechanisms, and page table organizations. In Proceedings of the
Eighth International Conference on Architectural Support for Programming
Languages and Operating Systems, ASPLOS, pages 295306, New York, NY,
USA, 1998. ACM.
[43] Doug Joseph and Dirk Grunwald. Prefetching using Markov predictors. In
Intl Symp. on Computer Archit. (ISCA), pages 252263, 1997.
98
[44] Asim Kadav, Matthew J. Renzelmann, and Michael M. Swift. Tolerating
hardware device failures in software. In ACM Symp. on Operating Systems
Principles (SOSP), pages 5972, 2009.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
[46] Gokul B. Kandiraju and Anand Sivasubramaniam. Going the distance for
TLB prefetching: An application-driven study. In Intl Symp. on Computer
Archit. (ISCA), pages 195206, 2002.
[47] Gregory Kerr. Dissecting a small InfiniBand application using the Verbs
API. Computing Research Repository (arxiv), abs/1105.1827, 2011. http:
//arxiv.org/abs/1105.1827.
[50] Joshua LeVasseur, Volkmar Uhlig, Jan Stoess, and Stefan Gotz. Unmodified
device driver reuse and improved system dependability via virtual machines.
In USENIX Symp. on Operating System Design & Implementation (OSDI),
pages 1730, 2004.
[53] Jang Suk Park and Gwang Seon Ahn. A software-controlled prefetching mech-
anism for software-managed tlbs. Microprocess. Microprogram., 41(2):121136,
May 1995.
[55] Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind
Krishnamurthy, Thomas Anderson, and Timothy Roscoe. Arrakis: The
operating system is the control plane. Technical Report UW-CSE-13-10-01,
University of Washington, Jun 2014.
99
[56] Ashley Saulsbury, Fredrik Dahlgren, and Per Stenstrom. Recency-based TLB
preloading. In Intl Symp. on Computer Archit. (ISCA), pages 117127, 2000.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
[57] Arvind Seshadri, Mark Luk, Ning Qu, and Adrian Perrig. SecVisor: A tiny
hypervisor to provide lifetime kernel code integrity for commodity OSes. In
ACM Symp. on Operating Systems Principles (SOSP), pages 335350, 2007.
[58] Livio Soares and Michael Stumm. FlexSC: Flexible system call scheduling
with exception-less system calls. In USENIX Symp. on Operating System
Design & Implementation (OSDI), pages 3346, 2010.
[59] Michael Swift, Brian Bershad, and Henry Levy. Improving the reliability
of commodity operating systems. ACM Trans. on Comput. Syst. (TOCS),
23(1):77110, Feb 2005.
[60] Fujita Tomonori. Intel IOMMU (and IOMMU for Virtualization) perfor-
mances. https://lkml.org/lkml/2008/6/5/164.
[62] Carl Waldspurger and Mendel Rosenblum. I/O virtualization. Comm. of the
ACM (CACM), 55(1):6673, Jan 2012.
[63] Dan Williams, Patrick Reynolds, Kevin Walsh, Emin Gun Sirer, and Fred B.
Schneider. Device driver safety through a reference validation mechanism.
In USENIX Symp. on Operating System Design & Implementation (OSDI),
pages 241254, 2008.
[64] Paul Willmann, Scott Rixner, and Alan L. Cox. Protection strategies for
direct access to virtualized I/O devices. In USENIX Ann. Technical Conf.
(ATC), pages 1528, 2008.
[65] Rafal Wojtczuk. Subverting the Xen hypervisor. In Black Hat, 2008.
http://www.blackhat.com/presentations/bh-usa-08/Wojtczuk/BH_
US_08_Wojtczuk_Subverting_the_Xen_Hypervisor.pdf. (Accessed: May
2014).
[66] Ben-Ami Yassour, Muli Ben-Yehuda, and Orit Wasserman. On the DMA
mapping problem in direct device assignment. In ACM Intl Systems &
Storage Conf. (SYSTOR), pages 18:118:12, 2010.
100
, "" , ,
"" . ""
"" , "" ""
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
.
7.56 . .ASP LOS 2015
IOM M U .
, , .
, ,
, .
.4.6 .F AST 2015
"
.
.
iii
.
.
.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
.
) (.
, ,
/ ,
. , / ,
,
) ( .
) ( .
.
.
.IOM M U
.
.
.
/
.
.
, .
.
IOM M U .
,
.
, , ,
. IOM M U
,
" .
IOM M U .
, ,
ii
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
, ,
. ,
.
,
. ,
. /
.
,
.
.
.
1982 80286 .
,
.
.
,
,
) ( .
.
.
, ,
, .
.M M U
i
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
, .
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
, , ,
, .
.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
2015 '' ''
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015