Você está na página 1de 121

Technion - Computer Science Department - M.Sc.

Thesis MSC-2015-10 - 2015

Moshe Malka
Rethinking the I/O Memory
Management Unit (IOMMU)
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Rethinking the I/O Memory


Management Unit (IOMMU)

Research Thesis

Submitted in partial fulfillment of the requirements


for the degree of Master of Science in Computer Science

Moshe Malka

Submitted to the Senate


of the Technion Israel Institute of Technology
Adar 5775 Haifa March 2015
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
This research was carried out under the supervision of Prof. Dan Tsafrir, in the Faculty
of Computer Science.

Some results in this thesis have been published as articles by the author and research
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

collaborators in conferences and journals during the course of the authors research
period, the most up-to-date versions of which being:

1. Moshe Malka, Nadav Amit, Muly Ben-Yehuda and Dan Tsafrir. rIOMMU:
Efficient IOMMU for I/O Devices that Employ Ring. In proceeding of the 20th
International Conference on Architectural Support for Programming Languages
and Operating Systems (ASPLOS 2015).

2. Moshe Malka, Nadav Amit, and Dan Tsafrir. Efficient IOMMU Intra-Operating
System Protection. In proceeding of the 13th USENIX Conference on File and
Storage Technologies (FAST 2015) .

Acknowledgements
I would like to thank my advisor Dan Tsafrir for his devoted guidance and help, my
research team Nadav Amit and Muli Ben-Yehuda, my parents and my friends.

The generous financial help of the Technion is gratefully acknowledged.


Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Contents

List of Figures

Abstract 1

Abbreviations and Notations 3

1 Introduction 5

2 Background 7
2.1 Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Physical and Virtual Addressing . . . . . . . . . . . . . . . . . . 8
2.1.2 Address Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Page Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Virtual Memory as a Tool for Memory Protection . . . . . . . . 11
2.1.5 Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Direct Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Transferring Data from the Memory to the Device . . . . . . . . 23
2.2.2 Transferring Data from the Device to the Memory . . . . . . . . 23
2.3 Adding Virtual Memory to I/O Transactions . . . . . . . . . . . . . . . 24

3 rIOMMU: Efficient IOMMU for I/O Devices that Employ Ring


Buffers 27
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Operating System DMA Protection . . . . . . . . . . . . . . . . 29
3.2.2 IOMMU Design and Implementation . . . . . . . . . . . . . . . . 30
3.2.3 I/O Devices Employing Ring Buffers . . . . . . . . . . . . . . . . 31
3.3 Cost of Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 Overhead Components . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.2 Protection Modes and Measured Overhead . . . . . . . . . . . . 33
3.3.3 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5.3 When IOTLB Miss Penalty Matters . . . . . . . . . . . . . . . . 50
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

3.5.4 Comparing to TLB Prefetchers . . . . . . . . . . . . . . . . . . . 50


3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4 Efficient IOMMU Intra-Operating System Protection 53


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Intra-OS Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 IOVA Allocation and Mapping . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 Long-Lasting Ring Interference . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 The EiovaR Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5.1 EiovaR with Strict Protection . . . . . . . . . . . . . . . . . . . 63
4.5.2 EiovaR with Deferred Protection . . . . . . . . . . . . . . . . . 65
4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5 Reducing the IOTLB Miss Overhead 77


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 General description of all the prefetchers we explore . . . . . . . . . . . 78
5.3 Markov Prefetcher (MP) . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.1 Markov Chain Theorem . . . . . . . . . . . . . . . . . . . . . . . 80
5.3.2 Prefetching Using the Markov Chain . . . . . . . . . . . . . . . . 81
5.3.3 Extension to IOMMU . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4 Recency Based Prefetching (RP) . . . . . . . . . . . . . . . . . . . . . . 81
5.4.1 TLB hit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4.2 TLB miss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4.3 Extension to IOMMU . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5 Distance Prefetching (DP) . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.7 Measuring the cost of an Intel IOTLB miss . . . . . . . . . . . . . . . . 90

6 Conclusions 93
6.1 rIOMMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 eIOVAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3 Reducing the IOTLB Miss Overhead . . . . . . . . . . . . . . . . . . . . 93

Hebrew Abstract i
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

List of Figures

2.1 A system that uses physical addressing. . . . . . . . . . . . . . . . . . . 8


2.2 A system that uses virtual addressing. . . . . . . . . . . . . . . . . . . . 9
2.3 Flat page table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Allocating a new virtual page. . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Using virtual memory to provide page-level memory protection. . . . . . 13
2.6 Addressing translation with page table. . . . . . . . . . . . . . . . . . . 14
2.7 Page hit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Components of a virtual address that are used to access the TLB. . . . 15
2.9 TLB hit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10 TLB miss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.11 A two-level page table hierarchy. Notice that addresses increase from top
to bottom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.12 Address translation with a k-level page table. . . . . . . . . . . . . . . . 19
2.13 Addressing for small memory system. Assume 14-bit virtual addresses
(n = 14), 12-bit physical addresses (m = 12), and 64-byte pages (P = 64). 20
2.14 TLB, page table, and cache for small memory system. All values in the
TLB, page table, and cache are in hexadecimal notation. . . . . . . . . . 21
2.15 DMA transaction flow with IOMMU sequence diagram. . . . . . . . . . 24

3.1 IOMMU is for devices what the MMU is for processes. . . . . . . . . . . 28


3.2 Intel IOMMU data structures for IOVA translation. . . . . . . . . . . . 30
3.3 A driver drives its device through a ring. With an IOMMU, pointers are
IOVAs (both registers and target buffers). . . . . . . . . . . . . . . . . . 32
3.4 The I/O device driver maps an IOVA v to a physical target buffer p. It
then assigns v to the DMA descriptor. . . . . . . . . . . . . . . . . . . . 34
3.5 The I/O device writes the packet it receives to the target buffer through
v, which the IOMMU translates to p. . . . . . . . . . . . . . . . . . . . . 34
3.6 After the DMA completes, the I/O device driver unmaps v and passes p
to a higher-level software layer. . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 CPU cycles used for processing one packet. The top bar labels are relative
to Cnone =1,816 (bottommost grid line). . . . . . . . . . . . . . . . . . . 36
3.8 Throughput of Netperf TCP stream as a function of the average number
of cycles spent on processing one packet. . . . . . . . . . . . . . . . . . 36
3.9 The rIOMMU data structures. e) is used only by hardware. The last two
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

fields of rRING are used only by software. . . . . . . . . . . . . . . . . . 38


3.10 rIOMMU data structures for IOVA translation. . . . . . . . . . . . . . 39
3.11 Outline of the rIOMMU logic. All DMAs are carried out with IOVAs
that are translated by the rtranslate routine. . . . . . . . . . . . . . . . 40
3.12 Outline of the rIOMMU OS driver, implementing map and unmap, which
respectively correspond to Figures 3.4 and 3.6. . . . . . . . . . . . . . . 41
3.13 Absolute performance numbers of the IOMMU modes when using the
Mellanox (top) and Broadcom (bottom) NICs. . . . . . . . . . . . . . . 47

4.1 IOVA translation using the Intel IOMMU. . . . . . . . . . . . . . . . . . . . 56


4.2 Pseudo code of the baseline IOVA allocation scheme. The functions rb next and
rb prev return the successor and predecessor of the node they receive, respectively. 59
4.3 The length of each alloc iova search loop in a 40K (sub)sequence of alloc iova
calls performed by one Netperf run. One Rx-Tx interference leads to regular
linearity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Netperf TCP stream iteratively executed under strict protection. The x axis
shows the iteration number. . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Average cycles breakdown of map with Netperf/strict. . . . . . . . . . . . . 64
4.6 Average cycles breakdown of unmap with Netperf/strict. . . . . . . . . . . . 64
4.7 Netperf TCP stream iteratively executed under deferred protection. The x axis
shows the iteration number. . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.8 Under deferred protection, EiovaRk eliminates costly linear searches when k
exceeds the high-water mark W . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.9 Length of the alloc iova search loop under the EiovaRk deferred protection
regime for three k values when running Netperf TCP Stream. Bigger capacity
implies that the searches become shorter on average. Big enough capacity
(k W = 250) eliminates the searches altogether. . . . . . . . . . . . . . . . 67
4.10 The performance of baseline vs. EiovaR allocation, under strict and deferred
protection regimes for the Mellanox (top) and Broadcom (bottom) setups.
Except for in the case of Netperf RR, higher values indicated better performance. 68
4.11 Netperf Stream throughput (top) and used CPU (bottom) for different message
sizes in the Broadcom setup. . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.12 Impact of increased concurrency on Memcached in the Mellanox setup. EiovaR
allows the performance to scale. . . . . . . . . . . . . . . . . . . . . . . . . 75

5.1 General scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78


5.2 Markov state transition diagram, which is represented as a directed graph
(right) or a matrix (left). . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Schematic implementation of the Markov Prefetcher. . . . . . . . . . . . 82
5.4 Schematic depiction of the recency prefetcher on a TLB hit. . . . . . . . 83
5.5 Schematic depiction of the recency prefetcher on a TLB miss. . . . . . . 84
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

5.6 Schematic depiction of the distance prefetcher on a TLB miss. . . . . . . 85


5.7 Hit rate simulation of Apache benchmarks with message sizes of 1k (top)
and 1M (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.8 Hit rate simulation of Netperf stream with message sizes of 1k (top) and
4k (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.9 Hit rate simulation of Netperf RR (top) and Memcached (bottom). . . . 89
5.10 Subtraction between the RTT when the IOMMU is enabled and the RTT
when the IOMMU is disabled. . . . . . . . . . . . . . . . . . . . . . . . . 91
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Abstract

Processes are encapsulated with virtual memory spaces and access the memory via
virtual addresses (VAs) to ensure, among other things, that they access only those
memory parts they have been explicitly granted access to. These memory spaces are
created and maintained by the OS, and the translation from virtual to physical addresses
(PAs) is done by the MMU. Analogously, I/O devices can be encapsulated with I/O
virtual memory spaces and access the memory using I/OVAs, which are translated by
the input/output memory management units (IOMMU) to physical addresses.
This encapsulation increases system availability and reliability, since it prevents
devices from overwriting any part of the memory, including memory that might be used
by other entities. It also prevents rogue devices from performing errant or malicious
access to the memory and ensures that buggy devices will not lose important data.
Chip makers understood the importance of this and added IOMMUs to the chipsets of
all servers and some PCs. However, this protection comes at the cost of performance
degradation, which depends on the IOMMU design, the way it is programmed, and the
workload. We found that Intels IOMMU degrades the throughput of I/O-intensive
workloads by up to an order of magnitude.
We investigate all the possible causes of IOMMU overhead and that of its driver
and suggest a solution for each. First we identify that the complexity of the kernel
subsystem in charge of IOVA allocation is linear in the number of allocated IOVAs and
thus a major source of overhead. We optimize the allocation in a manner that ensures
that the complexity is typically constant and never worse than logarithmic, and we
improve the performance of the Netperf, Apache, and Memcached benchmarks by up
to 4.6x. Observing that the IOTLB miss rate can be as high as 50%, we then suggest
hiding the IOTLB misses with a prefetcher. We extend some of the state-of-the-art
prefetchers to IOTLB and compare them. In our experiments we achieve a hit rate
of up to 99% on some configurations and workloads. Finally, we observe that many
devices such as network and disk controllers typically interact with the OS via circular
ring buffers that induce a sequential, completely predictable workload. We design a ring
IOMMU (rIOMMU) that leverages this characteristic by replacing the virtual memory
page table hierarchy with a circular, flat table. Using standard networking benchmarks,
we show that rIOMMU provides up to 7.56x higher throughput relative to the baseline
IOMMU, and that it is within 0.77-1.00x the throughput of a system without IOMMU.

1
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

2
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Abbreviations and Notations

ALH : Address Locality Hint


API : Application Programming Interface
CI : Cache Index
CO : Cache Offset
CPU : Central Processing Unit
CT : Cache Tag
DMA : Direct Memory Access
DP : Distance Prefetcher
DRAM : Dynamic Random Access Memory
DS : Data Structure
FIFO : First-In First-Out
GB : Gigabyte
Gbps : Gigabits per second
HW : Hardware
I/O : Input/Output
IOMMU : I/O Memory Management Unit
IOPF : I/O Page Fault
IOTLB : I/O Translation Lookaside Buffer
IOVA : I/O Virtual Address
IP : Internet Protocol
KB : Kilobyte
K : Kilo
KVM : Kernel-based Virtual Machine
LRU : Least Recently Used
MB : Megabyte
MMU : Memory Management Unit
MP : Markov Prefetcher
MPRE : Mapping Prefetch

3
NIC : Network Interface Controller
OS : Operating System
PA : Physical Address
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

PCI : Peripheral Component Interconnect


PFN : Physical Frame Number
PPN : Physical Page Number
PPO : Physical Page Offset
PTBR : Page Table Base Register
PTE : Page Table Entry
RAM : Random Access Memory
RDMA : Remote Direct Memory Access
rIOMMU : ring IOMMU
RP : Recency Prefetcher
RR : Request Response
RTT : Round Trip Time
Rx : Receiver
SUP : Supervisor
SW : Software
TCP : Transmission Control Protocol
TLBI : Translation Lookaside Buffer Index
TLB : Translation Lookaside Buffer
TLBT : Translation Lookaside Buffer Tag
Tx : Transmitter
UDP : User Datagram Protocol
VA : Virtual Address
VM : Virtual Machine
VPN : Virtual Page Number
VPO : Virtual Page Offset
VP : Virtual Page
VT-d : Virtualization Technology for Direct I/O
VT : Virtualization Technology

4
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Chapter 1

Introduction

I/O transactions, such as reading data from a hard disk or a network card, constitute a
major part of the workloads in todays computer systems, especially in servers. They are,
therefore, an important part of systems performance. To reduce the CPU utilization,
most of todays peripheral devices can bypass the processor and directly read from and
write to the main memory using a direct memory access (DMA) hardware unit [23, 17].
This unit is described in section 2.2.
Although accessing the memory directly gives performance advantages, it can also
lead to stability and security problems. A peripheral device with direct memory access
can be programmed to overwrite any part of the systems memory or to cause bugs in
the system; it can also be made vulnerable to infection with malicious software [15].
These disadvantages of direct access have received a lot of attention recently. How-
ever, similar problems also existed for CPU processes back before the virtual memory
mechanism was added to CPUs. Virtual memory encapsulates the processes with an
abstraction layer that prevents direct access to the memory. Many of the problems solved
by the virtual memory closely resemble problems that emerged due to direct memory
access. Hardware designers have noticed this similarity and duplicated the virtual
memory mechanism for I/O devices, calling it I/O virtual memory. The hardware unit
in charge of I/O virtual memory is called the I/O memory management unit (IOMMU).
Duplicating the mechanism was natural because the virtual memory is a major part of
almost every computer machine.
In chapter 2 we expand on the required background regarding virtual memory, the
direct memory access mechanism, how the I/O virtual memory mechanism fits into the
picture, and the differences between processes and I/O devices in the context of virtual
memory.
Although I/O virtual memory solves the problems mentioned above, hardware
designers did not take into account the difference between the workload of I/O devices
and processes. As a result, systems that perform I/O transactions using I/O virtual
memory experience a significant reduction in performance. The goal of this work
is to study all the causes for the performance reduction and suggest new designs

5
and algorithms to reduce the performance overhead. We carefully investigated the
workload experienced by the virtual memory and found that hardware overheads are
not exclusively to blame for its high cost. Rather, the cost is amplified by software due
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

to the IOMMU driver overhead.


In our primary paper we observe that many devices such as network and disk
controllers typically interact with the OS via circular ring buffers that induce a sequential,
completely predictable workload. We design a ring IOMMU (rIOMMU) that leverages
this characteristic by replacing the virtual memory page table hierarchy with a circular,
flat table. A flat table is adequately supported by exactly one IOTLB entry, making
every new translation an implicit invalidation of the former and thus requiring explicit
invalidations only at the end of I/O bursts. Using standard networking benchmarks,
we show that rIOMMU provides up to 7.56x higher throughput relative to the baseline
IOMMU, and that it is within 0.771.00x the throughput of a system without IOMMU
protection. We describe the design and evaluation of our newly proposed rIOMMU in
Chapter 3. A paper reflecting the content of this chapter has been accepted
for publication to the ACM 20th International Conference on Architectural
Support for Programming Languages and Operating Systems (ASPLOS),
2015 .
In a second paper we identify the kernel subsystem in charge of IOVA allocation
to be a major source of performance degradation for I/O-intensive workloads, due to
regularly suffering from a costly linear overhead. We suggest an efficient I/O virtual
address allocation mechanism (EIOVAR). Utilizing the characteristics of the load
experienced by IOMMUs, we optimize the allocation in a manner that ensures that
the complexity is typically constant and never worse than logarithmic. Our allocation
scheme is immediately applicable and improves the performance of Netperf, Apache, and
Memcached benchmarks by up to 4.6x. We describe EIOVAR in chapter 4. A paper
reflecting the content of this chapter has been accepted for publication to the
13th USENIX Conference on File and Storage Technologies (FAST), 2015 .
The IOMMU contains a small cache of the virtual memory translations (called the
I/O Translation Look aside Buffer). IOTLB misses cause the IOMMU to translate
the virtual address, an action that includes multiple memory accesses. In order to
hide the performance reduction caused by the IOTLB misses, we explored the prefetch
mechanism. This aspect of our research is described in chapter 5, where we review 3
state-of-the-art prefetchers and extend them to work with IOMMU. We then simulate
the miss rate of an IOMMU that contains these prefetchers.

6
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Chapter 2

Background

The concept of virtual memory was first developed by German physicist Fritz-Rudolf
Guntsch at the Technische Universitt Berlin in 1956 in his doctoral thesis, Logical Design
of a Digital Computer with Multiple Asynchronous Rotating Drums and Automatic High
Speed Memory Operation. In 1961, the Burroughs Corporation independently released
the first commercial computer with virtual memory, the B5000, with segmentation rather
than paging. The first minicomputer to introduce virtual memory was the Norwegian
NORD-1; during the 1970s, other minicomputers implemented virtual memory, notably
VAX models running VMS. Before 1982 all Intel CPUs were designed to work in Real
Mode. Real Mode, also called real address mode, is characterized by unlimited direct
software access to all memory, I/O addresses, and peripheral hardware, but provides
no support for memory protection. This situation continued until Protected Mode was
added to the x86 architecture in 1982. Introduced with the release of Intels 80286
(286)processor, Protected Mode was later extended, with the release of the 80386 in
1985, to include features such as virtual memory paging and safe multi-tasking. The
first
The virtual memory is an abstract layer of the main memory that separates the
processes from the physical memory. It is a combination of operating system, disk
files, hardware virtual address translation, hardware exceptions, and main memory that
provide processes with a large private address space without any intervention from the
application programmer. Virtual memory provides three main capabilities:

1. It provides programmers with a large uniform address space and lets them focus
on designing the program rather than dealing with managing the memory used by
the processes.

2. It uses main memory efficiently by treating it as a cache for process data stored
on disk, keeping only the active areas in main memory, and swapping data disk
and memory as needed.

3. It protects the address space of each process from corruption by other processes.

7
2.1 Virtual Memory
The goal of this subsection is to explain how virtual memory works, with a focus on:
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

The functionality relevant to the I/O: address space, address translation and so
forth;

The ability to provide encapsulation of the memory used by each device;

Protecting the system from devices that try to gain unauthorized access to the
memory.

The reader is referred to the references on which this subsection is based for more
in-depth information about virtual memory [18].

2.1.1 Physical and Virtual Addressing

Main memory
0
1
Physical
Address 2
(PA) 3
CPU
4
4
5
6
7
8

M-1:

Data word

Figure 2.1: A system that uses physical addressing.

A computer systems main memory is organized as an array of M contiguous byte-


sized cells. Each byte has a unique physical address (PA), as follows:

The first byte has an address of 0;

8
The next byte has an address of 1;

The next byte has an address of 2, and so on.


Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

As the organization is simple, physical addressing is the most natural way for
a CPU to access memory. Figure 2.1 shows an example of physical addressing for
a load instruction that reads the word starting at physical address 4. An effective
physical address is created when the CPU executes the load instruction. It then
passes it to main memory over the memory bus. The main memory fetches the 4-byte
word that starts at physical address 4 and returns it to the CPU, which stores it
in a register. Physical addressing, and systems such as digital signal processors and
embedded microcontrollers, were used in early personal computers. The method is still
employed in Cray supercomputers.
Modern processors no longer use physical addressing. Instead, they use virtual
addressing, as shown in Figure 2.2. In virtual addressing, the CPU accesses the main
memory by generating a virtual address (or VA, for short), which is converted to the
appropriate physical address before it is sent to the memory. Converting a virtual
address into a physical address is called address translation. Address translation, like
exception handling, requires the CPU hardware and the operating system to work
together closely.

Main memory
0
1
Physical
Address 2
Address
Virtual translation
Address(VA) (PA) 3
CPU MMU
4100 4 4
5
6
7
8

M-1:

Figure 2.2: A system that uses virtual addressing.

Virtual addresses are translated on the fly by the memory management unit (MMU),

9
which is dedicated hardware on the CPU chip that uses a look-up table stored in main
memory. The look-up table contains the translations for the mapped virtual addresses
and is managed by the operating system.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

2.1.2 Address Spaces


An address space is an ordered set of non-negative integer addresses 0, 1, 2, .... In a
system that uses virtual memory, the CPU creates virtual addresses from an address
space of N = 2n addresses called the virtual address space: 0, 1, 2, ...., N 1. The
size of an address space is determined by the number of bits required to represent the
largest address. For example, a virtual address space with N = 2n addresses is called
an n-bit address space. Modern computer systems typically support either 32-bit or
64-bit virtual address spaces.
A system also has a physical address space that corresponds to the M bytes of
physical memory in the system: 0, 1, 2, ...., M 1 M is not required to be a power of
two, but for our purposes, we will assume that M = 2.
The address space is an important concept because it makes a clear distinction
between:

Data objects (bytes); and

Their attributes (addresses).

This distinction allows us to generalize, and also allows each data object to have multiple
independent addresses, each chosen from a different address space. The basic idea of
virtual memory is that each byte of main memory has a virtual address chosen from the
virtual address space, and a physical address chosen from the physical address space.

2.1.3 Page Table


Virtual memory is a tool for caching process data from disk to main memory, memory
management, and memory protection. These capabilities are provided by a combination
of:

The operating system software;

Address translation hardware in the memory management unit; and

A data structure known as a page table. It is stored in physical memory and maps
virtual pages to physical pages. The address translation hardware reads the page
table each time it converts a virtual address to a physical address.

The operating system maintains the contents of the page table and transfers pages back
and forth between disk and DRAM.
The basic organization of a page table is shown in Figure 2.3

10
Physical memory
(DRAM)
Physical page
Number or
VP 1 PP 0
valid disk address
VP 2
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

PTE 0 0 null
VP 7
1
VP 4 PP 3
1
Virtual memory
1
(disk)
0 VP 1
0 null VP 2
VP 3
0
VP 4
PTE 7 1
Memory resident VP 6
Page table
VP 7
(DRAM)

Figure 2.3: Flat page table.

A page table is an array of page table entries (PTEs). Each page in the virtual
address space has a page table entry at a fixed offset in the page table. For our purposes,
we will assume that each page table entry consists of a valid bit and an n-bit address
field. The valid bit tells us whether the virtual page is currently cached in DRAM. If the
valid bit is set, the address field indicates the start of the corresponding physical page
in DRAM where the virtual page is cached. If the valid bit is not set, a null address
tells us that the virtual page has not yet been allocated. Otherwise, the address points
to the start of the virtual page on disk.
Figure 2.3 shows a page table for a system with eight virtual pages and four physical
pages. The four virtual pages, VP 1, VP 2, VP 4, and VP 7, are currently cached in
DRAM. Two pages, VP 0 and VP 5, are yet to be allocated, while pages VP 3 and VP
6 have been allocated - but are not currently cached. Because the DRAM cache is fully
associative, any physical page can contain any virtual page.

Allocating Memory and Mapping It to Virtual Space

When the operating system allocates a new page of virtual memory, such as the result
of calling malloc, this will affect the page table, as shown in the example in Figure 2.4.
In the example, VP 5 is allocated by creating room on disk and updating page table
entry 5 to point to the newly created page on disk.

2.1.4 Virtual Memory as a Tool for Memory Protection


Modern computer systems must provide the means by which the operating system
controls access to the memory system. A user process should be prevented from
modifying its read-only text section. Nor should it be allowed to read and/or to modify
any of the code and data structures in the kernel. It should also be prevented from
reading and or writing to the private memory of other processes. Furthermore, unless

11
Physical memory
(DRAM)
Physical page
VP 1 PP 0
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Number or
valid disk address
VP 2
PTE 0 0 null
VP 7
1
VP 3 PP 3
1
Virtual memory
1
(disk)
0 VP 1
0 VP 2
VP 3
0
VP 4
PTE 7 1
VP 5
Memory resident VP 6
Page table VP 7
(DRAM)

Figure 2.4: Allocating a new virtual page.

all parties expressly allow it, such as by calling to specific inter-process communication
system calls, modification of any virtual pages shared with other processes should be
forbidden.
Having separate virtual address spaces makes it easy to isolate the private memories
of different processes. However, the address translation mechanism can be extended by
adding some additional permission bits to the page table entries to provide even greater
access control. This is possible because the address translation hardware reads a page
table entry each time the CPU generates an address. Figure 2.5 shows how this can be
done.
In this example we have added three permission bits to each page table entry:

The SUP bit indicates whether processes must be running in kernel (supervisor)
mode to access the page. Processes running in kernel mode can access any page,
while processes running in user mode can access pages for which SUP is 0.

The read and write bits control read and write access to the page. For
example, if process i is running in user mode, then it has permission to read VP 0
and to read or write VP 1, but it cannot access VP 2.

If an instruction contravenes these permissions, the CPU triggers a general protection


fault that transfers control to an exception handler in the kernel. This exception is
reported as a segmentation fault in Unix shells.

12
Page tables with permission bits
Physical memory
Process i: SUP READ WRITE ADDRESS
Vp 0 PP 0
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

No Yes No PP 6
Vp 1 No Yes Yes PP 4
Vp 2 Yes Yes Yes PP 2 PP 2

PP 4

PP 6

SUP READ WRITE ADDRESS


Process j:
No Yes No PP 9
Vp 0 PP 9
Vp 1 No Yes Yes PP 6
Vp 2 Yes Yes Yes PP 11 PP 11

Figure 2.5: Using virtual memory to provide page-level memory protection.

2.1.5 Address Translation


In covering the basics of address translation in this section, our goal is to provide an
understanding of the role hardware plays in supporting virtual memory, in sufficient
detail to allow the reader to work through some examples on his or her own. Do bear
in mind that we are omitting a number of details, especially those related to timing.
Although they are important to hardware designers, such details are beyond the scope
of this thesis.
How the memory management unit uses the page table to perform the virtual address
mapping is shown in Figure 2.6.
The page table base register (PTBR), which is a control register in the CPU, points
to the current page table. The n-bit virtual address has two components:

1. A p-bit virtual page offset (VPO); and

2. An (n - p)-bit virtual page number (VPN).

The memory management unit uses the virtual page number to select the appropriate
page table entry. For example, VPN 0 selects PTE 0, VPN 1 selects PTE 1, and so on.
The corresponding physical address is the concatenation of the physical page number
(PPN) from the page table entry and the virtual page offset from the virtual address.
As both the physical and virtual pages are P bytes, the physical page offset (PPO) is

13
N-1 p p-1 o
Virtual page number (VPN) Virtual page offset (VPO)
Page table
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Base register
(PTBR)
Valid Physical page number (PPN)

The VPN acts


as index into the
page table

If valid=0
then page
not in memory M-1 p p-1 o
(page fault)
Physical page number (PPN) Physical page OFETFS (PPO)
Physical address

Figure 2.6: Addressing translation with page table.

identical to the virtual page offset.


The steps the CPU hardware performs when there is a page hit are shown in Figure
2.7.

2
CPU chip
PTEA

1 PTE
MMU 3 Cache/
Processor memory
VA
PA
4
Data
5

Figure 2.7: Page hit.

1. The processor generates a virtual address and sends it to the memory management
unit.

14
2. The memory management unit generates the page table entry address, and requests
it from the main memory.

3. The cache and/or main memory return the page table entry to the memory
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

management unit. The memory management unit creates the physical address
and sends it to main memory.

4. The main memory returns the requested data word to the processor.

Page hits are handled entirely by hardware. Handling a page fault requires coopera-
tion between hardware and the operating system kernel. As it is not relevant to our
work, we do not explain the process here.

Speeding up Address Translation with a TLB Translation Lookaside Buffer

Each time the CPU generates a virtual address, the memory management unit must
refer to a page table entry to translate the virtual address into a physical address. This
requires an additional fetch from memory that take tens or even hundreds of cycles.
If the page table entry is cached in L1, then the overhead is reduced to only one or
two cycles. However, even this low cost can be further reduced or even eliminated by
including a small cache of page table entries in the memory management unit. This is
called a translation lookaside buffer (or TLB for short).
A TLB is a small, virtually addressed cache where each line holds a block consisting
of a single page table entry. A translation lookaside buffer usually has a high degree
of associativity. This is shown in Figure 2.8, in which the index and tag fields used
for set selection and line matching are extracted from the virtual page number in the
virtual address. If the translation lookaside buffer has T = 2 t sets, then the translation
lookaside buffer index (TLBI) consists of the t least significant bits of the virtual page
number, and the translation lookaside buffer tag (TLBT) consists of the remaining bits
in the virtual page number.

N-1 p+t p+t-1 p p-1 0


TLB tag (TLBT) TLB index (TLBI) VPO

VPN

Figure 2.8: Components of a virtual address that are used to access the TLB.

The steps involved when there is a translation lookaside buffer hit (the usual case)
are shown in Figure 2.9. The important point is that all of the address translation steps
are performed inside the on-chip memory management unit. Because of this, they are
performed very quickly.

15
CPU chip
TLB
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

2 VPN PTE 3

4
1 Trans- Cache-
Processor lation memory
VA PA

5 Data

Figure 2.9: TLB hit.

1. The CPU generates a virtual address.

2. The memory management unit checks if the translation exists on the translation
lookaside buffer.

3. The memory management unit fetches the appropriate page table entry from the
translation lookaside buffer.

4. The memory management unit translates the virtual address into a physical
address, and then sends it to the main memory.

5. The main memory returns the requested data word to the CPU.

The memory management unit must fetch the page table entry from the L1 cache if
there is a translation lookaside buffer miss. This is shown in Figure 2.10. The newly
fetched page table entry is stored in the translation lookaside buffer miss and may
possibly overwrite an existing entry.

Multi-Level Page Tables

Until now we have assumed that the system uses a single page table for address
translation. If, however, we had a 32-bit address space, 4 KB pages, and a 4-byte page
table entry, we would require a 4 MB page table resident in memory at all times, even if
the application referenced only a small part of the virtual address space. This problem
is compounded for systems with 64-bit address spaces.

16
CPU chip
TLB
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

2 VPN 4
PTE

3
1 Trans- PTEA Cache-
Processor lation memory
VA PA
5

6 Data

Figure 2.10: TLB miss.

A commonly used method of compacting the page table is to use a hierarchy of page
tables. The idea is best explained using an example in which a 32-bit virtual address
space is partitioned into 4 KB pages, with page table entries that are 4 bytes each, and
for which the virtual address space has the following form:

The first 2K pages of memory are allocated for code and data;

The next 6K pages are unallocated;

The next 1023 pages are also unallocated; and

The next page is allocated for the user stack.

Figure 2.11 shows how a two-level page table hierarchy for this virtual address space
might be constructed.
Each page table entry in the level-1 table is responsible for mapping a 4 MB segment
of the virtual address space, in which each segment consists of 1024 contiguous pages.
For example, page table entry 0 maps the first segment, page table entry 1 the next
segment, and so on. As the address space is 4 GB, 1024 page table entries are sufficient
to cover the entire space.
If every page in segment i is unallocated, level 1 page table entry i will be null. For
example, in Figure 2.11, segments 2 to7 are unallocated. However, if at least one page
in segment I is allocated, level 1 page table entry i will point to the base of a level 2
page table. This is shown in Figure 1.11, where all or portions of segments 0, 1, and 8
are allocated, so their level 1 page table entries point to level 2 page tables.

17
Virtual
memory
Level 2 VP 0
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Page table
Level 1
Page table PTE 0
2K allocated VM
VP 1023
PTE 0 .. pages for code and
data
VP 1024
PTE 1 PTE 1023

PTE 2 (NULL)
VP 2047
PTE 3 (NULL) PTE 0
PTE 4 (NULL) ..
PTE 5 (NULL) PTE 1023
6 k allocated VM
PTE 6 (NULL) Gap
pages

PTE 7 (NULL)

PTE 8 1023 null


PTEs
PTE 1023
(1k-9) 1023
1023 allocated pages
Null PTEs Unallocated
pages
VP 9215 1 allocated VM pages
for the stack

Figure 2.11: A two-level page table hierarchy. Notice that addresses increase from top
to bottom.

Each page table entry in a level 2 page table maps a 4 KB page of virtual memory,
just as in a single-level page table. In 4-byte page table entries, each level 1 and level 2
page table is 4K bytes, which, conveniently, is the same size as a page.
Moreover, only the level 1 table needs to be in main memory all the time. The level
2 page tables can be created and paged in and out by the virtual memory system as
they are needed. This further reduces demand on the main memory. Only the most
frequently used level 2 page tables need be cached in the main memory.
Figure 2.12 summarizes address translation with a k-level page table hierarchy.

The virtual address is partitioned into k virtual page numbers and a virtual page
offset.

Each virtual page number i, 1 i k, is an index of a page table at level i.

Each page table entry in a level-j table, 1 j k 1, points to the base of some
page table at level j + 1.

18
Each page table entry in a level-k table contains either the physical page number
of some physical page, or the address of a disk block.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

To create the physical address, the memory management unit must access k page table
entries before it can allocate the physical page number. Again, as with a single-level
hierarchy, the physical page number is identical to the virtual page offset.

n -1 Virtual address P -1 0

VPN 1 VPN 2 ... VPN k VPO

Level 1 Level 2 Level k


Page table Page table Page table
...

PPN

M-1 P -1 0
PPN PPO
Physical address

Figure 2.12: Address translation with a k-level page table.

At first glance, accessing k page table entries appears to be expensive and impractical.
However, the translation lookaside buffer compensates for this by caching page table
entries from the page tables at the different levels, the effect of which is that address
translation with multi-level page tables is not significantly slower than with single-level
page tables.

End-to-end Address Translation (Page Walk)

We put it all together in this subsection with an example of end-to-end address translation
on a small system with a translation lookaside buffer and L1 d-cache. For simplicity, we
make the following assumptions:

The memory is byte addressable.

Memory accesses are to 1-byte words (not 4-byte words).

Virtual addresses are 14 bits wide (n = 14).

Physical addresses are 12 bits wide (m = 12).

19
13 12 11 10 9 8 7 6 5 4 3 2 1 0
Virtual
address
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

VPN VPO
(Virtual page number) (Virtual page number)

11 10 9 8 7 6 5 4 3 2 1 0
Physical
address
PPN PPO
(Physical page number) (Physical page number)

Figure 2.13: Addressing for small memory system. Assume 14-bit virtual addresses (n
= 14), 12-bit physical addresses (m = 12), and 64-byte pages (P = 64).

The page size is 64 bytes (P = 64).

The translation lookaside buffer is a four-way set associative with a total of 16


entries.

The L1 d-cache is physically addressed and direct mapped, with a 4-byte line size
and a total of 16 sets.

The formats of the virtual and physical addresses are shown in Figure 2.13. Since
each page is composed of 26 = 64 bytes, the low-order 6 bits of the virtual and physical
addresses serve as the virtual page offset and physical page offset respectively. The
high-order 8 bits of the virtual address serve as the virtual page number. The high-order
6 bits of the physical address serve as the physical page number.
Figure 2.14(a) shows a snapshot of this memory system, including the translation
lookaside buffer; Figure 2.14(b) shows a portion of the page table; and Figure 2.14(c)
shows the L1 cache.
We have also shown how the bits of the virtual and physical addresses are partitioned
by the hardware as it accesses these devices. This can be seen above the figures for the
translation lookaside buffer and cache.

Translation lookaside buffer: The translation lookaside buffer is virtually addressed


using the bits of the virtual page number. As the translation lookaside buffer has
four sets, the 2 low-order bits of the virtual page number serve as the set index
(the translation lookaside buffer index). The remaining 6 high-order bits serve as
the tag (translation lookaside buffer tag) that distinguishes the different virtual
page numbers that might map to the same translation lookaside buffer set.

Page table: The page table is a single-level design with a total of 28 = 256 page
table entries. However, we are only interested in the first sixteen of these. For

20
TLBT TLBT
13 12 11 10 9 8 7 6 5 4 3 2 1 0
Virtual
address

VPN VPO
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Set Tag PPN Valid Tag PPN Valid Tag PPN Valid Tag PPN Valid
0 03 _ 0 09 0D 1 00 - 0 07 02 1

1 03 2D 1 02 - 0 04 - 0 0A - 0

2 02 - 0 08 - 0 06 - 0 03 - 0

3 07 - 0 03 0D 1 0A 34 1 02 - 0

(a) TLB: Four sets, 16 entries, four-way set associative

CT CI CO
11 10 9 8 7 6 5 4 3 2 1 0
Physical
address
PPN PPO

Idx Tag Valid Blk 0 Blk 1 Blk 2 Blk 3


0 19 1 99 11 23 11
1 15 0 - - - -
VPN PPN Valid VPN PPN Valid
00 28 1 08 28 1 2 1B 1 00 02 04 08
3 36 0 - - - -
01 - 0 09 - 0
4 32 1 43 6D 8F 09
02 33 1 0A 33 1
5 0D 1 36 72 F0 1D
03 02 1 0B 02 1
6 31 0 - - - -
04 - 0 0C - 0
7 16 1 11 C2 DF 03
05 16 1 OD 2D 1
8 24 1 3A 00 51 89
06 - 0 0E 11 0
9 2D 0 - - - -
07 - 0 0F 0D 0
A 2D 1 93 15 DA 3B
(b) Page table: Only the first 16 PTEs are shown B 0B 0 - - - -
C 12 0 - - - -
D 16 1 04 96 34 15
E 13 1 83 77 1B D3
F 14 0 - - - -

(c) Cache: 16 sets, 4-byte blocks, direct mapped

Figure 2.14: TLB, page table, and cache for small memory system. All values in the
TLB, page table, and cache are in hexadecimal notation.

convenience, we have labeled each page table entry with the virtual page number
that indexes it. Keep in mind, however, that these virtual page numbers are not
part of the page table and not stored in memory. Also keep in mind that the
physical page number of each invalid page table entry is marked with a dash or
minus sign to emphasize that the bit values stored there are not meaningful.

Cache. The direct-mapped cache is addressed by the fields in the physical address.
As each block is 4 bytes, the low-order 2 bits of the physical address serve as the
block offset (CO) and, since there are 16 sets, the next 4 bits serve as the set
index (CI). The remaining 6 bits serve as the tag (CT).

What happens when the CPU executes a load instruction that reads the byte at
address 0x03d4? Recall that the hypothetical CPU reads one-byte words, not four-byte
words. In starting a manual simulation such as this, it is helpful to:

Write down the bits in the virtual address;

21
Identify the various fields we will need; and

Work out their hex values.


Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

A similar task is performed by the hardware when it decodes the address.


From the virtual page number, the memory management unit extracts the virtual
page number of 0x0F from the virtual address. It then checks with the translation
lookaside buffer set to see whether it has cached a copy of page table entry 0x0F from
another, previous, memory reference. The translation lookaside buffer extracts the
translation lookaside buffer index of 0x03 and the translation lookaside buffer tag of
0x3, hitting on a valid match in the second entry of Set 0x3. It then returns the cached
physical page number 0x0D to the memory management unit.
The memory management unit would need to fetch the PTE from main memory
if the translation lookaside buffer had missed. However, that did not happen in our
example. Instead, we had a translation lookaside buffer hit. The memory management
unit now has everything required to create the physical address, which it does by
concatenating the physical page number (0x0D) from the page table entry with the
virtual page offset (0x14) from the virtual address. This forms the physical address
0x354.
The memory management unit then sends the physical address to the cache, which
extracts from the physical address:

1. The cache offset (CO) of 0x0;

2. The cache set index (CI) of 0x5; and

3. The cache tag (CT) of 0x0D.

Because the tag in Set 0x5 matches the cache tag, the cache detects a hit, reads
out the data byte (0x36) at offset CO, and returns it to the memory management unit,
which then passes it back to the CPU.
It is possible to have other routes through the translation, an example being that if
the translation lookaside buffer misses, the memory management unit has to fetch the
physical page number from a page table entry in the page table. If the resulting page
table entry is invalid, this indicates a page fault and the kernel must reload the missing
page and rerun the load instruction. Another possibility is that the page table entry is
valid, but the necessary memory block misses in the cache.

2.2 Direct Memory Access


Direct memory access (DMA) is the hardware mechanism that allows peripheral compo-
nents to read from the memory or to write to it directly without involving the CPU.
This mechanism not only allows us to free the CPU to execute other commands, but

22
it can also significantly improve the throughput of the memory transactions from the
device to the memory and vice versa.
There are two types of DMAs: a software transfer of the data (by calling functions
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

such as read) or a hardware transfer of the data, known as Rx when the data is
transferred from the memory, and Tx when it is transferred to the device. We will focus
on the hardware DMA transaction to the memory, which is done asynchronously in this
case. That is, the device transfers the data at its own rate without the involvement of
the CPU, and the CPU continues its execution simultaneously. Those transactions are
done using DMA descriptors, which include the location on the memory to access.
A DMA descriptor is a data structure that contains all the information the hardware
needs to execute its operations such as read or write. The descriptor is prepared by the
OS in advance. Its location on the memory is known to the device. Once the hardware
becomes available to execute the next DMA command, it reads the descriptor, executes
the relevant command, advances to read the next descriptor, and so on until it reaches
an empty descriptor.

2.2.1 Transferring Data from the Memory to the Device

Consider an example of a case in which the device reads data from the memory, as
happens when a packet is sent to the NIC. On driver registration, the driver allocates a
set of descriptors, and, once there is data to transfer to the device, chooses a descriptor
belonging to the OS, updates it to point to the data buffer, writes in it the size of the
buffer, marks the descriptor as belonging to the device, and interrupts the device to
wake it up. The device reads the first descriptor, which belongs to it, and from that
descriptor it reads the pointer and the size of the buffer to be read. The device now
knows the number of bytes to read and where to read from, and it start the transaction.
After finishing the transaction, the device marks the descriptor as belonging to the OS,
advances to the next descriptor, and interrupts the OS. The OS detaches the buffer
from the descriptor, leaving the descriptor free for the next DMA command.

2.2.2 Transferring Data from the Device to the Memory

When the device writes data to the memory, it interrupts the OS to announce that new
data has arrived. Then, the OS uses the interrupt handler to allocate a buffer and tells
the hardware where to transfer its data. After the device writes the data to the buffer
and raises another interrupt, the OS wakes up a relevant process and passes the packet
to it. A network card is a typical example of a device that transfers data asynchronously
to the memory. The OS prepares the descriptor in advance, allocates a buffer, links it
to the descriptor, and marks the descriptor belonging to the device. If no packet arrives,
the descriptor contains a link to an allocated buffer waiting for data to be written to it.
At the arrival of a packet from the network, the network card reads the descriptor in
order to identify the address of the buffer to write the data to, writes the data to the

23
buffer, and raises an interrupt to the OS. Using the interrupt handler, the OS detaches
the buffer from the descriptor, allocates a new buffer, links it to the descriptor, and
passes the packet written on the buffer to the network stack. Now the descriptor, with
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

a new allocated buffer, again belongs to the device and waits until a new packet arrives.

2.3 Adding Virtual Memory to I/O Transactions

What follows is a brief description of what the I/O virtual memory adds to the flow
of I/O transactions. Before updating the DMA descriptor with the buffer address, the
driver asks the IOMMU driver to map the physical address of the buffer and receives a
virtual address. This address is inserted to the DMA descriptor instead of the physical
one, which was inserted if the IOMMU wasnt enabled. The flow is described in more
detail in Figure 2.15

Figure 2.15: DMA transaction flow with IOMMU sequence diagram.

I/O device transactions work as follows (each number refers to the corresponding
number in Figure 2.15):

1. When a device needs to perform an I/O transaction to/from the memory, its
driver (the piece of kernel software that control the device) issues a request for an
I/O buffer.

2. The OS updates the lookup table (page table) and returns an IOVA to the device.

24
3. The device driver initiates the device to transfer the data to and/or from a virtual
address via a corresponding DMA unit.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

4. The device starts the transaction to an IOVA via a DMA unit.

5. The IOMMU translates the IOVA to a physical address, and starts the transaction
to and/or from a physical address.

6. When the transaction ends, the device raises an interrupt to the driver.

7. The driver issues a request for unmapping the I/O buffer.

8. The OS updates the radix tree that the mapping is not available.

There are several strategies for deciding when to map and unmap the I/O buffer.
Strict is the common strategy and the one that offers maximum protection. Before a
DMA transaction takes place, all the memory accessed by the I/O device is mapped
and then unmapped once the transaction is complete (right after steps 3-6). Other
strategies postpone the unmap operation and add it to a list, the items on which will
be unmapped together. The longer the unmap operation is delayed, the less the system
is protected from misbehaving device drivers.

25
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

26
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Chapter 3

rIOMMU: Efficient IOMMU for


I/O Devices that Employ Ring
Buffers

3.1 Introduction

I/O device drivers initiate direct memory accesses (DMAs) to asynchronously move data
from their devices into memory and vice versa. In the past, DMAs used physical memory
addresses. But such unmediated access made systems vulnerable to (1) rogue devices
that might perform errant or malicious DMAs [14, 19, 40, 44, 65], and to (2) buggy
drivers that account for most operating system (OS) failures and might wrongfully trigger
DMAs to arbitrary memory locations [13, 21, 31, 50, 59, 63]. Subsequently, all major
chip vendors introduced I/O memory management units (IOMMUs) [3, 11, 36, 40],
which allow DMAs to execute with I/O virtual addresses (IOVAs). The IOMMU
translates the IOVAs into physical addresses according to I/O page tables that are setup
by the OS. The OS thus protects itself by adding a suitable translation just before
the corresponding DMA, and by removing the translation right after [17, 23, 64]. We
explain in detail how the IOMMU is implemented and used in 3.2.
DMA protection comes at a cost that can be substantial in terms of performance
[4, 15, 64], notably for newer, high-throughput I/O devices like 10/40 Gbps network
controllers (NICs), which can deliver millions of packets per second. Our measurements
indicate that using DMA protection with such devices can reduce the throughput by
up to 10x. This penalty has motivated OS developers to trade off some protection
for performance. For example, when employing the deferred IOMMU mode, the
Linux kernel defers IOTLB invalidations for a short while instead of performing them
immediately when necessary, because they are slow. The kernel then processes the
accumulated invalidations en masse by flushing the entire IOTLB, thus amortizing
the overhead at the risk of allowing devices to erroneously utilize stale IOTLB entries.

27
I/O deviceIOTLB
memory
I/O device TLB

device
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

I/O device
I/O device

I/Odevice
I/O
CPU

virtual physical physical virtual


address address address

physical

IOMMU
address

I/O
MMU

Figure 3.1: IOMMU is for devices what the MMU is for processes.

While this tradeoff can double the performance relative to the stricter IOMMU mode,
the throughput is still 5x lower than when the IOMMU is disabled. We analyze and
model the overheads associated with using the IOMMU in 3.3.
We argue that the degraded performance is largely due to the IOMMU needlessly
replicating the design of the regular MMU, which is based on hierarchical page tables.
Our claim pertains to the most widely used I/O devices, such as NICs and disk drives,
which utilize circular ring buffers in order to interact with the OS. A ring is an array
of descriptors that the OS driver sets when initiating DMAs. Descriptors encapsulate
the DMA details, including the associated IOVAs. Importantly, ring semantics dictate
that (1) the driver work through the ring in order, one descriptor after the other, and
that (2) the I/O device process these descriptors in the same order. Thus, IOVAs are
short-lived and the sequence in which they are used is linearly predictable: each IOVA
is allocated, placed in the ring, used in turn, and deallocated.
We propose a ring IOMMU (rIOMMU) that supports this pervasive sequential model
using flat (1D) page tables that directly correspond to the nature of rings. rIOMMU
has three advantages over the baseline IOMMU that significantly reduce the overhead
of DMA protection. First, building/destroying an IOVA translation in a flat table is
quicker than in a hierarchical structure. Second, the act of (de)allocating IOVAsthe
actual integers serving as virtual addressesis faster, as IOVAs are indices of flat
tables in our design. Finally, the frequency of IOTLB invalidations is substantially
reduced, because the rIOMMU designates only one IOTLB entry per ring. One is
enough because IOVAs are used sequentially, one after the other. Consequently, every
translation inserted to the IOTLB removes the previous translation, eliminating the
need to explicitly invalidate the latter. And since the OS handles high-throughput I/O
in bursts, explicit invalidations become rare. We describe rIOMMU in 3.4.
We evaluate the performance of rIOMMU using standard network benchmarks.
We find that rIOMMU improves the throughput by 1.007.56x, shortens latency by
0.800.99x, and reduces CPU consumption by 0.361.00x relative to the baseline DMA

28
protection. Our fastest rIOMMU variant is within 0.771.00x the throughput, 1.001.04x
the latency, and 1.001.22x the CPU consumption of a system that disables the IOMMU
entirely. We describe our experimental evaluation in 4.6.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

3.2 Background

3.2.1 Operating System DMA Protection

The role the IOMMU plays for I/O devices is similar to the role the regular MMU plays
for processes, as illustrated in Figure 3.1. Processes typically access the memory using
virtual addresses, which are translated to physical addresses by the MMU. Analogously,
I/O devices commonly access the memory via DMAs associated with IOVAs. The
IOVAs are translated to physical addresses by the IOMMU.
The IOMMU provides inter - and intra-OS protection [4, 62, 64, 66]. Inter-OS
protection is applicable in virtual setups. It allows for direct I/O, where the host
assigns a device directly to a guest virtual machine (VM) for its exclusive use, largely
removing itself from the guests I/O path and thus improving its performance [30, 50].
In this mode of operation, the VM directly programs device DMAs using its notion of
(guest) physical addresses. The host uses the IOMMU to redirect these accesses to
where the VM memory truly resides, thus protecting its own memory and the memory
of the other VMs. With inter-OS protection, IOVAs are mapped to physical memory
locations infrequently, typically only upon such events as VM creation and migration.
Such mappings are therefore denoted static or persistent [64]; they are not the focus of
this paper.
Intra-OS protection allows the OS to defend against the DMAs of errant/malicious
devices [14, 19, 24, 40, 44, 65] and of buggy drivers, which account for most OS failures
[21, 13, 31, 50, 59, 63]. Drivers and their I/O devices can perform DMAs to arbitrary
memory addresses, and IOMMUs allow OSes to protect themselves (and their processes)
against such accesses, by restricting them to specific physical locations. In this mode
of work, map operations (of IOVAs to physical addresses) and unmap operations
(invalidations of previous maps) are frequent and occur within the I/O critical path,
such that each DMA is preceded and followed by the mapping and unmapping of the
corresponding IOVA [44, 52]. Due to their short lifespan, these mappings are denoted
dynamic [17], streaming [23] or single-use [64]. This strategy of IOMMU-based intra-OS
protection is the focus of this paper. It is recommended by hardware vendors [40, 32, 44]
and employed by operating systems [9, 17, 23, 37, 51, 64].1 It is applicable in non-virtual
setups where the OS has direct control over the IOMMU. It is likewise applicable in
1
For example, the DMA API of Linux notes that DMA addresses should be mapped only for the
time they are actually used and unmapped after the DMA transfer [52]. In particular, once a buffer
has been mapped, it belongs to the device, not the processor. Until the buffer has been unmapped, the
[OS] driver should not touch its contents in any way. Only after [the unmap of the buffer] has been
called is it safe for the driver to access the contents of the buffer [23].

29
requester identifier IOVA (DMA address)
bus dev func 00 idx idx idx idx offset
15 8 3 0 63 48 39 30 21 12 0
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

root context PTE


PTE
entry entry PTE
PTE

root context page table hierarchy


table table
63 12 0
PFN offset
physical address
Figure 3.2: Intel IOMMU data structures for IOVA translation.

virtual setups where IOMMU functionality is exposed to VMs via paravirtualization


[15, 50, 57, 64], full emulation [4], and, more recently, hardware support for nested
IOMMU translation [3, 40].

3.2.2 IOMMU Design and Implementation

Given a target memory buffer of a DMA, the OS associates the physical address (PA) of
the buffer with an IOVA. The OS maps the IOVA to the PA by inserting the IOVAPA
translation to the IOMMU data structures. Figure 4.1 depicts these structures as
implemented by Intel x86-64 [40]. The PCI protocol dictates that each DMA operation
is associated with a 16-bit request identifier comprised of a bus-device-function triplet
that uniquely identifies the corresponding I/O device. The IOMMU uses the 8-bit bus
number to index the root table in order to retrieve the physical address of the context
table. It then indexes the context table using the 8-bit concatenation of the device
and function numbers. The result is the physical location of the root of the page table
hierarchy that houses all of the IOVAPA translations of that I/O device.
The purpose of the IOMMU page table hierarchy is similar to that of the MMU
hierarchy: recording the mapping from virtual to physical addresses by utilizing a 4-level
radix tree. Each 48-bit (I/O) virtual address is divided into two: the 36 high-order
bits, which constitute the virtual page number, and the 12 low-order bits, which are the
offset within the page. The translation procedure applies to the virtual page number
only, converting it into a physical frame number (PFN) that corresponds to the physical

30
memory location being addressed. The offset is the same for both physical and virtual
pages.
Let Tj denote a page table in the j-th radix tree level for j = 1, 2, 3, 4, such that
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

T1 is the root of the tree. Each Tj is a 4KB page containing up to 29 = 512 pointers
to physical locations of next-level Tj+1 tables. Last-levelT4 tables contain PFNs of
target buffer locations. Correspondingly, the 36-bit virtual page number is split into a
sequence of four 9-bit indices i1 , i2 , i3 and i4 , such that ij is used to index Tj in order
to find the physical address of the next Tj+1 along the radix tree path. Logically, in C
pointer notation, T1 [i1 ][i2 ][i3 ][i4 ] is the PFN of the target memory location.
Similarly to the MMU translation lookaside buffer (TLB), the IOMMU caches
translations using an IOTLB, which it fills on-the-fly as follows. Upon an IOTLB
miss, the IOMMU hardware hierarchically walks the page table as described above,
and it inserts the IOVAPA translation to the IOTLB. IOTLB entries are invalidated
explicitly by the OS as part of the corresponding unmap operation.
An IOMMU table walk fails if a matching translation was not previously established
by the OS, a situation which is logically similar to encountering a null pointer value
during the walk. A walk additionally fails if the DMA being processed conflicts with
the read/write permission bits found within the page table entries along the traversed
radix tree path. We note in passing that, at present, in contrast to MMU memory
accesses, DMAs are typically not restartable. Namely, existing systems usually do not
support I/O page faults, and hence the OS cannot populate the IOMMU page table
hierarchy on demand. Instead, IOVA translations of valid DMAs are expected to be
successful, and the corresponding pages must be pinned to memory. (Albeit I/O page
fault standardization does exist [54].)

3.2.3 I/O Devices Employing Ring Buffers

Many I/O devicesnotably NICs and disk drivesdeliver their I/O through one or
more producer/consumer ring buffers. A ring is an array shared between the OS device
driver and the associated device, as illustrated in Figure 3.3. The ring is circular in that
the device and driver wrap around to the beginning of the array when they reach its end.
The entries in the ring are called DMA descriptors. Their exact format and content
vary between devices, but they specify at least the address and size of the corresponding
target buffers. Additionally, the descriptors commonly contain status bits that help the
driver and the device to synchronize.
Devices must also know the direction of each requested DMA, namely, whether the
data should be transmitted from memory (into the device) or received (from the device)
into memory. The direction can be specified in the descriptor, as is typical for disk
controllers. Alternatively, the device can employ different rings for receive and transmit
activity, in which case the direction is implied by the ring. The receive and transmit
rings are denoted Rx and Tx, respectively. NICs employ at least one Rx and one Tx

31
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Figure 3.3: A driver drives its device through a ring. With an IOMMU, pointers are
IOVAs (both registers and target buffers).

per port. They may employ multiple Rx/Tx rings per port to promote scalability, as
different rings can be handled concurrently by different cores.
Upon initialization, the OS device driver is responsible for allocating the rings and
configuring the I/O device with their size and base location. For each ring, the device
and driver utilize a head and a tail pointers to delimit the content of the ring that can be
used by the device: [head, tail). The device iteratively consumes (removes) descriptors
from the head, and it increments the head to point to the descriptor that it will use
next. Similarly, the driver adds descriptors to the tail, and it increments the tail to
point to the entry it will use subsequently.
A device asynchronously informs its OS driver that data was transmitted or received
by triggering an interrupt. The device coalesces interrupts when their rate is high.
Upon receiving an interrupt, the driver of a high-throughput device handles the entire
I/O burst. Namely, it sequentially iterates through and processes all the descriptors
whose corresponding DMAs have completed,

3.3 Cost of Safety

This section enumerates the overhead components involved in using the IOMMU in the
Linux/Intel kernel (3.3.1). It experimentally quantifies the overhead of each component
(3.3.2). And it provides and validates a simple performance model that allows us to
understand how the overhead affects performance and to assess the benefits of reducing
it (3.3.3).

32
3.3.1 Overhead Components

Suppose that a device driver that employs a ring wants to transmit or receive data
from/to a target buffer. Figure 3.4 lists the actions it carries out. First, it allocates the
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

target buffer, whose physical address is denoted p (1). (For simplicity, let us assume
that p is page aligned.) It pins p to memory and then asks the IOMMU driver to map
the buffer to some IOVA, such that the I/O device would be able to access p (2). The
IOMMU driver invokes its IOVA allocator, which returns a new IOVA van integer
that is not associated with any other page currently accessible to the I/O device (3).
The IOMMU driver then inserts the v p translation to the page table hierarchy of
the I/O device (4), and it returns v to the device driver (5). Finally, when updating
the corresponding ring descriptor, the device driver uses v as the address for the target
buffer of the associated DMA operation (6).
Assume that the latter is a receive DMA. Figure 3.5 details the activity that takes
place when the I/O device gets the data. The device reads the DMA descriptor within
the ring through its head register. As the address held by the head is an IOVA, it is
intercepted by the IOMMU (1). The IOMMU consults its IOTLB to find a translation
for the head IOVA. If the translation is missing, the IOMMU walks the page table
hierarchy of the device to resolve the miss (2). Equipped with the heads physical
address, the IOMMU translates the head descriptor for of the I/O device (3). The head
descriptor specifies that v (IOVA defined above) is the address of the target buffer (4),
so the I/O device writes the received data to v (5). The IOMMU intercepts v, walks
the page table if the v p translation is missing (6), and redirects the received data to
p (7).
Figure 3.6 shows the actions the device driver carries out after the DMA operation
is completed. The device driver asks the IOMMU driver to unmap the IOVA v (1).
In response, the IOMMU driver removes the v p mapping from the page table
hierarchy (2), purges the mapping from the IOTLB (3), and deallocates v (4). (The
order of these actions is important.) Once the I/O device can no longer access p, it is
safe for the device driver to hand the buffer to higher levels in the software stack for
further processing (5).

3.3.2 Protection Modes and Measured Overhead

We experimentally quantify the overhead components of the map and unmap functions
of the IOMMU driver as outlined in Figures 3.4 and 3.6. To this end, we execute
the standard Netperf TCP stream benchmark, which attempts to maximize network
throughput between two machines over a TCP connection. (The experimental setup is
detailed in 4.6.)

Strict Protection We begin by profiling the Linux kernel in its safer IOMMU mode,
denoted strict, which strictly follows the map/unmap procedures described in 3.3.1.

33
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Figure 3.4: The I/O device driver maps an IOVA v to a physical target buffer p. It
then assigns v to the DMA descriptor.

Figure 3.5: The I/O device writes the packet it receives to the target buffer through v,
which the IOMMU translates to p.

Figure 3.6: After the DMA completes, the I/O device driver unmaps v and passes p to
a higher-level software layer.

34
function component strict strict+ defer defer+
map iova alloc 3986 92 1674 108
page table 588 590 533 577
other 44 45 44 42
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

sum 4618 727 2251 727


unmap iova find 249 418 263 454
iova free 159 62 189 57
page table 438 427 471 504
iotlb inv 2127 2135 9 9
other 26 25 205 216
sum 2999 3067 1137 1240

Table 3.1: Average cycles breakdown of the map and unmap functions of the IOMMU
driver for different protection modes.

Table 3.1 shows the average duration of the components of these procedures in cycles.
When examining the breakdown of strict/map, we see that its most costly component
is, surprisingly, IOVA allocation (Step 3 in Figure 3.4). Upon further investigation, we
found that the reason for this high cost is a nontrivial pathology in the Linux IOVA
allocator that regularly causes some allocations to be linear in the number of currently
allocated IOVAs. We were able to come up with a more efficient IOVA allocator, which
consistently allocates/frees in constant time [7]. We denote this optimized IOMMU
modewhich is quicker than strict but equivalent to it in terms of safetyas strict+.
Table 3.1 shows that strict+ indeed reduces the allocation time from nearly 4,000 cycles
to less than 100.
The remaining dominant strict(+)/map overhead is the insertion of the IOVA to
the IOMMU page table (Step 4 in Figure 3.4). The 500+ cycles of the insertion are
attributed to explicit memory barriers and cacheline flushes that the driver performs
when updating the hierarchy. Flushes are required, as the I/O page walk is incoherent
with the CPU caches on our system. (This is common nowadays; Intel started shipping
servers with coherent I/O page walks only recently.)
Focusing on the unmap components of strict/strict+, we see that finding the
unmapped IOVA in the allocators data structure is costlier in strict+ mode. The
reason: like the baseline strict, strict+ utilizes a red-black tree to hold the IOVAs. But
the strict+ tree is fuller, so the logarithmic search is longer. Conversely strict+/free
(Step 4 in Figure 3.6) is done in constant time, rather than logarithmic, so it is quicker.
The other unmap components are: removing the IOVA from the page tables (Step
2 in Figure 3.6) and the IOTLB (Step 3). The removal takes 400+ cycles, which is
comparable to the duration of insertion. IOTLB invalidation is by far the slowest unmap
component at around 2,000 cycles; this result is consistent with previous work [4, 66].

Deferred Protection In order to reduce the high cost of invalidating IOTLB entries,
the Linux deferred protection mode relaxes strictness somewhat, trading off some safety

35
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

18k 9.8x iova


cycles per packet (C)

16k iotlb inv


14k page table
other
12k
10k 5.5x
4.9x
8k
6k 3.3x
4k
none
0
strict strict+ defer defer+
completely safe tradeoff
Figure 3.7: CPU cycles used for processing one packet. The top bar labels are
relative to Cnone =1,816 (bottommost grid line).

25
none model
throughput [Gbps]

20 measured w/ delays
riommu measured w/ iommu
15
riommu-
10
defer+
defer
5 strict+ strict
0
2k 4k 6k 8k 10k 12k 14k 16k 18k

cycles per packet (C)


Figure 3.8: Throughput of Netperf TCP stream as a function of the average number
of cycles spent on processing one packet.

36
for performance. Instead of invalidating entries right away, the IOMMU driver queues
the invalidations until 250 freed IOVAs accumulate. It then processes all of them in
bulk by invalidating the entire IOTLB. This approach affects the cost of (un)mapping
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

in two ways, as shown in Table 3.1 in the defer and defer+ columns. (Defer+ is to
defer what strict+ is to strict.) First, as intended, it eliminates the cost of invalidating
individual IOTLB entries. And second, it reduces the cost of IOVA allocation in the
baseline deferred mode as compared to strict (1,674 vs. 3,986), because deallocating
IOVAs in bulk reduces somewhat the aforementioned linear pathology.
The drawback of deferred protection is that the I/O device might erroneously access
target buffers through stale IOTLB entries after the buffers have already been handed
back to higher software stack levels (Step 5 in Figure 3.6). Notably, at this point, the
buffers could be (re)used for other purposes.

3.3.3 Performance Model

Let C denote the average number of CPU cycles required to process one packet. Figure
3.7 shows C for each of the aforementioned IOMMU modes in our experimental setup.
The bottommost horizontal grid line shows Cnone , which is C when the IOMMU is
turned off. We can see, for example, that Cstrict is nearly 10x higher than Cnone .
Our experimental setup employs a NIC that uses two target buffers per packet:
one for the header and one for the data. Each packet thus requires two map and two
unmap invocations. So the processing of the packet includes: two IOVA (de)allocations;
two page table insertions and deletions; and two invalidations of IOTLB entries. The
corresponding aggregated cycles are respectively depicted as the three top stacked
sub-bars in the figure. The bottom, other sub-bar embodies all the rest of the packet
processing activity, notably TCP/IP and interrupt processing. As noted, the deferred
modes eliminate the IOTLB invalidation overhead, and the + modes reduce the
overhead of IOVA (de)allocation. But even Cdef er+ (the most performant mode, which
introduces a vulnerability window) is still over 3.3x higher than Cnone .
We find that the way the specific value of C affects the overall throughput of Netperf
is simple and intuitive. Specifically, if S denotes the cycles-per-second clock speed of
the core, then S/C is the number of packets the core can handle per second. And
since every Ethernet packet carries 1,500 bytes, the throughput of the system in Gbps
S
should be Gbps(C) = 1500 byte 8 bit C, assuming S is given in GHz. Figure 3.8
shows that this simple model (thick line) is accurate. It coincides with the throughput
obtained when systematically lengthening Cnone using a carefully controlled busy-wait
loop (thin line). It also coincides with the throughput measured under the different
IOMMU modes (cross points).

Consequences As our model accurately reflects reality, we conclude that the trans-
lation activity carried out by the IOMMU (as depicted in Figure 3.5) does not affect

37
the performance of the system, even when servicing demanding benchmarks like Net-
perf. Instead, the cost of IOMMU protection is entirely determined by the number
of cycles the core spends establishing and destroying IOVA mappings. Consequently,
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

we can later simulate and accurately assess the expected performance of our proposed
IOMMU by likewise spending cycles; there is no need to simulate the actual IOMMU
hardware circuitry external to the core. A second conclusion rests on the understanding
that throughput is proportional to 1/C. If C is high (right-hand side of Figure 3.8),
incrementally improving it would make little difference. The required change must be
significant enough to bring C to the proximity of Cnone .

3.4 Design

a) b) c) d)
struct rDEVICE { struct rRING { struct rPTE { struct rIOVA {
u16 size; u18 size; u64 phys_addr; u30 offset;
rRING rings[size]; rPTE ring[size]; u30 size; u18 rentry;
}; // SW only u02 dir; u16 rid;
u18 tail; u01 valid; }; // = 64bit
// SW only u31 unused;
u18 nmapped; };// = 128bit
};
e)
struct rIOTLB_entry {
u16 bdf;
u16 rid;
u18 rentry;
rPTE rpte;
rPTE next;
};

Figure 3.9: The rIOMMU data structures. e) is used only by hardware. The last two
fields of rRING are used only by software.

Our goal is to design rIOMMU, an efficient IOMMU for devices that employ rings. We
aim to substantially reduce all IOVA-related overheads: (de)allocation, insertion/deletion
to/from the page table hierarchy, and IOTLB invalidation (see Figure 3.7). We base our
design on the observation that ring semantics (3.2.3) dictate a well-defined memory
access order. The OS sequentially produces ring entries, one after the other. And the
I/O device sequentially consumes these entries, one after the other, making its DMA
pattern predictable.
We contend that the x86 hierarchical structure of page tables is poorly suited for
the ring model. For each DMA, the OS has to walk the hierarchical page table in order
to map the associated IOVA. Then, the device faults on the IOVA and so the IOMMU
must walk the table too. Shortly after, the OS has to walk the table yet again in order
to unmap the IOVA. Contributing to this overhead are the aforementioned memory
barriers and cacheline flushes required for propagating page table changes.
In a nutshell, we propose to replace the table hierarchy with a per-ring flat page
table (1D array) as shown in Figure 3.10. IOVAs would constitute indices of the array,

38
requester identifier IOVA (DMA address)
bus dev func ring id ring entry offset
15 8 3 0 63 49 48 30 29 0
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

root context size


rPTE offset <= size
entry entry rRING

root context page table hierarchy


table table
63 29 0
PFN offset
physical address
Figure 3.10: rIOMMU data structures for IOVA translation.

thus eliminating IOVA (de)allocation overheads. Not having to walk a hierarchical


structure would additionally reduce the table walk cost. In accordance with Figure 3.9e,
we propose an IOTLB that holds at most one entry per ring, such that each table walk
removes the previous IOVA translation. Consequently, given a burst of unmaps, only
the last IOVA in the sequence requires explicit invalidation. We further discuss this
point later on.
We next describe the rIOMMU design in detail. There are several ways to realize
the rIOMMU concept, and our description should be viewed as an example. Contrary
to the baseline IOMMU, which provides protection in page granularity, our rIOMMU
facilitates protection of any specified size.

Data Structures Figure 3.9 defines the rIOMMU data structures. The rDEVICE
(Figure 3.9a) is to the rIOMMU what the root page table is to the baseline IOMMU.
It is uniquely associated with a bus-device-function (bdf) triplet and is pointed to by
the context table (Figure 4.1). As noted, each DMA carries with it a bdf, allowing the
rIOMMU to find the corresponding rDEVICE when needed. The rDEVICE consists of
a physical pointer to an array of rRING structures (Figure 3.9b) and a matching size.
Each rRING entry represents a flat page table. It likewise contains the tables physical
address and size. The OS associates with each rRING: (1) a tail pointing to the next
entry to be allocated in the flat table, and (2) the current number of valid mappings
in the table. The latter two are not architected and are unknown to the rIOMMU
hardware. We include them in rRING to simplify the description.

39
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

u64 rtranslate(u16 bus_dev_func, rIOVA iova, u2 dir) {


rIOTLB_entry e = riotlb_find( bus_dev_func, iova.rid );
if( ! e ) {
e = rtable_walk( bus_dev_func, iova );
riotlb_insert( e );
}
if( e.rentry != iova.rentry )
riotlb_entry_sync( bus_dev_func, iova, e );
if( iova.offset >= e.rpte.size || ! (e.rpte.dir & dir) )
io_page_fault();
return e.rpte.phys_addr + iova.offset;
}

void riotlb_entry_sync(u16 bus_dev_func, rIOVA iova, rIOTLB_entry e) {


rDEVICE d = get_domain( bus_dev_func );
u18 next = (e.rentry + 1) % d.rings[e.rid].size;

if( e.next.valid && (iova.rentry===next) ) {


e.rpte = e.next; e.rentry = next;
e.next.valid = 0;
} else
e = rtable_walk( bus_dev_func, iova );
rprefetch( d, e );
}

rIOTLB_entry rtable_walk(u16 bus_dev_func, rIOVA iova)


{
rDEVICE d = get_domain( bus_dev_func );
if( iova.rid >= d.size ||
iova.rentry >= d.rings[iova.rid].size ||
! d.rings[iova.rid].ring[iova.rentry].valid )
io_page_fault();

rIOTLB_entry e;
rRING r = d.rings[iova.rid];
e.bdf = bus_dev_func;
e.rid = iova.rid;
e.rentry = iova.rentry;
e.rpte = r.ring[e.rentry]; // copy
rprefetch( d, e );
return e;
}

// async
void rprefetch(rDEVICE d, rIOTLB_entry e) {
rRING r = d.rings[e.rid];
u18 next = (e.rentry + 1) % r.size;
if( r.size > 1 && r.ring[next].valid )
e.next = r.ring[next]; // copy
}

Figure 3.11: Outline of the rIOMMU logic. All DMAs are carried out with IOVAs
that are translated by the rtranslate routine.

40
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

rIOVA map(rDEVICE d, u16 rid, u64 pa, u30 size, u2 direction)


{
rRING r = d.rings[rid];
locked { if( r.nmapped == r.size ) return OVERFLOW;
u18 t = r.tail;
r.tail = (r.tail + 1) % r.size;
r.nmapped++; }

r.ring[t].phys_addr = pa;
r.ring[t].size = size;
r.ring[t].dir = direction;
r.ring[t].valid = 1;
sync_mem( & r.ring[t] );
return pack_iova( 0/*offset*/, t/*rentry*/, rid );
}

void unmap(rDEVICE d, rIOVA iova, bool end_of_burst) {


rRING r = d.rings[iova.rid];
r.ring[iova.rentry].valid = 0;
locked { r.nmapped--; }
sync_mem( & r.ring[iova.rentry] );
if( end_of_burst )
riotlb_invalidate( bus_dev_func(d), iova.rid );
}

void sync_mem(void * line) {


if( ! riommu_pt_is_coherent() ) {
memory_barrier();
cache_line_flush( line );
}
memory_barrier();
}

Figure 3.12: Outline of the rIOMMU OS driver, implementing map and unmap, which
respectively correspond to Figures 3.4 and 3.6.

41
Each ring buffer of the I/O device is associated with two rRINGs in the rDEVICE
array. The first corresponds to IOVAs pointing to the device ring buffer (Step 1 of
Figure 3.5 for translating the head register). The second corresponds to IOVAs that
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

the device finds within its ring descriptors (Step 5 in Figure 3.5 for translating target
buffers). The IOVAs that reside in the first flat table are mapped as part of the I/O
device initialization. They will be unmapped only when the device is brought down, as
the device rings are always accessible to the device. IOVAs residing in the second flat
table are associated with DMA target buffers; they are mapped/unmapped repeatedly
and are valid only while their DMA is in flight.
The flat table pointed to by rRING.ring is an array of rPTE structures (Figure 3.9c).
An rPTE consists of the physical address and size associated with the corresponding
IOVA; two bits that specify the DMA direction, which can be from the device, to it, or
both; and a bit that indicates whether the rPTE (and thus the corresponding IOVA)
are valid. The physical address need not be page aligned and the associated size can
have any value, allowing for fine-grained protection.
The rIOVA structure (Figure 3.9d) defines the format of IOVAs. As noted, every
DMA has a bdf that uniquely identifies its rDEVICE. The rIOVA.rid (ring ID) serves as
an index to the corresponding rDEVICE.rings array, and thus it uniquely identifies the
rRING of the rIOVA. Likewise, rIOVA.rentry serves as an index to the rRING.ring array,
and thus it uniquely identifies the rPTE of the rIOVA. The target address of the rIOVA
is computed by adding rIOVA.offset to rPTE.phys addr.
The data structures discussed so far are used by both software and hardware. They
are setup by the OS and utilized by the rIOMMU to translate rIOVAs. The last
one (Figure 3.9e) is a hardware-only structure, representing one rIOTLB entry. The
combination of its first two fields (bdf+rid) uniquely identifies a rRING flat page table,
which we denote as T . The rIOTLB utilizes at most one rIOTLB entry per T . The
combination of the first three fields (bdf+rid+rentry) uniquely identifies T s current
rPTEthe PTE associated with the most recently translated rIOVA that belongs to T .
The current rPTE is cached by rIOTLB entry.rpte (holds a copy). The rIOTLB entry.next
field may or may not contain a prefetched copy of T s subsequent rPTE. (Our design
does not depend on the latter field and works just as well without it.)

Hardware The rtranslate routine (Figure 3.11 top/left) outlines how rIOMMU trans-
lates a rIOVA to a physical address. First, it searches for e, the rIOTLB entry of the
rRING that is associated with the rIOVA. (Recall that there is only one such entry per
rRING.) If e is missing from the rIOTLB, rIOMMU walks the table using the data
structures defined above, finds the rPTE, and inserts to the rIOTLB a matching entry.
Doing the table walk ensures that e.rpte is the rPTE of the given rIOVA. However, if e
was initially found in the rIOTLB, then e and the rIOVA might mismatch. rIOMMU
therefore compares the rentry numbers of e and the IOVA, and it updates e if they are
different. Now that e is up-to-date, rIOMMU checks that the direction of the DMA is

42
permitted according to the rPTE. It also checks that the offset of the IOVA is in range,
namely, smaller than the associated rPTE.size. Violating these conditions constitutes an
error, causing rIOMMU to trigger an I/O page fault (IOPF). IOPFs are not expected
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

to occur (drivers pin target buffers to memory), and OSes typically reinitialize the I/O
device if they do. If no violation is detected, rIOMMU finally performs the translation
by adding the offset of the IOVA to rPTE.phys addr.
The rtable walk routine (Figure 3.11 top/right) ensures that the rIOVA complies with
the rIOMMU data structure limits as well as points to a valid rPTE. Noncompliance might
be the result of, e.g., an errant DMA or a buggy driver. After validation, rtable walk
initializes the rIOTLB entry in a straightforward manner based on the rIOVA and its
rPTE. It additionally attempts to prefetch the subsequent rPTE by invoking rprefetch
(Figure 3.11 bottom/right), which succeeds if the next rPTE is valid. Prefetching can
be asynchronous.
The riotlb entry sync routine (Figure 3.11 bottom/left) is used by rtranslate to syn-
chronize e (the rIOTLB entry) with the current IOVA. The two become unsynchronized,
e.g., whenever the device handles a new DMA descriptor. The required rPTE can then
be found in e.next if prefetching was previously successful, in which case the routine
assigns e.next to e.rpte. Otherwise, it uses rtable walk to fetch the needed rPTE. Finally,
it attempts to prefetch the subsequent rPTE.

Software The (un)map functions comprising the rIOMMU OS driver are shown in
Figure 3.12. Their prototypes are logically similar to the associated Linux functions
from the baseline IOMMU OS driver (Figures 3.4 and 3.6), with minor adjustments.
The map flow corresponds to Figure 3.4. It gets a device, a ring ID, a physical address
to be mapped, and the associated size and direction of the DMA. The first part of the
code allocates a ring entry rPTE at the rings tail and then updates the tail/nmapped
fields accordingly. This allocationwhich consists of incrementing two integersis
analogous to the costly IOVA allocation of baseline Linux.
The second part of map initializes the newly allocated rPTE. When the rPTE is ready,
the map function invokes sync mem, which ensures that the rPTE memory updates are
visible to the rIOMMU. This part of the code is analogous to walking and updating
the page table hierarchy of the baseline IOMMU, but it is simpler since the page table
is flat. The return statement of the map function packs the rentry index and its ring
ID into an IOVA as dictated by the rIOVA data structure (Figure 3.9d). The offset is
always set to be 0 by the rIOMMU driver. Callers of map can later manipulate the
offset as they please, provided they conform to the size constraint encoded into the
corresponding rPTE.
The flow of unmap (Figure 3.12/right) corresponds to Figure 3.6. Unmap gets an
rIOVA, marks the associated rPTE as invalid (analogous to walking the table hierarchy),
decrements the rings nmapped counter (analogous to IOVA deallocation), and synchro-
nizes the memory to make the rPTE update visible to the rIOMMU. Recall that when

43
the device driver is notified that its device has finished some DMAs, it loops through
the relevant descriptors and sequentially unmaps their IOVAs (3.2.3). The driver sets
the end of burst parameter of unmap to true at the end of this loop, upon the last
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

IOVA. One invalidation is sufficient because, by design, each rRING has at most one
rIOTLB entry allocated in the rIOTLB.
In our experimental testbed, our measurements indicate that the average loop length
of a throughput-sensitive workload such as Netperf is 200 iterations. This is long
enough to make the amortized cost of IOTLB invalidations negligible, as in the deferred
mode, but without sacrificing safety. Amortization, however, does not apply to latency-
sensitive workloads. Nonetheless, the invalidation cost is small in comparison to the
overall latency as will shortly be demonstrated.
Finally, we consider the problem of synchronizing the memory between the IOMMU
and its driver. In sync mem (Figure 3.12 bottom/right), we see support for two hardware
modes, corresponding to whether the IOMMU table walk is coherent with the CPU
caches. The baseline Linux kernel queries the relevant IOMMU capability bit. If it
finds that the two are not in the same coherency domain, it introduces an additional
memory barrier followed by a cacheline flush. In the following section, we experimentally
evaluate two simulated rIOMMU versions corresponding to these two modes.

Limitations Let R be a ring of an I/O device. Let D be the number of DMA


descriptors comprising R. Let L be the maximal number of Rs live IOVAs whose
DMAs are currently in flight. And let N be the size of the associated rRING. N is
set by the device driver upon startup. Optimally, N L, or else the driver would
experience overflow (2nd line of map in Figure 3.12). While suboptimal, overflow is
legal as with other devices employing rings; it just means that the driver must slow
down. D is typically hundreds or a few thousands. In some I/O devices, each descriptor
can hold only a constant number of IOVAs (K), in which case setting N = D K
would prevent overflow. Some devices support scatter-gather lists, whose K might be
large or theoretically unbounded. Developers of device drivers must therefore make a
judicious decision regarding N based on their domain-specific knowledge about L. (In
our experiments, L was at most 8K for all rings.) Alternatively, developers can opt
for using the baseline IOMMU. Importantly, there are devices for which rIOMMU is
unsuitable, notably NICs that implement remote direct memory access (RDMA). We
therefore do not propose to replace the baseline IOMMU, but only to supplement it.

3.5 Evaluation

3.5.1 Methodology

Simulating rIOMMU We experimentally evaluate the seven IOMMU modes defined


in 3.33.4: (1) strict, which is the completely safe Linux baseline; (2) strict+, which

44
enhances strict with our faster IOVA allocator; (3) defer, which is the Linux variant that
trades off some protection for performance by batching IOTLB invalidations; (4) defer+,
which is defer with our IOVA allocator; (5) riommu- (in lowercase), which is the newly
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

proposed rIOMMU when assuming no I/O page table coherency; (6) riommu, which
does assume coherent I/O page tables; and (7) none, which turns off the IOMMU.
The five non-rIOMMU modes are executed as is. They constitute full implementa-
tions of working systems and do not require a simulation component. To simulate the
two rIOMMU modes, we start with the none mode as the baseline. We then supplement
the baseline with calls to the (un)map functions, similarly to the way they are called in
the non-simulated IOMMU-enabled modes. But instead of invoking the native functions
of the Linux IOMMU driver (Figures 3.4 and 3.6), we invoke the (un)map functions
that we implement in the simulated rIOMMU driver (Figure 3.12). All the code of
the rIOMMU driver can beand isexecuted, with one exception. Since there is no
real rIOTLB, we must simulate the invalidation of rIOTLB entries. We do so by busy
waiting for 2,150 cycles upon each entry invalidation, in accordance to the measurements
specified in Table 3.1.
Notice that our methodology does not account for differences between the existing
and proposed IOMMU translation mechanism. Namely, we only account for actions
shown in Figures 3.4 and 3.6 but not those in Figure 3.5. Notably, we ignore the fact
that the IOMMU works harder than the rIOMMU due to IOTLB misses that rIOMMU
avoids via prefetching. We likewise ignore the fact that rIOMMU works harder than the
no-IOMMU mode, since it translates addresses whereas the no-IOMMU mode does not.
We ignore these differences, as the model validated in 3.3.3 shows that throughput
is entirely determined by the number of cycles it takes the corenot the device or
the IOMMUto process a DMA request, even for the most demanding I/O-intensive
workloads. The system behaves this way probably because the device and IOMMU
operate in parallel to the CPU and are apparently fast enough so as not to constitute a
bottleneck.
We revalidate our methodology and show that it is also applicable for latency-sensitive
workloads by using the standard Netperf UDP request-response (RR) benchmark, which
repeatedly sends one byte to its peer and waits for an identical response. We run RR
under two IOMMU modes: hardware path-through (HWpt) and software pass-through
(SWpt). With HWpt, the IOMMU is enabled but never experiences IOTLB misses;
instead, it translates each IOVA to an identical physical address without consulting
any page table. SWpt provides an equivalent functionality by using a page table that
maps the entire physical memory and associates each physical page address with an
identical IOVA. Under SWpt, Netperf RR experiences an IOTLB miss on every packet
it sends and receives. Nonetheless, we find that the performance of HWpt and SWpt
is identical, because the network stack and interrupt processing introduce far greater
latencies that hide the IOTLB miss penalty. Moreover, we find that the RR performance
of HWpt/SWpt is identical to that of no-IOMMU.

45
Throughput performance of Netperf stream with HWpt and SWpt is smaller by
10% relative to no-IOMMU. But here too the difference is entirely caused by the core:
about 200 CPU cycles spent on unrelated kernel abstraction code that executes under
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

HWpt/SWpt but not under no-IOMMU.

Experimental Setup In an effort to get more general results, we conduct the evalu-
ation using two setups involving two different NICs, as follows.
The Mellanox setup (mlx for short) is comprised of two identical Dell PowerEdge
R210 II Rack Server machines that communicate through Mellanox ConnectX3 40Gbps
NICs. The NICs are connected back to back via a 40Gbps optical fiber and are configured
to use Ethernet. We use one machine as the server and the other as a workload generator
client. Each machine has a 8GB 1333MHz memory and a single-socket 4-core Intel Xeon
E3-1220 CPU running at 3.10GHz. The chipset is Intel C202, which supports VT-d,
Intels Virtualization Technology that provides IOMMU functionality. We configure
the server to utilize one core only and turn off all power optimizationssleep states
(C-states) and dynamic voltage and frequency scaling (DVFS)to avoid reporting
artifacts caused by nondeterministic events. The machines run Ubuntu 12.04 with the
Linux 3.4.64 kernel. All experimental findings described thus far were obtained with
the mlx setup.
The Broadcom setup (brcm for short) is similar, likewise utilizing two R210 machines.
The differences are that the two machines communicate through Broadcom NetXtreme
II BCM57810 10GbE NICs (connected via a CAT7 10GBASE-T cable for fast Ethernet);
that they are equipped with 16GB memory; and that they run the Linux 3.11.0 kernel.
The mlx and brcm device drivers differ substantially. Notably, mlx utilizes more
ring buffers and allocates more IOVAs (we observed a total of 12K addresses for mlx
and 3K for brcm). The mlx driver uses two target buffers per packet (header and
body) and thus two IOVAs, whereas the brcm driver allocates only one buffer/IOVA
per packet.

Benchmarks To drive our experiments we utilize the following benchmarks:

1. Netperf TCP stream [42] is a standard tool to measure networking performance


in terms of throughput. It maximizes the amount of data sent over one TCP
connection, simulating an I/O-intensive workload. Its default message size is
16KB. This is the application we used in 3.3.

2. Netperf UDP RR (request-response) is the second canonical configuration of


Netperf. As noted, it models a latency sensitive workload by repeatedly exchanging
one byte messages in a ping-pong manner. The per message latency can be
calculated as the inverse of the number of messages per second (which we show
later on).

46
netperf stream netperf rr apache 1MB apache 1KB memcached
[Gbps] [req/sec] [req/sec] [req/sec] [req/sec]
20 80k 2k 16k 160k 100
15 60k 1.5k 12k 120k 75
throughput (higher is better)

cpu [%] (lower is better)


10 40k 1k 8k 80k 50
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

5 20k 0.5k 4k 40k 25


0 0k 0k 0k 0k 0
s
s+
d
d+
r-
r
n

s
s+
d
d+
r-
r
n

s
s+
d
d+
r-
r
n

s
s+
d
d+
r-
r
n

s
s+
d
d+
r-
r
n
10 40k 1.2k 16k 160k 100
7.5 30k 0.9k 12k 120k 75
5 20k 0.6k 8k 80k 50
2.5 10k 0.3k 4k 40k 25
0 0k 0.0k 0k 0k 0
strict
strict+
defer
defer+
riommu-
riommu
none

strict
strict+
defer
defer+
riommu-
riommu
none

strict
strict+
defer
defer+
riommu-
riommu
none

strict
strict+
defer
defer+
riommu-
riommu
none

strict
strict+
defer
defer+
riommu-
riommu
none
throughput
cpu [%]

Figure 3.13: Absolute performance numbers of the IOMMU modes when using the
Mellanox (top) and Broadcom (bottom) NICs.

3. Apache [26, 27] is a popular HTTP web server. We drive it using ApacheBench
[8], the workload generator distributed with Apache. It measures the number of
requests per second that the server is capable of handling by requesting a static
page of a given size. We run it on the client machine configured to generate
32 concurrent requests. We use two instances of the benchmark, respectively
requesting a smaller (1KB) and a bigger (1MB) file.

4. Memcached [28] is a high-performance in-memory key-value storage server. It is


often used to cache slow disk queries. We used the Memslap benchmark [2] (part of
the libmemcached client library), which runs on the client machine and measures
the completion rate of the requests that it generates. By default, Memslap
generates a workload comprised of 90% get and 10% set operations, with 64B and
1KB key and value sizes, respectively. It too is set to use 32 concurrent requests.

3.5.2 Results

NIC benchmark throughput


riommu- divided by riommu divided by
strict strict+ defer defer+ none strict strict+ defer defer+ none
mlx stream 5.12 2.90 2.57 1.74 0.52 7.56 4.28 3.79 2.57 0.77
rr 1.23 1.07 1.05 1.02 0.95 1.25 1.09 1.07 1.03 0.96
apache 1M 5.30 1.62 1.58 1.20 0.76 5.80 1.77 1.73 1.31 0.83
apache 1K 2.32 1.08 1.07 1.03 0.92 2.32 1.08 1.07 1.03 0.92
memcached 4.77 1.17 1.25 1.03 0.82 4.88 1.19 1.28 1.05 0.83

brcm stream 2.17 1.00 1.00 1.00 1.00 2.17 1.00 1.00 1.00 1.00
rr 1.19 1.05 1.04 1.02 0.99 1.21 1.06 1.05 1.03 1.00
apache 1M 1.20 1.01 1.00 1.00 1.00 1.20 1.01 1.00 1.00 1.00
apache 1K 1.24 1.13 1.08 1.02 0.89 1.29 1.18 1.13 1.07 0.93
memcached 1.76 1.35 1.18 1.10 0.78 1.88 1.45 1.27 1.18 0.84

47
NIC benchmark cpu
riommu- divided by riommu divided by
strict strict+ defer defer+ none strict strict+ defer defer+ none
mlx stream 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

rr 0.94 0.99 0.98 0.99 1.01 0.93 0.98 0.96 0.98 1.00
apache 1M 0.99 0.99 1.00 1.00 1.00 0.99 0.99 0.99 1.00 1.00
apache 1K 0.99 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00
memcached 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

brcm stream 0.40 0.50 0.64 0.81 1.21 0.36 0.45 0.58 0.73 1.09
rr 0.86 0.96 0.96 1.00 1.11 0.84 0.93 0.93 0.98 1.08
apache 1M 0.48 0.49 0.60 0.75 1.41 0.41 0.42 0.52 0.65 1.22
apache 1K 0.99 0.99 0.99 1.00 1.00 0.99 1.00 1.00 1.00 1.00
memcached 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

Table 3.2: Relative performance numbers.

We run each benchmark 100 times. Each individual run is configured to take 10
seconds. We treat the first 10 runs as warmup and report the average of the remaining
90 runs. Figure 4.10 shows the resulting throughput and CPU consumption for the
mlx (top) and brcm (bottom) setups. The corresponding normalized performance is
shown in Table 3.2, specifying the relative improvement of the two rIOMMU variants
over the other modes. The top/left plot in Figure 4.10 corresponds to the analysis and
data shown in Figures 3.73.8.
Let us discuss the results in Figure 4.10, left to right. The greatest observed
improvement by rIOMMU is attained with mlx / Netperf stream (Figure 4.10/top/left).
This result is to be expected considering the model from 3.3.3 showing that every
cycle shaved off the IOVA (un)mappings translates into increased throughput. CPU
cycles constitute the bottleneck resource, as is evident from the mlx/stream/CPU curve,
which is at 100% for all IOMMU modes. The notable difference between riommu- and
riommu is due to 1.1K cycles that the former adds to the latter, which is the cost
of four additional memory barriers and four additional cacheline flushes, per packet.
(Specifically, a barrier and a cacheline flush in both map and unmap for two IOVAs
corresponding to the packets header and data.) Riommu- and riommu provide 2.90
7.56x higher throughput relative to the completely safe IOMMU modes strict and
strict+, and 1.743.79x higher throughput relative to the deferred modes. The latter,
however, does not constitute an apple-to-apples comparison, since the deferred modes
are vulnerable whereas the rIOMMU modes are safe. Riommu- and riommu deliver
0.52x and 0.77x lower throughout relative to the unprotected, no-IOMMU optimum.
The brcm/stream results (Figure 4.10/bottom/left) are quantitatively and qualita-
tively different. In particular, all IOMMU modes except strict have enough cycles to
saturate the Broadcom NIC and achieve its line-rate, which is 10 Gbps. The brcm setup
requires fewer cycles per packet because its device driver is more efficient, e.g., due to
utilizing only one IOVA per packet instead of two. In setups of this typewhere the
network is saturatedthe performance metric of interest becomes the CPU consumption.

48
NIC strict strict+ defer defer+ riommu- riommu none
mlx 17.3 15.1 14.9 14.4 14.1 13.9 13.4
brcm 41.9 36.7 36.6 35.8 35.1 34.7 34.6

Table 3.3: Netperf RR round-trip time in microseconds.


Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

By Table 3.2, we can see that riommu and riommu- consume 0.360.50x fewer CPU
cycles than the two strict modes; 0.580.81x fewer cycles than the deferred modes; and
1.09x and 1.21x more cycles than the no-IOMMU optimum, respectively.
The improvement by rIOMMU is less pronounced when running RR, in both mlx
and brcm, with 1.021.25x higher throughput and 0.841.00x lower CPU consumption
relative to the strict and deferred variants. It is less pronounced due to RRs ping-pong
nature, which implies that CPU cycles are in low demand, as indicated by the CPU
curves at 2830% for mlx and at 1215% for brcm. For this reason, in comparison
to mlx/RR/none, rIOMMU has 45% lower throughput and nearly identical CPU
consumption. In comparison to brcm/RR/none, rIOMMU has 811% higher CPU
consumption and nearly identical throughput. Although the per-packet processing
time at the core is smaller in brcm, overall, the mlx hardware transmits packets faster,
as indicated by its higher RR throughput. The corresponding round-trip time of the
different modes (which, as noted, is the inverse of the throughput in RRs case) is shown
in Table 3.3.
The results of Apache 1MB are qualitatively identical to those of Netperf stream,
because the benchmark transmits a lot of data per request and is thus throughput
sensitive. Conversely, Apache 1KB is not throughput sensitive. Its smaller 1KB requests
make the performance of mlx and brcm look remarkably similar despite their networking
infrastructure difference. In both cases, the bottleneck is the CPU, while the volume of
the transmitted data is only a small fraction of the NICs capacity. (Both deliver 12K
requests per second of 1KB files, yielding a transfer rate of 0.1 Gbps.) This is because
Apache requires heavy processing for each http request. This overhead is amortized
over hundreds of packets in the case of Apache 1MB, but over only one packet in the
case of 1KB. Consequently, the computational processing dominates the throughput of
Apache 1KB, and so the role of the networking infrastructure is marginalized. Even so,
rIOMMU demonstrates a 1.24x and 2.32x throughput improvement over brcm/strict
and mlx/strict, respectively. It is up to 1.18x higher relative to the other IOMMU-
enabled modes. And 0.9x lower relative to the unprotected optimum.2
The network activity of Apache 1KB is somewhat similar to that of the Memcached
benchmark, because both are configured with 32 concurrent requests, both receive
queries comprised of a few dozens of bytes (file name or key item), and both transmit
1KB responses (file content or data item). The difference is that the Memcached internal
2
We note in passing that our Apache 1KB throughput results coincide with that of Soares et al. [58],
who reported a latency of 22ms for 256 concurrent requests, which translate to 1000/22 256 12K
requests/second.

49
logic is simpler, as its purpose is merely to serve as an in-memory LRU cache. For
this reason, it achieves an order of magnitude higher throughput relative to Apache
1KB.3 The shorter per-request processing time makes the differences between the
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

IOMMU configurations more pronounced, with rIOMMU throughput that is 1.174.88x


higher than the completely safe modes, 1.031.28x higher than the deferred modes, and
0.780.84x lower than the optimum.

3.5.3 When IOTLB Miss Penalty Matters

Our experiments thus far indicated that using the IOMMU affects performance because
it mandates the OS to spend CPU cycles on creating and destroying IOVA mappings.
We were unable to measure the overhead caused by the actual IOMMU translation
activity of walking the page tables upon an IOTLB miss (Figures 4.1 and 3.5). In 4.6.1,
we attributed this inability to the substantially longer latencies induced by interrupt
processing and the TCP/IP stack. In Table 3.3, we specified the round-trip latencies,
whose magnitude (1342 s) seems to suggest that the occasional cost of 4 memory
references per table walk is negligible in comparison.
There are, however, high performance environments that enjoy lower latencies in the
order of a s [22, 29, 55, 61], which is required, e.g., where a fraction of a microsecond
can make a difference in the value of a transaction [1]. User-level I/O, for example,
might permit applications to (1) utilize raw Ethernet packets to eliminate TCP/IP
overheads, and to (2) poll the I/O device to eliminate interrupt delays.
With the help of the ibverbs library [38, 47], we established such a configuration
on top of the mlx setup. We ran two experiments. The first iteratively and randomly
selects a buffer from a large pool of previously mapped buffers and transmits it, thus
ensuring that the probability for the corresponding IOVA to reside in the IOTLB is
low. The second experiment does the same but with only one buffer, thus ensuring
that the IOTLB always hits. The latency differencewhich is the cost of an IOTLB
misswas 0.3 s (1013 cycles on average); we believe it is reasonable to assume
that it approximates the benefit of using rIOMMU over the existing IOMMU in high
performance environments of this type.

3.5.4 Comparing to TLB Prefetchers

rIOMMU is not a prefetcher. Rather, it is a new IOMMU design that allows for
efficient IOVA (un)mappings while minimizing costly IOTLB invalidations. (Unrelated
to prefetching.) But rIOMMU does have a prefetching component, since it loads to the
rIOTLB the next IOVA to be used ahead of time. While this component turned out to
be useful only in specialized setups (3.5.3), it is still interesting to compare this aspect
of our work to previously proposed TLB prefetchers.
3
Our Memcached results are comparable to that of Gordon et al. [30].

50
For lack of space, we only briefly describe the bottom line. We modified the IOMMU
layer of KVM/QEMU to log the DMAs that its emulated I/O devices perform. We
ran our benchmarks in a VM and generated DMA traces. We fed the traces to three
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

simulated TLB prefetchers: Markov [43], Recency [56], and Distance [46], as surveyed by
Kandiraju and Sivasubramaniam [45]. We found their baseline versions to be ineffective,
as IOVAs are invalidated immediately after being used. We modified them and allowed
them to store invalidated addresses, but mandated them to walk the page table and check
that their predictions are mapped before making them. Distance was still ineffective.
Recency and Markov, however, were able to predict most accesses, but only if the
number of entries comprising their history data structure grew larger than the ring. In
contrast, rIOTLB requires only two entries per ring and its predictions are always
correct.

3.6 Related Work


The overhead of IOMMU mapping and unmapping is a well known issue. Ben-Yehuda
et al. [15] showed that using the Calgary IOMMU can impose a 30% increase in CPU
utilization. In their work they proposed methods that can reduce the IOMMU mapping
layer overhead, yet would require significant changes to existing device drivers.
Several studies focused on the overhead of virtual IOMMUs, yet most of their
approaches are applicable to native systems as well. Willmann et al. [64] showed that
sharing mappings among DMA descriptors can reduce the overhead without sacrificing
security, yet, as the authors admitted the extent of the performance improvement is
workload-dependent and sometimes negligible. Other techniquesWillmanns persistent
mappings and validation of DMA buffers; Yassour et al. [66] mappings prefetching and
invalidations batching; and Amit et. al [4] asynchronous invalidationsall improve
performance at the cost of relaxed IOMMU protection.
Other research works addressed inefficiencies of the I/O virtual address space
allocator. Tomonori [60] proposed to enhance the allocator performance by managing
the I/O virtual address space using bitmaps instead of red-black trees. Cascardo [20]
showed that IOVA allocators suffer from lock-contention, and mitigating this contention
can can significantly improve the performance of multicore systems. These studies
do not address the overhead associated with IOTLB invalidations, and are therefore
orthogonal to our work.

51
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

52
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Chapter 4

Efficient IOMMU
Intra-Operating System
Protection

4.1 Introduction

The role that the I/O memory management unit (IOMMU) plays for I/O devices is
similar to the role that the regular memory management unit (MMU) plays for processes.
Processes typically access the memory using virtual addresses translated to physical
addresses by the MMU. Likewise, I/O devices commonly access the memory via direct
memory access operations (DMAs) associated with I/O virtual addresses (IOVAs),
which are translated to physical addresses by the IOMMU. Both hardware units are
implemented similarly with a page table hierarchy that the operating system (OS)
maintains and the hardware walks upon an (IO)TLB miss.
The IOMMU can provide inter - and intra-OS protection [4, 62, 64, 66]. Inter
protection is applicable in virtual setups. It allows for direct I/O, where the host
assigns a device directly to a guest virtual machine (VM) for its exclusive use, largely
removing itself from the guests I/O path and thus improving its performance [30, 50].
In this mode, the VM directly programs device DMAs using its notion of (guest)
physical addresses. The host uses the IOMMU to redirect these accesses to where
the VM memory truly resides, thus protecting its own memory and the memory of the
other VMs. With inter protection, IOVAs are mapped to physical memory locations
infrequently, typically only upon such events as VM creation and migration. Such
mappings are therefore denoted persistent or static [64].
Intra-OS protection allows the OS to defend against errant/malicious devices and
buggy drivers, which account for most OS failures [21, 59]. Drivers and their I/O
devices are able perform DMAs to arbitrary memory locations, and IOMMUs allow
OSes to protect themselves by restricting these DMAs to specific physical memory

53
locations. Intra-OS protection is applicable in non-virtual setups where the OS has
direct control over the IOMMU. It is likewise applicable in virtual setups where IOMMU
functionality is exposed to VMs via paravirtualization [15, 50, 57, 64], full emulation
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

[4], or, recently, hardware support for nested IOMMU translation [3, 40]. In this mode,
IOVA (un)mappings are frequent and occur within the I/O critical path. The OS
programs DMAs using IOVAs rather than physical addresses, such that each DMA
is preceded and followed by the mapping and unmapping of the associated IOVA to
the physical address it represents [44, 52]. For this reason, such mappings are denoted
single-use or dynamic [17]. The context of this paper is intra-OS protection. We
discussed it in more detail in 4.2.
To do its job, the intra-OS protection mapping layer must allocate IOVA values:
ranges of integer numbers that would serve as page identifiers. IOVA allocation is similar
to regular memory allocation. But it is different enough to merit its own allocator.
One key difference is that regular allocators dedicate much effort to preserving locality
and to combating fragmentation, whereas the IOVA allocator disallows locality and
enjoys a naturally unfragmented workload. This difference makes the IOVA allocator
12 orders of magnitude smaller in terms of lines of code. Another key difference
is that, by default, the IOVA subsystem trades off some safety for performance. It
systematically delays the completion of IOVA deallocations while letting the OS believe
that the deallocations have already been processed. Specifically, part of freeing an
IOVA is purging it from the IOTLB such that the associated physical buffer is no longer
accessible to the I/O device. But invalidating IOTLB entries is a costly, slow operation.
So the IOVA subsystem opts for batching the invalidations until enough accumulate.
Then, it invalidates the entire IOTLB, en masse, thus reducing the amortized price.
This default mode is called deferred protection. Users can turn it off at boot time by
instructing the kernel to use strict protection. We discuss the IOVA allocator and its
protection modes in detail in 4.3.
Single-use mappings that stress the IOVA mapping layer are usually associated with
I/O devices that employ ring buffers in order to communicate with their OS drivers in
a producer-consumer manner. The ring buffer is a cyclic memory array whose entries
correspond to DMA requests that the OS initiates and the I/O device must fulfill. The
ring entries contain IOVAs that the mapping layer allocates and frees before and after
the associated DMAs are processed by the device. We carefully analyze the performance
of the IOVA mapping layer and find that its allocation scheme is efficient despite its
simplicity, but only if the device is associated with a single ring. Devices, however,
often employ more rings, in which case our analysis indicates that the IOVA allocator
seriously degrades the performance. We carefully study this deficiency and find that its
root cause is a pathology we call long-lasting ring interference. The pathology occurs
when I/O asynchrony prompts an event that confuses the allocator into migrating an
IOVA from one ring to another, henceforth repetitively destroying the contiguity of the
rings I/O space upon which the allocator relies for efficiency. We conjecture that this

54
harmful effect remained hidden thus far because of the well-known slowness associated
with manipulating the IOMMU. The hardware took most of the blame for the high
price of intra-OS protection even though software is equally guilty, as it turns out. We
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

demonstrate and analyze long-lasting ring interference in 4.4.


To resolve the problem, we introduce the EiovaR optimization (Efficient IOVA
allocatoR) to the kernels mapping subsystem. In designing EiovaR, we exploit the
following two observations: (1) the workload handled by the IOVA mapping layer is
largely comprised of allocation requests for same size ranges, and (2) since the workload
is ring-induced, the difference D between the cumulative number of allocation and
deallocation requests at any given time is proportional to the size of the ring, which
is relatively small. EiovaR is thus a simple, thin layer on top of the baseline IOVA
allocator that proxies all the (de)allocations requests. It caches previously freed ranges
and reuses them to quickly satisfy subsequent allocations. It is successful because the
requests are similar, it is frugal in terms of memory consumption because D is small,
and it is compact (implementation-wise) because it is mostly an array of free-lists with a
bit of logic. EiovaR entirely eliminates the baseline allocators aforementioned reliance
on I/O space contiguity, ensuring all (de)allocations are efficient. We describe EiovaR
and experimentally explore its interaction with strict and deferred protection in 4.5.
We evaluate the performance of EiovaR using micro- and macrobenchmarks and
different I/O devices. On average, EiovaR satisfies (de)allocations in about 100 cycles,
and it improves the throughput of Netperf, Apache, and Memcached benchmarks by
up to 4.58x and 1.71x for strict and deferred protection, respectively. In configurations
that achieve the maximal throughput of the I/O device, EiovaR reduces the CPU
consumption by up to 0.53x. Importantly, EiovaR delivers strict protection with the
performance that is similar to that of the baseline system when employing deferred
protection. We conduct the experimental evaluation of EiovaR in 4.6.

4.2 Intra-OS Protection


DMA refers to the ability of I/O devices to read from or write to the main memory
without CPU involvement. It is a heavily used mechanism, as it frees the CPU to
continue to do work between the time it programs the DMA until the time the associated
data is sent or received. As noted, drivers of devices that stress the IOVA mapping layer
initiate DMA operations via a ring buffer, which is a circular array in main memory
that constitutes a shared data structure between the driver and its device. Each entry
in the ring contains a DMA descriptor, specifying the address(es) and size(s) of the
corresponding target buffer(s); the I/O device will write/read the data to/from the
latter, at which point it will trigger an interrupt to let the OS know that the DMA has
completed. (Interrupts are coalesced if their rate is high.) I/O device are commonly
associated with more than one ring, e.g., a receive ring denoted Rx for DMA read
operations, and a transmit ring denoted Tx for DMA write operations.

55
requester identifier IOVA (DMA address)
bus dev func 0s idx idx idx offset
15 8 3 0 63 39 29 21 12 0
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

root cntxt PTE


PTE
entry entry
PTE

root context page table hierarchy


table table (3-4 tables)
63 12 0
PFN offset
physical address
Figure 4.1: IOVA translation using the Intel IOMMU.

In the past, I/O devices had to use physical addresses in order to access the main
memory, namely, each DMA descriptor contained a physical address of its target buffer.
Such unmediated DMA activity directed at the memory makes the system vulnerable
to rogue devices performing errant or malicious DMAs [14, 19, 44, 65], or to buggy
drivers that might wrongfully program their devices to overwrite any part of the system
memory [13, 31, 50, 59, 63]. Subsequently, all major chip vendors introduced IOMMUs
[3, 11, 36, 40], alleviating the problem as follows.
The OS associates each DMA target buffer with some IOVA, which it uses instead
of the buffers physical address when filling out the associated ring descriptor. The
I/O device is oblivious to this change; it processes the DMA using the IOVA as if it
was physical memory. The IOMMU circuitry then translates the IOVA to the physical
address of the target buffer, routing the operation to the appropriate memory location.
Figure 4.1 illustrates the translation process as performed by the Intel IOMMU, which
we use in this paper. The PCI protocol dictates that each DMA operation is associated
with a 16 bit request identifier comprised of a bus-device-function triplet, which is
uniquely associated with the corresponding I/O device. The IOMMU uses the 8 bit bus
number to index the root table and thus retrieve the physical address of the context table.
It then indexes the context table using the 8 bit concatenation of the device and function
numbers, yielding the physical location of the root of the page table hierarchy that
houses the devices IOVA translations. Similarly to the MMU, the IOMMU accelerates
translations using an IOTLB.
The functionality of the IOMMU hierarchy is similar to that of the regular MMU: it
will permit an IOVA memory access to go through only if the OS previously inserted

56
a matching mapping between the IOVA and some physical memory address. The OS
can thus protect itself by allowing a device to access a target buffer just before the
corresponding DMA occurs (inserting a mapping), and by revoking access just after
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

(removing the mapping), exerting fine-grained control over what portions of memory
may be used in I/O transactions at any given time. This state-of-the-art strategy of
IOMMU-based protection was termed intra-OS protection by Willmann et al. [64].It
is recommended by hardware vendors [32, 44], and it is used by operating systems
[10, 17, 37, 51]. For example, the DMA API of Linuxwhich we use in this studynotes
that DMA addresses should be mapped only for the time they are actually used and
unmapped after the DMA transfer [52].

4.3 IOVA Allocation and Mapping


The task of generating IOVAsnamely, the actual integer numbers that the OS assigns
to descriptors and the devices then useis similar to regular memory allocation. But
it is sufficiently different to merit its own allocator, because it optimizes for different
objectives, and because it is required to make different tradeoffs, as follows:

Locality Memory allocators spend much effort in trying to (re)allocate memory chunks
in a way that maximizes reuse of TLB entries and cached content. The IOVA
mapping layer of the OS does the opposite. The numbers it allocates correspond
to whole pages, and they are not allowed to stay warm in hardware caches in
between allocations. Rather, they must be purged from the IOTLB and from
the page table hierarchy immediately after the DMA completes. Moreover, while
purging an IOVA, the mapping layer must flush each cache line that it modifies
in the hierarchy, as the IOMMU and CPU do not reside in the same coherence
domain.1

Fragmentation Memory allocators invest much effort in combating fragmentation,


attempting to eliminate unused memory holes and utilize the memory they
have before requesting the system for more. As we further discuss in 4.44.5, it
is trivial for the IOVA mapping layer to avoid fragmentation due to the simple
workload that it services, which is induced by the circular ring buffers, and which
is overwhelmingly comprised of 2j -page requests.

Complexity Simplicity and compactness matter and are valued within the kernel.
Not having to worry about locality and fragmentation while enjoying a simple
workload, the mapping layer allocation scheme is significantly simpler than regular
memory allocators. In Linux, it is comprised of only a few hundred lines of codes
instead of thousands [48, 49] or tens of thousands [16, 33].
1
Intel IOMMU specification documents a capability bit that indicates whether the IOMMU and
CPU coherence could be turned on [40], but we do not own such hardware and believe it is not yet
common.

57
Safety & Performance Assume a thread T0 frees a memory chunk M , and then
another thread T1 allocates memory. A memory allocator may give M to T1 , but
only after it processes the free of T0 . Namely, it would never allow T0 and T1 to use
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

M together. Conversely, the IOVA mapping layer purposely allows T0 (the device)
and T1 (the OS) to access M simultaneously for a short period of time. The reason:
invalidation of IOTLB entries is costly [4, 64]. Therefore, by default, the mapping
layer trades off safety for performance by (1) accumulating up to W unprocessed
free operations and only then (2) freeing those W IOVAs and (3) invalidating
the entire IOTLB en masse. Consequently, target buffers are actively being used
by the OS while the device might still access them through stale IOTLB entries.
This weakened safety mode is called deferred protection. Users can instead employ
strict protectionwhich processes invalidations immediatelyby setting a kernel
command line parameter.

Technicalities Memory allocators typically use the memory that their clients free to
store their internal data structures. (For example, a linked list of freed pages
where each next pointer is stored at the beginning of the corresponding page.)
The IOVA mapping layer cannot do that, because the IOVAs that it invents are
pointers to memory that is used by some other entity (the device or the OS).
An IOVA is just an additional identifier for a page, which the mapping layer
does not own. Another difference between the two types of allocators is that
memory allocators running on 64-bit machines use native 64-bit pointers. The
IOVA mapping layer prefers to use 32-bit IOVAs, as utilizing 64-bit addresses for
DMA would force a slower, dual address cycle on the PCI bus [17].

In accordance to the above, the allocation scheme employed by the Linux/x86


IOVA mapping layer is different than, and independent of, the regular kernel memory
allocation subsystem. The underlying data structure of the IOVA allocator is the generic
Linux kernel red-black tree. The elements of the tree are ranges. A range is a pair
of integer numbers [L, H] that represent a sequence of currently allocated I/O virtual
page numbers L, L + 1, ..., H 1, H, such that L H stand for low and high,
respectively. The ranges in the tree are pairwise disjoint, namely, given two ranges
[L1 , H1 ] 6= [L2 , H2 ] then either H1 < L2 or H2 < L1 .
Newly requested IOVA integers are allocated by scanning the tree right-to-left
from the highest possible value downwards towards zero in search for a gap that can
accommodate the requested range size. The allocation scheme attempts andas we
will later seeordinarily succeeds to allocate the new range from within the highest
gap available in the tree. The allocator begins to scan the tree from a cache node C
that it maintains, iterating from C through the ranges in a descending manner until a
suitable gap is found. C is maintained such that it usually points to a range that is
higher than (to the right of) the highest free gap, as follows. When (1) a range R is
freed and C currently points to a range lower than R, then C is updated to point to

58
struct range_t {int lo, hi;};
range_t alloc_iova(rbtree_t t, int rngsiz) {
range_t new_range;
rbnode_t right = t.cache;
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

rbnode_t left = rb_prev( right );


while(right.range.lo - left.range.hi <= rngsiz)
right = left;
left = rb_prev( left );
new_range.hi = right.lo - 1;
new_range.lo = right.lo - rngsiz;
t.cache = rb_insert( t, new_range );
return new_range;
}
void free_iova(rbtree_t t, rbnode_t d) {
if( d.range.lo >= t.cache.range.lo )
t.cache = rb_next( d );
rb_erase( t, d );
}

Figure 4.2: Pseudo code of the baseline


IOVA allocation scheme. The functions
rb next and rb prev return the successor and
predecessor of the node they receive,
respectively.

Rs successor. And (2) when a new range Q is allocated, then C is updated to point to
Q; if Q was the highest free gap prior to its allocation, then C still points higher than
the highest free gap after this allocation.

4.4 Long-Lasting Ring Interference


Figure 4.2 lists the pseudo code of the IOVA allocation scheme as was just described.
Clearly, the algorithms worst case complexity is linear due to the while loop that
scans previously allocated ranges beginning at the cache node C. But when factoring
in the actual workload that this algorithm services, the situation is not so bleak: the
complexity turns out to actually be constant rather than linear. (At least conceptually.)
Recall that the workload is commonly induced by a circular ring buffer, whereby
IOVAs of DMA target buffers are allocated and freed in a repeated, cyclic manner.
Consider, for example, an Ethernet NIC with a Rx ring of size n, ready to receive packets.
Assume the NIC initially allocates n target buffers, each big enough to hold one packet
(1500 bytes). The NIC then maps the buffers to n newly allocated, consecutive IOVAs
with which it populates the ring descriptors. Assume that the IOVAs are n, n 1, ..., 2, 1.
(The series is descending as IOVAs are allocated from highest to lowest.) The first
mapped IOVA is n, so the NIC stores the first received packet in the memory pointed to
by n, and it triggers an interrupt to let the OS know that it needs to handle the packet.
Upon handling the interrupt, the OS first unmaps the corresponding IOVA, purging
it from the IOTLB and IOMMU page table to prevent the device from accessing the

59
associated target buffer (assuming strict protection). The unmap frees IOVA=n, thus
updating C to point to ns successor in the red-black tree (free iova in Figure 4.2). The
OS then immediately re-arms the ring descriptor for future packets, allocating a new
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

target buffer and associating it with a newly allocated IOVA. The latter will be n, and
it will be allocated in constant time, as C points to ns immediate successor (alloc iova
in Figure 4.2). The same scenario will cyclically repeat itself for n 1, n 2, ..., 1 and
then again n, ..., 1 and so on as long as the NIC is operational.
Our soon to be described experiments across multiple devices and workloads indicate
that the above description is fairly accurate. IOVA allocations requests are overwhelm-
ingly for one page ranges (H = L), and the freed IOVAs are indeed re-allocated shortly
after being freed, enabling, in principle, the allocator in Figure 4.2 to operate in constant
time as described. But the algorithm succeeds to operate in this ideal manner only for
some bounded time. We find that, inevitably, an event occurs end ruins this ideality
thereafter.
The above description assumes there exists only one ring in the I/O virtual address
space. In reality, however, there are often two or more, for example, the Rx and Tx
receive and transmit rings. Nonetheless, even when servicing multiple rings, the IOVA
allocator provides constant time allocation in many cases, so long as each rings free iova
is immediately followed by a matching alloc iova for the same ring (the common case).
Allocating for one ring and then another indeed causes linear IOVA searches due to
how the cache node C is maintained. But large bursts of I/O activity flowing in one
direction still enjoy constant allocation time.
The aforementioned event that forever eliminates the allocators ability to accom-
modate large I/O bursts with constant time occurs when a free-allocate pair of one ring
is interleaved with that of another. Then, an IOVA from one ring is mapped to another,
ruining the contiguity of the rings I/O virtual address. Henceforth, every cycle of n
allocations would involve one linear search prompted whenever the noncontiguous IOVA
is freed and reallocated. We call this pathology long-lasting ring interference and note
that its harmful effect increases as additional inter-ring free-allocate interleavings occur.
Table 4.1 illustrates the pathology. Assume that a server mostly receives data and
occasionally transmits. Suppose that Rx activity triggers a Rx.free iova(L) of address
L (1). Typically, this action would be followed by Rx.alloc iova, which would then
return L (2). But sometimes a Tx operation sneaks in between. If this Tx operation is
Tx.free iova(H) such that H > L (3), then the allocator would update the cache node C
to point to Hs successor (4). The next Rx.alloc iova would be satisfied by H (5), but then
the subsequent Rx.alloc iova would have to iterate through the tree from H (6) to L (7),

inducing a linear overhead. Notably, once H is mapped for Rx, the pathology is repeated
every time H is (de)allocated. This repetitiveness is experimentally demonstrated in
Figure 4.3, showing the per-allocation number of rb prev invocations. The calls are
invoked in the loop in alloc iova while searching for a free IOVA.
We show below that the implications of long-lasting ring interference can be dreadful

60
operation without Tx with Tx
return C C return C C
value before after value before after
Rx.free(L=151) (1) 152 152 152 152
Tx.free(H=300) (3) 152 (4) 301
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Rx.alloc (2) 151 152 151 (5) 300 301 300


Rx.free(150) 151 151 300 (6) 300
Rx.alloc 150 151 150 (7) 151 300 151

Table 4.1: Illustrating why Rx-Tx interferences cause linearity, following the baseline
allocation algorithm detailed in Figure 4.2. (Assume that all addresses are initially allocated.)
[num of rb_prev calls]
length of search loop

3k

2k

1k

0k
n+ n+ n+ n+ n+
0k 1 0k 20 30 40
k k k
allocation [serial number]
Figure 4.3: The length of each alloc iova search loop in a 40K (sub)sequence of alloc iova calls
performed by one Netperf run. One Rx-Tx interference leads to regular linearity.

in terms of performance. How, then, is it possible that such a deficiency is overlooked?


We contend that the reason is twofold. The first is that commodity I/O devices were
slow enough in the past such that IOVA allocation linearity did not matter. The second
reason is the fact that using the IOMMU hardware is slow and incurs a high price,
motivating the deferred protection safety/performance tradeoff. Being that slow, the
hardware served as a scapegoat, wrongfully held accountable for most of the overhead
penalty and masking the fact that software is equally to blame.

4.5 The EiovaR Optimization

Suffering from frequent linear allocations, the baseline IOVA allocator is ill-suited for
high-throughput I/O devices that are capable of performing millions of I/O transactions
per second. It is too slow. One could proclaim that this is just another case of a special-
purpose allocator proved inferior to a general-purpose allocator and argue that the

61
latter should be favored over the former despite the notable differences between the two
as listed in 4.3. We, however, contend that the simple and repetitive, inherently ring-
induced nature of the workload justifies a special-purpose allocator. We further contend
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

that we are able to modify the existing allocator to consistently support extremely fast
(de)allocations while introducing only a minimal change to the existing allocator.
We propose the EiovaR optimization (Efficient IOVA allocatoR), which rests of the
following observation. I/O devices that stress the intra-OS protection mapping layer
are not like processes, in that the size of their virtual address spaces is relatively small,
inherently bounded by the size of their rings. A typical ring size n is a few hundreds
or a few thousands of entries. The number of per-device virtual page addresses that
the IOVA allocator must simultaneously support is proportional to the ring size, which
means it is likewise bounded and relatively small. Moreover, unlike regular memory
allocators, the IOVA mapping layer does not allocate real memory pages. Rather, it
allocates integer identifiers for those pages. Thus, it is reasonable to keep O(n) of these
identifiers alive under the hood for quick (de)allocation, without really (de)allocating
them (in the traditional, malloc sense of (de)allocation).
In numerous experiments with multiple devices and workloads, the maximal number
of per-device different IOVAs we have seen is 12K. More relevant is that, across all
experiments, the maximal number of previously-allocated-but-now-free IOVAs has never
exceeded 668 and was 155 on average. EiovaR leverages this workload characteristic
to cache freed IOVAs so as to satisfy future allocations quickly. It further leverages
the fact that, as noted earlier, IOVA allocations requests are overwhelmingly for one
6 L, the size of the
page ranges (H = L), and that, in the handful of cases where H =
requested range has always been a power of two (H L + 1 = 2j ). We have never
witnessed a non-power of two allocation.
EiovaR is thin layer that masks the red-black tree, resorting to using it only when
EiovaR cannot fulfill IOVA allocation on its own using previously freed elements. When
configured to have enough capacity, all tree allocations that EiovaR is unable to mask
are assured to occur in constant time.
EiovaRs main data structure is called the freelist. It has two components. The
first is an array, farr , which has M entries, such that 2M +12 is the upper bound on
the size of the consecutive memory areas that EiovaR supports. (M = 28 is enough,
implying a terabyte of memory.) Entries in farr are linked lists of IOVA ranges. They
are empty upon initialization. When an IOVA range [L, H] whose size is a power
of two is freed, instead of actually freeing it, EiovaR adds it to the liked list of the
corresponding exponent. That is, if H L + 1 = 2j , then EiovaR adds the range to the
its j-th linked list farr [j]. Ranges comprised of one page (H = L) end up in farr [0].
For completeness, when the size of a freed range is not a power of two, EiovaR
stores it in its second freelist component, frbt , which is a red-black tree. Unlike the
baseline red-black tree, which sorts [L, H] ranges by the L and H values, frbt sorts
ranges by their size (H L + 1), allowing EiovaR to locate a range of a desirable size

62
throughput freelist allocation avg freelist size avg search length
[Gbit/sec] [percent] [nodes] [rb_prev ops]
4 100 2.5 60
80 2 50 eiovar
3.5
60 1.5 40 eiovar512
3 30 eiovar64
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

40 1 20 eiovar8
2.5 20 0.5 eiovar1
10
baseline
2 0 0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
(a) (b) (c) (d)

Figure 4.4: Netperf TCP stream iteratively executed under strict protection. The x axis
shows the iteration number.

in logarithmic time.
EiovaR allocation performs the reverse operation of freeing. If the newly allocated
range has a power-of-2 size (i.e., 2j ), then EiovaR tries to satisfy the allocation using
farr [j]. Otherwise, it tries to satisfy the allocation using frbt . EiovaR resorts to utilizing
the baseline red-black tree only if a suitable range is not found in the freelist.
When configured with enough capacity, our experiments indicate that, after a
while, all (de)allocations are satisfied by farr in constant time, taking 50-150 cycles per
(de)allocation on average, depending on the configuration. When not imposing a limit
on the freelist capacity, the allocations that EiovaR satisfies by resorting to the baseline
tree are likewise done in constant time, seeing the tree never observes deallocations,
which means its cache node C always points to its smallest, leftmost node (Figure 4.2).
To bound the size of the freelist, EiovaR has a configurable parameter k that serves
as its maximal capacity. We use the EiovaRk notation to express this limit, with k =
indicating a limitless freelist.

4.5.1 EiovaR with Strict Protection

To understand the behavior and effect of EiovaR, we begin by analyzing five Eio-
vaRk variants as compared to the baseline under strict protection, where IOVAs are
(de)allocated immediately before and after the associated DMAs. We use the standard
Netperf stream benchmark that maximizes throughput on one TCP connection. We
initially restart the NIC interface for each allocation variant (thus clearing IOVA struc-
tures), and then we execute the benchmark iteratively. The exact experimental setup is
described in 4.6. The results are shown in Figure 4.4.
Figure 4.4a shows that the throughput of all EiovaR variants is similar and is
20%60% better than the baseline. The baseline gradually decreases except from in
the last iteration. Figure 4.4b highlights why even EiovaR1 is sufficient to provide the
observed benefit. It plots the rate of IOVA allocations that are satisfied by the freelist,
showing that k = 1 is enough to satisfy nearly all allocations. This result indicates that
each call to free iova is followed by alloc iova, such that the IOVA freed by the former
is returned by the latter, coinciding with the ideal scenario outlined in 4.3. Figure

63
baseline eiovar
4k 4k
map [cycles] 3k 3k
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

2k 2k
1k 1k
0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
alloc_iova iteration
all the rest
Figure 4.5: Average cycles breakdown of map with Netperf/strict.

baseline eiovar
4k 4k
unmap [cycles]

3k 3k
2k 2k
1k 1k
0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
free_iova iteration
all the rest
Figure 4.6: Average cycles breakdown of unmap with Netperf/strict.

4.4c supports this observation by depicting the average size of the freelist. The average
of EiovaR1 is inevitably 0.5, as every allocation and deallocation contributes to the
average 1 and 0 respectively. Larger k values are similar, with an average of 2.5 because
of two additional (de)allocation that are performed when Netperf starts running and
that remain in the freelist thereafter. Figure 4.4d shows the average length of the while
loop from Figure 4.2, which searches for the next free IOVA. It depicts a rough mirror
image of Figure 4.4a, indicating throughput is tightly negatively correlated with the
traversal length.
Figure 4.5 (left) shows the time it takes the baseline to map an IOVA, separating
allocation from the other activities. Whereas the latter remains constant, the former
exhibits a trend identical to Figure 4.4d. Conversely, the alloc iova time of EiovaR
(Figure 4.5, right) is negligible across the board. EiovaR is immune to long-lasting ring
interface, as interfering transactions are absorbed by the freelist and reused in constant

64
throughput freelist allocation avg freelist size avg search length
[Gbit/sec] [percent; log-scaled] [nodes; log-scaled] [rb_prev ops]
7 100 128 35
50 64 30 eiovar
6 25 32 25 eiovar512
10 16 20 eiovar64
5 8
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

15 eiovar8
3 4 10
4 2 eiovar1
1 1 5 baseline
3 0 0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
(a) (b) (c) (d)

Figure 4.7: Netperf TCP stream iteratively executed under deferred protection. The x axis
shows the iteration number.

time.

4.5.2 EiovaR with Deferred Protection

Figure 4.6 is similar to Figure 4.5, but it pertains to the unmap operation rather than
to map. It shows that the duration of free iova remains stable across iterations with
both EiovaR and the baseline. EiovaR deallocation is still faster as it is performed in
constant time whereas the baseline is logarithmic. But most of the overhead is not due
to free iova. Rather, it is due to the costly invalidation that purges the IOVA from the
IOTLB to protect the corresponding target buffer. This is the aforementioned costly
hardware overhead that motivated deferred protection, which amortizes the price by
delaying invalidations until enough IOVAs are accumulated and then processing all of
them together. As noted, deferring the invalidations trades off safety for performance,
because the relevant memory is accessible by the device even though it is already used
by the kernel for other purposes.
Figure 4.7 compares between the baseline and the EiovaR variants under deferred
protection. Interestingly, the resulting throughput divides the variants into two, with
EiovaR512 and EiovaR above 6Gbps and all the rest at around 4Gbps (Figure 4.7a).
We again observe a strong negative correlation between the throughput and the length
of the search to find the next free IOVA (Figure 4.7a vs. 4.7d).
In contrast to the strict setup (Figure 4.4), here we see that EiovaR variants with
smaller k values roughly perform as bad as the baseline. This finding is somewhat
surprising, because, e.g., 25% of the allocations of EiovaR64 are satisfied by the freelist
(Figure 4.7b), which should presumably improve its performance over the baseline. A
finding that helps explain this result is noticing that the average size of the EiovaR64
freelist is 32 (Figure 4.7c), even though it is allowed to hold up to k = 64 elements.
Notice that EiovaR holds around 128 elements on average, so we know there are
enough deallocations to fully populate the EiovaR64 freelist. One might therefore expect
that the latter would be fully utilized, but it is not.
The average size of the EiovaR64 freelist is 50% of its capacity due to the following
reason. Deferred invalidations are aggregated until a high-water mark W (kernel

65
parameter) is reached, and then all the W addresses are deallocated in bulk.2 When
k < W , the freelist fills up to hold k elements, which become k 1 after the subsequent
allocation, and then k 2 after the next allocation, and so on until zero is reached,
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

1 k
yielding an average size of k+1 j=0 j k/2 as our measurements show.
Importantly, the EiovaRk freelist does not have enough capacity to absorb all
the W consecutive deallocations when k < W . The remaining W k deallocations
would therefore be freed by the baseline free iova. Likewise, out of the subsequent W
allocations, only k would be satisfied by the freelist, and the remaining W k would be
serviced by the baseline alloc iova. It follows that the baseline free iova and alloc iova
are regularly invoked in an uncoordinated way despite the freelist. As described in 4.4,
the interplay between these two routines in the baseline version makes them susceptible
to long-lasting ring interference that inevitably induces repeated linear searches. In
contrast, when k is big enough ( W ), the freelist has sufficient capacity to absorb all
W deallocations, which are then used to satisfy the subsequent W allocations and thus
secure the conditions for preventing the harmful effect.
Figure 4.8 experimentally demonstrates this threshold behavior, depicting the
throughput as a function of the maximal allowed freelist size k. As k gets bigger,
the performance slowly improves, because morebut not yet all(de)allocations are
served by the freelist. When k reaches W = 250, the freelist is finally big enough, and
the throughput suddenly increases by 26%. Figure 4.9 provides further insight into
this result. It shows the per-allocation length of the loop within alloc iova that iterates
through the red-black tree in search for the next free IOVA (similarly to Figure 4.3).
The three sub-graphs correspond to three points from Figure 4.8 that are associated
with k values 64, 240, and 250. We see that the smaller k (left) yields more longer
searches relative to the bigger k (middle), and that the length of the search becomes
zero when k = W (right).

4.6 Evaluation
4.6.1 Methodology
Experimental Setup We implement EiovaR in the Linux kernel, and we experimen-
tally evaluate its performance and contrast it against the baseline IOVA allocation.
In an effort to attain more general results, we conducted the evaluation using two
setups involving two different NICs with two corresponding different device drivers that
generate different workloads for the IOVA allocation layer.
The Mellanox setup is comprised of two identical Dell PowerEdge R210 II Rack
Server machines that communicate through Mellanox ConnectX3 40Gbps NICs. The
NICs are connected back to back via a 40Gbps optical fiber and are configured to use
2
They cannot be freed before they are purged from the IOTLB, or else they could be re-allocated,
which would be a bug since their stale mappings might reside in the IOTLB and point to somewhere
else.

66
7
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

6
Gbit/sec

4
0 50 100 150 200 250
eiovars freelist size

Figure 4.8: Under deferred protection, EiovaRk eliminates costly linear searches when k
exceeds the high-water mark W .

k=64 k=240 k=W=250


[num of rb_prev calls]
length of search loop

8k 8k 8k
6k 6k 6k
4k 4k 4k
2k 2k 2k
0k 0k 0k
n+ n+ n+ n+ n+ n+ n+ n+ n+
0k 20 40 0k 20 40 0k 20 40
k k k k k k
allocation [serial number]
Figure 4.9: Length of the alloc iova search loop under the EiovaRk deferred protection
regime for three k values when running Netperf TCP Stream. Bigger capacity implies that the
searches become shorter on average. Big enough capacity (k W = 250) eliminates the
searches altogether.

Ethernet. We use one machine as the server and the other as a workload generator
client. Each machine is furnished with a 8GB 1333MHz memory and a single-socket
4-core Intel Xeon E3-1220 CPU running at 3.10GHz. The chipset is Intel C202, which
supports VT-d, Intels Virtualization Technology that provides IOMMU functionality.
We configure the server to utilize one core only, and we turn off all power optimizations

67
Netperf stream Netperf RR Apache 1MB Apache 1KB Memcached

transactions/sec
9 40 1.2k 12k 120k

latency [sec]

requests/sec

requests/sec
throughput 30
[Gbp/s]
6 0.8k 8k 80k
20
3 0.4k 4k 40k
10
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

0 0 0.0k 0k 0k
baseline
st

de

st

de

st

de

st

de

st

de
ric

ric

ric

ric

ric
fe

fe

fe

fe

fe
eiovar
t

t
r

r
Netperf stream Netperf RR Apache 1MB Apache 1KB Memcached

transactions/sec
9 40 1.2k 12k 120k
latency [sec]

requests/sec

requests/sec
throughput

30
[Gbp/s]

6 0.8k 8k 80k
20
3 10 0.4k 4k 40k
0 0 0.0k 0k 0k
baseline
st

de

st

de

st

de

st

de

st

de
ric

ric

ric

ric

ric
fe

fe

fe

fe

fe
eiovar
t

t
r

r
Figure 4.10: The performance of baseline vs. EiovaR allocation, under strict and deferred
protection regimes for the Mellanox (top) and Broadcom (bottom) setups. Except for in the
case of Netperf RR, higher values indicated better performance.

sleep states (C-states) and dynamic voltage and frequency scaling (DVFS)to avoid
reporting artifacts caused by nondeterministic events. The two machines run Ubuntu
12.04 and utilize the Linux 3.4.64 kernel.
The Broadcom setup is similar, likewise utilizing two Dell PowerEdge R210 machines.
The difference is: that the two machines communicate through Broadcom NetXtreme II
BCM57810 10GbE NICs (connected via a CAT7 10GBASE-T cable for fast Ethernet);
that they are equipped with 16GB memory; and that they run the Linux 3.11.0 kernel.
The device drivers of the Mellanox and Broadcom NICs differ in many respects.
Notably, the Mellanox NIC utilizes more ring buffers and it allocates more IOVAs (we
observed around 12K addresses for Mellanox and 3K for Broadcom). For example,
the Mellanox driver uses two buffers per packetone for the header and one for the
bodyand hence two IOVAs, whereas the Broadcom driver allocates only one buffer
and thus only one IOVA.

Benchmarks To drive our experiments we utilize the following benchmarks:

1. Netperf TCP stream [42] is a standard tool to measure networking performance


in terms of throughput. It attempts to maximize the amount of data sent over one
TCP connection, simulating an I/O-intensive workload. This is the application
we used above when demonstrating long-lasting ring interference and how EiovaR
solves it. Unless otherwise stated, Netperf TCP stream employs its default message
size, which is 16KB.

2. Netperf UDP RR (request-response) is the second canonical configuration of


Netperf. It models a latency sensitive workload by repeatedly sending a single byte
and waiting for a matching single byte response. The latency is then calculated
as the inverse of the observed number of transactions per second.

68
3. Apache [26, 27] is a popular HTTP web server. We drive it using the ApacheBench
[8] (also called ab), which is a workload generator that is distributed with Apache.
The goal of ApacheBench is to assess the number of concurrent requests per second
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

that the server is capable of handling by requesting a static page of a given size
from within several concurrent threads. We run ApacheBench on the client
machine configured to generate 100 concurrent requests. We use two instances of
the benchmark, respectively requesting a smaller (1KB) and a bigger (1MB) file.
The logging is disabled to avoid the overhead of writing to disk.

4. Memcached [28] is a high-performance in-memory key-value storage server. It is


used, for example, by websites for caching results of slow database queries, thus
improving the sites overall performance and scalability. We used the Memslap
benchmark [2] (part of the libmemcached client library), which runs on the client
machine and measures the completion rate of the requests that it generates. By
default, Memslap generates a random workload comprised of 90% get and 10% set
operations. Unless otherwise stated, Memslap is set to use 16 concurrent requests.

Before running each benchmark, we shut down and bring up the interface of the NIC
using the ifconfig utility, such that the IOVA allocation is redone from scratch using a
clean tree, clearing the impact of previous harmful long-lasting ring interferences. We
then iteratively run the benchmark 150 times, such that individual runs are configured
to take about 20 seconds. We present the the corresponding results, on average.

4.6.2 Results
The resulting average performance is shown in Figure 4.10 for the Mellanox (top) and
Broadcom (bottom) setups. The corresponding normalized performance, which specifies
relative improvement, is shown in the first part of Tables 4.2 and 4.3. In Figure 4.10,
higher numbers indicate better throughput in all cases but Netperf RR, which depicts
latency (the inverse of the throughput). No such exception is required in (the first part
of) Tables 4.2 and 4.3, which, for consistency, displays the normalized throughput for
all applications including Netperf RR.

69
protect benchmark baseline EiovaR diff
throughput strict Netperf stream 1.00 2.37 +137%
(normalized) Netperf RR 1.00 1.27 +27%
Apache 1MB 1.00 3.65 +265%
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Apache 1KB 1.00 2.35 +135%


Memcached 1.00 4.58 +358%

defer Netperf stream 1.00 1.71 +71%


Netperf RR 1.00 1.07 +7%
Apache 1MB 1.00 1.21 +21%
Apache 1KB 1.00 1.11 +11%
Memcached 1.00 1.25 +25%

alloc strict Netperf stream 7656 88 -99%


(cycles) Netperf RR 10269 175 -98%
Apache 1MB 17776 128 -99%
Apache 1KB 49981 204 -100%
Memcached 50606 151 -100%

defer Netperf stream 2202 103 -95%


Netperf RR 2360 183 -92%
Apache 1MB 2085 130 -94%
Apache 1KB 2642 206 -92%
Memcached 3040 171 -94%

search strict Netperf stream 153 0 -100%


(length) Netperf RR 206 0 -100%
Apache 1MB 381 0 -100%
Apache 1KB 1078 0 -100%
Memcached 893 0 -100%

defer Netperf stream 32 0 -100%


Netperf RR 32 0 -100%
Apache 1MB 30 0 -100%
Apache 1KB 33 0 -100%
Memcached 33 0 -100%

free strict Netperf stream 289 66 -77%


(cycles) Netperf RR 446 87 -81%
Apache 1MB 360 70 -81%
Apache 1KB 565 85 -85%
Memcached 525 73 -86%

defer Netperf stream 273 65 -76%


Netperf RR 242 66 -73%
Apache 1MB 278 65 -76%
Apache 1KB 300 66 -78%
Memcached 334 65 -80%

cpu strict Netperf stream 100 100 +0%


(%) Netperf RR 32 29 -8%
Apache 1MB 100 99 -0%
Apache 1KB 99 98 -1%
Memcached 100 100 +0%

defer Netperf stream 100 100 +0%


Netperf RR 30 29 -5%
Apache 1MB 99 99 -0%
Apache 1KB 98 98 -0%
Memcached 100 100 +0%

Table 4.2: Summary of the results obtained with the Mellanox setup

70
protect benchmark baseline EiovaR diff
throughput strict Netperf stream 1.00 2.35 +135%
(normalized) Netperf RR 1.00 1.07 +7%
Apache 1MB 1.00 1.22 +22%
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Apache 1KB 1.00 1.16 +16%


Memcached 1.00 1.40 +40%

defer Netperf stream 1.00 1.00 +0%


Netperf RR 1.00 1.02 +2%
Apache 1MB 1.00 1.00 +0%
Apache 1KB 1.00 1.10 +10%
Memcached 1.00 1.05 +5%

alloc strict Netperf stream 14878 70 -100%


(cycles) Netperf RR 3359 100 -97%
Apache 1MB 1469 74 -95%
Apache 1KB 2527 116 -95%
Memcached 5797 110 -98%

defer Netperf stream 1108 96 -91%


Netperf RR 1029 118 -89%
Apache 1MB 833 88 -89%
Apache 1KB 1104 133 -88%
Memcached 1021 130 -87%

search strict Netperf stream 345 0 -100%


(length) Netperf RR 68 0 -100%
Apache 1MB 27 0 -100%
Apache 1KB 39 0 -100%
Memcached 128 0 -100%

defer Netperf stream 13 0 -100%


Netperf RR 9 0 -100%
Apache 1MB 9 0 -100%
Apache 1KB 9 0 -100%
Memcached 9 0 -100%

free strict Netperf stream 294 47 -84%


(cycles) Netperf RR 282 48 -83%
Apache 1MB 250 50 -80%
Apache 1KB 425 52 -88%
Memcached 342 47 -86%

defer Netperf stream 268 47 -82%


Netperf RR 273 47 -83%
Apache 1MB 234 47 -80%
Apache 1KB 279 47 -83%
Memcached 276 47 -83%

cpu strict Netperf stream 100 53 -49%


(%) Netperf RR 13 12 -12%
Apache 1MB 99 99 -0%
Apache 1KB 98 98 -0%
Memcached 99 95 -4%

defer Netperf stream 55 44 -21%


Netperf RR 12 11 -7%
Apache 1MB 91 72 -21%
Apache 1KB 98 98 -0%
Memcached 93 92 -2%

Table 4.3: Summary of the results obtained with the Broadcom setup

71
Mellanox Setup Results Let us first examine the results of the Mellanox setup
(Table 4.2). In the topmost part of the table, we see that EiovaR yields throughput that
is 1.074.58x better than the baseline, and that improvements are more pronounced
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

under strict protection. In the second part of the table, we see that the reason underlying
the improved performance of EiovaR is that it reduces the average IOVA allocation
time by 12 orders of magnitude, from up to 50K cycles to around 100200. It also
reduces the average IOVA deallocation time by about 75%85%, from around 250550
cycles to around 6585 (fourth part of the table).
When comparing the overhead of the IOVA allocation routine to the length of
the search loop within this routine (third part of Tables 4.2), we can see that, as
expected, the two quantities are tightly correlated. Roughly, a longer loop implies
a higher allocation overhead. Conversely, notice that there is not necessarily such a
direct correspondence between the throughput improvement of EiovaR (first part of
the table) and the associated IOVA allocation overhead (second part). The reason
is largely due to the fact that latency sensitive applications are less affected by the
allocation overhead, because other components in their I/O paths have higher relative
weights. For example, under strict protection, the latency sensitive Netperf RR has
higher allocation overhead as compared to the throughput sensitive Netperf Stream
(10,269 cycles vs. 7,656, respectively), yet the throughput improvement of RR is smaller
(1.27x vs. 2.37x). Similarly, the IOVA allocation overhead of Apache/1KB is higher
than that of Apache/1MB (49,981 cycles vs. 17,776), yet its throughput improvement is
lower (2.35x vs. 3.65x).
While there is not necessarily a direct connection between throughput and allocation
overhead when examining only strict safety, the connection becomes apparent when
comparing the strict and deferred protection regimes. Clearly, the benefit of EiovaR in
terms of throughput is greater under strict protection because the associated baseline
allocation overheads are higher than that of deferred protection (7K50K cycles for
strict vs. 2K3K for deferred).

Broadcom Setup Results Let us now examine the results of the Broadcom setup
(Table 4.3). Strict EiovaR yields throughput that is 1.072.35x better than the baseline.
Deferred EiovaR, on the other hand, only improves the throughput by up to 10%, and,
in the case of Netperf Stream and Apache/1MB, it offers no improvement. Thus, while
still significant, throughput improvements in this setup are less pronounced. The reason
for this difference is twofold. First, as noted above, the driver of the Mellanox NIC
utilizes more rings and more IOVAs, increasing the load on the IOVA allocation layer
relative to the Broadcom driver and generating more opportunities for ring interference.
This difference is evident when comparing the duration of alloc iova in the two setups,
which is significantly lower in the Broadcom case. In particular, the average allocation
time in the Mellanox setup across all benchmarks and protection regimes is about 15K
cycles, whereas it is only about 3K cycles in the Broadcom setup.

72
The second reason for the less pronounced improvements in the Broadcom setup is
that the Broadcom NIC imposes a 10 Gbps upper bound on the bandwidth, which is
reached in some of the benchmarks. Specifically, the aforementioned Netperf Stream
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

and Apache/1MBwhich exhibit no throughput improvement under deferred Eio-


vaRhit this limit. These benchmarks are already capable of obtaining line rate
(maximal throughput) in the baseline/deferred configuration, so the lack of throughput
improvement in their case should come as no surprise. Importantly, when evaluating I/O
performance in a setting whereby the I/O channel is saturated, the interesting evaluation
metric ceases to be throughput and becomes CPU usage. Namely, the question becomes
which system is capable of achieving line rate using fewer CPU cycles. The bottom part
of Table 4.2 shows that EiovaR is indeed the more performant alternative, using 21%
less CPU cycles in the case of the said Netperf Stream and Apache/1MB under deferred
protection. (In the Mellanox setup, it is the CPU which is saturated in all cases but the
latency sensitive Netperf RR.)

Deferred Baseline vs. Strict EiovaR We explained above that deferred protection
trades off safety to get better performance. We now note that, by Figure 4.10, the
performance attained by EiovaR when strict protection is employed is similar to the
performance of the baseline configuration that uses deferred protection (the default in
Linux). Specifically, in the Mellanox setup, on average, strict EiovaR achieves 5% higher
throughput than the deferred baseline, and in the Broadcom setup EiovaR achieves 3%
lower throughput. Namely, if strict EiovaR is made the default, it will simultaneously
deliver similar performance and better protection as compared to the current default
configuration.

Different Message Sizes The default configuration of Netperf Stream utilizes a


16KB message size, which is big enough to optimize throughput. Our next experiment
systematically explores the performance tradeoffs when utilizing smaller message sizes.
Such messages can overwhelm the CPU and thus reduce the throughput. Another issue
that might negatively affect the throughput of small packets is the maximal number of
packets per second (PPS), which NICs commonly impose in conjunction with an upper
bound on the throughput. (For example, the specification of our Broadcom NIC lists a
maximal rate of 5.7 million PPS [34], and a rigorous experimental evaluation of this
NIC reports that a single port in it is capable of delivering less than half that much
[25].)
Figure 4.11 shows the throughput (top) and consumed CPU (bottom) as a function
of the message size for strict (left) and deferred safety (right) using the Netperf Stream
benchmark in the Broadcom setup. Initially, with a 64B message size, the PPS limit
dominates the throughput in all four configurations. Strict/baseline saturates the CPU
with a message size as small as 256B; from that point on it achieves the same throughput
(4Gbps), because the CPU remains its bottleneck. The other three configurations enjoy

73
strict defer
throughput [Gbit/s]
10 10
8 8
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

6 6
4 4
2 2
0 0
64 25 1K 4K 16 64 25 1K 4K 16
B 6B B B K B B 6B B B KB

strict defer
100 100
80 80
CPU [%]

60 60
40 40
20 20
0 0
64 25 1K 4K 16 64 25 1K 4K 16
B 6B B B K B B 6B B B KB

eiovar message size (log scale)


baseline
Figure 4.11: Netperf Stream throughput (top) and used CPU (bottom) for different message
sizes in the Broadcom setup.

a gradually increasing throughput until the line rate is reached. However, to achieve
the same level of throughput, strict/EiovaR requires more CPU than deferred/baseline,
which in turn requires more CPU than deferred/EiovaR.

Concurrency Our final experiment focuses on concurrent I/O streams, as concurrency


amplifies harmful long-lasting ring interference. Figure 4.12 depicts the results of running
Memcached in the Mellanox setup with an increasing number of clients. The left sub-
graph reveals that the baseline allocation hampers scalability, whereas EiovaR allows
the benchmark to scale such that it is up to 5.5x more performant than the baseline
(with 32 clients). The right sub-graphs highlights why, showing that the baseline IOVA
allocation becomes costlier proportionally to the number of clients, whereas EiovaR
allocation remains negligible across the board.

74
throughput alloc_iova
120k 60k

transactions/sec
100k 50k
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

80k 40k

cycles
60k 30k
40k 20k
20k 10k
0k 0k
1
2
4
8
16
32

1
2
4
8
16
32
eiovar clients (log scale)
baseline
Figure 4.12: Impact of increased concurrency on Memcached in the Mellanox setup. EiovaR
allows the performance to scale.

4.7 Related Work


Several studies recognized the poor performance of the IOMMU for OS protection and
addressed the issue using various techniques.
Willmann et al. [64] studied the poor performance guests exhibit when IOMMU
is used for inter-guest protection. Two of the strategies suggested by the authors to
enhance performance are also applicable to native OS that use the IOMMU for OS
protection: Theshared mappings strategy reuses existing mappings when multiple I/O
buffers reside on the same page, yet does not reduce the the CPU utilization significantly.
The persistent mappings strategy defers the actual unmap operations to allow the
reuse of mappings when the same I/O buffer is reused later. This strategy improves
performance considerably, yet relaxes the protection IOMMU delivers.
A similar work by Amit et al. [4] presented additional techniques that can be applied
to enhance the performance of native OS that use IOMMU protection with a short
window of vulnerability. Asynchronous IOTLB invalidations and refinement of the
persistent mappings to limit the time stale mappings may be deferred was proved to
reduce the overhead considerably. Nonetheless, these techniques still relax the IOMMU
protection and may allow invalid DMA transaction to succeed and corrupt system
memory.
Tomonori [60] studied the IOMMU performance bottleneck, suggesting management
of the IOVA free space using bitmaps instead of red-black trees can enhance performance
by 20% on x86 platforms. Similarly, Cascardo [20] showed using red-black trees for IOVA
free space management on Power PC systems delivers significantly worse performance.
Nonetheless, using bitmaps for free space management consumes significantly more
memory and often does not scale well. Apparently Intel refrained from using bitmaps

75
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

for these reasons.

76
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Chapter 5

Reducing the IOTLB Miss


Overhead

5.1 Introduction
Virtual to physical IO address translation is on the critical path of DMA operations
in computer systems that use IOMMUs since it is invoked on every memory access by
the device that performs the DMA. To decrease the translation latency, IOMMU uses
a cache of recent translations called the I/O Translation Lookaside Buffer (IOTLB).
The IO virtual to physical address translation algorithm that the IOMMU executes is
similar to the algorithm used by the CPUs MMU, and thus studies about TLBs should
be relevant for learning the behavior of the IOTLB [41, 6, 35, 46]. Although many
proposed optimizations were supposed to improve the TLBs performance by reducing
the miss rate, none of the studies above succeeded in devising a TLB with a miss rate
close to 0%. This is because at least the first access to an address causes a miss on the
TLB. Prefetching can hide some or all of the cost of TLB misses, as has been shown in
several recent studies [12, 56, 53, 46].
Amit et al. were, as far as we know, the first to explore prefetching in the context
of IOTLB and used 2 simple prefetching approaches. The first, proposed by Intel, uses
the Address Locality Hints (ALH) mechanism[39]. With this approach, the IOMMU
prefetches adjacent virtual address translations. The second approach is Mapping
Prefetch (MPRE), whereby the operating system provides the IOMMU with explicit
hints to prefetch the first mapping of each group of streaming DMA mappings, where a
group is defined as a scatter-gather list of several contiguous pages that are mapped
consecutively. The evaluation of these prefetchers has been covered elsewhere; we refer
the interested reader to [5].
In general, we can classify prefetchers into 2 groups: those that capture strided
reference patterns and those that decide whether to prefetch on the basis of history.
(Strided reference patterns are a sequence of memory reads and writes to addresses, each
of which is separated from the last by a constant interval.) Anand et al. reviewed 5 main

77
prefetching mechanisms for the data TLB [46, 45], two of which belong to the first class:
sequential prefetching and arbitrary stride prefetching. These prefetching mechanisms
cannot be adapted to the context of IOMMU because one of them depends on the
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Page Table structure and the second uses the Program Counter register. Consequently,
we used only the other three for comparison against our prefetching mechanism and
adjusted them to the IO context.
The first of these is based on limited history and called Markov prefetching [43]. The
second, called recency-based prefetching, is based on a much larger history (until the
previous miss to a page) and is described in [56]. The third is a stride-based technique
called distance prefetching that was proposed for TLBs in [46].

5.2 General description of all the prefetchers we explore

IOMMU Main Memory

IOTLB Page Table


1) A Virtual address
translation is
required
2) IOTLB Translating
miss IOVA to PA

Prediction Table

If both IOTLB and 3) Update the


PB are missed, prediction table
translate from
page table
Logic

Prefetch Buffer

4) Prefetch the relevant IOVA translations to the


PB using the prediction table and the page table

Figure 5.1: General scheme.

In a nutshell, the prefetch mechanism wraps the regular translation mechanism,


predicts the next IOTLB misses, and brings the translations closer to the IOMMU so
they can be supplied rapidly when they are required but not found in the IOTLB. A
prefetcher contains 3 main parts: a data base called the prediction table, a special cache

78
called the prefetch buffer, and a logic unit. These are depicted in Figure 5.1.
The prediction table summarizes information about the recent IOTLB misses and
provides the logic unit relevant data for deciding what translation to prefetch and
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

whether to prefetch. Each row in this table has a tag and s slots containing data to
calculate the next s address translations to prefetch.The size of the table is therefore the
number of rows multiplied by s. In our work we used s=2. In Markov prefetching, the
current missing page is used as the tag, and the table gives two other pages with high
probability of being missed immediately after: these two pages need to be prefetched.
Distance prefetching uses as the tag the current distance (difference in page number
between the current miss and the previous miss), and the table gives two other distances
that occurred immediately after this distance was encountered. Recency prefetching
keeps an LRU stack of TLB misses by maintaining two additional pointers in each page
table entry (kept in memory).
On each IOTLB miss, the logic unit updates the prediction table and decides which
translation to prefetch into the prefetch buffer. The prefetch buffer is as close as the
IOTLB to the IOMMU, and it can be searched concurrently with the IOTLB. Hence,
in the case of a prefetch buffer hit, the IOMMU gets the translation as quickly as if it
were in the IOTLB.
Note that the prefetch mechanism observes the miss reference stream to make deci-
sions and never directly loads the fetched entries into the IOTLB. As a result, prefetching
does not influence the miss rate of the IOTLB; it can only hide the performance cost of
some of the IOTLB misses. Hence the only drawback of the prefetch mechanism is the
additional memory traffic of prefetched and unused translations.

5.3 Markov Prefetcher (MP)

The Markov prefetcher is based on the Markov Chain Theorem and the assumption that
there is a good chance that history repeats itself. It uses a matrix (X Y ) to represent an
approximation of a Markov state transition diagram and build it dynamically. Markov
state transition diagrams answer the question: given a visit to a specific state, what are
the X 1 states with the highest probability to be visited right after it?
The diagram contains states denoting the referenced unit (virtual addresses in this
context) and transition arcs denoting the probability to move from one state to the state
pointed to by the arc. The probabilities are tracked from prior references to that state.
To know which states have the highest probability to be visited, one should search in
the states arcs. Joseph et al. were the first to propose this mechanism for caching [43].
Kandiraju et al. extended it to work with TLBs and called it the Markov Prefetcher
[46], and we extended the latter to work with IOTLBs, as will be discussed shortly. We
will briefly describe the Markov Chain theorem for a better understanding of how the
prefetcher works.

79
5.3.1 Markov Chain Theorem

Let us assume the following definitions: A stochastic process X = {Xt : t T } is a


Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

collection of random variables, representing the evolution of some system of random


values over time. We call Xt the state of the process at time t.
The Markov property is defined as a memoryless transition between states, namely, the
next state depends only on the current state and not on the sequence of events that
preceded it.
Let x1 , x2 ...xn be random variables.
Then P r(Xn+1 = x | X1 = x1 , X2 = x2 , ...Xn = xn ) = P r(Xn+1 = x | Xn = xn ).
A Markov chain is a stochastic process with the Markov property.

The Markov chain can thus be used for describing systems that follow a chain of
linked events represented as states, S = {s1 , s2 , ...sn }. The process starts in one of
these states and moves successively from one state to another. These changes of state
of the system are also called transitions. If the system is currently in state ssi , the
probability that the next state will be sj is denoted by pij , where pij depends only on
the current state (si ) of the system and not on the sequence of events that preceded it
nor on time i. This can be described mathematically as: P r[Xn+1 = b | Xt = a] = p{ ab}
(Stationary Assumption). The states and transitions are denoted by a Markov state
transition diagram, which is represented as a directed graph or a matrix and usually
built using the history of the process.

Example

This can be illustrated by a simple example. Let us assume that the entire World Wide
Web consists of only 3 websites, e1, e2, e3, visited in the sequence e1, e1, e2, e2, e1, e2,
e1, e3, e3, e2, e3. Then the Markov state transition diagram, built according to the
history of these visits, is illustrated in 5.2.

2/4

2/4 1/4
1/4 e1 e2
1/4
0
1/4 1/2
e3

1/2

Figure 5.2: Markov state transition diagram, which is represented as a directed graph
(right) or a matrix (left).

80
The left is the matrix representation and the right is the directed graph representation.
If we are in state e1 there is a chance of 1/4 to stay in e1, 2/4 to move to e2 and 1/4 to
move to e3. If we are in state e2 there is a chance of 1/4 to stay in e2, 2/4 to move to
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

e1 and 1/4 to move to e3, etc.

5.3.2 Prefetching Using the Markov Chain

The prediction table for the Markov prefetcher includes a tag column containing the
virtual addresses that missed on the TLB. Each row of the table has s slots, with each
slot containing a virtual page address that has the highest (approximated) probability
to be accessed after the virtual address denoted by the tag of the row. These slots
therefore correspond to translations to be prefetched the next time a miss occurs. On a
TLB miss, the prefetcher looks up the missed address in the table. If not found, then
this entry is added, and the s slots for this entry are kept empty. If the missed address
is found, then a prefetch is initiated for the corresponding s slots of this address.

5.3.3 Extension to IOMMU

In the TLB extension of the Markov prefetcher [46], the slots contain an entire page
table entry (including the virtual address and the translation). This is in contrast to
our IOMMU extension, where the slots contain only virtual addresses; the translations
corresponding to these addresses are brought by invoking a page walk for each address.
This difference is due to the fact that in the IO context the translations are not available
after a DMA transaction is finished, unlike the CPU context where the virtual addresses
exist until the process is killed or the memory is free.
In addition to prefetching, the prefetchers logic unit goes to the entry of the previous
virtual address that missed, and adds the current missed address into one of its s slots
(whichever is free). If all the slots are occupied, then it evicts one in accordance with
LRU policy. As a result, the s slots for each entry correspond to different virtual pages
that also missed immediately after this page. Since the table has limited entries, an
entry (row) could itself be replaced because of conflicts. A simplified hardware block
diagram implementation of the Markov prefetcher with s = 2 is given in Figure 5.3.

5.4 Recency Based Prefetching (RP)

While Markov prefetching schemes were initially proposed for caches, recency prefetching
has been proposed solely for TLBs [56, 46]. This scheme uses the memory accesses (and
misses on the TLB) as points on the time-line and works on the principle that after a
TLB miss of a virtual address with a given recency, the next TLB miss will likely be
of a virtual address with the same recency. The underlying assumption is that if an
application has sequential accesses to a set of data structures, then there is a temporal

81
ordering between the accesses to these structures, despite the fact that there may be no
ordering in either the virtual or physical address space.
To keep an updated data structure that contains all the page table entries ordered
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

by their recency, the recency prefetcher builds an LRU stack of page table entries. The
hardware TLB maintains temporal ordering and naturally contains virtual addresses
with contemporary recency: if the TLB has N entries, then it implements the upper
N entries of the recency stack and the rest of the entries are located in the prefetcher
database (the LRU maintained in the main memory).

5.4.1 TLB hit

Assume we have a TLB with 4 entries, A, B, C, and D, corresponding to an LRU stack


order of 1,2,3, and 4, respectively. If a virtual address C, which is the third address on
the recency stack, is accessed, then only the TLB is updated. C becomes the first in
the LRU order, A becomes second, B becomes third, and the rest of the recency stack,
which is in the main memory, remains the same as in Figure 5.4.

5.4.2 TLB miss

When accessing a virtual address that does not exist in the TLB (a TLB miss), the
required virtual address translation is removed from the in-memory LRU stack, inserted
into the top of the TLB, and the address has the most contemporary recency at that
point in time. All the following addresses are pushed down. As a result, the last entry

Page # (Virtual)
Predicted Addresses Prefetch
Tags
Buffer

1) P1 is missed on the IOTLB


after P0 was missed
P1 P2 P3

3) P2 and P3 are prefetched


to the prefetch buffer
P0 P1

2) P1 is inserted to the P0
predicted addresses

Figure 5.3: Schematic implementation of the Markov Prefetcher.

82
1 A C C
A
2 B A

TLB
B
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

3 C B
4 D D 2) A and B move to D
the 2nd and 3rd
1) C is accessed position respectively
when it is the 3rd
TLB entry and then
moved to the 1st
5 U position U U
In memory recency stack
6 V V V
7 W W W
8 X X X
9 Y Y Y
10 Z Z Z

LRU stack order Recency stack before Updating recency Recency stack after
(Contemporary recency) TLB hit stack on TLB hit TLB hit

Figure 5.4: Schematic depiction of the recency prefetcher on a TLB hit.

is evacuated from the TLB to make room for the inserted entry and inserted to the
in-memory stack. Thus, the only in-memory recency stack manipulations required are
those related to TLB misses. In addition to all these updates, the prefetch mechanism
prefetches 2 LRU entries with similar contemporary recency to the accessed virtual
address, namely the predecessor and the successor of the accessed virtual address before
it was removed from the in memory LRU stack.

Example

Assume we have a TLB with 4 entries, A, B, C, and D, with an LRU stack order of 1,
2, 3, and 4, respectively, and a memory recency stack with 6 entries, U, V, W, X, Y ,
and Z, with an LRU stack order of 5, 6, 7, 8, 9, and 10, respectively; see Figure 5.5.
If virtual address Y, which is 9th in the LRU stack order, is accessed, the LRU stack
is updated as follows:

1. Y is removed from the in-memory recency stack and inserted to the first entry in
the TLB.

2. D is evacuated from the TLB and inserted to the first place in the in-memory
stack; namely, it becomes the 5th in the LRU stack order.

83
1 A Y 2) A,B and C move Y
A to the 2nd , 3rd and
2 B A

TLB
4th position
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

B respectively
3 C B
3) D is moved from C
4 D the TLB to the in- C
memory stack.
From the 4th to the
5th position.
D
5 U D
In memory recency stack

U
6 V U
V
7 W V
W
8 X W
X
9 Y X
1) Y is accessed
10 Z when it is the 9th Z Z
entry in the LRU 4) X and Y are
stack and then prefetched to the
moved to be the first prefetch buffer
position in the TLB

LRU stack order Recency stack before Updating recency Recency stack after
(Contemporary recency) TLB miss stack on TLB miss TLB miss

Figure 5.5: Schematic depiction of the recency prefetcher on a TLB miss.

3. A,B,C stay in the TLB but become 2nd , 3rd and 4th respectively, in the LRU stack
order. Since Y was missed in the TLB, prefteching is executed, that is to say X
and Z are inserted into the prefetch buffer.

5.4.3 Extension to IOMMU

As explained before, virtual addresses translations are available only during the DMA
transactions and unpadded right after the transaction is finished, so keeping the trans-
lations is useless. Since the virtual address allocator reuses the addresses after they are
freed, it is probable that the recency prefetcher can predict missing patterns. We thus
propose holding the virtual addresses in the stack even if they are unmapped and, when
a prefetch is executed, looking for the translation in the page table and prefetching it
only if it exists.

5.5 Distance Prefetching (DP)


The distance prefetcher works on the hypothesis that if we could keep track of differences
(distances) between successive addresses, then we could make more predictions in a
smaller space than the previous prefetchers. Let us define distance as the subtraction
between 2 addresses and assume that the sequence of accessed addresses is 0, 2, 10, 12,

84
20, 22, 30, 32. Then the distances of this sequence are: 2, 8, 2, 8, 2, 8, 2. Knowing that
a distance of 2 is followed by a distance of 8 and vice versa allows us to predict, after
only the third access (address 10), the rest of the sequence.This is the main idea behind
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

the distance prefetcher.


Although the Markov prefetcher [43, 46] and recency prefetcher [56, 46] are well-
suited to learning accessing patterns in a system, they require considerably more space
than DP to hold the data when the domain becomes large (the domain includes more
addresses). In order to detect a pattern of a sequential scan, for example, MP and RP
need considerable space. They also take a while to learn a pattern, since a prefetch
is done only after repetitions in addresses. According to Kandiraju et al. [46], who
proposed it, the distance prefetcher can detect at least as many accessing patterns as
the recency and Markov prefetchers. Unlike the distance prefetcher, the Markov and
recency prefetchers cannot be used to predict the first access to a given address.

Prefetch
Page # (Virtual)
Previous IOVA Predicted Addresses Buffer
Tags

Missed
IOVA -
d1 d2 d3

Current distance d0 d1
4) Translate and Prefetch
prefetch the IOVAs translations
2) Insert the current
Previous distance
distance to the previous
distance row
Page Table
+ +
1) IOVA is missed and
current distance is 3) Calculate the addresses Prefetch
calculated to prefetch by adding Virtual
distances to current IOVA Addresses

Figure 5.6: Schematic depiction of the distance prefetcher on a TLB miss.

The hardware implementation is similar to the Markov prefetcher implementation


except that the prediction table contains distances instead of addresses, and calculations
are required to obtain these distances, as shown in Figure 5.6. The table rows are
indexed with distances and each row contains 2 predicted distances. The tag cells
represent the distance of the current missed virtual address from the last missed virtual
address and the 2 other cells contain the prediction of possible distances from the current
address. When a virtual address is missed on the TLB, the current distance is calculated
from the previous virtual address, which is kept in a register. The previous distance is
also kept in a register, and the current distance is inserted to its row. Then 2 distances
from the current distance row are taken in order to calculate 2 addresses which will be
prefetched to the prefetch buffer. Now that we have possible virtual addresses, we look

85
for their translations in the page table and, if found, fetch them from the page table
and store them in the TLB.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

5.6 Evaluation
5.6.1 Methodology

We designed a simulator to simulate different types of TLBs and prefetchers. The


simulator input is a specific IOMMU architecture (TLB and prefetcher) and a stream
of events (map memory address, unmap memory address and access to memory). The
output is statistics of the IOMMU performance. To supply the event stream to the
simulator, we implemented an IOMMU emulation on KVM/QEMU to log the DMAs
that the emulated I/O devices perform (in the relevant locations in the QEMU code we
entered print commands). While writing the logs to files, we then ran our benchmarks:
Netperf TCP stream, Netperf UDP RR, Apache and Memcached, which are described
in chapters 3 and 4. After the benchmarks finish running, we get a set of files which
are the recorded benchmark runs and include all the information needed to simulate
different IOTLB schemes. Finally, we implemented simulations of all the prefetchers
described above. Given a prefetcher and a benchmark log file, the simulator parses the
file, creates an event stream, and executes the corresponding methods of the prefetcher
to these events.
The virtual machines of our modified QEMU used Ubuntu 12.04 server images and
were configured to use emulated e1000 NIC devices. Each IOMMU simulation included
a 32-entry fully associative LRU IOTLB and a 32-entry fully associative Prefetch Buffer.
The size of the prediction table of the prefetcher varied as shown in Figures 5.7, 5.8 and
5.9.

5.6.2 Results

In order to understand the results, one should understand what memory accesses are
included in a DMA transaction. First the device reads the DMA descriptor which resides
in the Rx or Tx descriptor ring, gets the buffer address, and then accesses (write or read)
that memory address. When IOMMU is enabled, every access requires a translation and
the IOMMU checks whether the translation is cached in the IOTLB or in the prefetch
buffer.
Each E1000 device in our virtual machines includes only 2 rings with 256 descriptors.
One ring takes 4k bytes of memory, and 1 page is sufficient to contain it. Hence, each
virtual machine uses only 2 addresses for the NIC rings and the maximum possible
virtual addresses is 514: 2 addresses of the rings and 256 buffers in each ring (Rx and
Tx). When the NIC only receives data, there are only 258 virtual addresses: 2 addresses
of the rings and 256 Rx buffers. Note that the Rx ring contains the maximum number
of buffers and the Tx ring is empty.

86
Minimum accesses per transaction Maximum accesses per transaction
100

80
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Apache 1k
hit rate [%]
60

40

20

0
100

80
Apache 1M
hit rate [%]

60

40

20

0
8
32
64
12
258
516

8
32
64
12
258
516

8
32
64
12
258
516

8
32
64
12
258
516

8
32
64
12
258
516

8
32
64
12
258
516
2

2
recency distance markov recency distance markov
size of prediction table size of prediction table
IOTLB rings
IOTLB buffers
PB buffers

Figure 5.7: Hit rate simulation of Apache benchmarks with message sizes of 1k (top)
and 1M (bottom).

We observed that the 2 addresses of the rings are inserted to the IOTLB and are
never invalidated. This is because 32 entries are sufficient for never reaching the point
where those 2 addresses are the least recently used in the IOTLB and other entries are
evicted before it. Hence there are 3 type of possible hits in our results:

1. IOTLB hit caused by accessing a ring descriptor;

2. IOTLB hit caused by accessing a buffer;

3. Prefetch buffer hit caused by accessing a buffer;

We present 2 versions of results for each benchmark, because the granularity of the
transferred buffer chunks varies from system to system and affects the miss rate of the
IOMMU. The right-hand columns in the figures show the results for systems with a
granularity of 64 bytes, which cause the maximum number of accesses per transaction,
while the left-hand columns include results for systems that have a chunk with the size
of the transferred buffer.

87
Minimum accesses per transaction Maximum accesses per transaction
100

80
Netperf Stream 1k
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

hit rate [%]

60

40

20

0
100

80
Netperf Stream 4k
hit rate [%]

60

40

20

0
8
32
64
12
258
516

8
32
64
12
258
516

8
32
64
12
258
516

8
32
64
12
258
516

8
32
64
12
258
516

8
32
64
12
258
516
2

2
recency distance markov recency distance markov
size of prediction table size of prediction table
IOTLB rings
IOTLB buffers
PB buffers

Figure 5.8: Hit rate simulation of Netperf stream with message sizes of 1k (top) and
4k (bottom).

Markov prefetcher

The Markov prefetcher begins to be effective only when the prediction table has 256 or
more entries. Not coincidentally, 256 is the size of a ring in E1000 and constitutes most
of the virtual addresses that are in use by the NIC.

When sending a 1k buffer, only one Ethernet packet is required, but when sending a
1M buffer the buffer is split to 700 Ethernet packets. Those 700 packets are entered one
after the other to the Tx descriptor ring, which has 256 entries. The order of sending
256 packets repeats itself about 3 times, so the predictor can learn this pattern after
the first 1M buffer is sent. The results are better than Apache 1k (Figure 5.7). For the
same reason, when sending 1M buffers, 256 entries in the prediction table are sufficient
to obtain results as good as those obtained when 512 entries are used. Any number of
entries smaller than 256 results in a zero prefetch buffer hit rate because the periodicity
of the rings is 256.

In all the other benchmarks, 256 prediction table entries are sufficient to obtain a
prefetch buffer of about 95% (Figures 5.8, 5.9).

88
Minimum accesses per transaction Maximum accesses per transaction
100

80

Netperf RR 1k
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

hit rate [%]


60

40

20

0
100

80
Memcache
hit rate [%]

60

40

20

0
8
32
64
12
258
516

8
32
64
12
258
516

8
32
64
12
258
516

8
32
64
12
258
516

8
32
64
12
258
516

8
32
64
12
258
516
2

2
recency distance markov recency distance markov
size of prediction table size of prediction table
IOTLB rings
IOTLB buffers
PB buffers

Figure 5.9: Hit rate simulation of Netperf RR (top) and Memcached (bottom).

The Recency prefetcher

As expected, all benchmarks obtain a hit rate of almost 100% when the prediction table
contains 512 entries and a lower hit rate when the prediction table contains at most 256
entries.
Because of the IOVA allocation scheme explained in chapter 4, Rx and Tx swaps
IOVAs. The IOVA recency, which is accessed both by sending and receiving, is always
above 256. Hence, when running Netperf RR, 256 prediction table entries are not
enough for the preftcher to predict anything.
As for the Markov prefetcher, here, too, any prediction table with fewer than 256
entries will result in a 0% prefetch buffer hit rate because the periodicity of the rings is
256.

The Distance prefetcher

The distance prefetcher is not influenced by the addresses or by the periodicity of the
NIC descriptor rings. The distances between accessed address are such that only 8
prediction table entries are enough to predict something and increasing that number
does not really improve the prediction. Moreover, in some benchmarks (for example the

89
Netperf stream shown in Figure 5.8), increasing the number of prediction table entries
reduces the prefetch buffer hit rate, because the relevance of the access pattern changes
over time. Hence, the bigger the table is, the more irrelevant data is kept by it.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

5.7 Measuring the cost of an Intel IOTLB miss

Experimental Setup

We used two identical Dell PowerEdge R210 II Rack Server machines that communicate
through Mellanox ConnectX3 40Gbps NICs. The NICs are connected back to back
and are configured to use Ethernet. We use one machine as the server and the other
as a workload generator client. Each machine has a 8GB 1333MHz memory and a
single socket 4-core Intel Xeon E3-1220 CPU running at 3.10GHz. The chipset is Intel
C202, which supports VT-d, Intels Virtualization Technology that provides IOMMU
functionality. We configure the server to utilize one core only and turn off all power
optimization sleep states (C-states), dynamic voltage, and frequency scaling (DVFS) to
avoid reporting artifacts caused by nondeterministic events. The machines run Ubuntu
12.04 with the Linux 3.11 kernel.
With the help of the ibverbs library [47, 38], we bypassed the entire network stack
and communicated directly with the NIC. This way we could measure the latency more
accurately.

Experimental results

We first measured the IOMMU latency sending an Ethernet packet and then measured
the packets round trip time. Each experiment was performed in two versions. In the
first, a buffer was iteratively and randomly selected from a large pool of previously
mapped buffers and transmitted, thus ensuring that the probability for the corresponding
IOVA to reside in the IOTLB is low. We call this version the miss-version. The second
version does the same but with a pool containing only one buffer, thus ensuring that the
IOTLB always hits. This version is called the hit-version. The two types of experiments
are detailed below:

1. Measuring the IOMMU hardware latency of sending an Ethernet packet:


On each iteration of this experiment we sent a packet and polled the NIC status
register until receiving notification the packet was sent. Note that this was done
without the OS network stack.

The hit-version results are given in the first line of Table 5.1. The system takes
about 3300 cycles to send 1 packet. As expected, both with or without the IOMMU
the results are the same for the hit-version. The miss-version shows a difference
of about 1000 cycles between running the experiment with IOMMU enabled and

90
running it with IOMMU disabled. Thus, we can learn that the IOTLB miss
latency for sending a packet is 0.3 microseconds in our machine.

Intel IOTLB hit rate Intel IOMMU disabled Intel IOMMU enabled
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

100% 3343 3357


0% 5423 6399

Table 5.1: Hardware latency of sending an Ethernet packet [cycles]

2. Measuring the round-trip time of an Ethernet packet: We used the ap-


plication ibv rc pingpong, which can be found at openfabrics.org, under the
examples directory. This application sets up a connection between two nodes
running Mellanox adapters and transfers data. The client sends a message and the
server acknowledges it by sending back a corresponding message. Only after the
acknowledgement message arrives to the client does it send another message. We
repeated this experiment with different size pools of previously mapped buffers
(both on the client and server side).

1.2
Round Trip Time Delta [micro seconds]

0.8

0.6

0.4

0.2

0
2 3 8 16 32 64 100 128 256 10k
number of different buffers sent

Figure 5.10: Subtraction between the RTT when the IOMMU is enabled and the RTT
when the IOMMU is disabled.

Figure 5.10 shows the difference between the round-trip time (RTT) as a function
of buffer pool size with the IOMMU enabled and disabled. The first thing we notice
is that with more than 32 buffers in the pool, the RTT delta jumps from 0.6 to 1.2
microseconds. From this we learn that a buffer pool size larger than 32 causes IOTLB

91
misses and the size of the IOTLB is between 32 and 64. We also learn that the IOTLB
miss penalty for a round trip is 0.6 microseconds. Finally, we learn that there is an
additional cost of 0.6 microseconds due to the private data structure of the NIC.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

92
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Chapter 6

Conclusions

6.1 rIOMMU
The IOMMU design is similar to that of the regular MMU, despite the inherent differences
in their workloads. This design worked reasonably well when I/O devices were relatively
slow as compared to the CPU. But it hampers the performance of contemporary devices
like 10/40 Gbps NICs. We foresee that this problem will get worse due to the ever
increasing speed of such devices. We thus contend that it makes sense to rearchitect
the IOMMU such that it directly supports the unique characteristics of its workload.
We propose rIOMMU as an example for such a redesign and show that the benefits of
using it are substantial.

6.2 eIOVAR
The IOMMU has been falsely perceived as the main responsible party for the significant
overheads of intra-OS protection. We find that the OS is equally to blame, suffering
from the long-term ring interference pathology that makes its IOVA allocations slow.
We exploit the ring-induced nature of IOVA (de)allocation requests to design a simple
optimization called eIOVAR. The latter eliminates the harmful effect and makes the
baseline allocator orders of magnitude faster. It improves the performance of common
benchmarks by up to 4.6x.

6.3 Reducing the IOTLB Miss Overhead


Both recency and Markov, in our case, require a prediction table with a minimum size
of 256, which is the size of the descriptor ring. This size will increase to thousands if the
system includes a 40 Gbps NIC. The hit rate improvement will not be worth this price.
In addition, the prefetch buffer hardware can be used as an IOTLB instead of adding a
prefetcher to the system which can be helpful when using multiple devices. The distance
prefetcher is a good compromise and only few entries for device are enough to improve

93
the hit rate by up to 12%. But, as shown in chapter 3, by changing the page table data
structure to a ring and using only two cache entries per ring, it is possible to predict
100% of the accessed addresses, and this is the best solution for devices that use rings.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

94
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Bibliography

[1] Dennis Abts and Bob Felderman. A guided tour through data-center net-
working. ACM Queue, 10(5):10:1010:23, May 2012.

[2] Brian Aker. Memslap - load testing and benchmarking a server. http://docs.
libmemcached.org/bin/memslap.html. libmemcached 1.1.0 documentation.

[3] AMD Inc. AMD IOMMU architectural specification, rev 2.00. http://
support.amd.com/TechDocs/48882.pdf, Mar 2011.

[4] Nadav Amit, Muli Ben-Yehuda, Dan Tsafrir, and Assaf Schuster. vIOMMU:
efficient IOMMU emulation. In USENIX Ann. Technical Conf. (ATC), pages
7386, 2011.

[5] Nadav Amit, Muli Ben-Yehuda, and Ben-Ami Yassour. Iommu: Strategies
for mitigating the iotlb bottleneck. In Proceedings of the 2010 Interna-
tional Conference on Computer Architecture, ISCA10, pages 256274, Berlin,
Heidelberg, 2012. Springer-Verlag.

[6] Thomas E. Anderson, Henry M. Levy, Brian N. Bershad, and Edward D.


Lazowska. The interaction of architecture and operating system design. In
Proceedings of the Fourth International Conference on Architectural Support
for Programming Languages and Operating Systems, ASPLOS IV, pages
108120, New York, NY, USA, 1991. ACM.

[7] Anonymized. Efficient IOVA allocation. Submitted, 2014.

[8] Apachebench. http://en.wikipedia.org/wiki/ApacheBench.

[9] Apple Inc. Thunderbolt device driver programming guide: Debugging VT-d
I/O MMU virtualization. https://developer.apple.com/library/mac/
documentation/HardwareDrivers/Conceptual/ThunderboltDevGuide/
DebuggingThunderboltDrivers/DebuggingThunderboltDrivers.html,
2013. (Accessed: May 2014).

[10] Apple Inc. Thunderbolt device driver programming guide: Debugging VT-d
I/O MMU virtualization. https://developer.apple.com/library/mac/

95
documentation/HardwareDrivers/Conceptual/ThunderboltDevGuide/
DebuggingThunderboltDrivers/DebuggingThunderboltDrivers.html,
2013. (Accessed: May 2014).
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

[11] ARM Holdings. ARM system memory management unit archi-


tecture specification SMMU architecture version 2.0. http:
//infocenter.arm.com/help/topic/com.arm.doc.ihi0062c/IHI0062C_
system_mmu_architecture_specification.pdf, 2013.

[12] Kavita Bala, M. Frans Kaashoek, and William E. Weihl. Software prefetching
and caching for translation lookaside buffers. In USENIX Ann. Technical
Conf. (ATC), pages 243253, 1994.

[13] Thomas Ball, Ella Bounimova, Byron Cook, Vladimir Levin, Jakob Lichten-
berg, Con McGarvey, Bohus Ondrusek, Sriram K. Rajamani, and Abdullah
Ustuner. Thorough static analysis of device drivers. In ACM EuroSys, pages
7385, 2006.

[14] Michael Becher, Maximillian Dornseif, and Christian N. Klein. FireWire: all
your memory are belong to us. In CanSecWest applied security conference,
2005.

[15] Muli Ben-Yehuda, Jimi Xenidis, Michal Ostrowski, Karl Rister, Alexis Bruem-
mer, and Leendert van Doorn. The price of safety: Evaluating IOMMU
performance. In Ottawa Linux Symp. (OLS), pages 920, 2007.

[16] Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R.


Wilson. Hoard: A scalable memory allocator for multithreaded applications.
In ACM Intl Conf. on Architecture Support for Programming Languages &
Operating systems (ASPLOS), pages 117128, 2000.

[17] James E.J. Bottomley. Dynamic DMA mapping using the generic device.
https://www.kernel.org/doc/Documentation/DMA-API.txt. Linux ker-
nel documentation.

[18] Randal E. Bryant and David R. OHallaron. Computer Systems: A Program-


mers Perspective. Addison-Wesley Publishing Company, USA, 2nd edition,
2010.

[19] Brian D. Carrier and Joe Grand. A hardware-based memory acquisition


procedure for digital investigations. Digital Investigation, 1(1):5060, Feb
2014.

[20] Thadeu Cascardo. DMA API Performance and Contention on IOMMU


Enabled Environments. https://events.linuxfoundation.org/images/
stories/slides/lfcs2013_cascardo.pdf.

96
[21] Andy Chou, Junfeng Yang, Benjamin Chelf, Seth Hallem, and Dawson Engler.
An empirical study of operating systems errors. In ACM Symp. on Operating
Systems Principles (SOSP), pages 7388, 2001.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

[22] Cisco. Understanding switch latency. http://www.cisco.com/c/en/us/


products/collateral/switches/nexus-3000-series-switches/white_
paper_c11-661939.html, Jun 2012. White paper. Accessed: Aug 2014.

[23] Jonathan Corbet. Linux Device Drivers, chapter 15: Memory Mapping and
DMA. OReilly, 3rd edition, 2005.

[24] John Criswell, Nicolas Geoffray, and Vikram Adve. Memory safety for low-
level software/hardware interactions. In USENIX Security Symp., pages
83100, 2009.

[25] Demartek, LLC. QLogic FCoE/iSCSI and IP networking adapter


evaluation (previously: Broadcom BCM957810 10Gb). http:
//www.demartek.com/Reports_Free/Demartek_QLogic_57810S_FCoE_
iSCSI_Adapter_Evaluation_2014-05.pdf, May 2014. (Accessed: May
2014).

[26] The Apache HTTP server project. http://httpd.apache.org.

[27] Roy T. Fielding and Gail Kaiser. The Apache HTTP server project. IEEE
Internet Computing, 1(4):8890, Jul 1997.

[28] Brad Fitzpatrick. Distributed caching with memcached. Linux J., 2004(124),
Aug 2004.

[29] Brice Goglin. Design and implementation of Open-MX: High-performance


message passing over generic Ethernet hardware. In IEEE Intl Parallel &
Distributed Processing Symp. (IPDPS), 2008.

[30] Abel Gordon, Nadav Amit, Nadav HarEl, Muli Ben-Yehuda, Alex Landau,
Assaf Schuster, and Dan Tsafrir. ELI: Bare-metal performance for I/O
virtualization. In ACM Intl Conf. on Architecture Support for Programming
Languages & Operating systems (ASPLOS), pages 411422, 2012.

[31] Jorrit N. Herder, Herbert Bos, Ben Gras, Philip Homburg, and Andrew S.
Tanenbaum. Failure resilience for device drivers. In IEEE/IFIP Ann. Intl
Conf. on Dependable Syst. & Networks (DSN), pages 4150, 2007.

[32] Brian Hill. Integrating an EDK custom peripheral with a LocalLink interface
into Linux. Technical Report XAPP1129 (v1.0), XILINX, May 2009.

[33] The hoard memory allocator. http://www.hoard.org/. (Accessed: May


2014).

97
[34] HP Development Company. Family data sheet: Broadcom NetXtreme network
adapters for HP ProLiant Gen8 servers. http://www.broadcom.com/docs/
features/netxtreme_ethernet_hp_datasheet.pdf, Aug 2013. Rev. 2. (Ac-
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

cessed: May 2014).

[35] Jerry Huck and Jim Hays. Architectural support for translation table man-
agement in large address space machines. In Proceedings of the 20th Annual
International Symposium on Computer Architecture, ISCA, pages 3950, New
York, NY, USA, 1993. ACM.

[36] IBM Corporation. PowerLinux servers 64-bit DMA concepts.


http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/topic/liabm/
liabmconcepts.htm. (Accessed: May 2014).

[37] IBM Corporation. AIX kernel extensions and device support programming con-
cepts. https://publib.boulder.ibm.com/infocenter/aix/v7r1/topic/
com.ibm.aix.kernelext/doc/kernextc/kernextc_pdf.pdf, 2013. (Acc-
ssed: May 2014).

[38] ibverbs evaluation. http://www.scalalife.eu/book/export/html/434.


Accessed: Aug 2014.

[39] Intel Corporation. Intel virtualization technology for directed I/O,


architecture specification - architecture specification - rev. 1.2.
http://www.intel.com/content/dam/www/public/us/en/documents/
product-specifications/vt-directed-io-spec.pdf, Sep 2008.

[40] Intel Corporation. Intel virtualization technology for directed I/O,


architecture specification - architecture specification - rev. 2.2.
http://www.intel.com/content/dam/www/public/us/en/documents/
product-specifications/vt-directed-io-spec.pdf, Sep 2013.

[41] Bruce L. Jacob and Trevor N. Mudge. A look at several memory management
units, tlb-refill mechanisms, and page table organizations. In Proceedings of the
Eighth International Conference on Architectural Support for Programming
Languages and Operating Systems, ASPLOS, pages 295306, New York, NY,
USA, 1998. ACM.

[42] Rick A. Jones. A network performance benchmark (revision 2.0). Tech-


nical report, Hewlett Packard, 1995. http://www.netperf.org/netperf/
training/Netperf.html.

[43] Doug Joseph and Dirk Grunwald. Prefetching using Markov predictors. In
Intl Symp. on Computer Archit. (ISCA), pages 252263, 1997.

98
[44] Asim Kadav, Matthew J. Renzelmann, and Michael M. Swift. Tolerating
hardware device failures in software. In ACM Symp. on Operating Systems
Principles (SOSP), pages 5972, 2009.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

[45] Gokul B. Kandiraju and Anand Sivasubramaniam. Characterizing the d-TLB


behavior of SPEC CPU2000 benchmarks. In ACM SIGMETRICS Intl Conf.
on Measurement & Modeling of Comput. Syst., pages 129139, Jun 2002.

[46] Gokul B. Kandiraju and Anand Sivasubramaniam. Going the distance for
TLB prefetching: An application-driven study. In Intl Symp. on Computer
Archit. (ISCA), pages 195206, 2002.

[47] Gregory Kerr. Dissecting a small InfiniBand application using the Verbs
API. Computing Research Repository (arxiv), abs/1105.1827, 2011. http:
//arxiv.org/abs/1105.1827.

[48] Doug Lea. A memory allocator. http://g.oswego.edu/dl/html/malloc.


html, 2000.

[49] Doug Lea. malloc.c. ftp://g.oswego.edu/pub/misc/malloc.c, Aug 2012.


(Accessed: May 2014).

[50] Joshua LeVasseur, Volkmar Uhlig, Jan Stoess, and Stefan Gotz. Unmodified
device driver reuse and improved system dependability via virtual machines.
In USENIX Symp. on Operating System Design & Implementation (OSDI),
pages 1730, 2004.

[51] Vinod Mamtani. DMA directions and Win-


dows. http://download.microsoft.com/download/a/f/d/
afdfd50d-6eb9-425e-84e1-b4085a80e34e/sys-t304_wh07.pptx, 2007.
(Accessed: May 2014).

[52] David S. Miller, Richard Henderson, and Jakub Jelinek. Dynamic


DMA mapping guide. https://www.kernel.org/doc/Documentation/
DMA-API-HOWTO.txt. Linux kernel documentation.

[53] Jang Suk Park and Gwang Seon Ahn. A software-controlled prefetching mech-
anism for software-managed tlbs. Microprocess. Microprogram., 41(2):121136,
May 1995.

[54] PCI-SIG. Address translation services revision 1.1. https://www.pcisig.


com/specifications/iov/ats, Jan 2009.

[55] Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind
Krishnamurthy, Thomas Anderson, and Timothy Roscoe. Arrakis: The
operating system is the control plane. Technical Report UW-CSE-13-10-01,
University of Washington, Jun 2014.

99
[56] Ashley Saulsbury, Fredrik Dahlgren, and Per Stenstrom. Recency-based TLB
preloading. In Intl Symp. on Computer Archit. (ISCA), pages 117127, 2000.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

[57] Arvind Seshadri, Mark Luk, Ning Qu, and Adrian Perrig. SecVisor: A tiny
hypervisor to provide lifetime kernel code integrity for commodity OSes. In
ACM Symp. on Operating Systems Principles (SOSP), pages 335350, 2007.

[58] Livio Soares and Michael Stumm. FlexSC: Flexible system call scheduling
with exception-less system calls. In USENIX Symp. on Operating System
Design & Implementation (OSDI), pages 3346, 2010.

[59] Michael Swift, Brian Bershad, and Henry Levy. Improving the reliability
of commodity operating systems. ACM Trans. on Comput. Syst. (TOCS),
23(1):77110, Feb 2005.

[60] Fujita Tomonori. Intel IOMMU (and IOMMU for Virtualization) perfor-
mances. https://lkml.org/lkml/2008/6/5/164.

[61] Transpacket Fusion Networks. Ultra low latency of 1.2 microsec-


onds for 1G to 10G Ethernet aggregation. http://tinyurl.com/
transpacket-low-latency, Dec 2012. Accessed: Aug 2014.

[62] Carl Waldspurger and Mendel Rosenblum. I/O virtualization. Comm. of the
ACM (CACM), 55(1):6673, Jan 2012.

[63] Dan Williams, Patrick Reynolds, Kevin Walsh, Emin Gun Sirer, and Fred B.
Schneider. Device driver safety through a reference validation mechanism.
In USENIX Symp. on Operating System Design & Implementation (OSDI),
pages 241254, 2008.

[64] Paul Willmann, Scott Rixner, and Alan L. Cox. Protection strategies for
direct access to virtualized I/O devices. In USENIX Ann. Technical Conf.
(ATC), pages 1528, 2008.

[65] Rafal Wojtczuk. Subverting the Xen hypervisor. In Black Hat, 2008.
http://www.blackhat.com/presentations/bh-usa-08/Wojtczuk/BH_
US_08_Wojtczuk_Subverting_the_Xen_Hypervisor.pdf. (Accessed: May
2014).

[66] Ben-Ami Yassour, Muli Ben-Yehuda, and Orit Wasserman. On the DMA
mapping problem in direct device assignment. In ACM Intl Systems &
Storage Conf. (SYSTOR), pages 18:118:12, 2010.

100
, "" , ,
"" . ""
"" , "" ""
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

.
7.56 . .ASP LOS 2015

IOM M U .

, , .
, ,
, .
.4.6 .F AST 2015

"
.

.

iii
.
.

.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

.
) (.

, ,
/ ,
. , / ,
,
) ( .
) ( .

.
.
.IOM M U

.
.
.

/
.
.

, .
.
IOM M U .
,

.

, , ,

. IOM M U
,
" .
IOM M U .
, ,

ii
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

, ,

. ,

.

,
. ,
. /
.

,
.
.
.

1982 80286 .
,
.
.

,
,
) ( .
.
.

, ,
, .
.M M U

i
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
, .
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015


, , ,
, .

.
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015


2015 '' ''
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015
Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015

Você também pode gostar