Escolar Documentos
Profissional Documentos
Cultura Documentos
by
Committee in charge:
1995
Copyright
Jonathan Kay, 1994.
All rights reserved.
The dissertation of Jonathan Simon Kay is approved,
and it is acceptable in quality and form for publication
on microfilm:
Chair
iii
To my friends.
iv
Chapter I:Introduction 13
1.1 Introduction 13
1.2 The Importance of Latency 14
1.3 Latency Optimization 16
1.4 Throughput Optimization 16
1.5 Organization 16
Chapter II:Related Work 18
2.1 Introduction 18
2.2 Optimizing Data-Touching Overheads 18
2.2.1 Checksumming 18
2.2.2 Copying From User to Kernel 19
2.2.3 Copying From Kernel to User 20
2.3 Optimizing Non-Data-Touching Overheads 21
2.4 Network Software Structure 22
2.4.1 Modularity 23
2.4.2 Control Flow 23
2.4.3 Buffer Descriptors 24
2.5 Performance Measurements 25
2.6 Workload Measurement 25
2.7 Conclusions 26
Chapter III:Workloads and Overheads 27
3.1 Introduction 27
3.2 Workload 27
3.2.1 LAN Traffic 27
3.2.2 WAN Traffic 30
3.3 Overhead Categories 30
3.4 Experimental Setup 32
3.5 Packet Size Effects 34
3.5.1 Bottlenecks to Minimal Latency 36
3.5.2 The Importance of Non-Data-Touching Overheads 38
3.5.2.1 Difficulty of Optimizing Non-Data-Touching Overheads 39
3.6 In-Depth Analysis 40
3.6.1 Touching Data 41
3.6.2 Protocol-Specific Processing 43
3.6.3 Mbufs 43
3.6.4 Operating System Overheads 44
3.6.5 Error Checking 45
3.6.6 Data Structure Manipulations 45
3.7 Conclusions 47
Chapter IV:Optimizing Non-Data-Touching Overhead 49
4.1 PathIDs 49
4.2 PathID Implementation 50
4.2.1 New Link Layer 50
v
4.2.2 PathID Fast Lookup 52
4.2.3 Flow of Control 53
4.3 PathID Savings in Functionality 54
4.4 PathID Weaknesses 58
4.5 Performance 59
4.6 Conclusions 60
Chapter V:Optimizing Data-Touching Overhead 61
5.1 Introduction 61
5.2 Checksum Redundancy Avoidance 61
5.2.1 Eliminating Checksum Redundancy 62
5.2.2 UDP Implementation 63
5.2.3 TCP Implementation 63
5.2.4 Performance Improvement 64
5.2.5 Detecting Locality With Certainty 64
5.3 DMA Redirection 65
5.3.1 DMA Redirection 66
5.3.3 Early Packet Detection By Polling From the Idle Loop 69
5.3.4 Results 72
5.4 Eliminating User-to-Kernel Copying 72
5.4.1 Modern Network Data Movement Design 73
5.4.2 Nocopyin 77
5.4.3 Results 82
5.4.4 DMA vs. PIO for High Performance Adapters 84
5.5 Conclusions 85
Chapter VI:Conclusions 86
6.1 Problem Summary 86
6.2 Summary of Results 86
6.3 Future Work 87
6.4 Bibliographical References 89
vi
ACKNOWLEDGEMENTS
Hans-Werner Braun and the National Science Foundation helped by
giving me packet trace data for the NSF backbone and other backbone and
LAN environments around the San Diego Supercomputer Center and UCSD.
This made possible the analysis of WAN and backbone packet processing time
and strengthened the LAN analysis.
Kim Claffy helped greatly with impressively thorough proofreading
jobs. Furthermore, her own research into flow lengths agrees with the implica-
tions of this work that the common mental model of TCP/IP traffic as a series
of long data transfers is woefully inaccurate and in need of update.
Tuong Nguyen assisted by finding out how to use FrameMaker’s more
esoteric features when we were writing our theses at the same time.
Keith Muller was of particular help; he helped with the design and
debugging of the experiments and various of the schemes described in this dis-
sertation. I would also like to thank him for teaching me about the Berkeley
Unix kernel and for providing encouragement when I seemed to be at a dead
end with the the logic analyzer.
Vernon Schryver and Kevin Fall have been helpful in finding problems
with checksum redundancy avoidance. Max Okumoto helped me in finding the
locality detection algorithm that lays those problems to rest.
Peter Desnoyers helped in writing the fast MIPS checksum routine by
finding an algorithm that reduced per-word processing from my algorithm’s
five instructions to his algorithm’s four instructions.
Digital Equipment Corporation provided generous support to Project
Sequoia 2000, with which this work is intricately bound.
Bob Querido, Farhad Razavian, and Bob Cardinal, of Loral Test and
Information Systems, provided me with an education in network hardware and
market characteristics rare in a graduate student.
Lastly and most importantly, I’d like to acknowledge the role played by
my advisor, Joe Pasquale. This work was made possible by the manner in
vii
which he provided an atmosphere of ideas, was always supportive, encouraged inde-
pendent thinking, was generous and humane with financial support, and was able to
acquire and make available to us that rarest of resources in an academic environment
- equipment sufficient for advanced operating systems research.
viii
VITA
PUBLICATIONS
J. Kay, “Network Sloth: Where the Time Goes in Network Software,” Technical
Report CS92-238, University of California, San Diego, December 1991.
J. Kay, “An Empirical Study of Using Certification Trails,” Master’s Thesis, The
Johns Hopkins University, August 1990.
ix
FIELDS OF STUDY
Studies in Networking
Professors Joseph Pasquale and George Polyzos
x
ABSTRACT OF THE DISSERTATION
by
xi
entific data management and digital video transmission. This dissertation describes
techniques that greatly reduce processing time for large packets, improving through-
put, by eliminating most of all three major CPU-moderated data-touching operations:
checksumming, copying data from kernel to user buffers, and copying data from user
to kernel buffers. Checksumming is largely eliminated through a technique which
avoids checksumming in the frequent case in which it is redundant with the CRC of a
LAN adapter. Copying from kernel to user buffer can be largely eliminated by redirect-
ing DMA pointers concurrently with packet delivery. Copying from user to kernel
buffer can be avoided by arranging for adapters to DMA directly from user buffers.
xii
Chapter I: Introduction
1.1 Introduction
Network performance is important to an increasing number of com-
puter applications. Many applications written without networking in mind are
turned into network applications through access to network file systems
BibRef[62]. Some newer applications are being written with networking in
mind, even designed with the expectation of a certain level of network perfor-
mance, such as multimedia applications BibRef[15], distributed applications
based upon LANs BibRef[5]BibRef[63], and scientific applications using dis-
tributed data BibRef[61]. Similarly, network hardware performance is climb-
ing to meet the demands of these new applications, with the advent of 100-
megabit Ethernet, ATM, and fast processors to drive them.
By contrast, network software performance has not risen to meet the
challenge implied by the new applications and network hardware. Network
software latency is especially disappointing - it was high compared to network
hardware latency even shortly after the invention of the Ethernet BibRef[60],
and has declined comparatively slowly since. Seven years have passed between
the introduction of the Sun 3/60 (1987) and the Alpha 3000/800 (1994). The
Alpha is roughly 40 times faster than a Sun 3/60, implying that CPU speeds
have improved by an average factor of 1.7 each year, yet the network latency
between a pair of the Alphas is only a fifth that between a pair of the Sun 3s,
implying that network latency has only decreased by an average factor of 1.3
each year.
This dissertation introduces PathIDs, a means of reducing network
latency. Network latency poses a barrier to the increasing use of networks for
a wide variety of applications such as cheap parallel processing
BibRef[63]BibRef[66] and scientific computing BibRef[27]. Additionally, high
latency hampers many other aspects of networking, as it causes a tradeoff
between application implementation ease and performance. Programs based
13
14
on simple remote procedure calls are notorious for poor performance. Protocols
carefully designed for performance over high-latency links, such as TCP, are
often relatively complex, require large buffer spaces to achieve high through-
puts, and require occasional modifications such as the provisions for increased
window space specified in BibRef[7].
whereas small packets usually carry control information. On both the LAN and
WAN, small packets far outnumber large packets. In fact, even though process-
ing a large packet requires more time, the high proportion of small packets
causes non-data-touching processing time to dominate processing of the overall
workload.
There are philosophical reasons why network software latency improve-
ment has not historically been a high priority with the network research com-
munity. However, network measurement and observation do not bear out
either reason. There are those who believe that latency is a problem that will
solve itself as machines speed up, whereas throughput will become a worse
problem as data movement becomes more difficult relative to computation
BibRef[50]. However, as noted in the introduction, network software latency is
by no means keeping pace with CPU speed. This is because system software, of
which network software is a subset, exhibits poor memory locality
BibRef[2]BibRef[17] and includes varieties of processing not considered to be
“common cases” by RISC CPU designers such as interrupts and context
switches.
Another reason why network software latency improvement has errone-
ously been considered to be unimportant is it was believed that communication
patterns are more or less uniformly distributed and the increasing diameter of
the Internet would inevitably lead to higher network delays due to the limita-
tion imposed by the speed of light. As it turns out, there is a great deal of local-
ity within TCP/IP networks today; most packets go no further than the LAN on
which they are generated. A related argument is that the explosion of high-
speed WANs will make wide-area communication much more common than
local-area communication. This ignores the question of whether social and
administrative rather than performance issues produced the locality observed
today.
16
1.5 Organization
The dissertation is organized as follows. Chapter 2 describes related work.
Chapter 3 describes processing time breakdowns under realistic workloads. Chap-
ter 4 concerns a technique for reducing non-data-touching overheads, PathIDs.
17
2.2.1 Checksumming
Checksumming is often the most time-consuming BibRef[19] of non-data-
touching overheads because it is tricky to implement efficiently, is not as pressing
for an operating system vendor to optimize as other bulk data operations such as
bcopy, and requires architectural carry bit support to perform as quickly as read-
ing data.
There are a variety of strategies for reducing time spent checksumming.
One is to take advantage of processor support to minimize the number of instruc-
tions required. BibRef[8] suggests how this may be done on CISC and vector archi-
tectures, taking advantage of carry bits, and gathering operations into either
processor word sizes or vector lengths on vector architectures. However, optimizing
checksum computations on RISC architectures requires additional techniques,
both to optimize memory accesses, and to reduce the number of instructions
19
required on RISC architectures lacking carry bits. BibRef[41] describes how check-
summing can be optimized on RISC architectures, taking advantage of pipelines,
load delay slots, and scoreboarding to minimize memory access delays; BibRef[39]
describes an algorithm for checksumming with minimal loss of performance on
architectures lacking carry bits.
An alternate strategy is to implement the Internet checksum on the net-
work adapter BibRef[4]BibRef[38]BibRef[57], but that requires extra adapter
hardware or extra muscle from an onboard CPU. For example, DEC’s FDDI
adapter, which does not support onboard Internet checksum processing, is based
on the relatively simple 16-bit MC68000 processor BibRef[55]BibRef[65]. SGI’s
FDDI adapters, which do support the Internet checksum, require a RISC processor
- the 32-bit AMD29000. The Silicon Graphics adapters further have a reputation
for being expensive, though this could be due to other factors such as Silicon
Graphics’ choices of I/O bus and business model.
Still another scheme is to merge copying and checksumming to reduce traf-
fic between memory and the CPU register file
BibRef[19]BibRef[22]BibRef[35]BibRef[51]. Instead of being implemented as sep-
arate routines, called separately, copying and checksumming are implemented as
a single operation that operates on a word in a buffer by reading it into a register,
adding the contents of a register to a sum register, and then writes the word out to
some other location. This saves a read operation over performing checksumming
and copying separately. On some machines, such as machines based upon SPARC
and PA-RISC processors, this strategy is sufficient to result in the effective elimi-
nation of checksum overhead BibRef[35], but it does not work as well on other
machines, such as machines based on MIPS and Alpha processors
BibRef[41]BibRef[57]BibRef[16].
include several megabytes of VRAM from which the kernel allocates network buff-
ers. Data is moved between the host and the adapter through the normal copy
operation between user and kernel buffers. The data is in fact later moved within
the adapter from the VRAM to the MAC chip FIFO, but CPU performance is not
affected because only adapter hardware resources are engaged. Since the adapter
memory is effectively dual-ported, there is no performance degradation due to con-
tention for the adapter memory. The disadvantages of this design are increased
cost relative to simpler competitive adapters and the fact that PIO is slow on many
hosts because of bus transaction overheads on each cache line/write buffer move-
ment - for example, DMA is ten times faster than programmed I/O across the DEC-
station I/O bus BibRef[25].
the first packet - by the time processors are fast enough, 100baseT or ATM is likely
to be the dominant LAN medium. Furthermore, the common use of the select call
for synchronization in Berkeley Unix based networking code means that Unix pro-
cesses waiting for data frequently do not specify the buffer in which they want
incoming data to be placed until after the data has arrived.
Another approach is to place a context ID field in the link layer header that
the adapter knows about BibRef[6]. The adapter decides where to DMA incoming
packets based on the ID field, and the ID field corresponds to a transport protocol
session, which in turn can keep track of a list of buffers of waiting processes and
set DMA destinations accordingly.
software subsystem in wide use. BibRef[23] also describes the design of a network
software subsystem, but it is not oriented towards high performance.
Most publications on network software structure concern a specific area of
design. Existing research is particularly strong on a few points of traditional diffi-
culty in networking software: modularity, control flow, and buffer descriptors.
2.4.1 Modularity
The software engineering difficulties associated with network subsystems
are formidable, as network subsystems usually comprise large bodies of code. As
in many other areas of operating system design, modularity has traditionally been
employed to combat these problems BibRef[19]. Sometimes modularity has been
taken to extremes that candamage performance: for example, in Xinu BibRef[23],
each protocol has its own thread and input/output queues. More modestly, streams
BibRef[54] is a system in which each protocol can have its own coroutines and
queues, but does not have to; still, some implementors of streams-based TCP/IP try
to reduce protocol boundary penalties by placing multiple protocols within a single
streams module BibRef[37]. Berkeley Unix largely uses procedure calls, source
code file boundaries, and naming conventions to achieve modularity BibRef[43]. It
has been argued that even this level of modularity impedes performance;
BibRef[19]BibRef[22] suggest violating layer boundaries to combine copying and
checksumming, while BibRef[35] suggest combination of layers into fastpaths that
take advantage of the limited scope of applicability of each path to reduce the
amount of functionality, and thus processing time, required to process packets eli-
gible for processing by a fastpath.
rupt thread of control, sockets (the interface between user processes and kernel
networking code) processing occurs in the application process’ thread of control,
but protocol processing is largely done in the thread of control of a software inter-
rupt generated during the receive interrupt. The additional software interrupt is
useful because it has a lower priority than network interrupts, so adapters with
small numbers of receive buffers can free those buffers as quickly as possible. How-
ever, most recently designed adapters are equipped with adequate receive buffers,
so recently there have been proposals to eliminate the software network interrupt
BibRef[51].
Some systems have more complicated flows of control - Xinu BibRef[23]
embeds each protocol in its own thread, Mach BibRef[1] moves network processing
to a separate user process, and streams BibRef[54] uses a flexible coroutines mech-
anism.
2.7 Conclusions
In summary, although, as this chapter has shown, most aspects of network
software performance have been investigated to some degree, issues in reducing
minimal software latency in general-purpose protocol suites have been examined
with less thoroughness than issues in raising maximum software throughput. This
imbalance in effort is probably a contributor to the imbalance between network
throughput and latency. Recently, improvements in network throughput have been
seen from the deployment of adapters supporting hardware checksumming and
onboard kernel buffers. During the same time frame, relatively little improvement
has been seen in network latency, despite the effects of the increase of the ratio of
processor speed to memory speed BibRef[50].
27
3.2 Workload
To determine realistic TCP and UDP packet size distributions, packet
traces of two different FDDI networks, each used for a different purpose, were
obtained. One trace, which reflects wider usage, is of traffic destined for and gen-
erated by a file server on an FDDI LAN of general-purpose workstations at UC
Berkeley between 5 and 9 PM on Sept 11, 1992. The other trace, of wide area net-
work traffic, is from the FDDI network that feeds the NSFnet backbone node at the
San Diego Supercomputer Center. This is the same trace as was described and
analyzed in BibRef[18]; it was taken between 2 and 3 PM on March 23, 1993.
Both traces show strong bimodal behavior. This and the other traffic behav-
iors observed conform to findings of earlier studies
BibRef[12]BibRef[29]BibRef[47].
Packet size is defined to mean the amount of actual user data sent, not
including protocol headers. Figures 1a-b show the packet size distributions for
TCP and UDP packets.
0.35
0.30
0.25
Fraction of Total Messages
0.20
0.15
0.10
0.05
0.00
0 1 3 7 15 31 63 127 255 511 1023 2047 4095 8191
Message Length in Bytes
Figure 1a: Histogram of TCP packet sizes on a LAN. The packet size scale is logarithmic.
Note that almost all TCP packets are small (0.5% of the TCP packets are actually larger than
256 bytes - they are merely invisible in the histogram).
0.35
0.30
0.25
Fraction of Total Messages
0.20
0.15
0.10
0.05
0.00
0 1 3 7 15 31 63 127 255 511 1023 2047 4095 8191
Message Length in Bytes
Figure 1b: Histogram of UDP packet sizes on a LAN, with a logarithmic packet size scale. As
in Figure 1a, most of the packets are small. However, a significant fraction of the packets are
very large.
139,720 packet sizes were collected; 90% of the packets are UDP packets, mostly
generated by NFS. The rest are TCP packets.
Figure 1a shows that almost all the TCP packets observed are small; over
99% are less than 200 bytes long. Figure 1b shows a bimodal distribution of UDP
packet sizes. The great majority of UDP packets are small, but there are some very
29
large packets: 86% of the UDP packets are less than 200 bytes long, while 9% are
around 8192 bytes long.
The observed packet size distributions matched expectations set by previ-
ous work on Ethernet-based packet traces BibRef[29]BibRef[47]. The median
packet sizes for TCP and UDP packets are 32 and 128 bytes, respectively. The low
UDP median reflects the fact that even in the case of UDP, most packets are small
(e.g. NFS status packets). The reason for the large number of 8 kilobyte UDP pack-
ets is NFS. The scarcity of large TCP packets (there are a few large TCP packets,
as large as 2048 bytes, but not enough to be visible in Figures 1a-b) arises because
it is easier to move data by NFS than by more traditional TCP-based file movement
commands such as FTP and rcp. Examination of data from other LANs supported
these results. There are applications that produce large TCP packets, such as X
Window System image transfer and Network News Transfer Protocol, but these
packets are relatively infrequent.
0.12
0.30
0.10
0.25
Fraction of Total Messages
0.08
0.20
0.06
0.15
0.04
0.10
0.05 0.02
0.00 0.00
0 1 3 7 15 31 63 127 255 511 1023 2047 0 1 3 7 15 31 63 127 255 511 1023
Message Length in Bytes Message Length in Bytes
Figure 2a-b: Histograms of TCP (2a) and UDP (2b) packet sizes on a WAN. The packet size
scale is logarithmic. The longest packets in the trace are 1460 bytes long, not quite long
enough for touching data to dominate their processing time. Contrary to the distribution seen
in the LAN trace, 92% of the packets in this trace are TCP packets.
30
DataMove: There are three of data movement: copying data between user
and kernel buffers (Usr-Krnl Cpy), copying data out to the FDDI controller (Device
Copy), and cache coherency maintenance (Cache Clear).
DataStruct: Manipulation of the socket buffer (Socket Buffer), IP defrag-
mentation queue (Defrag Queue), and interface queue (Device Queue) data struc-
tures. Mbuf manipulation is covered by its own category.
ErrorChk: The category of checks for user and system errors, such as
parameter checking on socket system calls.
Mbuf: All network software subsystems require a complex buffer descrip-
tor which allows the headers to be prepended and packets to be defragmented effi-
ciently. Berkeley Unix-based network subsystems buffer network data in a data
structure called an mbuf BibRef[43]. All mbuf operations are part of this category.
Allocation and freeing of mbufs are the most time-consuming mbuf operations.
OpSys: Operating system overhead includes support for sockets, synchro-
nization overhead (sleep/wakeup), and other general operating system support
functions.
ProtSpec: This category is of protocol-specific operations, such as setting
header fields and maintaining protocol state, which are not included in any of the
other categories. This category is a comparatively narrow definition of protocol-
specific processing. For example, although checksumming is a part of TCP, UDP,
and IP, it is placed in a separate category because of its high expense, and because
it is not limited specifically to any one of these protocols.
Other: This final category of overhead includes all the operations which are
too small to measure. An example of this is the symmetric multiprocessing (SMP)
locking mechanism called frequently in DEC’s Ultrix 4.2a, the operating system
measured. SMP locking executes hundreds of times per packet, but each execution
consumes less time than the uncertainty on the execution time of our probes. Thus
it is not possible to tell with certainty how much time is consumed by that mecha-
nism. In the processing times presented, the time due to “Other” is the difference
32
between the total processing time and the sum of the times of the categories listed
above.
DECstation 5000/200
DECstation 5000/200
process
process
kernel
Measured Region of Code = kernel
System Under
Test
Isolated FDDI Network
Figure 3: The experimental system consisted of two DECstation 5000/200 workstations con-
nected by an FDDI network. A process on one machine sends a packet to a process on the sys-
tem under test, which then sends the same packet back. The lines connecting the processes
show the path of the packet. Measurements are made on that part of executed code located in
the kernel of the system under test, highlighted by the stippled lines.
components of network software when sending and then receiving the same
packet. The experimental system, shown in FigureFoobleFigure 3, consists of two
workstations connected by an FDDI network with no other workstations and no
network traffic other than that generated by the experiment. An experiment con-
sists of one workstation sending a packet to the system under test, which then
sends the same packet back. All measurements are made on the system under test,
which is executing a probed kernel and is hooked up to the logic analyzer. Each
experiment is carried out for 40 packet sizes evenly spaced from 1 byte to 8192
bytes. Experiments are repeated 100 times at the same packet size to obtain sta-
tistical significance in the results (the average percentage of standard error over
all categories is less than 5%).
The experiments are designed to only capture CPU time spent processing
network packets, ignoring other sources of delays in network communications. Our
timings do not include network transmission times. Nor is time that a packet is
held up by flow or congestion control counted, except for the processing time
needed to decide not to hold up the packet; the workload does not provoke flow or
congestion controls into operation. Note that the TCP workload design has the
result that acknowledgments are always piggybacked.
3000 3000
Checksum Checksum
2500 2500
2000 2000
Microseconds
Microseconds
1000 1000
Figure 4a-b: Breakdown of operation processing times for TCP (4a) and UDP (4b) packet
sizes ranging from very small, 0 bytes, to very large, 8192.
34
1024 bytes long, calling a memory allocation routine for each mbuf. Up to 4096
bytes of packets larger than 1024 bytes are held in a single cluster mbuf; this
causes two calls to the memory allocator, one for the mbuf and another for the
page. The algorithm is repeated to hold the rest of each packet longer than 4096
bytes until enough mbufs have been allocated to hold the entire packet. The bimo-
dality of the mbuf allocation algorithm causes the hump between 1 and 1024 bytes
and part of the hump between 4096 and 5120 bytes, largely because of the result-
ing distribution of numbers of calls to the mbuf allocation routine. Others have
remarked on the bimodality BibRef[11]BibRef[32].
The breakdowns of TCP and UDP processing times shown in FigureFooble-
Figure 4 are very similar. Even TCP protocol-specific processing is only slightly
more expensive than UDP protocol-specific processing. The differences are so
small because even though TCP is the most complicated portion of the TCP/IP
implementation, it is only a relatively small part of the executed layers of network
software. Sending or receiving either a TCP or UDP packet involves executing IP,
the socket layer, the FDDI driver, and numerous support routines.
100 100
Other Other
Mbuf Mbuf
90 OpSys 90 OpSys
Error Chk Error Chk
Data Struct Data Struct
ProtSpec ProtSpec
80 Data Move 80 Data Move
Checksum Checksum
70 70
60 60
Microseconds
Microseconds
50 50
40 40
30 30
20 20
10 10
0 0
8 409 1024 1638 2252 2867 3481 4096 4710 5324 5939 6553 7168 7782 1 409 1024 1638 2252 2867 3481 4096 4710 5324 5939 6553 7168 7782
Message Length in Bytes Message Length in Bytes
the total processing overhead time. Notice that the upper two regions in Figure-
FoobleFigure 5 are due to the data-touching overheads; for large packet sizes these
operations consume approximately 70% of the total processing overhead time.
However, as packets get smaller, the non-data-touching overhead times become
more prominent, as expected. In fact, for single byte packets, data-touching over-
heads contribute only 11% of the total processing overhead time.
400 400
Mbuf
300 300
Mbuf
250 Checksum 250 Checksum
Data Move Data Move
Microseconds
Microseconds
ProtSpec
200 200
Other
150 150
Other
100 100 OpSys
OpSys Error Chk
0 0
8 204 409 614 1 204 409 614
Message Length In Bytes Message Length In Bytes
Figure 6a-b: Breakdown of operation processing times for small TCP (6a) and UDP (6b)
packet sizes.
400
300
Microseconds
200
100
0
Checksum Data Move ProtSpec Data Struct Error Chk OpSys Mbuf Other
Operation
Figure 7a: Profile of aggregate LAN TCP operation processing times. Since most TCP packets
are small, more time is spent on protocol overheads than on data movement.
400
300
Microseconds
200
100
0
Checksum Data Move ProtSpec Data Struct Error Chk OpSys Mbuf Other
Operation
Figure 7b: Profile of aggregate LAN UDP operation processing times. Since there are a num-
ber of large UDP packets, checksumming and copying dominate.
transport protocol (e.g., TCP or UDP) and environment (e.g., LAN or WAN). Thus,
the categories of processing overheads defined in Section 2 have different relative
costs depending upon protocol and environment because of the differing distribu-
tions of packet sizes. These observations lead to the aggregate processing overhead
times based on the packet size distributions given in FigureFoobleFigure 1 and
FigureFoobleFigure 2. FigureFoobleFigure 7 shows aggregate processing over-
38
head times for TCP and UDP on a LAN, while FigureFoobleFigure 8 shows aggre-
gate processing times for a WAN. This dissertation concentrates on the LAN
environment because it is represents the more common overall workload for a com-
puter.
300
300
250
250
200
200
Microseconds
Microseconds
150
150
100
100
50
50
0
0
Checksum Data Move ProtSpec Data Struct Error Chk OpSys Mbuf Other Checksum Data Move ProtSpec Data Struct Error Chk OpSys Mbuf Other
Operation Operation
Figure 8a-b: Histograms of TCP (8a) and UDP (8b) packet sizes on a WAN. The packet size
scale is logarithmic. All of the packets in the trace are too short for data-touching overheads
to dominate packet processing time, so it is unsurprising that copying and checksumming
consume so little time in the overall breakdown. In contrast to the distribution seen in the
LAN trace, 92% of the packets in this trace are TCP packets.
checksums with nearly none of the loss of integrity nor any additional cost in hard-
ware. Specifically, it seems reasonable to avoid the TCP and UDP data checksum
computations when the source and destination are both on the same LAN and that
LAN supports a hardware CRC. Since the overwhelming majority of packets fit
those criteria, our Checksum Redundancy Avoidance algorithm provides a dra-
matic performance improvement for most packets without loss of checksum protec-
tion. Such strategies can almost double throughputs.
Improving overall performance by optimizing non-data-touching operations
is more difficult. For example, because of the lack of large TCP packets, the most
prominent overhead category in the TCP profile in Figure 7a is ProtSpec. ProtSpec
consists of the protocol-specific processing overhead from each protocol layer (TCP,
IP, Link layer, and FDDI driver). The largest component of ProtSpec is that of TCP,
which as mentioned above consumes 13% of the total processing time; however,
TCP protocol-specific processing is actually made up of a large number of smaller
operations, as are the Mbuf and Other categories. Thus, a wide range of improve-
ments would be needed to produce a significant improvement in performance for
the non-data-touching operations. The system described in BibRef[64] could, from
a certain point of view, be considered to represent a successful dramatic improve-
ment in latency through a complete redesign and rewrite of a UDP-based RPC.
100
100
80
80
60
60
Microseconds
Microseconds
40
40
20
20
0
0
Cache Clear Device Copy Usr-Krnl Cpy Cache Clear Device Copy Usr-Krnl Cpy
Operation Operation
Figure 9a-b: Aggregate data movement times for TCP (9a) and UDP (9b) packets. Copying
times are much higher for UDP because of the much greater numbers of large UDP packets.
ponent and a component rising linearly with packet size. Usr-Krnl Copy has a
larger constant component but a smaller linear component than Device Copy, and
thus the reversal in relative sizes is due to the different TCP and UDP packet size
distributions.
150
150
100
100
Microseconds
Microseconds
50
50
0
Demux ifaddr Arp IP Protl Link Protl TCP Protl Device Protl Demux ifaddr Arp (Dis)connect IP Protl Link Protl UDP Protl Device Protl
Operation Operation
Figure 10a-b: Aggregate protocol processing (Protls) time under TCP (10a) and UDP (10b)
packets. TCP protocol processing time is large, but less than half of total protocol processing
time. UDP layer overheads are very small.
43
3.6.3 Mbufs
Mbuf is the second largest of the non-data-touching categories of overhead.
The mbuf data structure supports a number of operations, the most costly of which
is Mbuf Alloc. FigureFoobleFigure 11 contains breakdowns of the Mbuf category
into mbuf allocation and deallocation (Mbuf Alloc) and all other mbuf operations
(Mbuf Misc).
44
140
140
120
120
100
100
Microseconds
Microseconds
80
80
60
60
40
40
20
20
0
0
Mbuf Misc Mbuf Alloc Mbuf Misc Mbuf Alloc
Operation Operation
Figure 11a-b: Aggregate network buffer (i.e. “mbuf”) management times for TCP (11a) and
UDP (11b) packet sizes. The large cost of the several memory allocation operations per-
formed by the mbuf packages is one of the largest single operation costs.
25
25
20
20
Microseconds
Microseconds
15
15
10
10
5
5
0
0
Proc Restart Socket Misc Sleep Wakeup Softint Handler Sched Soft Intr Proc Restart Socket Misc Sleep Wakeup Softint Handler Sched Soft Intr
Operation Operation
Figure 12a-b: Aggregate operating system overhead times for TCP (12a) and UDP (12b)
packet sizes. Transfer of control is surprisingly inexpensive.
again. Sched Soft Intr is the operation of scheduling a software interrupt to process
the incoming packet, and Softint Handler is the software interrupt handler, which
dequeues incoming packets and calls IP to handle them. Perhaps the most inter-
esting aspect of this category is that the various transfer-of-control operations are
so cheap relative to the other categories of overheads (see FigureFoobleFigure 7
and FigureFoobleFigure 8).
100
100
80
80
60
60
Microseconds
Microseconds
40
40
20
20
0
0
Skt Errchk Syscall Argchk Skt Errchk Syscall Argchk
Operation Operation
Figure 13a-b: Aggregate error checking times for TCP (13a) and UDP (13b) packet sizes.
Checking that a user has specified a correct receive buffer is expensive and imposes a large
penalty on large receive buffers.
40
40
30
30
Microseconds
Microseconds
20
20
10
10
0
Socket Buffer Defrag Queue Device Queue Socket Buffer Defrag Queue Device Queue
Operation Operation
Figure 14a-b: Aggregate data structure manipulation times for TCP (14a) and UDP (14b)
packet sizes. TCP makes more extensive use of the socket buffer’s properties than UDP does,
ure 14 show the breakdown of overheads for this category. The socket buffer is the
data structure in which a limited amount of data is enqueued either for or by the
transport protocol. TCP makes heavier use of this data structure than UDP, which
makes no use of the socket buffer structure for outgoing packets, and only uses it
47
as a finite-length queue upon reception. In contrast, TCP uses the socket buffer to
implement windowing flow control on both send and receive sides.
The Defrag Queue is the data structure used to defragment IP packets. The
times for Defrag Queue processing require explanation. In general, packets larger
than the FDDI MTU must be sent in multiple pieces (fragmented and defrag-
mented). Fragmentation is implemented in both IP and TCP. UDP packets are
fragmented by IP, but TCP does its own fragmentation specifically to avoid IP frag-
mentation. Thus, it is surprising that the code checking the defragmentation
queue is called at all for TCP packets; this phenomenon reflects a place in the IP
code where a check for matching fragments could have been avoided. Even more
interesting is the fact that the average amount of time spent in Defrag Queue for
UDP is not noticeably greater than the amount of time spent in Defrag Queue by
each unfragmented packet. Less than 10% of UDP packets are large enough to be
fragmented; the extra time spent defragmenting large packets is not sufficient to
noticeably raise the average cost of defragmentation.
The Device Queue is the data structure in which outgoing data is enqueued
by the link layer until the network controller is prepared to process it, and incom-
ing data is enqueued by the device driver until IP is ready to process it by a soft-
ware interrupt. UDP packets spend more time in Device Queue processing because
of fragmentation. If a UDP packet is fragmented into two FDDI packets, then twice
as much work must be performed by the Device Queue to send that packet.
3.7 Conclusions
I measured various categories of processing overhead times of the TCP/IP
and UDP/IP protocol stacks on a DECstation 5000/200, and used the results to
search for bottlenecks affecting throughput and minimal latency. As expected, the
data-touching operations, checksumming and copying, are bottlenecks affecting
throughput.
Non-data-touching overheads turn out to be the bottleneck to minimal
latency. Furthermore, because most packets observed in real networks are small,
48
mon cases. Implemenations of this idea range from rearrangement of packet pro-
cessing to be more efficient in common cases to sophisticated schemes involving
multiple processing paths. In the case of transmission, BibRef[35]BibRef[51] sug-
gest creation of different paths through multiple layers of network software -
socket, transport, and network layers - for each type of packet (for example, one
might create a path for TCP packets, a path for UDP packets, and a path for routed
IP packets). This is not too difficult, since simply by sending on a particular socket
one identifies the type of packet one is generating. However, packet reception is
more difficult, since one generally cannot know an incoming packet’s type until
after a fair amount of processing has transpired. By providing a facility for identi-
fying packets rapidly in advance of nearly all other processing, PathIDs makes it
possible to follow a similar strategy for packet reception.
PathIDs is solely intended to optimize receive-side processing. The informa-
tion that PathIDs makes available is already available on transmission to any
implementation, such as Berkeley Unix, that does not limit knowledge of useful
data structures by layer. PathIDs has no advantage in rapid transmission over a
fastpaths-based implementation.
FDDI Header
PathID Header Verification Pattern
Source Unused
Destination PathID
Destination Unused
IP Header
UDP Header
Figure 15: New packet header layout with PathIDs. They are the same as normal headers
except that a new header has been added between the FDDI link layer and IP headers. Names
of individual PathID header fields are on the right hand side of the PathID header.
Destination PathID: The field for which this protocol exists. The contents
of this field are used to look up path and socket information in a table. From there
processing proceeds along the fastpath corresponding to the PathID.
Source PathID: This field supports the negotiation process for PathIDs.
This is the PathID used to identify the entity that is sending. The destination ses-
sion in a PathID-supporting host that receives this will in future use the contents
of this field as the destination PathID when sending packets to the source of this
52
packet. Since this negotiation algorithm does not support multicast, it will have to
be changed.
Verification Pattern: A pattern impossible for a conventional FDDI
packet at this header location. This is necessary because few networks are likely
to carry exclusively PathID traffic.
Type: Conventional Ethernet type field, for use when no destination
PathID field is set.
Source/Destination Unused: Extra space for related experiments.
Berkeley PathIDs
User Unix
Kernel
Process
Sockets
socket Process
queue
UDP
Software
IP Interrupt
interface socket
queue queue
Device
Figure 16: Flow of control on Fastpaths. Link layer header with PathIDs. Transport and net-
work layer processing that was done in the network software interrupt in Berkeley Unix is
done by the user process under PathIDs.
cess address space to be available to the UDP and IP code within the fastpath (in
many operating systems some information concerning each process is only avail-
able from the context of the process in question), which permits more optimiza-
tions, and it means that interrupts are blocked for a shorter length of time.
PathIDs is perfect for solving this sort of design problem: everything is known to
all software along the path, so the transport protocol code simply copies the
address directly into the user process’ buffer. The resulting code is far simpler than
the original.
ICMP Error Notification: A variety of error conditions, such as invalid
transport protocol port and unreachable destination network, require a host to
generate an error packet. Berkeley Unix must go through some preparation for
this event even when no error is found. However, the destination of a packet must
exist to end up on a fastpath at all, so no ICMP error need ever be generated from
a fastpath, which in turn implies that no preparation for such an event is needed.
Checksumming: Usually software checksumming is extremely common.
Some hosts, those either running Checksum Redundancy Avoidance (see Section
Section5.2) and those whose vendors supply hardware checksumming in most net-
work interfaces, may not see much checksumming. It may make sense to remove
it from the fastpath of such hosts. On the other hand, this may also noticeably
reduce the number of packets to which fastpaths apply or raise the number of fast-
paths in exchange for little gain. The initial implementation places checksum pro-
cessing in the fastpath.
Protocol Options Processing: Most packets do not contain options, so it
is worth removing this from the fastpath. Some designs for the next version of IP
embed a philosophy that options need not be uncommon. If options should become
common, the focus may shift from removing options from the fastpath to assuming
that options sets will usually not change much for any given transport protocol ses-
sion.
IP Length Checking: Since both UDP and TCP perform length checking
of their own, and execution of a UDP or TCP fastpaths is evidence that a packet
will be processed by a protocol supporting length checking, checking the IP length
field for consistency is entirely superfluous. The same logic also applies to checks
that each packet is larger than the IP header size.
57
processing. Like the soft network interrupt, support for the software interface
queue should be moved out of the common path into drivers for which it is useful.
Driver-level buffer allocation: Normally, device drivers allocate net-
work buffers upon packet reception to replace each buffer that is to be sent up the
protocol stack. The buffers are later freed after they are in turn filled with an
incoming packet, processed, and consumed by a user process. However, this alloca-
tion and deallocation can be avoided by merging some of the functionality of hard-
ware adapter queues with the functionality of socket buffers. One could bind each
buffer permanently to a hardware queue buffer descriptor, clearing the buffer
pointer in the hardware queue so that full buffers are not overwritten; not only
would this eliminate the allocation, but it would also sharply reduce the amount of
time spent initializing adapter receive buffer descriptors. However, such a static
binding would have the disadvantage that the adapter would not be able to use the
buffer until the receiving process consumes it. This should only rarely cause
dropped packets for lack of a buffer, since hardware queues are growing quite long.
If it does prove to be a problem, allocation could be accomplished only when the
queue of contiguous, empty buffers grows short.
whether the fix being applied is appropriate; such thought may require a deeper
understanding of the code than a software maintainer has, since such paths
involve every layer of networking code.
Some of the changes proposed here are difficult to port from one operating
system to another. In particular, the implementation of synchronization calls is
often operating-system specific. This may not pose much of a problem for a vendor
implementing PathIDs in its own source code, but it creates problems for a
researcher or a business specializing in enhancing network code. Even a vendor
may face increased difficulties when it wishes to change source code trees (such as
the recent SunOS->Solaris and Ultrix->OSF/1 conversions).
Still another problem potentially limits the ultimate effectiveness of
PathIDs and fastpaths if care is not taken. These techniques optimize by restrict-
ing the domain of situations for which they are effective, in a manner similar to
decisions on RISC architecture design BibRef[31]. The more decisions made in
advance about what sorts of packets are supported, the faster a fastpath is. How-
ever, if each piece of functionality has a probability Pf of being needed by a partic-
ular packet independent of the chance of other pieces of functionality being used,
and there are n pieces of functionality which the fastpath does not support, then
the overall chance of a fastpath being applicable is the product of the Pf. In other
words, the chance that a path can be used for any given packet goes down exponen-
tially with the number of decisions made in advance in the design of the path. For-
tunately, the chances of use of these pieces of functionality are very much mutually
dependent; still, it is a problem that must be watched. PathIDs is superior to sim-
ple fastpaths in that it is easier to create many paths that optimize for disjoint sets
of functionality, although tens of thousands of PathIDs could potentially tax mem-
ory, identifier space, and connection setup time.
4.5 Performance
The PathIDs implementation exhibits 31% lower minimal latency than
unmodified OSF/1 on an Alpha 3000/400. The test used to derive these results mea-
60
sures round trip latency by sending a 1-byte UDP packet to another workstation
which returns the packet to its sender, and noting how long it takes before the
packet is returned to its original sender.
4.6 Conclusions
PathIDs achieves substantial improvements in minimal latency by provid-
ing a mechanism that eases optimization of non-data-touching overheads. PathIDs
adds a field to link layer headers which can be used to immediately recognize the
destination and type of each incoming packet. The importance of this change
derives less from the resulting savings in demultiplexing time than from the fact
that all aspects of a packet are known from the beginning of packet processing.
Processing of each type of packet can be tailored to fit the bare requirements for
that type of packet instead of being designed to support any type of incoming
packet.
61
The side that wishes to create a connection starts negotiation. TCP uses the
same series of checks as in UDP to decide whether to send an option to turn off
checksums. When server-side TCP responds to a TCP connection request, it
responds with the checksum-avoidance option if it gets a checksum-avoidance
request in the connection setup packet. Each side of a conversation only disables
checksumming if it receives a checksum elimination option from the other host in
the conversation.
As with UDP, a socket option is implemented that causes the implementa-
tion to try to negotiate disabling checksums even when the checksum is not redun-
dant. Since the sender of the option makes the decision as to whether to negotiate
checksums, and receivers knowledgeable of this option automatically agree to any
suggestion of turning off checksums, only one side of a conversation need call the
option.
TCP UDP
LAN bridges BibRef[3] and proxy arp BibRef[13] both make it difficult for a host
to be certain that another host is truly on the same network.
It is possible to decide with certainty whether a remote host is truly on the
same network as a local host. The key is the IEEE spanning tree protocol used for
routing among bridges BibRef[3]. Bridges are not allowed to pass packets using
this protocol, and IP routers ignore such packets. Thus, reception of a spanning
tree protocol packet from a host is good evidence that host is on the same physical
network as the local host.
Hosts implementing this scheme would broadcast a spanning tree packet
right after responding to arp requests. The contents of the spanning tree packets
would be set in such a manner that no bridge will expect a host to actually bridge
MAC packets. Any host from which such a spanning tree packet is seen can be
assumed not to be on the other side of a bridge.
ing packets as they arrive. As soon as each packet’s header arrives in system mem-
ory, the CPU processes the headers and changes adapter DMA pointers to point
directly to user receive buffers.
User Buffer
User Copy out
Kernel Kernel Buffer Kernel Buffer
Figure 17: Normal incoming packet DMA. For each packet, there are a number of receive
descriptors, each pointing to a kernel buffer.
and a pointer. The length specifies how many bytes to place in the buffer pointed
to by the descriptor in question. If an incoming packet contains more bytes than
the descriptor specifies, the next descriptor is used and the algorithm repeated.
Device driver writers are expected to provide enough bytes in the various descrip-
tors to hold worst-case large packets.
Once DMA into host memory is complete, the adapter generates an inter-
rupt and packet processing begins. The incoming packet is later copied into user
buffers as one of the last steps in packet processing.
kernel buffer, and initially the second descriptor points to a kernel buffer long
enough to hold the rest of an incoming packet’s data. As soon as enough informa-
tion (i.e. the protocol header) has arrived to decide to which user buffer a packet is
bound, the second descriptor’s pointer is redirected to point to that user buffer. It
takes a certain amount of time to decide where a packet is bound; the buffer
pointed to by the first descriptor continues to fill while the packet is arriving. After
DMA redirection is complete, the bulk of the packet is placed directly into the
user’s buffer with no need for additional copying.
The byte length contained in the first descriptor, waitlen, is a function of
the speed of the network and the amount of time required to change the descriptor.
On the DECstation, the early implementation requires a waitlen value of 2048
bytes in order to have enough time to reliably process the header and set the sec-
ond descriptor’s pointer. Since this implementation uses an FDDI network, the
amount of required processing time can be computed as follows:
Since the second buffer initially points to a kernel buffer, if DMA is not redirected
for any reason the packet will still be received, though through the “normal” slower
path, with the extra copy operation.
unusual bit pattern at byte 64, confusing the algorithm, the worst that can happen
is that the packet is processed in the usual, slower manner.
Packet
Arrivals Packet 1 Packet 2 Packet 3
Network
Processing Polling Normal Recv Processing
User Buffer
Availability user buffer filled by packet 1
read() call
Time
Figure 19: Closely spaced packets cause problems for DMA Redirection with a normal pro-
cessing order. The user buffer must be available to do DMA Redirection. After packet 1
arrives, no buffer is available until network software is complete. Thus, DMA Redirection
cannot be performed for packet 2. DMA Redirection could be performed for packet 3.
71
receive latency = network transmission time + network software processing time (EQ 2)
receive latency = max(network transmission time, network software processing time) (EQ 3)
Processing would necessarily be split into two parts: the bulk of the process-
ing to be performed immediately on early packet detection, and a second part that
would confirm that no errors have been detected and awaken the receiving user
process if appropriate. This is similar in some ways to a split described in
BibRef[35]BibRef[51].
The early processing of a packet due to DMA Redirection can also improve
the likelihood that DMA Redirection can be used to process a given packet. Con-
sider the sequence of events when there is a sequence of closely spaced packets
bound for a single process, as illustrated in FigureFoobleFigure 19. A user buffer
from a waiting process must be available for DMA Redirection to be performed. Yet
the user buffer can only be available for DMA Redirection if a process is already
waiting for a packet. The user process cannot begin waiting for the next packet
until it has received the previous one; this can only occur after packet processing
of the previous packet is complete. Normally, packet processing does not begin
until a packet has finished arriving, and may not complete until well after the next
packet has begun arriving; thus, the user process does not have an opportunity to
provide a buffer for the next packet. If one begins packet processing upon early
packet detection, packet processing would complete relatively shortly after the
packet finished arriving. This would allow a process which is simply interested in
72
receiving packets as fast as possible to execute and immediately wait for the next
packet, switching execution to the idle loop if no other computation is ongoing,
which in turn allows DMA Redirection on the next packet.
5.3.4 Results
Our results indicate that processing time for UDP packets improved as a
result of our implementation of DMA Redirection on a DECstation 5000/200
equipped with a DEFTA FDDI adapter. Table 3 shows the resultant improvements
in round trip times for 4000-byte packets. In particular, round trip time was
reduced by approximately 15% when DMA Redirection was applied after having
already applied Redundant Checksum Elimination.
Pointers to memory
Descriptor ...
Host queue
Figure 20: Output side of a DMA-based adapter. The adapter uses DMA to fill its internal
transmission buffer directly from kernel buffer memory. A queue of descriptors tells the
adapter where to fetch memory from.
host memory and adapter, whereas with a PIO-based adapter, the host CPU itself
copies the data.
Confusingly, both types of adapters often have an internal DMA engine, so
DMA-based adapters are described first. A diagram of the transmission side of
such an adapter is shown in FigureFoobleFigure 20. The adapter and device driver
jointly manage a queue of buffer descriptors in host memory. Each buffer descrip-
tor points to a location in host memory. The adapter contains a relatively small
amount of memory in an onboard transmission buffer, which holds data to be
placed on the network medium. When the transmission buffer approaches empti-
ness, the adapter transfers data from the buffer pointed to by the first descriptor
in the queue; that buffer is in the main memory of the host.
Many PIO-based LAN adapters, especially LAN adapters designed around
common MAC chipsets, are similar to DMA-based adapters, as shown in Figure-
FoobleFigure 21. In fact, PIO-based adapters often contain what amounts to an
internal DMA-based adapter, often a cheap MAC chip produced in volume. The dif-
ference is that the PIO-based adapter also contains a relatively large internal
memory (typically up to a couple of megabytes), and the adapter transfers data to
75
Pointers to memory
Descriptor ...
queue
Figure 21: Output side of a PIO-based adapter. This is the same as a DMA-based adapter
except for a an extra buffer layer of memory in the adapter. Transfers to the transmission
buffer are satisfied from the adapter’s onboard memory. The CPU copies data into the
adapter’s memory.
the transmission buffer from the adapter memory instead of directly from main
memory. The CPU is expected to copy data from main memory into the adapter
memory. Examples of this may be found in several of the workstation Ethernet
adapters based on the AMD LANCE Ethernet MAC chipset BibRef[26], the DEC
DEFZA FDDI adapter BibRef[55], and the HP Afterburner and Medusa adapters
BibRef[4]BibRef[24]. The LANCE chipset itself implements a scatter/gather DMA
engine similar to FigureFoobleFigure 20. On many Sun workstations, the LANCE
chipset is interfaced to the host through the addition of I/O bus DMA support cir-
cuitry which allows the LANCE to fetch data directly from main memory. On DEC-
stations, the LANCE is interfaced to the host through a PIO-based interface - the
LANCE fetches data from the adapter memory, and the CPU copies data into
adapter memory. The reason often given for the extra layer of memory is that
because the LANCE’s FIFO is very small, there is risk of starvation for the I/O bus
76
long enough for the FIFO to either fill, if receiving a packet, or empty, if transmit-
ting a packet, resulting in incorrect packet transfer. Alternatively, when designing
adapters for hosts with new, low-volume, or complicated I/O bus interfaces it may
be cheaper to put the memory onboard with an ad hoc internal DMA interface than
to support I/O bus DMA.
The best choices in adapter and operating system design depend upon a
number of system parameters. One parameter is whether adapter DMA is able to
transfer data more rapidly than the CPU, both within the system and across the I/
O bus. DMA is often, though by no means always, faster than copying across the I/
O bus, because DMA typically is done in large bursts that amortize the overhead
of acquiring the bus over a large number of transferred bus words, while CPUs typ-
ically transfer data in small bursts if they use bursts at all. DMA is often faster
than CPU copying internal to a host because designers try to build I/O busses to
have cycle times comparable to other host circuitry, and thus if DMA is fast relative
to other I/O bus transfers it is often fast relative to internal data transfers.
User buffer
Pointers to memory
Descriptor ...
queue
DMA to adapter
Adapter
Adapter transmission buffer
Figure 22: Normal send-side data movement with DMA-based interface. Kernel copies data
from user process to kernel buffers. The adapter transfers data from kernel buffers. There
are a total of two data copies.
operating system I/O interfaces due to the necessity of page-aligning I/O buffers;
many make no attempt and require rewriting of applications to gain a performance
advantage. A major goal of all the work in this dissertation is to gain whatever per-
formance is possible without sacrificing compatibility.
5.4.2 Nocopyin
Nocopyin eliminates the copy from user process buffers into kernel buffers.
Strictly speaking, there is no need for this copy operation because data could be
transferred directly from the user process to the adapter. Retention of this copy
operation has resulted in higher performance in the past because Unix I/O seman-
tics allow a process to modify the contents of a buffer as soon as it returns from the
I/O operation.
78
User buffer
Adapter
Adapter Adapter Adapter
buffer buffer buffer
Pointers to memory
Descriptor
queue ...
Figure 23: Normal send-side data movement with PIO interface. Kernel copies data from
user process to kernel buffers, then copies data from kernel buffers to adapter. The adapter
transfers data from internal buffers. There are three data copies.
the high speed of adapter DMA and the increasing size of internal adapter trans-
mission buffers.
Many DMA-based adapters BibRef[26]BibRef[33] only contain small inter-
nal transmission buffers, on the order of a hundred bytes in size, not enough to con-
tain an entire packet. Data transmission proceeds as follows: first the adapter
transfers enough data to fill the internal transmission buffer. The adapter uses the
contents of the buffer to transmit to the network; when the buffer runs low, the
adapter repeats the process, transferring another bufferful of data. The process
repeats until the entire packet has been transmitted.
In this process, data transfer is not complete until the adapter has nearly
finished transmitting the packet. Over the duration of packet transmission, on
average, DMA proceeds at the same rate as transmission onto the network. Each
filling of the adapter transmission buffer is rapid, but is succeeded by a relatively
long waiting period during which the transmission buffer drains. Host packet buff-
ers cannot be freed until this relatively slow process is complete, and if a buffer
comes from a user process’ address space, that process must be frozen until trans-
mission is complete. It is faster from the point of a transmitting user process for
the buffer to be copied than for the process to wait until the relatively slow network
transmission is complete.
More recently designed, high-performance adapters such as the DEFTA
BibRef[65] or the Boggs T3 adapter BibRef[6] contain tens of kbytes of transmis-
sion buffer memory. When these adapters try to fill their queues, they can DMA
entire packets continuously. On systems in which adapter DMA is able to move
data more rapidly than PIO can, it is faster to wait for DMA to complete than to
copy the packet. Thus, on such systems, it makes sense to redesign the network
software to let network adapters DMA directly from user buffers, as shown in Fig-
ureFoobleFigure 24.
of system and adapter are known to each driver writer. This is not unlike the
designs of BibRef[4]BibRef[35], except that my design does not call the driver to do
buffer allocation; it would not help in the highest performance case of the DMA-
based adapter, and it would not help performance much in the (slower) case of PIO-
based adapters. Instead, user buffers are locked into physical memory and mapped
into somewhat modified mbufs; copying is delayed until the driver entry point for
packet transmission is called in the normal Berkeley Unix design.
A driver interface flag has been added so that drivers not modified to know
about the new scheme have the data copied for them in the link layer, while mod-
ified drivers are expected to copy the data in some fashion. The experimental
driver spins until the driver has finished the DMA; since the DECstation is unable
to accomplish much computation under circumstances of contention with continu-
ous adapter DMA, this seems like the best course.
Nocopyin should not be used when adapter queues grow long. Even modern
adapter transmission buffers only have room for a small number of packets; once
the transmission buffer is full, the adapter will start transferring packets in rela-
tively small chunks to keep the transmission buffer full, changing to a more
LANCE-like strategy of transferring data in small chunks. A driver implementing
Nocopyin should start copying outgoing packets and stop spinning on transmission
once it enqueues enough packets to fill the transmission buffer. This is not a prob-
lem for the experimental system, a DECstation 5000/200 on an FDDI network that
it is unable to saturate.
5.4.2.4 Retransmission
A problem with Nocopyin is that reliable transport protocols such as TCP
require the kernel to keep a copy of each packet in case the need arises to retrans-
mit. Even in the best case, this copy must be kept around long enough for the
packet to be acknowledged, which is typically a large amount of time compared to
the time required to transmit a packet. If the kernel did not copy the packet, it
would be forced to freeze the user process for an unacceptably long time.
Protocols implemented outside the kernel, built on top of UDP, such as the
various Mbone protocols or DNS, have no such requirement. Packet traces from
LANs supporting Mbone suggest that multimedia packets represent an increasing
component of LAN workload.
Copyin can be drastically reduced even under TCP when transmitting over
PIO-based adapters with an onboard, CPU-addressable transmit queue, such as
the DEFZA BibRef[55]. The algorithm used is that the data is copied out to the
adapter in much the same manner as for UDP. However, TCP continues to track,
in its retransmission queue, the packet data onboard the adapter. Even in the
worst (and common) case of a fixed circular array adapter queue, this works
because the data remains undisturbed until all the other packets in the queue
have been used and the buffer in question must be reused to transmit a newer
packet. When the time arrives for the buffer to be reused, the old packet data must
be copied back into the kernel so that TCP can maintain its retransmission queue;
in this circumstance, one may actually take a performance hit if copy speed across
the I/O bus is slow. However, the usual case is likely to be that the packet is
acknowledged before the adapter buffer must be reused.
As an example, it takes about 4 milliseconds on the DECstation 5000/200
using a PIO-based DEFZA FDDI adapter to send two 4324-byte UDP packets and
for a single-byte packet to return, the typical ACK pattern for TCP sending at max-
imal throughput. Being an FDDI adapter, the DEFZA can transmit no faster than
100 Mbps. Thus, the usual number of packet buffers that will be consumed is given
by (bandwidth * delay / buffer size). Plugging in numbers yields
82
5.4.3 Results
Nocopyin has been implemented on a DECstation 5000/200 running Ultrix
4.2a, with support for nocopyin in the drivers of both the DEFZA and DEFTA FDDI
83
adapters. The DEFTA supports DMA for transmission, while the DEFZA uses PIO.
The baseline system is an unmodified Ultrix 4.2a kernel.
The benchmark run is a UDP throughput measurement tool called ncns; it
was run using 4000-byte packets. The measured transmission speed as given in
Table 4 is significant for a 19.5 SPECint CPU. All numbers are taken with check-
sum-computation avoidance in effect, due to the proliferation of workstations ship-
ping using techniques for reducing or eliminating checksumming.
Note that the technique is much more effective for the DMA-based DEFTA
adapter, which is the topic of the next section.
User buffer
User Copy into kernel
DMA to adapter
Figure 24: Send-side data movement with DMA but no copyin. No intermediate kernel buffers
are used. Data is transferred directly from the user process’ buffer.
84
User buffer
User
Pointers to memory
Descriptor
queue ...
Figure 25: Send-side data movement with PIO and no copyin. No intermediate kernel buffers
are used. Data is copied directly from the user process’ buffer to the adapter.
atively high-performance part is used for the onboard adapter memory, there is
contention between DMA and CPU on the PIO-based adapter’s onboard memory.
Our results support the claim that DMA-based adapters are often more effi-
cient than PIO-based adapters. The DEFTA achieves more than twice the perfor-
mance of the DEFZA.
5.5 Conclusions
The three modifications just described serve to largey eliminate data-touch-
ing overheads, substantially improving maximal throughput. Checksum Redun-
dancy Avoidance eliminates most checksum overheads, DMA Redirection obviates
most copying from kernel to user buffers, and Nocopyin avoids most copying from
user to kernel buffers.
An implementation on a DECstation 5000/200 which included all three
optimizations would probably be capable of a throughput nearly four times as
great as the throughput of a DECstation 5000/200 running Ultrix 4.2a, since data
touching overheads account for roughly 75% of the time required to process 8k
packets, the size of packet most commonly used for data transfer.
86
The other techniques are designed to largely eliminate each of the data-touching
operations known to be the major bottleneck to maximal throughput in systems
87
• DMA Redirection avoids the need to copy from kernel to user buffers by
redirecting adapter DMA pointers as soon as the header of each packet
arrives; the header identifies the packet sufficiently to set DMA pointers
to point to the buffer of a waiting packet.
minor redesign in order to handle multicast. The locality detection and statistical
checksumming algorithms described in the Checksum Redundancy Avoidance sec-
tion are as yet unimplemented. More work is required in order for DMA Redirec-
tion to be able to raise throughput in addition to lowering latency of large packets.
Nocopyin does not yet support TCP or NFS.
These techniques could be applied to protocols other than TCP/IP and
implementations other than Berkeley Unix. Each technique attempts to solve a
problem not unique to TCP/IP or Unix in a manner that takes advantage of facili-
ties similarly not specific to TCP/IP or Unix.
6.4 Bibliographical References
[1] Accetta, R. Baron, D. Golub, R. Rashid, A. Tevanian, M. Young, “Mach: A New
Kernel Foundation for UNIX Development,” Proceedings of the Summer
1986 Usenix Technical Conference and Exhibition, June 1986.
[3] ANSI/IEEE, Std 802.1d, “Information Processing Systems - Local and Metro-
politan Area Networks - Part 1d,” ANSI/IEEE Std 802.1d, 1992.
[10] S. Bradner, Harvard Network Device Test Lab results, available by anony-
mous FTP from hsdndev.harvard.edu.
89
90
[36] R. Jain and S. Routhier, “Packet Trains: Measurements and New Model for
Computer Network Traffic,” IEEE Journal on Selected Areas in Communica-
tion, v4n6, 986-995, September 1986.
[38] H. Kanakia, D. R. Cheriton, “The VMP Network Adapter Board (NAB): High-
Performance Network Communication for Multiprocessors,” Proceedings of
the SIGCOMM ‘88 Symposium on Communications Architectures and Prin-
ciples, pp. 175-187, 1988.
[39] J. Kay, ccsum.c [an Internet checksum routine optimized for the DECstation
5000/200], available for anonymous ftp as ucsd.edu:pub/csl/fastnet/
ccsum.c.Z, April 1993.
[40] J. Kay, J. Pasquale, “Network Sloth: Where the Time Goes in Network Soft-
ware,” University of California, San Diego Technical Report CS92-238,
June 1992.
[44] LSI Logic, L64360 and ATMizer Architecture Manual, August 1994.
92
[47] J. Mogul, “Network Locality at the Scale of Processes,” Proceedings of the SIG-
COMM ‘91 Symposium on Communications Architectures and Protocols, pp.
273-284, August 1991
[48] J. Mogul, “Path MTU Discovery,” Internet Request for Comments 1191,
November 1990.
[62] Sun Microsystems, Inc., “NFS: Network File System Protocol Specification,”
Internet RFC 1094, March 1989.