10 1 1 30

UNIVERSITY OF CALIFORNIA, SAN DIEGO
Path IDs: A Mechanism for Reducing Network

Software Latency
A dissertation submitted in partial satisfaction of the

requirements for the degree Doctor of Philosophy
in Computer Science and Engineering
by
Jonathan Simon Kay
Committee in charge:
Professor Joseph Pasquale, Chair

Professor George Polyzos
Professor Keith Marzullo
Professor Ramesh Jain
Professor Rene Cruz
1995
Copyright
Jonathan Kay, 1994.
All rights reserved.
The dissertation of Jonathan Simon Kay is approved,
and it is acceptable in quality and form for publication
on microfilm:
Chair
University of California, San Diego

1995
iii
To my friends.
iv
Chapter I:Introduction 13
1.1 Introduction 13
1.2 The Importance of Latency 14
1.3 Latency Optimization 16
1.4 Throughput Optimization 16
1.5 Organization 16
Chapter II:Related Work 18
2.1 Introduction 18
2.2 Optimizing Data-Touching Overheads 18
2.2.1 Checksumming 18
2.2.2 Copying From User to Kernel 19
2.2.3 Copying From Kernel to User 20
2.3 Optimizing Non-Data-Touching Overheads 21
2.4 Network Software Structure 22
2.4.1 Modularity 23
2.4.2 Control Flow 23
2.4.3 Buffer Descriptors 24
2.5 Performance Measurements 25
2.6 Workload Measurement 25
2.7 Conclusions 26
Chapter III:Workloads and Overheads 27
3.1 Introduction 27
3.2 Workload 27
3.2.1 LAN Traffic 27
3.2.2 WAN Traffic 30
3.3 Overhead Categories 30
3.4 Experimental Setup 32
3.5 Packet Size Effects 34
3.5.1 Bottlenecks to Minimal Latency 36
3.5.2 The Importance of Non-Data-Touching Overheads 38
3.5.2.1 Difficulty of Optimizing Non-Data-Touching Overheads 39
3.6 In-Depth Analysis 40
3.6.1 Touching Data 41
3.6.2 Protocol-Specific Processing 43
3.6.3 Mbufs 43
3.6.4 Operating System Overheads 44
3.6.5 Error Checking 45
3.6.6 Data Structure Manipulations 45
3.7 Conclusions 47
Chapter IV:Optimizing Non-Data-Touching Overhead 49
4.1 PathIDs 49
4.2 PathID Implementation 50
4.2.1 New Link Layer 50
v
4.2.2 PathID Fast Lookup 52
4.2.3 Flow of Control 53
4.3 PathID Savings in Functionality 54
4.4 PathID Weaknesses 58
4.5 Performance 59
4.6 Conclusions 60
Chapter V:Optimizing Data-Touching Overhead 61
5.1 Introduction 61
5.2 Checksum Redundancy Avoidance 61
5.2.1 Eliminating Checksum Redundancy 62
5.2.2 UDP Implementation 63
5.2.3 TCP Implementation 63
5.2.4 Performance Improvement 64
5.2.5 Detecting Locality With Certainty 64
5.3 DMA Redirection 65
5.3.1 DMA Redirection 66
5.3.3 Early Packet Detection By Polling From the Idle Loop 69
5.3.4 Results 72
5.4 Eliminating User-to-Kernel Copying 72
5.4.1 Modern Network Data Movement Design 73
5.4.2 Nocopyin 77
5.4.3 Results 82
5.4.4 DMA vs. PIO for High Performance Adapters 84
5.5 Conclusions 85
Chapter VI:Conclusions 86
6.1 Problem Summary 86
6.2 Summary of Results 86
6.3 Future Work 87
6.4 Bibliographical References 89
vi
ACKNOWLEDGEMENTS
Hans-Werner Braun and the National Science Foundation helped by
giving me packet trace data for the NSF backbone and other backbone and
LAN environments around the San Diego Supercomputer Center and UCSD.
This made possible the analysis of WAN and backbone packet processing time
and strengthened the LAN analysis.
Kim Claffy helped greatly with impressively thorough proofreading
jobs. Furthermore, her own research into flow lengths agrees with the implica-
tions of this work that the common mental model of TCP/IP traffic as a series
of long data transfers is woefully inaccurate and in need of update.
Tuong Nguyen assisted by finding out how to use FrameMaker’s more
esoteric features when we were writing our theses at the same time.
Keith Muller was of particular help; he helped with the design and
debugging of the experiments and various of the schemes described in this dis-
sertation. I would also like to thank him for teaching me about the Berkeley
Unix kernel and for providing encouragement when I seemed to be at a dead
end with the the logic analyzer.
Vernon Schryver and Kevin Fall have been helpful in finding problems
with checksum redundancy avoidance. Max Okumoto helped me in finding the
locality detection algorithm that lays those problems to rest.
Peter Desnoyers helped in writing the fast MIPS checksum routine by
finding an algorithm that reduced per-word processing from my algorithm’s
five instructions to his algorithm’s four instructions.
Digital Equipment Corporation provided generous support to Project
Sequoia 2000, with which this work is intricately bound.
Bob Querido, Farhad Razavian, and Bob Cardinal, of Loral Test and
Information Systems, provided me with an education in network hardware and
market characteristics rare in a graduate student.
Lastly and most importantly, I’d like to acknowledge the role played by
my advisor, Joe Pasquale. This work was made possible by the manner in
vii
which he provided an atmosphere of ideas, was always supportive, encouraged inde-
pendent thinking, was generous and humane with financial support, and was able to
acquire and make available to us that rarest of resources in an academic environment
- equipment sufficient for advanced operating systems research.
viii
VITA
September 3, 1968 Born, Mayaguez, Puerto Rico

1989 B.S., Johns Hopkins University
1990 M.S., Johns Hopkins University
1995 Ph.D., University of California, San Diego
PUBLICATIONS
J. Kay, J. Pasquale, “Profiling and Reducing Processing Overheads in TCP/IP,”

submitted for consideration, IEEE/ACM Transactions on Networking.
J. Kay, J. Pasquale, “The Importance of Non-Data Touching Processing Over-

heads in TCP/IP”, Proceedings of the SIGCOMM ‘93 Symposium on Com-
munications Architectures and Protocols, September 1993.
J. Kay, J. Pasquale, “A Summary of TCP/IP Networking Software Performance

for the DECstation 5000,” Proceedings of the 1993 SIGMETRICS Confer-
ence, 249-258, January 1993.
J. Kay, J. Pasquale, “Measurement, Analysis, and Improvement of UDP/IP

Throughput for the DECstation 5000,” Proceedings of the Winter 1993
USENIX Conference, 249-258, January 1993 (available with TCP/IP
acceleration software on ucsd.edu:/pub/csl/fastnet).
J. Kay, T. Nguyen, J. Pasquale, “Fast Software Dithering of 24 to 8 or 1 Bit

Video,” to appear in High-Speed Networking and Multimedia Computing,
IS&T/SPIE Symposium on Electronic Imaging: Science & Technology,
February 1994.
J. Kay, “Network Sloth: Where the Time Goes in Network Software,” Technical
Report CS92-238, University of California, San Diego, December 1991.
J. Pasquale, G. Polyzos, E. Anderson, K. Fall, J. Kay, V. Kompella, S. McMul-

lan, D. Ranganathan, “Network and Operating Systems Support for Mul-
timedia Applications,” Technical Report CS91-186, University of
California, San Diego, March 1991.
J. Kay, “An Empirical Study of Using Certification Trails,” Master’s Thesis, The
Johns Hopkins University, August 1990.
ix
FIELDS OF STUDY
Major Field: Computer Science
Studies in Networking
Professors Joseph Pasquale and George Polyzos
Studies in Operating Systems

Professors Joseph Pasquale and George Polyzos
x
ABSTRACT OF THE DISSERTATION
Path IDs: A Mechanism for Reducing Network Software Latency
by
Jonathan Simon Kay

Doctor of Philosophy in Computer Science and Engineering
University of California, San Diego, 1994
Professor Joseph Pasquale, Chair
Network performance is important to an increasing number of com-

puter applications. Similarly, network hardware performance is climbing to
meet the demands of these new applications, with the advent of 100-megabit
Ethernet, ATM, and fast processors to drive them. By contrast, network soft-
ware latency is improving relatively slowly. It poses a barrier to the increasing
use of networks for a wide variety of applications; as this dissertation shows,
minimizing latency is more important to the minimization of processing time
of the overall network workload than maximizing throughput.
This dissertation introduces PathIDs, a means of reducing network
latency. The latency problem is exacerbated both by the relative lack of
research into latency optimization as compared to throughput optimization
and by the inherent difficulty of optimizing latency. It is difficult to significantly
reduce the average processing time due to non-data-touching overheads
because that time is spread across many operations, so that achieving a signif-
icant savings would require optimization of many different mechanisms.
PathIDs provides a framework for optimizing such a wide variety of overheads.
Optimizing large packet processing is also important; it is the focus of a
great deal of study and is believed to be important to applications such as sci-
xi
entific data management and digital video transmission. This dissertation describes
techniques that greatly reduce processing time for large packets, improving through-
put, by eliminating most of all three major CPU-moderated data-touching operations:
checksumming, copying data from kernel to user buffers, and copying data from user
to kernel buffers. Checksumming is largely eliminated through a technique which
avoids checksumming in the frequent case in which it is redundant with the CRC of a
LAN adapter. Copying from kernel to user buffer can be largely eliminated by redirect-
ing DMA pointers concurrently with packet delivery. Copying from user to kernel
buffer can be avoided by arranging for adapters to DMA directly from user buffers.
xii
Chapter I: Introduction
1.1 Introduction
Network performance is important to an increasing number of com-
puter applications. Many applications written without networking in mind are
turned into network applications through access to network file systems
BibRef[62]. Some newer applications are being written with networking in
mind, even designed with the expectation of a certain level of network perfor-
mance, such as multimedia applications BibRef[15], distributed applications
based upon LANs BibRef[5]BibRef[63], and scientific applications using dis-
tributed data BibRef[61]. Similarly, network hardware performance is climb-
ing to meet the demands of these new applications, with the advent of 100-
megabit Ethernet, ATM, and fast processors to drive them.
By contrast, network software performance has not risen to meet the
challenge implied by the new applications and network hardware. Network
software latency is especially disappointing - it was high compared to network
hardware latency even shortly after the invention of the Ethernet BibRef[60],
and has declined comparatively slowly since. Seven years have passed between
the introduction of the Sun 3/60 (1987) and the Alpha 3000/800 (1994). The
Alpha is roughly 40 times faster than a Sun 3/60, implying that CPU speeds
have improved by an average factor of 1.7 each year, yet the network latency
between a pair of the Alphas is only a fifth that between a pair of the Sun 3s,
implying that network latency has only decreased by an average factor of 1.3
each year.
This dissertation introduces PathIDs, a means of reducing network
latency. Network latency poses a barrier to the increasing use of networks for
a wide variety of applications such as cheap parallel processing
BibRef[63]BibRef[66] and scientific computing BibRef[27]. Additionally, high
latency hampers many other aspects of networking, as it causes a tradeoff
between application implementation ease and performance. Programs based
13
14
on simple remote procedure calls are notorious for poor performance. Protocols
carefully designed for performance over high-latency links, such as TCP, are
often relatively complex, require large buffer spaces to achieve high through-
puts, and require occasional modifications such as the provisions for increased
window space specified in BibRef[7].
1.2 The Importance of Latency

Perhaps in part because of the difficulty of designing applications able
to tolerate high latency, minimizing latency is more important to the minimi-
zation of processing time of the overall network workload than maximizing
throughput. This work includes a study of the breakdown of various processing
overheads required to process a realistic TCP/IP workload on a DECstation
5000/200 running Ultrix 4.2a and used that information to find out what
aspects of the TCP/IP implementation are the greatest bottlenecks. I found
that minimal latency is most in need of improvement.
Processing time breakdowns for control packets 64-128 bytes long fun-
damentally differ from those of 8 kbyte long NFS data packets. The processing
time of large packets is dominated by data-touching operations such as copying
and checksumming BibRef[11]BibRef[19]BibRef[21]BibRef[22]BibRef[67]
because these operations must be applied to each byte. However, small packets
have few bytes, and thus their processing time is dominated by non-data-touch-
ing operations. This property is useful for drawing distinction between large
and small packets: a packet is considered to be large if it is large enough that
its processing time is dominated by data-touching operations, and it is consid-
ered to be small otherwise. On the DECstation 5000/200 used for measurement
in this dissertation, the packet size at which the network software spends an
equal amount of time on data-touching and non-data-touching overheads is in
the vicinity of 1500 bytes, the Ethernet MTU.
Figures 1a-b and 2a-b show that TCP/IP LAN and WAN traffic largely
consists of large and small packets; large packets typically carry bulk data,
15
whereas small packets usually carry control information. On both the LAN and
WAN, small packets far outnumber large packets. In fact, even though process-
ing a large packet requires more time, the high proportion of small packets
causes non-data-touching processing time to dominate processing of the overall
workload.
There are philosophical reasons why network software latency improve-
ment has not historically been a high priority with the network research com-
munity. However, network measurement and observation do not bear out
either reason. There are those who believe that latency is a problem that will
solve itself as machines speed up, whereas throughput will become a worse
problem as data movement becomes more difficult relative to computation
BibRef[50]. However, as noted in the introduction, network software latency is
by no means keeping pace with CPU speed. This is because system software, of
which network software is a subset, exhibits poor memory locality
BibRef[2]BibRef[17] and includes varieties of processing not considered to be
“common cases” by RISC CPU designers such as interrupts and context
switches.
Another reason why network software latency improvement has errone-
ously been considered to be unimportant is it was believed that communication
patterns are more or less uniformly distributed and the increasing diameter of
the Internet would inevitably lead to higher network delays due to the limita-
tion imposed by the speed of light. As it turns out, there is a great deal of local-
ity within TCP/IP networks today; most packets go no further than the LAN on
which they are generated. A related argument is that the explosion of high-
speed WANs will make wide-area communication much more common than
local-area communication. This ignores the question of whether social and
administrative rather than performance issues produced the locality observed
today.
16
1.3 Latency Optimization

The latency problem is exacerbated both by the relative lack of research
into latency optimization as compared to throughput optimization and by the
inherent difficulty of optimizing latency. It is difficult to significantly reduce the
average processing time due to non-data-touching overheads because that time is
spread across many operations, so that achieving a significant savings would
require optimization of many different mechanisms. I suggest a technique,
PathIDs, that provides a framework for optimizing such a wide variety of over-
heads.
1.4 Throughput Optimization

Optimizing large packet processing is also important; it is the focus of a
great deal of study and is believed to be important to applications such as scientific
data management BibRef[61] and digital video transmission BibRef[53]. This dis-
sertation describes techniques that greatly reduce processing time for large pack-
ets, improving throughput, by eliminating most of all three major CPU-moderated
data-touching operations: checksumming, copying data from kernel to user buff-
ers, and copying data from user to kernel buffers. Checksumming is largely elimi-
nated through a technique which avoids checksumming in the frequent case in
which it is redundant with the CRC of a LAN adapter. Copying from kernel to user
buffer can be largely eliminated by redirecting DMA pointers concurrently with
packet delivery. Copying from user to kernel buffers can be avoided by arranging
for adapters to DMA directly from user buffers.
1.5 Organization
The dissertation is organized as follows. Chapter 2 describes related work.
Chapter 3 describes processing time breakdowns under realistic workloads. Chap-
ter 4 concerns a technique for reducing non-data-touching overheads, PathIDs.
17
Chapter 5 fully describes the techniques used to nearly eliminate data-touching

overheads. Finally, the dissertation concludes in Chapter 6.
18
Chapter II: Related Work

2.1 Introduction
Host network software optimization is an area of active interest. This chap-
ter describes much of the work in the field more directly relevant to this disserta-
tion. One important point that this chapter illustrates is the emphasis within the
network community upon improvement in throughput at the expense of improve-
ment in latency.
2.2 Optimizing Data-Touching Overheads

Optimizing data-touching overheads, and thus raising maximal through-
put, has been a major focus of work. The thrust of work in this area is to reduce or
eliminate one or more of six data touching operations involved in processing a
packet in Berkeley Unix BibRef[43]: moving data from the adapter to kernel mem-
ory, moving data from kernel memory to the adapter, moving data from kernel
memory to user buffers, moving data from user buffers to kernel, checksumming
incoming data, and checksumming outgoing data.
2.2.1 Checksumming
Checksumming is often the most time-consuming BibRef[19] of non-data-
touching overheads because it is tricky to implement efficiently, is not as pressing
for an operating system vendor to optimize as other bulk data operations such as
bcopy, and requires architectural carry bit support to perform as quickly as read-
ing data.
There are a variety of strategies for reducing time spent checksumming.
One is to take advantage of processor support to minimize the number of instruc-
tions required. BibRef[8] suggests how this may be done on CISC and vector archi-
tectures, taking advantage of carry bits, and gathering operations into either
processor word sizes or vector lengths on vector architectures. However, optimizing
checksum computations on RISC architectures requires additional techniques,
both to optimize memory accesses, and to reduce the number of instructions
19
required on RISC architectures lacking carry bits. BibRef[41] describes how check-
summing can be optimized on RISC architectures, taking advantage of pipelines,
load delay slots, and scoreboarding to minimize memory access delays; BibRef[39]
describes an algorithm for checksumming with minimal loss of performance on
architectures lacking carry bits.
An alternate strategy is to implement the Internet checksum on the net-
work adapter BibRef[4]BibRef[38]BibRef[57], but that requires extra adapter
hardware or extra muscle from an onboard CPU. For example, DEC’s FDDI
adapter, which does not support onboard Internet checksum processing, is based
on the relatively simple 16-bit MC68000 processor BibRef[55]BibRef[65]. SGI’s
FDDI adapters, which do support the Internet checksum, require a RISC processor
- the 32-bit AMD29000. The Silicon Graphics adapters further have a reputation
for being expensive, though this could be due to other factors such as Silicon
Graphics’ choices of I/O bus and business model.
Still another scheme is to merge copying and checksumming to reduce traf-
fic between memory and the CPU register file
BibRef[19]BibRef[22]BibRef[35]BibRef[51]. Instead of being implemented as sep-
arate routines, called separately, copying and checksumming are implemented as
a single operation that operates on a word in a buffer by reading it into a register,
adding the contents of a register to a sum register, and then writes the word out to
some other location. This saves a read operation over performing checksumming
and copying separately. On some machines, such as machines based upon SPARC
and PA-RISC processors, this strategy is sufficient to result in the effective elimi-
nation of checksum overhead BibRef[35], but it does not work as well on other
machines, such as machines based on MIPS and Alpha processors
BibRef[41]BibRef[57]BibRef[16].
2.2.2 Copying From User to Kernel

The HP Medusa and Afterburner adapters BibRef[4]BibRef[24] introduce a
design which largely removes the performance impact of a copy between adapter
and kernel by placing the kernel buffers on the adapter. These adapters each
20
include several megabytes of VRAM from which the kernel allocates network buff-
ers. Data is moved between the host and the adapter through the normal copy
operation between user and kernel buffers. The data is in fact later moved within
the adapter from the VRAM to the MAC chip FIFO, but CPU performance is not
affected because only adapter hardware resources are engaged. Since the adapter
memory is effectively dual-ported, there is no performance degradation due to con-
tention for the adapter memory. The disadvantages of this design are increased
cost relative to simpler competitive adapters and the fact that PIO is slow on many
hosts because of bus transaction overheads on each cache line/write buffer move-
ment - for example, DMA is ten times faster than programmed I/O across the DEC-
station I/O bus BibRef[25].
2.2.3 Copying From Kernel to User

On many machines which avoid software checksumming, whether because
of redundant checksum avoidance or because hardware checksum support is avail-
able, copying an incoming packet from kernel to user is the single most costly oper-
ation BibRef[41]. This is because the data is not yet in the cache, so copying is
relatively slow. On the DECstation 5000/200, copying uncached data is three times
more expensive than copying cached data. This problem is difficult to solve because
one must specify to network adapters where data should be put in advance of its
actual arrival, and one does not generally know in advance where a packet is going.
One approach to solving this problem, optimistic burst
BibRef[14]BibRef[49] is to guess that the next packet will be bound for the same
buffer in the same process as the previous packet, in the hope that packet trains
are long and common and the process is blocked waiting for the packet. However,
the chance that a packet is be bound for the same transport protocol port as its pre-
decessor - a likely indicator of the chance a packet is bound for the same buffer as
its predecessor - is 60-75% BibRef[47], and it turns out that it is actually impossi-
ble to respond to the beginning of a packet train until the third packet because
even the fastest modern processors do not approach the speed necessary to process
a packet and appropriately modify DMA pointers during the interframe gap after
21
the first packet - by the time processors are fast enough, 100baseT or ATM is likely
to be the dominant LAN medium. Furthermore, the common use of the select call
for synchronization in Berkeley Unix based networking code means that Unix pro-
cesses waiting for data frequently do not specify the buffer in which they want
incoming data to be placed until after the data has arrived.
Another approach is to place a context ID field in the link layer header that
the adapter knows about BibRef[6]. The adapter decides where to DMA incoming
packets based on the ID field, and the ID field corresponds to a transport protocol
session, which in turn can keep track of a list of buffers of waiting processes and
set DMA destinations accordingly.
2.3 Optimizing Non-Data-Touching Overheads

Relatively little attention has been paid to optimizing non-data-touching
overheads, and thus latency, within TCP/IP. There have been some contributions,
notably on fastpaths. The main idea behind fastpaths is to optimize network code
specifically for common varieties of packets. In the case of packet transmission,
this is taken to an extreme in which there are specialized paths which run through
the socket, transport, and network layers for each common kind of packet -
BibRef[35] describes a TCP fastpath, while BibRef[51] describes a UDP fastpath.
There has been more investigation into reduction of and minimization of
latency for simple RPC protocols. Impressive speedups were achieved by imple-
menting some RPC functionality in the microcode of an early Ethernet interface
BibRef[60]. More recently, Ethernet interfaces have become standardized to the
extent that the functionality previously implemented by a processor is imple-
mented in cheap MAC chips - there is no microcode to change. However, FDDI and
ATM are new enough that adapters for them still frequently include onboard pro-
cessors BibRef[44].
More recent approaches have assumed that RPC functionality is best
implented in CPU software. BibRef[56] concerns measurement and optimization
22
of an RPC protocol implemented on a Firefly multiprocessor. BibRef[64] concerns

possible limits to optimization of RPCs, particularly lower limits to latency.
An approach currently showing promise and high performance is the notion
of placing parallel processor network hardware on workstations BibRef[66]. This
hardware has the property that it is able to interface directly to individual pro-
cesses on hosts without a loss in security, so it is often possible for processes to com-
municate without operating system intervention. A further, related idea is the
notion of active messages BibRef[66]: protocol information should be limited to
specification of a function to call that knows how to handle the incoming packet
BibRef[66]. The eventual role of such protocols is so far unclear since losses in func-
tionality result from this approach.
Although little has been published on the subject, router and bridge ven-
dors have been very successful in achieving very high packet processing rates, up
to nearly 200,000 packets per second BibRef[10], which implies spectacularly low
non-data-touching overheads. It is known that Cisco achieves these packet pro-
cessing rates through hardware BibRef[30].
2.4 Network Software Structure

Network software performance concerns are closely bound with overall net-
work software design concerns, as some network software designs are much better
for performance than others. An ideal network software structure from the perfor-
mance point of view is one which makes it easy to write high performance network
software.
The Berkeley Unix network subsystem is probably the most widely investi-
gated network software. It is also one of the most widely used, as well as one of the
best-performing - most of the best-performing network software is based upon
recent versions of it.
As the major description of Berkeley Unix design, BibRef[43] is invaluable;
it probably remains the best description of a robust, yet high performance network
23
software subsystem in wide use. BibRef[23] also describes the design of a network
software subsystem, but it is not oriented towards high performance.
Most publications on network software structure concern a specific area of
design. Existing research is particularly strong on a few points of traditional diffi-
culty in networking software: modularity, control flow, and buffer descriptors.
2.4.1 Modularity
The software engineering difficulties associated with network subsystems
are formidable, as network subsystems usually comprise large bodies of code. As
in many other areas of operating system design, modularity has traditionally been
employed to combat these problems BibRef[19]. Sometimes modularity has been
taken to extremes that candamage performance: for example, in Xinu BibRef[23],
each protocol has its own thread and input/output queues. More modestly, streams
BibRef[54] is a system in which each protocol can have its own coroutines and
queues, but does not have to; still, some implementors of streams-based TCP/IP try
to reduce protocol boundary penalties by placing multiple protocols within a single
streams module BibRef[37]. Berkeley Unix largely uses procedure calls, source
code file boundaries, and naming conventions to achieve modularity BibRef[43]. It
has been argued that even this level of modularity impedes performance;
BibRef[19]BibRef[22] suggest violating layer boundaries to combine copying and
checksumming, while BibRef[35] suggest combination of layers into fastpaths that
take advantage of the limited scope of applicability of each path to reduce the
amount of functionality, and thus processing time, required to process packets eli-
gible for processing by a fastpath.
2.4.2 Control Flow

Different systems have different designs for control flow. Berkeley Unix has
a design which changes context relatively little. Transmission processing is largely
accomplished by the application process’ thread of control, with the exception that
some adapter processing is done in the transmit interrupt thread of control. Packet
reception is more complex: adapter processing is accomplished in the receive inter-
24
rupt thread of control, sockets (the interface between user processes and kernel
networking code) processing occurs in the application process’ thread of control,
but protocol processing is largely done in the thread of control of a software inter-
rupt generated during the receive interrupt. The additional software interrupt is
useful because it has a lower priority than network interrupts, so adapters with
small numbers of receive buffers can free those buffers as quickly as possible. How-
ever, most recently designed adapters are equipped with adequate receive buffers,
so recently there have been proposals to eliminate the software network interrupt
BibRef[51].
Some systems have more complicated flows of control - Xinu BibRef[23]
embeds each protocol in its own thread, Mach BibRef[1] moves network processing
to a separate user process, and streams BibRef[54] uses a flexible coroutines mech-
anism.
2.4.3 Buffer Descriptors

Any general-purpose network software implementation requires some form
of complex buffer descriptor, such as the Berkeley Unix mbuf, in order to defrag-
ment packets without copying fragments into an intermediate buffer. The striking
difference between the complexity of such buffers and the simplicity of more tradi-
tional <base, length> buffer descriptors usable for most other forms of I/O
causes ongoing debate. The additional memory allocation overheads resulting
from the more complex mbufs are the subject of still more debate - most variants
of Berkeley Unix use somewhat different network buffer allocators. BibRef[43]
explains the Berkeley Unix network buffer descriptors, called mbufs. BibRef[11]
details how the mbuf allocation policy, tuned towards memory conservation in
early versions of Berkeley Unix intended to run on 4 MB VAXes, produces counter-
intuitive performance behavior; this behavior is also observable in the Ultrix ker-
nel measured for this study. BibRef[32] describes the design of the buffer
descriptor used in version 3.1 of the x-kernel, which encapsulates for ease of use a
number of buffer manipulation primitives that must be written on a per-case basis
when using mbufs, and attempts to reduce the number of calls needed for buffer
25
manipulation. BibRef[40] shows how the x-kernel’s lack of an allocator tuned to

deliver certain block sizes of data rapidly results in memory allocation being a
major bottleneck to x-kernel processing time.
2.5 Performance Measurements

A measurement study of TCP/IP under Berkeley Unix is completed every
few years. BibRef[11] is an thorough study of overheads under 4.2 BSD and a
description of some improvements made in 4.3 BSD as a result of the study.
BibRef[21], BibRef[34] include information on throughput bottlenecks in the
Tahoe BSD release.
In addition, some vendors with network subsystems based on Berkeley
Unix have published papers which include information on maximal throughputs
achievable by their systems: BibRef[4]BibRef[24] include such studies of HP/UX,
BibRef[59] is a study of Ultrix throughput, and BibRef[16] is a study of OSF/1.
Each publication except BibRef[16] is primarily concerned with adapter design;
the throughput information appears because adapter design and host network per-
formance are closely intertwined.
2.6 Workload Measurement

This dissertation introduces little new information on network workloads.
The conclusion that non-data-touching overheads consume more time than data-
touching overheads is, however, sensitive to network workload. In particular, it is
dependent upon a few, relatively well-established principles of TCP/IP network
workloads: network workloads are bimodal, LAN workloads are dominated by
small NFS packets and a small fraction of large NFS packets, and nearly all pack-
ets on WANs are less than 512 bytes long. These points are made convincingly for
LANs in BibRef[29] (keeping in mind that 1k ND packets have been replaced with
8k NFS packets) and for WANs by BibRef[12].
26
2.7 Conclusions
In summary, although, as this chapter has shown, most aspects of network
software performance have been investigated to some degree, issues in reducing
minimal software latency in general-purpose protocol suites have been examined
with less thoroughness than issues in raising maximum software throughput. This
imbalance in effort is probably a contributor to the imbalance between network
throughput and latency. Recently, improvements in network throughput have been
seen from the deployment of adapters supporting hardware checksumming and
onboard kernel buffers. During the same time frame, relatively little improvement
has been seen in network latency, despite the effects of the increase of the ratio of
processor speed to memory speed BibRef[50].
27
Chapter III: Workloads and Overheads

3.1 Introduction
This chapter concerns why non-data touching operations dominate the time
spent processing workloads of packets with realistic size distributions. It goes on
to discuss problems in reducing latency and detailed breakdowns of network soft-
ware overheads.
These findings result from a study I undertook of bottlenecks to TCP/IP pro-
cessing to find what varieties of optimization would be most useful. The study
involved measurements of overhead breakdowns in a TCP/IP implementation and
of the packet size distribution on a network of general-purpose workstations.
3.2 Workload
To determine realistic TCP and UDP packet size distributions, packet
traces of two different FDDI networks, each used for a different purpose, were
obtained. One trace, which reflects wider usage, is of traffic destined for and gen-
erated by a file server on an FDDI LAN of general-purpose workstations at UC
Berkeley between 5 and 9 PM on Sept 11, 1992. The other trace, of wide area net-
work traffic, is from the FDDI network that feeds the NSFnet backbone node at the
San Diego Supercomputer Center. This is the same trace as was described and
analyzed in BibRef[18]; it was taken between 2 and 3 PM on March 23, 1993.
Both traces show strong bimodal behavior. This and the other traffic behav-
iors observed conform to findings of earlier studies
BibRef[12]BibRef[29]BibRef[47].
Packet size is defined to mean the amount of actual user data sent, not
including protocol headers. Figures 1a-b show the packet size distributions for
TCP and UDP packets.
3.2.1 LAN Traffic

The trace that reflects widest usage is a trace of traffic destined for and gen-
erated by a file server on an FDDI LAN used for general-purpose computing.
28
0.35
0.30
0.25
Fraction of Total Messages
0.20
0.15
0.10
0.05
0.00
0 1 3 7 15 31 63 127 255 511 1023 2047 4095 8191
Message Length in Bytes
Figure 1a: Histogram of TCP packet sizes on a LAN. The packet size scale is logarithmic.
Note that almost all TCP packets are small (0.5% of the TCP packets are actually larger than
256 bytes - they are merely invisible in the histogram).
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0 1 3 7 15 31 63 127 255 511 1023 2047 4095 8191
Message Length in Bytes
Figure 1b: Histogram of UDP packet sizes on a LAN, with a logarithmic packet size scale. As
in Figure 1a, most of the packets are small. However, a significant fraction of the packets are
very large.
139,720 packet sizes were collected; 90% of the packets are UDP packets, mostly
generated by NFS. The rest are TCP packets.
Figure 1a shows that almost all the TCP packets observed are small; over
99% are less than 200 bytes long. Figure 1b shows a bimodal distribution of UDP
packet sizes. The great majority of UDP packets are small, but there are some very
29
large packets: 86% of the UDP packets are less than 200 bytes long, while 9% are
around 8192 bytes long.
The observed packet size distributions matched expectations set by previ-
ous work on Ethernet-based packet traces BibRef[29]BibRef[47]. The median
packet sizes for TCP and UDP packets are 32 and 128 bytes, respectively. The low
UDP median reflects the fact that even in the case of UDP, most packets are small
(e.g. NFS status packets). The reason for the large number of 8 kilobyte UDP pack-
ets is NFS. The scarcity of large TCP packets (there are a few large TCP packets,
as large as 2048 bytes, but not enough to be visible in Figures 1a-b) arises because
it is easier to move data by NFS than by more traditional TCP-based file movement
commands such as FTP and rcp. Examination of data from other LANs supported
these results. There are applications that produce large TCP packets, such as X
Window System image transfer and Network News Transfer Protocol, but these
packets are relatively infrequent.
0.12
0.30
0.10
0.25
0.08
0.20
0.06
0.15
0.04
0.10
0.05 0.02
0.00 0.00
0 1 3 7 15 31 63 127 255 511 1023 2047 0 1 3 7 15 31 63 127 255 511 1023
Message Length in Bytes Message Length in Bytes
Figure 2a-b: Histograms of TCP (2a) and UDP (2b) packet sizes on a WAN. The packet size
scale is logarithmic. The longest packets in the trace are 1460 bytes long, not quite long
enough for touching data to dominate their processing time. Contrary to the distribution seen
in the LAN trace, 92% of the packets in this trace are TCP packets.
30
3.2.2 WAN Traffic

Packets are even smaller on wide-area networks. No packet in the SDSC
WAN trace is larger than 1500 bytes. Packets of lengths in the vicinity of 1500
bytes are very much in the minority, and are TCP packets probably generated by
the few hosts that implement MTU Discovery BibRef[48]. 99.7% of the packets are
no more than 536 bytes long.
Contrary to the what is seen on the LAN, most WAN data is sent by TCP.
This is unsurprising because TCP contains mechanisms for working smoothly over
WANs and UDP does not. This is one reason why some vendors provide a port of
NFS to TCP for use over WANs.
Few hosts see a WAN-like distribution of traffic. Most TCP/IP packets are
generated on a LAN and destined for another host on the same LAN; thus, on most
hosts the workload component of packets that cross WANs is drowned out by the
much greater workload of purely local packets. The WAN trace is included as an
attempt at determining the processing profile of hosts communicating by wide area
networks. Also, once again, these results are unsurprising and consistent with
BibRef[12].
3.3 Overhead Categories

The major processing overheads in network software are broken into cate-
gories as follows (the name abbreviations used in subsequent graphs are in paren-
theses): checksum computation (“Checksum”), data movement (“DataMove”), data
structure manipulations (“Data Struct”), error checking (“ErrorChk”), network
buffer management (“Mbuf”), operating system functions (“OpSys”), and protocol-
specific processing (“ProtSpec”). Other studies have shown some of these over-
heads to be expensive BibRef[11]BibRef[19]BibRef[21]BibRef[67].
Checksum: Computing checksums is accomplished by a single procedure,
the Internet checksum routine BibRef[8], which is performed on data in TCP and
UDP, and on the header in IP.
31
DataMove: There are three of data movement: copying data between user
and kernel buffers (Usr-Krnl Cpy), copying data out to the FDDI controller (Device
Copy), and cache coherency maintenance (Cache Clear).
DataStruct: Manipulation of the socket buffer (Socket Buffer), IP defrag-
mentation queue (Defrag Queue), and interface queue (Device Queue) data struc-
tures. Mbuf manipulation is covered by its own category.
ErrorChk: The category of checks for user and system errors, such as
parameter checking on socket system calls.
Mbuf: All network software subsystems require a complex buffer descrip-
tor which allows the headers to be prepended and packets to be defragmented effi-
ciently. Berkeley Unix-based network subsystems buffer network data in a data
structure called an mbuf BibRef[43]. All mbuf operations are part of this category.
Allocation and freeing of mbufs are the most time-consuming mbuf operations.
OpSys: Operating system overhead includes support for sockets, synchro-
nization overhead (sleep/wakeup), and other general operating system support
functions.
ProtSpec: This category is of protocol-specific operations, such as setting
header fields and maintaining protocol state, which are not included in any of the
other categories. This category is a comparatively narrow definition of protocol-
specific processing. For example, although checksumming is a part of TCP, UDP,
and IP, it is placed in a separate category because of its high expense, and because
it is not limited specifically to any one of these protocols.
Other: This final category of overhead includes all the operations which are
too small to measure. An example of this is the symmetric multiprocessing (SMP)
locking mechanism called frequently in DEC’s Ultrix 4.2a, the operating system
measured. SMP locking executes hundreds of times per packet, but each execution
consumes less time than the uncertainty on the execution time of our probes. Thus
it is not possible to tell with certainty how much time is consumed by that mecha-
nism. In the processing times presented, the time due to “Other” is the difference
32
between the total processing time and the sum of the times of the categories listed
above.
DECstation 5000/200
DECstation 5000/200
process
process
kernel
Measured Region of Code = kernel
System Under
Test
Isolated FDDI Network
Figure 3: The experimental system consisted of two DECstation 5000/200 workstations con-
nected by an FDDI network. A process on one machine sends a packet to a process on the sys-
tem under test, which then sends the same packet back. The lines connecting the processes
show the path of the packet. Measurements are made on that part of executed code located in
the kernel of the system under test, highlighted by the stippled lines.
3.4 Experimental Setup

The experiments involve instrumenting the TCP/IP and UDP/IP protocol
stacks in the DEC Ultrix 4.2a kernel. All measurements were taken on a DECsta-
tion 5000/200 workstation, a 19.5 SPECint MIPS RISC machine, connected to
other similar workstations by an FDDI LAN. Software processing time measure-
ments with a resolution of 40 nanoseconds (the DECstation clock cycle time) are
taken by connecting an HP 1652B Logic Analyzer to the DECstation I/O bus. The
kernel is instrumented by placing C preprocessor macros at the beginning and end
of the source code for each operation of interest. Each macro execution causes a
pattern plus an event identifier to be sent over the DECstation’s I/O bus. The logic
analyzer recognizes the pattern and stores the event identifier, along with a times-
tamp. The measurement software causes minimal interference, generating over-
head of less than 1% of the total network software processing time.
Offline software later downloads the raw measurement information from
the logic analyzer and converts the information to processing times for various
33
components of network software when sending and then receiving the same
packet. The experimental system, shown in FigureFoobleFigure 3, consists of two
workstations connected by an FDDI network with no other workstations and no
network traffic other than that generated by the experiment. An experiment con-
sists of one workstation sending a packet to the system under test, which then
sends the same packet back. All measurements are made on the system under test,
which is executing a probed kernel and is hooked up to the logic analyzer. Each
experiment is carried out for 40 packet sizes evenly spaced from 1 byte to 8192
bytes. Experiments are repeated 100 times at the same packet size to obtain sta-
tistical significance in the results (the average percentage of standard error over
all categories is less than 5%).
The experiments are designed to only capture CPU time spent processing
network packets, ignoring other sources of delays in network communications. Our
timings do not include network transmission times. Nor is time that a packet is
held up by flow or congestion control counted, except for the processing time
needed to decide not to hold up the packet; the workload does not provoke flow or
congestion controls into operation. Note that the TCP workload design has the
result that acknowledgments are always piggybacked.
3000 3000
Checksum Checksum
2500 2500
2000 2000
Microseconds
Microseconds
1500 Data Move 1500 Data Move
1000 1000
500 ProtSpec 500 Mbuf

Mbuf
Other ProtSpec
Other
OpSys Error Chk
Data Struct Data Struct
0 Error Chk 0 OpSys
8 409 1024 1638 2252 2867 3481 4096 4710 5324 5939 6553 7168 7782 1 409 1024 1638 2252 2867 3481 4096 4710 5324 5939 6553 7168 7782
Message Length In Bytes Message Length In Bytes
Figure 4a-b: Breakdown of operation processing times for TCP (4a) and UDP (4b) packet
sizes ranging from very small, 0 bytes, to very large, 8192.
34
3.5 Packet Size Effects

This section presents results on processing overhead times by packet size.
These results provide a basis for understanding the performance effects given a
network workload. FigureFoobleFigure 4 shows the per-packet processing times
versus packet size for the various overheads for TCP and UDP packets, for a large
range of packet sizes, 1 to 8192 bytes. One can distinguish two different types of
overheads: those due to data-touching operations (Data Move and Checksum) and
those due to non-data-touching operations. Data-touching overheads dominate the
processing time for large data transfer packets, whereas non-data-touching oper-
ations dominate the processing time for small control transfer packets. This is
because generally, data-touching overhead times scale linearly with packet size,
whereas non-data-touching overhead times are comparatively constant.
The observation that non-data-touching overheads are constant is simplis-
tic; FigureFoobleFigure 4 shows variations in non-data-touching processing times
by packet size. Fragmentation is the reason for the jump at 4312 bytes: FDDI can-
not transmit packets larger than 4352 bytes, so TCP and IP fragment packets
larger than 4312 bytes (4312 = 4352 - TCP header length - IP header length) into
multiple packets. Fragmentation results in more non-data-touching processing
time because it is necessary to execute device driver routines for each fragment.
Mbuf manipulation is another reason for the humps. Ultrix 4.2a network
software (as well as other network software based on Berkeley UNIX) uses a data
structure called mbufs for buffering. An mbuf is a 128-byte structure. Under
Ultrix, it can either store up to 108 bytes of data or a pointer to a 4096-byte page
if more space is needed. Mbufs containing a pointer to a page are called ‘cluster
mbufs’ and do not store any data in the original 128-byte data structure. All mbufs
contain linked list pointers so they can be connected into an ‘mbuf chain,’ either to
store packets larger than 4096 bytes, to prepend protocol headers to mbuf chains,
or to string fragments into a single chain. Allocation of mbufs to hold packets is
done in a complex manner. To reduce internal fragmentation, the socket transmis-
sion code allocates a string of up to 10 small mbufs to hold packets no greater than
35
1024 bytes long, calling a memory allocation routine for each mbuf. Up to 4096
bytes of packets larger than 1024 bytes are held in a single cluster mbuf; this
causes two calls to the memory allocator, one for the mbuf and another for the
page. The algorithm is repeated to hold the rest of each packet longer than 4096
bytes until enough mbufs have been allocated to hold the entire packet. The bimo-
dality of the mbuf allocation algorithm causes the hump between 1 and 1024 bytes
and part of the hump between 4096 and 5120 bytes, largely because of the result-
ing distribution of numbers of calls to the mbuf allocation routine. Others have
remarked on the bimodality BibRef[11]BibRef[32].
The breakdowns of TCP and UDP processing times shown in FigureFooble-
Figure 4 are very similar. Even TCP protocol-specific processing is only slightly
more expensive than UDP protocol-specific processing. The differences are so
small because even though TCP is the most complicated portion of the TCP/IP
implementation, it is only a relatively small part of the executed layers of network
software. Sending or receiving either a TCP or UDP packet involves executing IP,
the socket layer, the FDDI driver, and numerous support routines.
100 100
Other Other
Mbuf Mbuf
90 OpSys 90 OpSys
Error Chk Error Chk
ProtSpec ProtSpec
80 Data Move 80 Data Move
Checksum Checksum
70 70
60 60
Microseconds
Microseconds
50 50
40 40
30 30
20 20
10 10
0 0
8 409 1024 1638 2252 2867 3481 4096 4710 5324 5939 6553 7168 7782 1 409 1024 1638 2252 2867 3481 4096 4710 5324 5939 6553 7168 7782
Message Length in Bytes Message Length in Bytes
Figure 5a-b: Breakdown of operation processing times expressed as accumulated percentages

of the total processing time for TCP (5a) and UDP (5b) packet sizes.
FigureFoobleFigure 5 presents a different view of the data, showing the

breakdown of processing overhead times expressed as cumulative percentages of
36
the total processing overhead time. Notice that the upper two regions in Figure-
FoobleFigure 5 are due to the data-touching overheads; for large packet sizes these
operations consume approximately 70% of the total processing overhead time.
However, as packets get smaller, the non-data-touching overhead times become
more prominent, as expected. In fact, for single byte packets, data-touching over-
heads contribute only 11% of the total processing overhead time.
400 400
350 ProtSpec 350
Mbuf
300 300
Mbuf
250 Checksum 250 Checksum
Data Move Data Move
Microseconds
Microseconds
ProtSpec
200 200
Other
150 150
Other
100 100 OpSys
OpSys Error Chk

50 Error Chk 50
0 0
8 204 409 614 1 204 409 614
Message Length In Bytes Message Length In Bytes
Figure 6a-b: Breakdown of operation processing times for small TCP (6a) and UDP (6b)
packet sizes.
FigureFoobleFigure 6 magnifies the left-most regions of the graphs in Fig-

ureFoobleFigure 4 to focus on small packets. The packet sizes range from 0 to 614,
which includes the most common sizes of packets BibRef[12] sent over an Internet.
For small packets, especially those smaller than 128 bytes, typical for remote pro-
cedure calls and TCP acknowledgments, the non-data-touching overheads clearly
dominate.
3.5.1 Bottlenecks to Minimal Latency

So far it has been established that network software processing overhead
times for small packets depend mainly on non-data-touching operations, while for
large packets the processing overhead times depend mainly on data-touching oper-
ations. It has also been established that the distribution of packet sizes differs by
37
400
300
Microseconds
200
100
0
Checksum Data Move ProtSpec Data Struct Error Chk OpSys Mbuf Other
Operation
Figure 7a: Profile of aggregate LAN TCP operation processing times. Since most TCP packets
are small, more time is spent on protocol overheads than on data movement.
400
300
Microseconds
200
100
0
Checksum Data Move ProtSpec Data Struct Error Chk OpSys Mbuf Other
Operation
Figure 7b: Profile of aggregate LAN UDP operation processing times. Since there are a num-
ber of large UDP packets, checksumming and copying dominate.
transport protocol (e.g., TCP or UDP) and environment (e.g., LAN or WAN). Thus,
the categories of processing overheads defined in Section 2 have different relative
costs depending upon protocol and environment because of the differing distribu-
tions of packet sizes. These observations lead to the aggregate processing overhead
times based on the packet size distributions given in FigureFoobleFigure 1 and
FigureFoobleFigure 2. FigureFoobleFigure 7 shows aggregate processing over-
38
head times for TCP and UDP on a LAN, while FigureFoobleFigure 8 shows aggre-
gate processing times for a WAN. This dissertation concentrates on the LAN
environment because it is represents the more common overall workload for a com-
puter.
300
300
250
250
200
200
Microseconds
Microseconds
150
150
100
100
50
50
0
0
Checksum Data Move ProtSpec Data Struct Error Chk OpSys Mbuf Other Checksum Data Move ProtSpec Data Struct Error Chk OpSys Mbuf Other
Operation Operation
Figure 8a-b: Histograms of TCP (8a) and UDP (8b) packet sizes on a WAN. The packet size
scale is logarithmic. All of the packets in the trace are too short for data-touching overheads
to dominate packet processing time, so it is unsurprising that copying and checksumming
consume so little time in the overall breakdown. In contrast to the distribution seen in the
LAN trace, 92% of the packets in this trace are TCP packets.
3.5.2 The Importance of Non-Data-Touching Overheads

Figure 7a shows that only a small fraction (16%) of the total time spent pro-
cessing TCP packets on a LAN is due to checksumming and data movement.
Checksumming and data-touching are significant factors for LAN UDP, as shown
in Figure 7b, but they do not overwhelm the other categories of overhead, and they
only represent 40% of the total processing time. Consequently, the non-data-touch-
ing operations have a major effect on performance. Because packets tend to be far
smaller on WANs than on LANs, touching data consumes still less time: on WANs,
17% of the time required to process TCP packets is spent touching data, while 13%
of the time needed to process UDP packets is spent touching data.
In considering reasons for the large size of the non-data-touching over-
heads, I do not believe that the Ultrix 4.2a network subsystem is poorly tuned. Our
39
observation is that TCP/IP implementation in Ultrix 4.2a and similar Berkeley

Unix variants has very rich functionality. Even the UDP/IP protocol stack under
Unix has a great deal of functionality, despite the reputation UDP has for being a
minimal protocol: UDP under Ultrix has its own port space, checksum integrity,
scalable internetworking, fragmentation, dynamic LAN address resolution, two-
level input buffering and queuing, an error-checked system call interface to user
programs, etc. Most of this functionality is due to layers other than UDP: IP, sock-
ets, the link layer, and the device driver. Each bit of functionality carries a cost, the
sum of which is significant. Consider that the TCP portion of the ProtSpec (i.e. pro-
tocol-specific processing) category accounts for only 13% of the total processing
time for single-byte packets (whose processing times are least affected by data-
touching operations); all the other time is consumed by functionality provided by
other layers and by the operating system.
Some researchers BibRef[60]BibRef[64]BibRef[66] have built specialized
protocols with impressively low latency. However, each of these protocols achieve
low latency by supporting the absolute minimal feature set needed by its applica-
tion, resulting in far simpler code than TCP/IP. Thus, at the moment, notably low
latency has only be achieved by building highly application-specific protocols from
the ground up.
3.5.2.1 Difficulty of Optimizing Non-Data-Touching Overheads

Eliminating the single operation of checksumming is sufficient to produce
a large reduction in processing time for large packets. An increasing number of
TCP/IP implementations succeed in reducing or avoiding the checksumming oper-
ation by various strategies. Some vendors have disabled UDP checksumming by
default, risking a loss in packet integrity and in contravention of Internet host
requirements BibRef[9]. Clark and Tennenhouse have suggested combining check-
summing and copying using integrated layer processing BibRef[22]. Silicon Graph-
ics, Inc. workstations move the burden of performing the checksum computation
from the CPU to their FDDI interfaces BibRef[57]. This dissertation includes a
proposal intended to preserve almost all of the performance gain from disabling
40
checksums with nearly none of the loss of integrity nor any additional cost in hard-
ware. Specifically, it seems reasonable to avoid the TCP and UDP data checksum
computations when the source and destination are both on the same LAN and that
LAN supports a hardware CRC. Since the overwhelming majority of packets fit
those criteria, our Checksum Redundancy Avoidance algorithm provides a dra-
matic performance improvement for most packets without loss of checksum protec-
tion. Such strategies can almost double throughputs.
Improving overall performance by optimizing non-data-touching operations
is more difficult. For example, because of the lack of large TCP packets, the most
prominent overhead category in the TCP profile in Figure 7a is ProtSpec. ProtSpec
consists of the protocol-specific processing overhead from each protocol layer (TCP,
IP, Link layer, and FDDI driver). The largest component of ProtSpec is that of TCP,
which as mentioned above consumes 13% of the total processing time; however,
TCP protocol-specific processing is actually made up of a large number of smaller
operations, as are the Mbuf and Other categories. Thus, a wide range of improve-
ments would be needed to produce a significant improvement in performance for
the non-data-touching operations. The system described in BibRef[64] could, from
a certain point of view, be considered to represent a successful dramatic improve-
ment in latency through a complete redesign and rewrite of a UDP-based RPC.
3.6 In-Depth Analysis

This section takes a look at the aggregate times for individual categories
when processing a LAN workload, in decreasing order of importance. Each cate-
gory is examined in detail and the most time-consuming factors in each category
are explained. Differences between TCP and UDP time breakdowns are also
explained. In general, the differences either result from differences between the
distributions of packet sizes or differences in TCP and UDP protocol complexity.
41
3.6.1 Touching Data

Checksum and DataMove are the operations that touch each byte of the
packet; thus, their operation times increase with packet size. The data-dependent
routines are the same for TCP and UDP.
As FigureFoobleFigure 4 and FigureFoobleFigure 5 illustrated, Checksum
is the dominant overhead for large packets. FigureFoobleFigure 7 shows that the
aggregate times spent checksumming are 48 microseconds and 356 microseconds
for the TCP and UDP packet size distributions, respectively.
DataMove is the next largest time-consuming category of overheads for
large packets; FigureFoobleFigure 9 presents a breakdown of this category into
three operations: Usr-Krnl Cpy, Device Copy, and Cache Clear. Usr-Krnl Cpy is the
amount of time spent copying incoming packets from kernel to user buffers and
outgoing packets from user to kernel buffers. Device Copy and Cache Clear con-
cern movement of data between kernel buffers and the FDDI controller. The con-
troller used for our measurements does not support send-side DMA, so the CPU
must copy data to the controller. Device Copy is the amount of time needed to copy
each packet from kernel buffers to buffers in the FDDI controller. The controller
does support receive-side DMA, so there is no receive-side equivalent to Device
Copy. A cache coherency protocol is needed; Cache Clear is the amount of time used
to insure cache consistency when the FDDI controller puts incoming packets into
memory.
The amount of time spent copying data between the device and memory
dominates the other times, with user-kernel copying time coming in a close second.
In fact, Device Copy and Krnl Cpy do roughly the same amount of work, but cach-
ing improves the latter’s time.
Since the time consumed by DataMove overheads increase with packet size,
the higher times for UDP reflect the higher average UDP packet length. Another
smaller source of difference between TCP and UDP is that Device Copy consumes
less time than Usr-Krnl Copy in TCP, but the pattern is reversed for UDP. That is
because each of the data-touching overhead times effectively has a constant com-
42
100
100
80
80
60
60
Microseconds
Microseconds
40
40
20
20
0
0
Cache Clear Device Copy Usr-Krnl Cpy Cache Clear Device Copy Usr-Krnl Cpy
Operation Operation
Figure 9a-b: Aggregate data movement times for TCP (9a) and UDP (9b) packets. Copying
times are much higher for UDP because of the much greater numbers of large UDP packets.
ponent and a component rising linearly with packet size. Usr-Krnl Copy has a
larger constant component but a smaller linear component than Device Copy, and
thus the reversal in relative sizes is due to the different TCP and UDP packet size
distributions.
150
150
100
100
Microseconds
Microseconds
50
50
0
Demux ifaddr Arp IP Protl Link Protl TCP Protl Device Protl Demux ifaddr Arp (Dis)connect IP Protl Link Protl UDP Protl Device Protl
Operation Operation
Figure 10a-b: Aggregate protocol processing (Protls) time under TCP (10a) and UDP (10b)
packets. TCP protocol processing time is large, but less than half of total protocol processing
time. UDP layer overheads are very small.
43
3.6.2 Protocol-Specific Processing

As seen in FigureFoobleFigure 7, ProtSpec is the dominant category for
TCP and is prominent for UDP. For TCP packets, ProtSpec consumes nearly half
the total processing overhead time.
A breakdown of the ProtSpec category is shown in FigureFoobleFigure 10.
Interface Protl is in the device driver layer, Link Protl is part of the IEEE 802
encapsulation layer, IP Protl is the IP layer, and TCP Protl is the TCP layer. Arp
is the Address Resolution Protocol, Demux is the operation of finding a protocol
control block, given a TCP or UDP header (in_pcblookup), and PCB (Dis)connect
includes the operations of checking that session information for a socket, setting
up the protocol control block to reflect current connection state properly, and
removing the session information (in_pcbconnect and in_pcbdisconnect).
TCP Protl dominates the ProtSpec category. This is in stark contrast to
UDP, of which UDP Protl is a rather small portion. However, despite TCP Protl’s
size, it only consumes 13% of the total TCP/IP processing time. Reduction of TCP
protocol-specific processing time would be useful, but it is questionable whether
the performance improvement would be worth the major effort required. Achieving
such a reduction would require significant improvements across the entire stack of
protocols.
At first glance, it may be surprising that Demux is so small, under 20 micro-
seconds, given reports that this operation is a bottleneck BibRef[21]BibRef[45].
This is because our single-user testing environment only has a single connection,
and hence a single entry in the list of protocol control blocks.
3.6.3 Mbufs
Mbuf is the second largest of the non-data-touching categories of overhead.
The mbuf data structure supports a number of operations, the most costly of which
is Mbuf Alloc. FigureFoobleFigure 11 contains breakdowns of the Mbuf category
into mbuf allocation and deallocation (Mbuf Alloc) and all other mbuf operations
(Mbuf Misc).
44
140
140
120
120
100
100
Microseconds
Microseconds
80
80
60
60
40
40
20
20
0
0
Mbuf Misc Mbuf Alloc Mbuf Misc Mbuf Alloc
Operation Operation
Figure 11a-b: Aggregate network buffer (i.e. “mbuf”) management times for TCP (11a) and
UDP (11b) packet sizes. The large cost of the several memory allocation operations per-
formed by the mbuf packages is one of the largest single operation costs.
Mbuf Alloc is more expensive for two reasons: memory allocation is an

inherently expensive operation and memory allocation is performed a number of
times per packet. Mbuf Alloc consumes more time for UDP because of the mbuf
allocation strategy. Under Ultrix, packets less than 1024 bytes long are stored in
as many small buffers as necessary, each individually allocated to store at most 108
bytes. The overwhelming majority of TCP packets fit within a single mbuf, while
many UDP packets are long enough to require two mbufs for data storage.
The other mbuf operations constituting Mbuf Misc are simpler operations
such as copying mbuf chains, defragmentation (implemented by joining two linked
lists of mbufs), and packet length checking. TCP spends more time in Mbuf Misc
than UDP because TCP must make a copy of each packet in case retransmission is
necessary.
3.6.4 Operating System Overheads

FigureFoobleFigure 12 shows the breakdown of operating system over-
heads. Sleep is the kernel call a process uses to block itself. Socket Misc is the mis-
cellaneous processing of the socket layer. Wakeup is the call used to awaken a
process. Proc Restart is the time needed for the sleeping process to start running
45
25
25
20
20
Microseconds
Microseconds
15
15
10
10
5
5
0
0
Proc Restart Socket Misc Sleep Wakeup Softint Handler Sched Soft Intr Proc Restart Socket Misc Sleep Wakeup Softint Handler Sched Soft Intr
Operation Operation
Figure 12a-b: Aggregate operating system overhead times for TCP (12a) and UDP (12b)
packet sizes. Transfer of control is surprisingly inexpensive.
again. Sched Soft Intr is the operation of scheduling a software interrupt to process
the incoming packet, and Softint Handler is the software interrupt handler, which
dequeues incoming packets and calls IP to handle them. Perhaps the most inter-
esting aspect of this category is that the various transfer-of-control operations are
so cheap relative to the other categories of overheads (see FigureFoobleFigure 7
and FigureFoobleFigure 8).
3.6.5 Error Checking

ErrorChk is the category of checks for user and system errors. Figure-
FoobleFigure 13 shows the breakdown of overheads for this category. Skt Errchk
contains assorted checks for errors within the socket layer. Syscall Argchk is
checking specifically for incorrect user arguments to the system calls used to send
and receive packets. Interestingly, Syscall Argchk consumes a relatively large
amount of time. This is mostly spent verifying that a user buffer is within a span
of virtual memory accessible by the user process.
3.6.6 Data Structure Manipulations

The DataStruct category consists of manipulations of various data struc-
tures whose cost do not justify individual scrutiny as did mbufs. FigureFoobleFig-
46
100
100
80
80
60
60
Microseconds
Microseconds
40
40
20
20
0
0
Skt Errchk Syscall Argchk Skt Errchk Syscall Argchk
Operation Operation
Figure 13a-b: Aggregate error checking times for TCP (13a) and UDP (13b) packet sizes.
Checking that a user has specified a correct receive buffer is expensive and imposes a large
penalty on large receive buffers.
40
40
30
30
Microseconds
Microseconds
20
20
10
10
0
Socket Buffer Defrag Queue Device Queue Socket Buffer Defrag Queue Device Queue
Operation Operation
Figure 14a-b: Aggregate data structure manipulation times for TCP (14a) and UDP (14b)
packet sizes. TCP makes more extensive use of the socket buffer’s properties than UDP does,
ure 14 show the breakdown of overheads for this category. The socket buffer is the
data structure in which a limited amount of data is enqueued either for or by the
transport protocol. TCP makes heavier use of this data structure than UDP, which
makes no use of the socket buffer structure for outgoing packets, and only uses it
47
as a finite-length queue upon reception. In contrast, TCP uses the socket buffer to
implement windowing flow control on both send and receive sides.
The Defrag Queue is the data structure used to defragment IP packets. The
times for Defrag Queue processing require explanation. In general, packets larger
than the FDDI MTU must be sent in multiple pieces (fragmented and defrag-
mented). Fragmentation is implemented in both IP and TCP. UDP packets are
fragmented by IP, but TCP does its own fragmentation specifically to avoid IP frag-
mentation. Thus, it is surprising that the code checking the defragmentation
queue is called at all for TCP packets; this phenomenon reflects a place in the IP
code where a check for matching fragments could have been avoided. Even more
interesting is the fact that the average amount of time spent in Defrag Queue for
UDP is not noticeably greater than the amount of time spent in Defrag Queue by
each unfragmented packet. Less than 10% of UDP packets are large enough to be
fragmented; the extra time spent defragmenting large packets is not sufficient to
noticeably raise the average cost of defragmentation.
The Device Queue is the data structure in which outgoing data is enqueued
by the link layer until the network controller is prepared to process it, and incom-
ing data is enqueued by the device driver until IP is ready to process it by a soft-
ware interrupt. UDP packets spend more time in Device Queue processing because
of fragmentation. If a UDP packet is fragmented into two FDDI packets, then twice
as much work must be performed by the Device Queue to send that packet.
3.7 Conclusions
I measured various categories of processing overhead times of the TCP/IP
and UDP/IP protocol stacks on a DECstation 5000/200, and used the results to
search for bottlenecks affecting throughput and minimal latency. As expected, the
data-touching operations, checksumming and copying, are bottlenecks affecting
throughput.
Non-data-touching overheads turn out to be the bottleneck to minimal
latency. Furthermore, because most packets observed in real networks are small,
48
non-data-touching overheads consume a majority of the total software processing

time (84% for TCP and 60% for UDP in our traces). Unfortunately, unlike the
breakdown for data-touching overheads, which only consists of checksumming and
a small number of types of copying, time is evenly spread among the non-data-
touching overheads. Reducing a single non-data-touching overhead, such as TCP
protocol-specific processing, has little effect on overall performance. Thus, a wide
range of optimizations to non-data-touching operations is needed to significantly
reduce minimal latency.
49
Chapter IV: Optimizing Non-

Data-Touching Overhead
4.1 PathIDs
This chapter describes path identifiers (PathIDs), a technique for reducing
non-data-touching overheads, and thus minimal latency. The difficulty with opti-
mizing non-data-touching overheads is that no single operation consumes a large
fraction of the time. Thus, unlike throughput-raising techniques, a design princi-
ple is required here that would make it much easier to optimize a large class of
operations.
A simple principle lies behind the design of PathIDs: as much information
should be available as soon as possible to network code. In particular, PathIDs
makes available implied knowledge of what steps are required to process an incom-
ing packet almost as soon as it arrives rather than waiting until the end of all pro-
cessing. The resulting advance knowledge of exact packet type and destination
makes a great deal of currently standard protocol processing unnecessary.
To achieve this goal, the link layer protocol is modified to add a new pathID
field. When a packet arrives, full session state is rapidly retrieved by using the con-
tents of the field as an index. The time required to determine what kind of packet
has arrived and where the packet is going is nearly eliminated. More importantly,
though, this means that the exact type of each packet is known at the beginning of
packet processing, so one knows exactly what processing each packet requires as
soon as possible. A great deal of packet processing, particularly protocol-specific
processing, consists of deciding what further processing is needed.
Once an implementation has determined this information about a packet,
it executes a fastpath specifically tailored to the type of packet. If an incoming
packet is of an uncommon type, it is processed in a normal manner.
An implementation can take advantage of the advance knowledge that
PathIDs provides through an adaptation of fastpaths techniques BibRef[35]. The
main idea behind fastpaths is to aggressively optimize network software for com-
50
mon cases. Implemenations of this idea range from rearrangement of packet pro-
cessing to be more efficient in common cases to sophisticated schemes involving
multiple processing paths. In the case of transmission, BibRef[35]BibRef[51] sug-
gest creation of different paths through multiple layers of network software -
socket, transport, and network layers - for each type of packet (for example, one
might create a path for TCP packets, a path for UDP packets, and a path for routed
IP packets). This is not too difficult, since simply by sending on a particular socket
one identifies the type of packet one is generating. However, packet reception is
more difficult, since one generally cannot know an incoming packet’s type until
after a fair amount of processing has transpired. By providing a facility for identi-
fying packets rapidly in advance of nearly all other processing, PathIDs makes it
possible to follow a similar strategy for packet reception.
PathIDs is solely intended to optimize receive-side processing. The informa-
tion that PathIDs makes available is already available on transmission to any
implementation, such as Berkeley Unix, that does not limit knowledge of useful
data structures by layer. PathIDs has no advantage in rapid transmission over a
fastpaths-based implementation.
4.2 PathID Implementation

An experimental version of PathIDs has been implemented on an Alpha
3000/400 running DEC OSF/1 v1.3 which gains a 31% performance advantage over
unmodified OSF/1. Only the DEFTA FDDI driver is supported.
4.2.1 New Link Layer

Since current versions of TCP/IP do not have space in their headers for a
PathID, protocol headers must be altered to support this functionality. The best
long-term solution is to use a mechanism like trailer negotiation to negotiate an
altered link-layer header. A change in the link layer header is best because then it
is not necessary to go through normal demultiplexing to reach a higher-level pro-
tocol. A possible future alternative is to use the flowID field that seems likely to be
supported in next generation IP.
51
The initial implementation is designed to discover as quickly as possibly

whether PathIDs produces a useful speedup and has no interoperability require-
ment. Thus, it is simpler than any of the above proposals. It inserts a PathID
header between the FDDI link layer header and the IP header. There is no negoti-
ation; hosts with this modification are only able to communicate over FDDI with
similarly altered hosts. The resulting headers is shown in FigureFoobleFigure 15.
The meanings of the fields are as follows:
FDDI Header
PathID Header Verification Pattern
(each field is 16 bits wide) Source PathID
Source Unused
Destination PathID
Destination Unused
IP Header
UDP Header
Figure 15: New packet header layout with PathIDs. They are the same as normal headers
except that a new header has been added between the FDDI link layer and IP headers. Names
of individual PathID header fields are on the right hand side of the PathID header.
Destination PathID: The field for which this protocol exists. The contents
of this field are used to look up path and socket information in a table. From there
processing proceeds along the fastpath corresponding to the PathID.
Source PathID: This field supports the negotiation process for PathIDs.
This is the PathID used to identify the entity that is sending. The destination ses-
sion in a PathID-supporting host that receives this will in future use the contents
of this field as the destination PathID when sending packets to the source of this
52
packet. Since this negotiation algorithm does not support multicast, it will have to
be changed.
Verification Pattern: A pattern impossible for a conventional FDDI
packet at this header location. This is necessary because few networks are likely
to carry exclusively PathID traffic.
Type: Conventional Ethernet type field, for use when no destination
PathID field is set.
Source/Destination Unused: Extra space for related experiments.
4.2.2 PathID Fast Lookup

The lookup algorithm PathIDs uses is simple table indexing. This is possi-
ble because in the initial implementation receivers choose their own IDs and com-
municate them when sending to their peers by placing them in the source PathID
field in the PathID header.
Exchanging the simplest form of identification, a pointer to a socket, would
be a bad idea because a forged packet or buggy send-side implementation could
easily cause crashes and security problems. Thus, an small integer, an index to a
table, is exchanged instead. The initial implementation uses the Unix open file
table BibRef[43] as the table, since a socket is a file. Each incoming ID is checked
for plausibility.
From the open file table it is possible to get to every data structure of use
in processing a packet through pointers. The open file table entry contains a
pointer to the socket data structure, which in turn contains a pointer to the proto-
col control block and so on.
Once lookup is complete, PathIDs extracts a pointer to a handler function
from the socket data structure, and calls the function with arguments of the socket
pointer and a pointer to the mbuf chain of the incoming packet. This allows each
socket to specify its own handling code (fastpath). The architecture may have to
change in the future to allow multiple paths per socket.
53
Berkeley PathIDs
User Unix
Kernel
Process
Sockets
socket Process
queue
UDP
Software
IP Interrupt
interface socket
queue queue
Link Adapter Adapter

Interrupt Interrupt
Driver
Device
Figure 16: Flow of control on Fastpaths. Link layer header with PathIDs. Transport and net-
work layer processing that was done in the network software interrupt in Berkeley Unix is
done by the user process under PathIDs.
4.2.3 Flow of Control

The flow of control for PathIDs is necessarily different from the Berkeley
Unix flow of control since the network software interrupt has been removed. The
new flow of control, shown in FigureFoobleFigure 16, is similar to the structure
suggested in BibRef[35]BibRef[51]. Specifically, processing that would have been
done in the network software interrupt is now done by the user process. Network
software interrupt processing could have been moved to the adapter interrupt han-
dler instead; the process is a better choice for two reasons: it allows the user pro-
54
cess address space to be available to the UDP and IP code within the fastpath (in
many operating systems some information concerning each process is only avail-
able from the context of the process in question), which permits more optimiza-
tions, and it means that interrupts are blocked for a shorter length of time.
4.3 PathID Savings in Functionality

PathIDs produces a performance improvement by reducing the functional-
ity required to process a packet. Here is a list of pieces of functionality present in
Berkeley Unix but unnecessary in an implementation that makes use of PathIDs,
along with explanations of how each saving is attained.
Demultiplexing: Normally each protocol must decide what entity it is
going to pass incoming packets to when it is done. This is obviated by PathIDs,
which uses the Path ID field to do all the demultiplexing at once; it is the most obvi-
ous optimization resulting from PathIDs.
UIO Conversion: Work is done to convert the simple <base pointer,
buffer length> buffer format supported by the read / write, send / recv, and
sendto / recvfrom system calls to a more complex format supporting the scatter/
gather I/O readv / writev system calls. String-matching on the source code to
OSF/1, SunOS, and the Tahoe and Reno releases of Berkeley Unix reveals that
only one program, the system error logger, not normally a resource-intensive pro-
gram, makes use of readv and writev. It seems reasonable to suggest avoiding
this multiple conversion in the fastpath. Calls to readv / writev would not use
any fastpaths.
High-level synchronization calls: Unix traditionally makes use of high-
level calls - sleep(ID) and wakeup(ID) - for synchronization. Sleep and wakeup
are thus required to hash based on the value of ID. However, all the information
on which process to awaken is already known from the PathID lookup, so the nor-
mal process of hashing on an ID makes little sense. Synchronization requires even
more unnecessary overhead under OSF/1 because the scheduler supports both
Unix and Mach 2.5 functionality, yet the network subsystem only makes use of a
55
particular subset of the functionality. Thus, it makes sense to develop customized

versions of sleep and wakeup for use by fastpaths.
MSG_PEEK: Berkeley Unix supports an option to some of its network data
reception calls which changes the behavior of those calls so that they examine data
enqueued for the user process without consuming that data. Some code in the com-
mon path is required to support this. Yet the use of MSG_PEEK is even more rare
than that of scatter/gather I/O: none of the source code trees mentioned above
made use of it at all.
Out Of Band Data: Berkeley Unix sockets includes support for out of band
data. This feature is actually used by a number of programs, but relatively few
packets contain out-of-band data. It makes sense to maintain the appropriate data
structures, even within fastpaths, but not to support calls to out of band data pro-
cessing features in any fastpath.
Source Address Notification: Many programs ask that the network
address of the sender of data be given to them with each incoming packet (e.g.,
using the recvfrom call). This functionality is currently implemented in a generic
and inefficient manner for reasons that PathIDs allows one to avoid. The algorithm
for deciding upon the source address of an incoming packet depends upon which
transport protocol is processing the packet, so such decisions must be made in
transport protocol code. Yet transport protocol implementations generally do not
know where the address is going to end up, since that can be complex and yet has
nothing to do with transport protocols. Thus, transport protocols must somehow
hand the address information to socket code, which does understand the address’
destination, in a generic format. Additionally, in Berkeley Unix there is a queue
between the transport and socket layers, so there can be multiple pending packets
(and thus addresses) per socket. Furthermore, an address can theoretically be of
any length, so the code must be prepared for memory allocation and deallocation
of some sort. Thus, it is logical, if inefficient, for each transport protocol to allocate
an mbuf for each packet to hold the address. Socket code later looks for the mbuf,
copies the address to a user process’ buffer if appropriate and deallocates the mbuf.
56
PathIDs is perfect for solving this sort of design problem: everything is known to
all software along the path, so the transport protocol code simply copies the
address directly into the user process’ buffer. The resulting code is far simpler than
the original.
ICMP Error Notification: A variety of error conditions, such as invalid
transport protocol port and unreachable destination network, require a host to
generate an error packet. Berkeley Unix must go through some preparation for
this event even when no error is found. However, the destination of a packet must
exist to end up on a fastpath at all, so no ICMP error need ever be generated from
a fastpath, which in turn implies that no preparation for such an event is needed.
Checksumming: Usually software checksumming is extremely common.
Some hosts, those either running Checksum Redundancy Avoidance (see Section
Section5.2) and those whose vendors supply hardware checksumming in most net-
work interfaces, may not see much checksumming. It may make sense to remove
it from the fastpath of such hosts. On the other hand, this may also noticeably
reduce the number of packets to which fastpaths apply or raise the number of fast-
paths in exchange for little gain. The initial implementation places checksum pro-
cessing in the fastpath.
Protocol Options Processing: Most packets do not contain options, so it
is worth removing this from the fastpath. Some designs for the next version of IP
embed a philosophy that options need not be uncommon. If options should become
common, the focus may shift from removing options from the fastpath to assuming
that options sets will usually not change much for any given transport protocol ses-
sion.
IP Length Checking: Since both UDP and TCP perform length checking
of their own, and execution of a UDP or TCP fastpaths is evidence that a packet
will be processed by a protocol supporting length checking, checking the IP length
field for consistency is entirely superfluous. The same logic also applies to checks
that each packet is larger than the IP header size.
57
Reassembly: Packets that have been fragmented must be reassembled.

The strategy in the initial implementation is to note that there are far more
unfragmented packets than fragmented packets and leave reassembly support out
of the fastpath. In the longer term, this is a more complex issue, since UDP frag-
ments generated by NFS are common. Probably the best design in a production
system would be to have support in the UDP fastpath for checking for previously
arrived fragments (e.g., last fragments would travel the UDP fastpath), and a sin-
gle fastpath for fragments that arrive in order and whose arrival is not interleaved
with the arrival of fragments from different packets.
IP Address Matching: Usually IP must check each incoming packet to
check if its destination IP address matches one of the host’s IP addresses. This is
unnecessary with PathIDs. Similarly with IP address lookup for routing: one
would probably use a separate ID for each routing table entry in active use in a
manner similar to routing over large ATM clouds BibRef[18].
Mbuf Pullup: Currently, data at the beginning of most incoming packets
is copied into a small mbuf to ensure that header data is aligned correctly. Since
most devices can be programmed to align incoming data conveniently, this func-
tionality is more appropriately moved to drivers which need it.
Software Network Interrupt: Normally this is needed because Berkeley
Unix makes few assumptions on what functionality is present in a network
adapter. In particular, some early adapters only contained single receive buffers.
With that type of adapter, it is important to give device driver processing a higher
priority than other packet reception processing so that as few packets as possible
are lost. The PathIDs philosophy is to optimize for the common case, which in
today’s workstations is multiple hardware receive buffers. Software network inter-
rupt functionality would be moved into the drivers of lacking adapters.
Software Interface Queue: This is another piece of functionality only
required by adapters with inadequate receive buffers. It is a queue maintained by
software for synchronization between driver processing and soft network interrupt
58
processing. Like the soft network interrupt, support for the software interface
queue should be moved out of the common path into drivers for which it is useful.
Driver-level buffer allocation: Normally, device drivers allocate net-
work buffers upon packet reception to replace each buffer that is to be sent up the
protocol stack. The buffers are later freed after they are in turn filled with an
incoming packet, processed, and consumed by a user process. However, this alloca-
tion and deallocation can be avoided by merging some of the functionality of hard-
ware adapter queues with the functionality of socket buffers. One could bind each
buffer permanently to a hardware queue buffer descriptor, clearing the buffer
pointer in the hardware queue so that full buffers are not overwritten; not only
would this eliminate the allocation, but it would also sharply reduce the amount of
time spent initializing adapter receive buffer descriptors. However, such a static
binding would have the disadvantage that the adapter would not be able to use the
buffer until the receiving process consumes it. This should only rarely cause
dropped packets for lack of a buffer, since hardware queues are growing quite long.
If it does prove to be a problem, allocation could be accomplished only when the
queue of contiguous, empty buffers grows short.
4.4 PathID Weaknesses

Some software engineering tradeoffs must be made in exchange for the
improved performance from PathIDs. A network software subsystem that uses
PathIDs is likely to be harder to maintain than a normal network software sub-
system. PathIDs may not be a good idea in environments in which maintainability
is more important than performance
The worst problem posed by PathIDs is the creation of multiple, parallel
paths through the network software. Fixing bugs in the various input paths asso-
ciated with the TCP/IP suite may require modification of several pieces of source
code that do similar but not identical things. In particular, a small change in sock-
ets could potentially require modification of every fastpath. Worse still, because
each path is so highly tuned, it would require thought to decide for each path
59
whether the fix being applied is appropriate; such thought may require a deeper
understanding of the code than a software maintainer has, since such paths
involve every layer of networking code.
Some of the changes proposed here are difficult to port from one operating
system to another. In particular, the implementation of synchronization calls is
often operating-system specific. This may not pose much of a problem for a vendor
implementing PathIDs in its own source code, but it creates problems for a
researcher or a business specializing in enhancing network code. Even a vendor
may face increased difficulties when it wishes to change source code trees (such as
the recent SunOS->Solaris and Ultrix->OSF/1 conversions).
Still another problem potentially limits the ultimate effectiveness of
PathIDs and fastpaths if care is not taken. These techniques optimize by restrict-
ing the domain of situations for which they are effective, in a manner similar to
decisions on RISC architecture design BibRef[31]. The more decisions made in
advance about what sorts of packets are supported, the faster a fastpath is. How-
ever, if each piece of functionality has a probability Pf of being needed by a partic-
ular packet independent of the chance of other pieces of functionality being used,
and there are n pieces of functionality which the fastpath does not support, then
the overall chance of a fastpath being applicable is the product of the Pf. In other
words, the chance that a path can be used for any given packet goes down exponen-
tially with the number of decisions made in advance in the design of the path. For-
tunately, the chances of use of these pieces of functionality are very much mutually
dependent; still, it is a problem that must be watched. PathIDs is superior to sim-
ple fastpaths in that it is easier to create many paths that optimize for disjoint sets
of functionality, although tens of thousands of PathIDs could potentially tax mem-
ory, identifier space, and connection setup time.
4.5 Performance
The PathIDs implementation exhibits 31% lower minimal latency than
unmodified OSF/1 on an Alpha 3000/400. The test used to derive these results mea-
60
sures round trip latency by sending a 1-byte UDP packet to another workstation
which returns the packet to its sender, and noting how long it takes before the
packet is returned to its original sender.
Table 1: Minimal latency improvements from PathIDs.
Minimal UDP Latency
Unmodified OSF/1 1.3 759 microseconds

PathIDs 578 microseconds (31% better)
4.6 Conclusions
PathIDs achieves substantial improvements in minimal latency by provid-
ing a mechanism that eases optimization of non-data-touching overheads. PathIDs
adds a field to link layer headers which can be used to immediately recognize the
destination and type of each incoming packet. The importance of this change
derives less from the resulting savings in demultiplexing time than from the fact
that all aspects of a packet are known from the beginning of packet processing.
Processing of each type of packet can be tailored to fit the bare requirements for
that type of packet instead of being designed to support any type of incoming
packet.
61
Chapter V: Optimizing Data-

Touching Overhead
5.1 Introduction
This chapter describes designs intended to reduce data-touching overheads,
and thus raise maximal throughput. A network protocol suite implementation
which included all of these techniques would be a zero-copy stack, from which CPU-
moderated data-touching overheads have been almost entirely removed without
loss of backward compatibility. Data movement between adapters and system
memory is the only bulk data-touching operation remaining; no CPU-moderated
bulk data-touching overheads remain when transmitting over or receiving from an
adapter supporting DMA.
Three techniques, each intended to avoid a different form of data-touching
overhead, are used to achieve this goal. Checksum Redundancy Avoidance,
Nocopyin, and DMA Redirection are intended to eliminate most of checksumming,
copying from user buffers to kernel buffers, and copying from kernel buffers to user
buffers, respectively.
5.2 Checksum Redundancy Avoidance

It is clear from FigureFoobleFigure 4 and FigureFoobleFigure 5 that the
largest bottleneck to maximal throughput in a DECstation 5000 with unmodified
Ultrix 4.2a is checksumming. Thus, the first optimization is a scheme, called
Checksum Redundancy Avoidance (CRA), for avoiding most checksumming.
The simplest way to eliminate checksum overheads is to eliminate the
checksum altogether. Options already exist in the Ultrix based networking code to
disable checksumming. However, the Internet checksum exists for a good reason;
packets are occasionally corrupted during transmission, and the checksum is
needed to detect corrupted data. It is recommended that systems not be shipped
with checksumming disabled by default BibRef[9].
62
5.2.1 Eliminating Checksum Redundancy

A great deal of checksumming is arguably redundant. Ethernet and FDDI
networks implement their own cyclic redundancy checksum. Thus, packets sent
directly over an Ethernet or FDDI network are already protected from data corrup-
tion. The high cost of checksumming found in this study illustrates that the redun-
dant checksum is expensive. I suggest that in cases of complete redundancy the
Internet checksum not be performed. One must be careful, though, about deciding
when the Internet checksum is redundant. It is reasonable to turn off checksums
when crossing a single network which implements its own checksum. Since the
destinations of most TCP and UDP packets are within the same LAN on which
they are sent, this policy would eliminate checksumming of most TCP and UDP
packets.
Such a policy differs somewhat from traditional TCP/IP design in one
aspect of protection against corruption. Always performing a software checksum in
host memory protects against errors in data transfer over the I/O bus in addition
to the protection between network interfaces given by the Ethernet and FDDI
checksums. However, since data transfers over the I/O bus for such common
devices such as disks are routinely assumed to be correct and are not checked in
software, a reduction in protection against I/O bus transfer errors for network
devices does not seem unreasonable, especially in light of the heavy cost of check-
summing.
However, turning off checksum protection in any wider area context seems
unwise without considerable justification. Not all networks are protected by check-
sums, and it is difficult to see how one might check that an entire routed path is
protected by checksums without undue complications involving IP extensions. A
more fundamental problem is that network checksums only protect a packet
between network interfaces; errors may arise while a packet is in a gateway
machine. Although such corruption is unlikely for a single machine, the chance of
data corruption occurring increases exponentially with the number of gateways a
packet crosses.
63
Applications should be able to, in addition to making use of CRA, disable

checksumming on a per-connection basis. While disabling checksums is unwise for
many applications such as NFS or FTP, many other applications, such as audio
and video applications, may be tolerant of some data corruption because of the
transient nature of the data.
5.2.2 UDP Implementation

The implementation of CRA in UDP uses the existing checksum-nonuse
protocol. In UDP, if the checksum field of an incoming packet is zero, then it is
assumed that there is no checksum on the packet. All that remains, then, is to
decide on the details of when the sender should send a zeroed checksum field. My
implementation adds a flag to each network interface which indicates which inter-
faces support some form of checksum. Under Berkeley Unix, routing information
is already available to UDP, and the route structure includes a flag indicating
whether the next hop in the route is a gateway or the destination host. CRA checks
whether the next hop is a destination, and whether the network interface that will
transmit the packet supports checksumming. Checksumming is skipped if both
conditions hold. A new socket option which allows applications to disable check-
summing on a per-socket basis is also implemented, even for sessions involving a
large number of hops between hosts.
5.2.3 TCP Implementation

Extending TCP to implement checksum elimination is more difficult than
extending UDP, since one is always supposed to checksum TCP data, and zero is a
valid checksum. Additional mechanism is required so that implementations can
agree to disable checksums. An experimental alternate checksum option already
exists BibRef[68]. The alternate checksum option provides a generic mechanism
for TCP implementations to agree to use an alternate checksum. It is straightfor-
ward to use the alternate checksum option to define an alternate checksum value
specifying no data checksum.
64
The side that wishes to create a connection starts negotiation. TCP uses the
same series of checks as in UDP to decide whether to send an option to turn off
checksums. When server-side TCP responds to a TCP connection request, it
responds with the checksum-avoidance option if it gets a checksum-avoidance
request in the connection setup packet. Each side of a conversation only disables
checksumming if it receives a checksum elimination option from the other host in
the conversation.
As with UDP, a socket option is implemented that causes the implementa-
tion to try to negotiate disabling checksums even when the checksum is not redun-
dant. Since the sender of the option makes the decision as to whether to negotiate
checksums, and receivers knowledgeable of this option automatically agree to any
suggestion of turning off checksums, only one side of a conversation need call the
option.
5.2.4 Performance Improvement

Table 2 summarizes throughput resulting from the improved checksum
techniques. The numbers in MB (i.e. 1,000,000 bytes) reflect absolute throughput
for each configuration, and the percentage values in parenthesis are the percent-
age improvements relative to unmodified Ultrix 4.2a. The performance improve-
ments are significant.
Table 2: Throughput improvements from checksum optimization and elimination
TCP UDP
Unmodified Ultrix 4.2a 17 Mbps 19 Mbps

Optimized Checksum 22 Mbps (29%) 26 Mbps (33%)
Checksum Elimination 25 Mbps (48%) 33 Mbps (71%)
5.2.5 Detecting Locality With Certainty

It has been pointed out BibRef[58] that CRA as described above is not
always certain that a remote host is truly on the same network as the local host.
65
LAN bridges BibRef[3] and proxy arp BibRef[13] both make it difficult for a host
to be certain that another host is truly on the same network.
It is possible to decide with certainty whether a remote host is truly on the
same network as a local host. The key is the IEEE spanning tree protocol used for
routing among bridges BibRef[3]. Bridges are not allowed to pass packets using
this protocol, and IP routers ignore such packets. Thus, reception of a spanning
tree protocol packet from a host is good evidence that host is on the same physical
network as the local host.
Hosts implementing this scheme would broadcast a spanning tree packet
right after responding to arp requests. The contents of the spanning tree packets
would be set in such a manner that no bridge will expect a host to actually bridge
MAC packets. Any host from which such a spanning tree packet is seen can be
assumed not to be on the other side of a bridge.
5.3 DMA Redirection

The second optimization technique, DMA Redirection, is designed to elimi-
nating most copying of incoming packets from kernel to user buffers. Using the
Checksum Redundancy Avoidance technique described in Section 5.2, it is possible
to transmit FDDI packets on a DECstation 5000/200 equipped with a DEFTA
FDDI adapter at 50 Mbps. However, the DECstation is limited to receiving data at
33 MBps. That is because, although both sides perform an equal number of copies,
the receive side must copy uncached data. Copying an incoming packet from kernel
to user buffers is the single most costly operation in many systems which avoid
software checksumming BibRef[41]BibRef[57] whether because of redundant
checksum avoidance or because hardware checksum support is available. This is
because the data is not yet in the cache, and reading data from main memory is
slow. On the DECstation 5000/200, copying uncached data is three times more
expensive than copying cached data BibRef[25].
DMA Redirection is intended to equalize the send and receive rates by
avoiding copying incoming packets. This is achieved by redirecting DMA of incom-
66
ing packets as they arrive. As soon as each packet’s header arrives in system mem-
ory, the CPU processes the headers and changes adapter DMA pointers to point
directly to user receive buffers.
User Buffer
User Copy out
Kernel Kernel Buffer Kernel Buffer
FDDI Interface DMA

DMA
4096 bytes
4096 bytes
Figure 17: Normal incoming packet DMA. For each packet, there are a number of receive
descriptors, each pointing to a kernel buffer.
5.3.1 DMA Redirection

It is tricky to arrange for adapters to DMA network data directly to a user
buffer. The principal difficulty is that the operating system must tell any device
which supports DMA, including network adapters, where data should be placed
before the data actually arrives; in the case of network adapters, the information
required to identify the final destination of the data is in the incoming data itself.
Thus, normal incoming data movement design is to arrange for each incoming
packet to go into a buffer not owned by any process and then copy the data into a
user buffer once header processing is complete.
5.3.1.1 Normal Scatter Receive Buffer Design

FigureFoobleFigure 17 illustrates the register/memory setup of DMA-
based network devices as managed by common Berkeley Unix-based systems.
DMA-based network devices typically support scatter I/O. Each incoming packet
can be scattered across multiple receive descriptors. Each descriptor has a length
67
and a pointer. The length specifies how many bytes to place in the buffer pointed
to by the descriptor in question. If an incoming packet contains more bytes than
the descriptor specifies, the next descriptor is used and the algorithm repeated.
Device driver writers are expected to provide enough bytes in the various descrip-
tors to hold worst-case large packets.
Once DMA into host memory is complete, the adapter generates an inter-
rupt and packet processing begins. The incoming packet is later copied into user
buffers as one of the last steps in packet processing.
User User Buffer

Kernel
Kernel Buffer
FDDI Interface DMA
<waitlen> bytes DMA

4096 bytes
Figure 18: Direct device to user DMA. The adapter is given a pointer to the user buffer while
data in the first descriptor is still arriving.
5.3.2.2 DMA Redirection Receive Buffer Design

By contrast, DMA Redirection starts processing the header of each packet
as soon as the header arrives, without waiting for the rest of the packet to appear.
Once the header has arrived and the host has finished processing it, there is
enough information for the host to change DMA targets while the packet is arriv-
ing. Since it is necessary to tell the device where data should be placed before the
data arrives, the pointer field of a descriptor must be filled before data for that
descriptor arrives. Use of multiple descriptors is key to the design. The new
scheme is shown in FigureFoobleFigure 18. The first descriptor points to a short
68
kernel buffer, and initially the second descriptor points to a kernel buffer long
enough to hold the rest of an incoming packet’s data. As soon as enough informa-
tion (i.e. the protocol header) has arrived to decide to which user buffer a packet is
bound, the second descriptor’s pointer is redirected to point to that user buffer. It
takes a certain amount of time to decide where a packet is bound; the buffer
pointed to by the first descriptor continues to fill while the packet is arriving. After
DMA redirection is complete, the bulk of the packet is placed directly into the
user’s buffer with no need for additional copying.
The byte length contained in the first descriptor, waitlen, is a function of
the speed of the network and the amount of time required to change the descriptor.
On the DECstation, the early implementation requires a waitlen value of 2048
bytes in order to have enough time to reliably process the header and set the sec-
ond descriptor’s pointer. Since this implementation uses an FDDI network, the
amount of required processing time can be computed as follows:
(2048 bytes * 8 bits/byte) / 100 Mbits/sec = 163 microseconds (EQ 1)
Since the second buffer initially points to a kernel buffer, if DMA is not redirected
for any reason the packet will still be received, though through the “normal” slower
path, with the extra copy operation.
5.3.2.3 Buffer Descriptor Details

Under normal operation, a device driver generates buffers and buffer
descriptors (mbufs) by allocating them from a free memory pool. Each adapter
receive descriptor is initialized to point to the data within a different mbuf. Cir-
cumstances are slightly different for a driver supporting DMA Redirection. The
driver must still allocate mbufs in normal fashion in case DMA is not successfully
redirected, but must also be prepared to change to a strategy of using the user
buffer directly as buffer space.
A user process usually supplies a receive buffer in terms of the beginning of
a contiguous area within the process’ address space and the buffer’s length. How-
ever, one cannot simply place the pointer to the beginning of the buffer within the
69
network adapter’s receive descriptor buffer pointer. Few architectures support

DMA to virtual memory, virtual memory may be paged out, and in any case it is
not certain that a process’s address space will be switched in when a packet arrives
for it. Instead, first the kernel forces the buffer to be paged in if it is not already,
then converts the single <base, length> buffer description into a list of physical
buffers within the physical pages to which the buffer is mapped.
Network code and network device drivers expect buffers to be described in
terms of lists of mbufs. Thus, the list of buffers in physical pages is represented as
a list of cluster mbufs. A cluster mbuf does not contain data, but rather contains a
pointer to an external buffer within the kernel. The pointer within each cluster
mbuf in this case is set to point to a buffer in physical memory.
When a packet begins to arrive, DMA Redirection finds a pointer to the list
of cluster mbufs and sets each adapter receive descriptor pointer to point to a dif-
ferent mbuf ’s data pointer.
5.3.3 Early Packet Detection By Polling From the Idle Loop

Operating systems typically deal with device-related processing by
responding to interrupts generated by the device. A device posts an interrupt when
an event of interest to the operating system has occurred. This causes a problem
for DMA Redirection, because it requires CPU processing immediately after a
packet begins to appear on the network medium, and most existing network inter-
faces are not built to deliver an interrupt under such circumstances; they only
deliver an interrupt when a packet has fully arrived.
This problem is circumvented by polling from the idle loop. A facility for reg-
istering handlers to be called upon each iteration of the idle loop has been added
to the kernel. Each device driver of a fast network interface would have such a reg-
istered handler.
The handler checks whether enough bytes (64 for FDDI and TCP/IP) have
arrived to decide to which user process a packet is bound by checking whether an
unusual bit pattern earlier stored at byte 63 of each buffer descriptor is still intact;
if not, it knows that a packet’s header has arrived. If a packet happens to have the
70
unusual bit pattern at byte 64, confusing the algorithm, the worst that can happen
is that the packet is processed in the usual, slower manner.
5.3.3.1 DMA Redirection Adapter Design

Given the polling design, incoming packets cannot be detected at all during
any user or interrupt processing whatsoever, since the poll routine derives its flow
of control from the idle loop. One could add polling points elsewhere in the operat-
ing system and try to take advantage of packet trains BibRef[36], but the polling
rate needs to be quite high to catch a reasonable percentage of FDDI packets.
Regardless, it is not unusual that a high-performance workstation’s operating sys-
tem, given a single-user workload, has all of its processes waiting, making the idle-
loop solution perform effectively.
If network adapters were designed to support interrupting the CPU upon
sighting packet arrival on the network medium, DMA Redirection would be faster,
more general, more reliable, and easier to implement.
DMA Redirection and the interrupt in question would allow network
adapter designers to support high performance network software while avoiding
the need to place buffer memory on a network adapter board. While memory is
cheap, memory controllers and time to design them are not.
Packet
Arrivals Packet 1 Packet 2 Packet 3
Network
Processing Polling Normal Recv Processing
User Buffer
Availability user buffer filled by packet 1
read() call
Time
Figure 19: Closely spaced packets cause problems for DMA Redirection with a normal pro-
cessing order. The user buffer must be available to do DMA Redirection. After packet 1
arrives, no buffer is available until network software is complete. Thus, DMA Redirection
cannot be performed for packet 2. DMA Redirection could be performed for packet 3.
71
5.3.3.2 Processing Upon Early Packet Detection

Early packet detection provides for a reduction in latency. Currently, hosts
do not even begin to process packets until they have finished arriving. This means
that the “receive latency”, the length of time between when a packet arrives at the
host’s network interface and when it becomes available for processing by user soft-
ware, is given by
receive latency = network transmission time + network software processing time (EQ 2)
If one were to begin processing packets immediately after their headers

have arrived, the receive latency would reduce to
receive latency = max(network transmission time, network software processing time) (EQ 3)
Processing would necessarily be split into two parts: the bulk of the process-
ing to be performed immediately on early packet detection, and a second part that
would confirm that no errors have been detected and awaken the receiving user
process if appropriate. This is similar in some ways to a split described in
BibRef[35]BibRef[51].
The early processing of a packet due to DMA Redirection can also improve
the likelihood that DMA Redirection can be used to process a given packet. Con-
sider the sequence of events when there is a sequence of closely spaced packets
bound for a single process, as illustrated in FigureFoobleFigure 19. A user buffer
from a waiting process must be available for DMA Redirection to be performed. Yet
the user buffer can only be available for DMA Redirection if a process is already
waiting for a packet. The user process cannot begin waiting for the next packet
until it has received the previous one; this can only occur after packet processing
of the previous packet is complete. Normally, packet processing does not begin
until a packet has finished arriving, and may not complete until well after the next
packet has begun arriving; thus, the user process does not have an opportunity to
provide a buffer for the next packet. If one begins packet processing upon early
packet detection, packet processing would complete relatively shortly after the
packet finished arriving. This would allow a process which is simply interested in
72
receiving packets as fast as possible to execute and immediately wait for the next
packet, switching execution to the idle loop if no other computation is ongoing,
which in turn allows DMA Redirection on the next packet.
5.3.4 Results
Our results indicate that processing time for UDP packets improved as a
result of our implementation of DMA Redirection on a DECstation 5000/200
equipped with a DEFTA FDDI adapter. Table 3 shows the resultant improvements
in round trip times for 4000-byte packets. In particular, round trip time was
reduced by approximately 15% when DMA Redirection was applied after having
already applied Redundant Checksum Elimination.
Table 3: Latency improvements from DMA Redirection

Round Trip
Latency (millisec)
Unmodified Ultrix 4.2a 5.8

DMA Redirection only 5.3
Redundant Cksum Elim 4.7
DMA Redirection and 4.0
Redundant Cksum Elim
5.4 Eliminating User-to-Kernel Copying

With the elimination of checksumming and copying from kernel to user
buffers, copying from user buffers to kernel buffers becomes the largest bottleneck
to throughput. This copying can be largely eliminated by changing the user-kernel
data copying architecture of the operating system to take advantage of new net-
work adapter capabilities. Many modern high-performance network adapters dif-
fer from their predecessors in that they contain enough adapter transmission
buffer space to transfer outgoing packets in their entirety. Such adapters can com-
pletely transfer an outgoing packet to adapter buffers at the speed of burst DMA.
Since burst DMA can often move data more rapidly than PIO, one can raise
throughput by letting the network adapter assume the burden of copying. In a sys-
73
tem equipped with some form of checksum-computation avoidance and a technique

such as DMA Redirection which eliminates copying from kernel to user process,
this change, Nocopyin, allows the elimination of data-touching operations in net-
work software, to make a zero-copy stack.
What is new about this work is not the idea of eliminating send-side copy-
ing; this has been done before BibRef[28]BibRef[35]BibRef[52], often in the context
of page-based I/O interfaces amenable to virtual memory manipulation. This is the
first design that works on standard networking hardware, avoids copying data into
the kernel, and yet yields increased single-threaded performance to programs
using a Unix-compatible I/O system call interface.
This design has the weakness that it does relatively little to improve TCP
performance. In-kernel TCP implementations must keep a copy of data for retrans-
mission purposes; this prevents our scheme from working with DMA-based adapt-
ers. This is not much of a problem because traces of LAN traffic BibRef[42] suggest
that TCP represents a minority of LAN traffic. This scheme does work for Mbone
multimedia traffic BibRef[15], in-kernel server NFS BibRef[62], and protocols such
as DNS BibRef[46] with retransmission implemented outside the kernel.
This section describes an implementation of the elimination of send-side
CPU copying, as well as why adapters able to fetch data by DMA are likely to sup-
port higher performance than adapters which require the CPU to copy data to the
adapter.
5.4.1 Modern Network Data Movement Design

To understand Nocopyin, it is necessary to have an understand how data
movement works under many commercial operating systems (e.g. BibRef[43]) and
most network adapters (e.g.
BibRef[4]BibRef[6]BibRef[24]BibRef[26]BibRef[33]BibRef[55]BibRef[65]).
5.4.1.1 Network Adapter Data Movement Design

Network adapters are frequently categorized according to the copying
interface they present to a host’s CPU. A DMA-based adapter moves data between
74
Mbuf Kernel Kernel Kernel

chain buffer buffer buffer
Pointers to memory
Descriptor ...
Host queue
Adapter DMA to adapter
Adapter transmission buffer
Figure 20: Output side of a DMA-based adapter. The adapter uses DMA to fill its internal
transmission buffer directly from kernel buffer memory. A queue of descriptors tells the
adapter where to fetch memory from.
host memory and adapter, whereas with a PIO-based adapter, the host CPU itself
copies the data.
Confusingly, both types of adapters often have an internal DMA engine, so
DMA-based adapters are described first. A diagram of the transmission side of
such an adapter is shown in FigureFoobleFigure 20. The adapter and device driver
jointly manage a queue of buffer descriptors in host memory. Each buffer descrip-
tor points to a location in host memory. The adapter contains a relatively small
amount of memory in an onboard transmission buffer, which holds data to be
placed on the network medium. When the transmission buffer approaches empti-
ness, the adapter transfers data from the buffer pointed to by the first descriptor
in the queue; that buffer is in the main memory of the host.
Many PIO-based LAN adapters, especially LAN adapters designed around
common MAC chipsets, are similar to DMA-based adapters, as shown in Figure-
FoobleFigure 21. In fact, PIO-based adapters often contain what amounts to an
internal DMA-based adapter, often a cheap MAC chip produced in volume. The dif-
ference is that the PIO-based adapter also contains a relatively large internal
memory (typically up to a couple of megabytes), and the adapter transfers data to
75

Host
CPU copy to adapter
Adapter
Adapter Adapter Adapter
buffer buffer buffer
Pointers to memory
Descriptor ...
queue
Adapter internal DMA
Figure 21: Output side of a PIO-based adapter. This is the same as a DMA-based adapter
except for a an extra buffer layer of memory in the adapter. Transfers to the transmission
buffer are satisfied from the adapter’s onboard memory. The CPU copies data into the
adapter’s memory.
the transmission buffer from the adapter memory instead of directly from main
memory. The CPU is expected to copy data from main memory into the adapter
memory. Examples of this may be found in several of the workstation Ethernet
adapters based on the AMD LANCE Ethernet MAC chipset BibRef[26], the DEC
DEFZA FDDI adapter BibRef[55], and the HP Afterburner and Medusa adapters
BibRef[4]BibRef[24]. The LANCE chipset itself implements a scatter/gather DMA
engine similar to FigureFoobleFigure 20. On many Sun workstations, the LANCE
chipset is interfaced to the host through the addition of I/O bus DMA support cir-
cuitry which allows the LANCE to fetch data directly from main memory. On DEC-
stations, the LANCE is interfaced to the host through a PIO-based interface - the
LANCE fetches data from the adapter memory, and the CPU copies data into
adapter memory. The reason often given for the extra layer of memory is that
because the LANCE’s FIFO is very small, there is risk of starvation for the I/O bus
76
long enough for the FIFO to either fill, if receiving a packet, or empty, if transmit-
ting a packet, resulting in incorrect packet transfer. Alternatively, when designing
adapters for hosts with new, low-volume, or complicated I/O bus interfaces it may
be cheaper to put the memory onboard with an ad hoc internal DMA interface than
to support I/O bus DMA.
The best choices in adapter and operating system design depend upon a
number of system parameters. One parameter is whether adapter DMA is able to
transfer data more rapidly than the CPU, both within the system and across the I/
O bus. DMA is often, though by no means always, faster than copying across the I/
O bus, because DMA typically is done in large bursts that amortize the overhead
of acquiring the bus over a large number of transferred bus words, while CPUs typ-
ically transfer data in small bursts if they use bursts at all. DMA is often faster
than CPU copying internal to a host because designers try to build I/O busses to
have cycle times comparable to other host circuitry, and thus if DMA is fast relative
to other I/O bus transfers it is often fast relative to internal data transfers.
5.4.1.2 Operating System Data Movement

FigureFoobleFigure 22 is a diagram of data movement operations in a Ber-
keley Unix kernel when transmitting TCP/IP data through a DMA-based adapter.
First data is copied from a user process’ buffer to kernel buffers (“copyin”). A check-
sum is calculated on the data in the kernel buffers. Finally, the adapter transfers
data from the kernel buffers to the network adapter’s transmission buffer. Each
byte is either examined or transferred three times.
Data movement through a PIO-based adapter, shown in FigureFoobleFig-
ure 23, is similar. An extra copy has been added between kernel buffers and
adapter buffers, and the adapter transfers from internal memory instead of main
memory.
A large number of variations have been developed according to the scheme
shown in FigureFoobleFigure 22. One popular set of variants involves replacing
copyin with virtual memory page mapping operations BibRef[28]BibRef[52]. Most
such schemes have difficulty preserving backward compatibility with existing
77
User buffer
Copy into kernel

User
Kernel Mbuf Kernel Kernel Kernel

Pointers to memory
Descriptor ...
queue
DMA to adapter
Adapter
Figure 22: Normal send-side data movement with DMA-based interface. Kernel copies data
from user process to kernel buffers. The adapter transfers data from kernel buffers. There
are a total of two data copies.
operating system I/O interfaces due to the necessity of page-aligning I/O buffers;
many make no attempt and require rewriting of applications to gain a performance
advantage. A major goal of all the work in this dissertation is to gain whatever per-
formance is possible without sacrificing compatibility.
5.4.2 Nocopyin
Nocopyin eliminates the copy from user process buffers into kernel buffers.
Strictly speaking, there is no need for this copy operation because data could be
transferred directly from the user process to the adapter. Retention of this copy
operation has resulted in higher performance in the past because Unix I/O seman-
tics allow a process to modify the contents of a buffer as soon as it returns from the
I/O operation.
78
User buffer
User Copy into kernel

Kernel
CPU copy to adapter
Adapter
Adapter Adapter Adapter
Pointers to memory
Descriptor
queue ...
Figure 23: Normal send-side data movement with PIO interface. Kernel copies data from
user process to kernel buffers, then copies data from kernel buffers to adapter. The adapter
transfers data from internal buffers. There are three data copies.
5.4.2.1 Optimal Copy Performance With DMA-Based Adapters

In the past, implementation of Nocopyin on workstations that use Unix I/O
semantics and are equipped with DMA-based adapters would not have resulted in
a performance gain. If an adapter transfers more slowly than the CPU is able to
copy buffers, then the CPU will be able to get more work done (such as in generat-
ing new packets) by copying the I/O buffer, freeing the user process to return, and
letting the adapter move the copied data at its own rate. However, many modern
high-performance adapters move data faster than the CPU can. This is because of
79
the high speed of adapter DMA and the increasing size of internal adapter trans-
mission buffers.
Many DMA-based adapters BibRef[26]BibRef[33] only contain small inter-
nal transmission buffers, on the order of a hundred bytes in size, not enough to con-
tain an entire packet. Data transmission proceeds as follows: first the adapter
transfers enough data to fill the internal transmission buffer. The adapter uses the
contents of the buffer to transmit to the network; when the buffer runs low, the
adapter repeats the process, transferring another bufferful of data. The process
repeats until the entire packet has been transmitted.
In this process, data transfer is not complete until the adapter has nearly
finished transmitting the packet. Over the duration of packet transmission, on
average, DMA proceeds at the same rate as transmission onto the network. Each
filling of the adapter transmission buffer is rapid, but is succeeded by a relatively
long waiting period during which the transmission buffer drains. Host packet buff-
ers cannot be freed until this relatively slow process is complete, and if a buffer
comes from a user process’ address space, that process must be frozen until trans-
mission is complete. It is faster from the point of a transmitting user process for
the buffer to be copied than for the process to wait until the relatively slow network
transmission is complete.
More recently designed, high-performance adapters such as the DEFTA
BibRef[65] or the Boggs T3 adapter BibRef[6] contain tens of kbytes of transmis-
sion buffer memory. When these adapters try to fill their queues, they can DMA
entire packets continuously. On systems in which adapter DMA is able to move
data more rapidly than PIO can, it is faster to wait for DMA to complete than to
copy the packet. Thus, on such systems, it makes sense to redesign the network
software to let network adapters DMA directly from user buffers, as shown in Fig-
ureFoobleFigure 24.
5.4.2.2 Nocopyin Implementation on DMA-Based Adapters

I implement nocopyin by moving control of copying from the socket level to
the driver level, in the expectation that the best tradeoffs for a given combination
80
of system and adapter are known to each driver writer. This is not unlike the
designs of BibRef[4]BibRef[35], except that my design does not call the driver to do
buffer allocation; it would not help in the highest performance case of the DMA-
based adapter, and it would not help performance much in the (slower) case of PIO-
based adapters. Instead, user buffers are locked into physical memory and mapped
into somewhat modified mbufs; copying is delayed until the driver entry point for
packet transmission is called in the normal Berkeley Unix design.
A driver interface flag has been added so that drivers not modified to know
about the new scheme have the data copied for them in the link layer, while mod-
ified drivers are expected to copy the data in some fashion. The experimental
driver spins until the driver has finished the DMA; since the DECstation is unable
to accomplish much computation under circumstances of contention with continu-
ous adapter DMA, this seems like the best course.
Nocopyin should not be used when adapter queues grow long. Even modern
adapter transmission buffers only have room for a small number of packets; once
the transmission buffer is full, the adapter will start transferring packets in rela-
tively small chunks to keep the transmission buffer full, changing to a more
LANCE-like strategy of transferring data in small chunks. A driver implementing
Nocopyin should start copying outgoing packets and stop spinning on transmission
once it enqueues enough packets to fill the transmission buffer. This is not a prob-
lem for the experimental system, a DECstation 5000/200 on an FDDI network that
it is unable to saturate.
5.4.2.3 Nocopyin with PIO-Based Adapters

Nocopyin works even better with PIO-based adapters. In fact, Nocopyin on
PIO-based adapters has been implemented by various researchers
BibRef[4]BibRef[24]BibRef[35]. Nocopyin implementation in drivers for PIO-
based adapters is trivial, since such drivers already copy the data. All that needs
to be done is to set the flag stating that the driver does its own copying in.
81
5.4.2.4 Retransmission
A problem with Nocopyin is that reliable transport protocols such as TCP
require the kernel to keep a copy of each packet in case the need arises to retrans-
mit. Even in the best case, this copy must be kept around long enough for the
packet to be acknowledged, which is typically a large amount of time compared to
the time required to transmit a packet. If the kernel did not copy the packet, it
would be forced to freeze the user process for an unacceptably long time.
Protocols implemented outside the kernel, built on top of UDP, such as the
various Mbone protocols or DNS, have no such requirement. Packet traces from
LANs supporting Mbone suggest that multimedia packets represent an increasing
component of LAN workload.
Copyin can be drastically reduced even under TCP when transmitting over
PIO-based adapters with an onboard, CPU-addressable transmit queue, such as
the DEFZA BibRef[55]. The algorithm used is that the data is copied out to the
adapter in much the same manner as for UDP. However, TCP continues to track,
in its retransmission queue, the packet data onboard the adapter. Even in the
worst (and common) case of a fixed circular array adapter queue, this works
because the data remains undisturbed until all the other packets in the queue
have been used and the buffer in question must be reused to transmit a newer
packet. When the time arrives for the buffer to be reused, the old packet data must
be copied back into the kernel so that TCP can maintain its retransmission queue;
in this circumstance, one may actually take a performance hit if copy speed across
the I/O bus is slow. However, the usual case is likely to be that the packet is
acknowledged before the adapter buffer must be reused.
As an example, it takes about 4 milliseconds on the DECstation 5000/200
using a PIO-based DEFZA FDDI adapter to send two 4324-byte UDP packets and
for a single-byte packet to return, the typical ACK pattern for TCP sending at max-
imal throughput. Being an FDDI adapter, the DEFZA can transmit no faster than
100 Mbps. Thus, the usual number of packet buffers that will be consumed is given
by (bandwidth * delay / buffer size). Plugging in numbers yields
82
(100 Mbps / 8 bits/byte) * 0.004 sec / 4324 bytes/buffer = 12 buffers (EQ 4)

For these purposes, the DEFZA effectively has 102 onboard transmit buff-
ers, almost an order of magnitude to spare. The only circumstance in which extra
copying is likely to be necessary is when TCP conversations cross WANs.
There are three circumstances that suggest usually extra copies are not
needed under this algorithm. The first is the observation that most TCP traffic is
local. Although the bandwidth-delay product of conversations which cross WANs is
certainly rising, it is not clear that the fraction of remote conversations is rising.
The second observation is that the size of buffers placed onboard PIO-based adapt-
ers appears to be rising; the DEFZA adapter contains a megabyte of memory,
whereas the DEC LANCE Ethernet adapter only contains 128 kbytes. The third
observation is that in normal TCP traffic patterns high bandwidths are normally
used only for relatively short bursts. Equation 4 is very pessimistic because it
applies to a workload which transmits at the maximum FDDI bandwidth. In most
TCP LAN workloads, one observes bursty behavior, with relatively short bursts at
high bandwidths. The amount of time that an ACK has to appear before a buffer
must be copied out of the adapter depends upon the rate at which adapter buffers
are consumed, which in turn depends upon the rate of packet transmission, which
in TCP usually depends upon the average rate of transmission, which in turn
depends upon the workload.
This idea is not entirely new. BibRef[4] suggests using the onboard buffers
of a PIO-based adapter for retransmission buffers as well, although the buffers in
question are not rigidly organized into a circular queue, and thus the notion of
copying buffers back from the adapter into main memory to make space in the
adapter transmission queue for outgoing data is not suggested in that work.
5.4.3 Results
Nocopyin has been implemented on a DECstation 5000/200 running Ultrix
4.2a, with support for nocopyin in the drivers of both the DEFZA and DEFTA FDDI
83
adapters. The DEFTA supports DMA for transmission, while the DEFZA uses PIO.
The baseline system is an unmodified Ultrix 4.2a kernel.
The benchmark run is a UDP throughput measurement tool called ncns; it
was run using 4000-byte packets. The measured transmission speed as given in
Table 4 is significant for a 19.5 SPECint CPU. All numbers are taken with check-
sum-computation avoidance in effect, due to the proliferation of workstations ship-
ping using techniques for reducing or eliminating checksumming.
Table 4: UDP transmission throughput, using 4000-byte packets.

Kernel DEFZA (without DMA) DEFTA (with DMA)
Ultrix 4.2a 34.2 Mbps 57.7 Mbps
Nocopyin 36.7 Mbps (7% faster) 77.9 Mbps (35% faster)
Note that the technique is much more effective for the DMA-based DEFTA
adapter, which is the topic of the next section.
User buffer
User Copy into kernel
Kernel Descriptor Pointers to memory

queue
...
DMA to adapter

Adapter
Figure 24: Send-side data movement with DMA but no copyin. No intermediate kernel buffers
are used. Data is transferred directly from the user process’ buffer.
84
User buffer
User
Kernel CPU copy to adapter
Adapter Adapter Adapter Adapter

Pointers to memory
Descriptor
queue ...
Figure 25: Send-side data movement with PIO and no copyin. No intermediate kernel buffers
are used. Data is copied directly from the user process’ buffer to the adapter.
5.4.4 DMA vs. PIO for High Performance Adapters

There has been debate concerning whether DMA or PIO is a better interface
for high-speed adapters. I believe that DMA-based adapters frequently allow
higher performance than PIO-based adapters. It is a choice between FigureFooble-
Figure 24 (zero CPU copies, one data transfer), as implemented by Nocopyin or
optimistic blast BibRef[14]BibRef[49], and FigureFoobleFigure 25 (a single CPU
copy and two data transfers), as implemented by a single-copy stack
BibRef[4]BibRef[24]. The DMA-based adapter requires fewer data transfers. It has
been pointed out that under PIO the DMA is insulated from main memory, which
means that contention with DMA need not freeze the CPU, but a time-consuming
copy operation consumes the CPU in a certain sense more effectively than DMA
possibly could; if adapter DMA consumes less time than PIO, then DMA is less dis-
ruptive than PIO. Furthermore, under situations of high throughput, unless a rel-
85
atively high-performance part is used for the onboard adapter memory, there is
contention between DMA and CPU on the PIO-based adapter’s onboard memory.
Our results support the claim that DMA-based adapters are often more effi-
cient than PIO-based adapters. The DEFTA achieves more than twice the perfor-
mance of the DEFZA.
5.5 Conclusions
The three modifications just described serve to largey eliminate data-touch-
ing overheads, substantially improving maximal throughput. Checksum Redun-
dancy Avoidance eliminates most checksum overheads, DMA Redirection obviates
most copying from kernel to user buffers, and Nocopyin avoids most copying from
user to kernel buffers.
An implementation on a DECstation 5000/200 which included all three
optimizations would probably be capable of a throughput nearly four times as
great as the throughput of a DECstation 5000/200 running Ultrix 4.2a, since data
touching overheads account for roughly 75% of the time required to process 8k
packets, the size of packet most commonly used for data transfer.
86
Chapter VI: Conclusions

6.1 Problem Summary
An increasing number of computer applications are dependent upon net-
work performance for their usefulness, from the plethora of existing applications
turned into network applications through networked file systems to newer appli-
cations such as parallel computing over TCP/IP networks. However, network soft-
ware performance is not rising to keep up with the demand posed by ever-faster
processors and increasingly network-dependent applications. Network software
latency is a particular source of trouble. Network software latency probably cumu-
latively causes more slowdowns than any other aspect of network performance,
since it is the major bottleneck to typical LAN workload processing on the software
we measured, most TCP/IP traffic remains on LANs, and most TCP/IP traffic con-
sists of small packets. Furthermore, latency is becoming an worse problem, since
improvements in latency are not keeping pace with either improvements in proces-
sor speed or improvements in network software throughput.
6.2 Summary of Results

This dissertation describes a set of techniques for improving TCP/IP soft-
ware performance.
• PathIDs is the most important technique, since it attacks the most

pressing problem. It yields a substantial improvement in network soft-
ware latency. It is difficult to significantly reduce the average processing
time due to non-data-touching overheads because that time is spread
across many operations, so that achieving a significant savings would
require optimization of many different mechanisms. PathIDs provides
a framework for optimizing such a wide variety of overheads.
The other techniques are designed to largely eliminate each of the data-touching
operations known to be the major bottleneck to maximal throughput in systems
87
with a DMA-based network adapter: checksumming, copying from kernel to user

buffers, and copying from user to kernel buffers.
• Checksum Redundancy Avoidance largely obviates checksumming by

not performing a software checksum when it would be redundant with
the CRC standard with most LAN adapters.
• DMA Redirection avoids the need to copy from kernel to user buffers by
redirecting adapter DMA pointers as soon as the header of each packet
arrives; the header identifies the packet sufficiently to set DMA pointers
to point to the buffer of a waiting packet.
• Nocopyin frequently makes copying from kernel to user buffers unnec-

essary by observing that, unlike Ethernet adapters, modern high-speed
adapters DMA sufficiently swiftly that there is no need for the tradi-
tional copy from user to kernel.
6.3 Future Work

As with any research, more can be done along the lines outlined in this dis-
sertation. On a general level, PathIDs is strikingly effective for optimizing protocol
code, but not so effective at reducing socket and driver level overheads. My next
area of investigation would be to examine means of reducing those overheads as
well.
A number of desirable implementation aspects are lacking. Some of the
techniques are implemented in Ultrix and others are implemented in OSF/1
because I tried to keep up with technology and move my efforts to Alpha worksta-
tions. It would be very interesting to put implementations of each technique, along
with Jacobson-style fastpaths BibRef[35], into a single, high-performance imple-
mentation. With some rewriting for general-purpose use, such a kernel could be a
valuable high-performance network technology resource. Such a unified kernel
would also allow easy comparison of the effectiveness of fastpaths and PathIDs at
reducing latency, still another desirable but unachieved goal.
Each technique has unimplemented design aspects. PathIDs requires
88
minor redesign in order to handle multicast. The locality detection and statistical
checksumming algorithms described in the Checksum Redundancy Avoidance sec-
tion are as yet unimplemented. More work is required in order for DMA Redirec-
tion to be able to raise throughput in addition to lowering latency of large packets.
Nocopyin does not yet support TCP or NFS.
These techniques could be applied to protocols other than TCP/IP and
implementations other than Berkeley Unix. Each technique attempts to solve a
problem not unique to TCP/IP or Unix in a manner that takes advantage of facili-
ties similarly not specific to TCP/IP or Unix.
6.4 Bibliographical References
[1] Accetta, R. Baron, D. Golub, R. Rashid, A. Tevanian, M. Young, “Mach: A New
Kernel Foundation for UNIX Development,” Proceedings of the Summer
1986 Usenix Technical Conference and Exhibition, June 1986.
[2] T. Anderson, H. M. Levy, B. N. Bershad, E. D. Lazowska, “The Interaction of

Architecture and Operating System Design,” Proceedings of the Fourth
International Conference on Architectural Support for Programming Lan-
guages and Operating Systems, pp. 108-120, April 1991.
[3] ANSI/IEEE, Std 802.1d, “Information Processing Systems - Local and Metro-
politan Area Networks - Part 1d,” ANSI/IEEE Std 802.1d, 1992.
[4] D. Banks, M. Prudence, “A High-Performance Network Architecture for a PA-

RISC Workstation,” IEEE Journal on Selected Areas in Communications, pp.
191-202, February 1993.
[5] K. P. Birman, T. A. Joseph, F. Schmuck, “Isis Documentation: Release 1,” Cor-

nell University Department of Computer Science Technical Report 87-849,
July 1987.
[6] D. Boggs, private communication, February 1994.
[7] D. Borman, R. Braden, V. Jacobson, “TCP Extensions for High Performance,”

Internet RFC 1323, May 1992.
[8] R. T. Braden, D. A. Borman, and C. Partridge, “Computing the Internet Check-

sum,” Internet RFC 1071, September 1988.
[9] R. T. Braden, ed., “Requirements for Internet Hosts - Communication Layers,”

Internet RFC 1122, October 1989.
[10] S. Bradner, Harvard Network Device Test Lab results, available by anony-
mous FTP from hsdndev.harvard.edu.
[11] L.-F. Cabrera, E. Hunter, M. J. Karels, D. A. Mosher, “User-Process Commu-

nication Performance in Networks of Computers,” IEEE Transactions on
Software Engineering, v14n1, pp. 38-53, Jan. 1988.
[12] R. Caceres, P. B. Danzig, S. Jamin, D. J. Mitzel. “Characteristics of Wide-Area

TCP/IP Conversations,” Proceedings of the SIGCOMM ‘91 Symposium on
Communications Architectures and Protocols, pp. 101-112, August 1991.
[13] S. Carl-Mitchell, J. Quarterman, “Using ARP to implement transparent sub-

net gateways,” Internet RFC 1027, October 1987.
[14] J. B. Carter, W. Zwaenepoel, “Optimistic Implementation of Bulk Data Trans-

fer Protocols,” ACM SIGMETRICS Conference on Computer Performance,
pp. 61-69, 1989.
89
90
[15] S. Casner, “Frequently Asked Questions on the Multicast Backbone

(MBONE),” available for anonymous FTP as venera.isi.edu:mbone/faq.txt,
May 1993.
[16] C.-H. Chang, R. Flower, J. Forecast, H. Gray, W. R. Hawe, K. K. Ramakrish-

nan, A. P. Nadkarni, U. N. Shikarpur, K. M. Wilde, “High-performance TCP/
IP and UDP/IP Networking in DEC OSF/1 for Alpha AXP,” Proceedings of the
Third IEEE International Symposium on High Performance Distributed
Computing, pp. 35-42, August 1994.
[17] J. B. Chen, B. N. Bershad, “The Impact of Operating System Structure on

Memory System Performance,” Proceedings of the Fourteenth ACM Sympo-
sium on Operating System Principles, December 1993.
[18] K. Claffy, “Internet Workload Characterization,” Ph.D Thesis, University of

California, San Diego, June 1994.
[19] D. D. Clark, “Modularity and Efficiency in Protocol Implementation,” Internet

RFC, 817, 1982.
[20] D. D. Clark, “The Structuring of Systems Using Upcalls,” Proceedings of the

Tenth ACM Symposium on Operating System Principles, pp. 171-180, Decem-
ber 1985.
[21] D. D. Clark, V. Jacobson, J. Romkey, H. Salwen, “An Analysis of TCP Process-

ing Overhead,” IEEE Communications Magazine, pp. 23-29, June 1989.
[22] D. D. Clark, D. L. Tennenhouse, “Architectural Considerations for a New Gen-

eration of Protocols,” Proceedings of the SIGCOMM ‘90 Symposium on Com-
munications Architectures and Protocols, pp. 200-208, September 1990.
[23] D. Comer, Internetworking with TCP/IP, v3, Prentice-Hall, 1988.
[24] C. Dalton, G. Watson, D. Banks, C. Calamvokis, A. Edwards, J. Lumley,

“Afterburner,” IEEE Network, pp. 36-43, July 1993.
[25] Digital Equipment Corporation Workstation Systems Engineering, DECsta-

tion 5000/200 KN02 System Module Functional Specification, August 1990.
[26] Digital Equipment Corporation Workstation Systems Engineering, PMAD-

AA TURBOchannel Ethernet Module Functional Specification, August 1990.
[27] J. Dongarra, private communication, May 1994.
[28] P. Druschel, L. L. Peterson, “Fbufs: A High-Bandwidth Cross-Domain Trans-

fer Facility,” Proceedings of the Fourteenth ACM Symposium on Operating
System Principles, August 1993
[29] R. Gusella, “A Measurement Study of Diskless Workstation Traffic on an

Ethernet,” IEEE Transactions on Communications, 38(9), pp. 1557-1568,
September 1990.
91
[30] J. Hawkinson, “comp.dcom.cisco Frequently Asked Questions (FAQ),”

comp.dcom.cisco, Usenet, November 1994.
[31] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative

Approach, Morgan Kaufmann Publishers, 1990.
[32] N. C. Hutchinson, S. Mishra, L. L. Peterson, V. T. Thomas, “Tools for Imple-

menting Network Protocols,” Software - Practice and Experience, pp. 895-
916, September 1989.
[33] Intel, Connectivity Options, 1992.
[34] V. Jacobson, “BSD TCP Ethernet Throughput”, comp.protocols.tcp-ip, Usenet,

1988.
[35] V. Jacobson, “Some Design Issues for High-speed Networks,” Networkshop

‘93, November 1993.
[36] R. Jain and S. Routhier, “Packet Trains: Measurements and New Model for
Computer Network Traffic,” IEEE Journal on Selected Areas in Communica-
tion, v4n6, 986-995, September 1986.
[37] G. Jones, private communication, August 1994.
[38] H. Kanakia, D. R. Cheriton, “The VMP Network Adapter Board (NAB): High-
Performance Network Communication for Multiprocessors,” Proceedings of
the SIGCOMM ‘88 Symposium on Communications Architectures and Prin-
ciples, pp. 175-187, 1988.
[39] J. Kay, ccsum.c [an Internet checksum routine optimized for the DECstation
5000/200], available for anonymous ftp as ucsd.edu:pub/csl/fastnet/
ccsum.c.Z, April 1993.
[40] J. Kay, J. Pasquale, “Network Sloth: Where the Time Goes in Network Soft-
ware,” University of California, San Diego Technical Report CS92-238,
June 1992.
[41] J. Kay, J. Pasquale, “Measurement, Analysis, and Improvement of UDP/IP

Throughput for the DECstation 5000,” Proceedings of the Winter 1993
USENIX Conference, pp. 249-258, January 1993.
[42] J. Kay, J. Pasquale, “The Importance of Non-Data Touching Processing Over-

heads in TCP/IP,” Proceedings of the SIGCOMM ‘93 on Communications
Architectures and Protocols, September 1993.
[43] S. J. Leffler, M. K. Mckusick, M. J. Karels, J. S. Quarterman, The Design and

Implementation of the 4.3 BSD UNIX Operating System, Addison-Wesley,
November 1989.
[44] LSI Logic, L64360 and ATMizer Architecture Manual, August 1994.
92
[45] P. E. McKenney, K. F. Dove, “Efficient Demultiplexing of Incoming TCP Pack-

ets,” Proceedings of the SIGCOMM ‘92 Symposium on Communications
Architectures and Protocols, pp. 269-279, August 1992.
[46] P. Mockapetris, “Domain Names - Implementation and Specification,” Inter-

net Request for Comments 1035, June 1987.
[47] J. Mogul, “Network Locality at the Scale of Processes,” Proceedings of the SIG-
COMM ‘91 Symposium on Communications Architectures and Protocols, pp.
273-284, August 1991
[48] J. Mogul, “Path MTU Discovery,” Internet Request for Comments 1191,
November 1990.
[49] S. W. O’Malley, M. A. Abbott, N. C. Hutchinson, L. L. Peterson, “A Transpar-

ent Blast Facility,” Journal of Internetworking Research and Experience,
December 1990.
[50] J. Ousterhout, “Why Aren’t Operating Systems Getting Faster As Fast As

Hardware,” Proceedings of the Summer 1990 USENIX Conference, pp. 247-
256, June 1990.
[51] C. Partridge, “Implementing Those Checksum Improvements,” comp.proto-

cols.tcp-ip, Usenet, March 1993.
[52] J. Pasquale, E. Anderson, K. Muller, “Container-Shipping: Operating System

Support for I/O Intensive Applications,” Computer, pp. 84-93, March 1994.
[53] J. Pasquale, G. Polyzos, E. Anderson, K. Fall, J. Kay, V. Kompella, S. McMul-

lan, D. Ranganathan, “Network and Operating Systems Support for Multi-
media Applications,” Technical Report CS91-186, University of California,
San Diego, March 1991.
[54] D. M. Ritchie, “A Stream Input-Output System,” AT&T Bell Laboratories

Technical Journal, pp. 1897-1910, October 1984.
[55] D. Sawyer, P. Weeks, F. Itkowsky, “DEC FDDIcontroller 700 Port Specifica-

tion,” Digital Equipment Corporation, March 1991.
[56] M. D. Schroeder, M. Burrows, “Performance of Firefly RPC,” ACM Transac-

tions on Computer Systems, pp. 1-17, February 1990.
[57] V. J. Schryver, private communication, November 1992.
[58] V. J. Schryver, “Re: bypassing TCP checksum,” comp.protocols.tcp-ip, Usenet,

June 1993.
[59] U. Sinkewicz, C.-H. Chang, L. G. Palmer, C. Smelser, F. L. Templin, “ULTRIX

Fiber Distributed Data Interface Networking Subsystem Implementation,”
Digital Technical Journal, Spring 1991.
93
[60] A. Z. Spector, “Performing Remote Operations Efficiently on a Local Com-

puter Network,” Communications of the ACM, pp. 246-260, April 1982.
[61] M. Stonebraker, J. Dozier, “Sequoia 2000: Large Capacity Object Servers to

Support Global Change Research,” Sequoia 2000 Technical Report #1, U.C.
Berkeley, July 1991.
[62] Sun Microsystems, Inc., “NFS: Network File System Protocol Specification,”
Internet RFC 1094, March 1989.
[63] V. S. Sunderam, G. A. Geist, J. Dongarra, R. Manchek, “The PVM Concurrent

Computing System: Evolution, Experiences, and Trends,” Parallel Comput-
ing, April 1994.
[64] C. A Thekkath, H. M. Levy, “Limits to Low-Latency Communication on High-

Speed Networks,” ACM Transactions on Computer Systems, pp. 179-203,
May 1993.
[65] D. M. Washabaugh, “PDQ FDDI Adapter Port Specification,” Digital Equip-

ment Corporation, January 1992.
[66] T. von Eicken, D. Culler, S. Goldstein, K. Schasuer, “Active Packets: A Mech-

anism for Integrated Communication and Computation,” Proceedings of the
19th International Symposium on Computer Architecture, pp. 19-21, May
1992.
[67] R. W. Watson, S. A. Mamrak, “Gaining Efficiency in Transport Services by

Appropriate Design and Implementation Choices,” ACM Transactions on
Computer Systems, 5(2), pp. 97-120, May 1987.
[68] J. Zweig, C. Partridge, “TCP Alternate Checksum Options,” Internet RFC

1146, March 1991.

10 1 1 30

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

10 1 1 30

Enviado por

Direitos autorais:

Formatos disponíveis

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Path IDs: A Mechanism for Reducing Network

A dissertation submitted in partial satisfaction of the

Jonathan Simon Kay

Professor Joseph Pasquale, Chair

University of California, San Diego

September 3, 1968 Born, Mayaguez, Puerto Rico

J. Kay, J. Pasquale, “Profiling and Reducing Processing Overheads in TCP/IP,”

J. Kay, J. Pasquale, “The Importance of Non-Data Touching Processing Over-

J. Kay, J. Pasquale, “A Summary of TCP/IP Networking Software Performance

J. Kay, J. Pasquale, “Measurement, Analysis, and Improvement of UDP/IP

J. Kay, T. Nguyen, J. Pasquale, “Fast Software Dithering of 24 to 8 or 1 Bit

J. Pasquale, G. Polyzos, E. Anderson, K. Fall, J. Kay, V. Kompella, S. McMul-

Major Field: Computer Science

Studies in Operating Systems

Path IDs: A Mechanism for Reducing Network Software Latency

Jonathan Simon Kay

Network performance is important to an increasing number of com-

1.2 The Importance of Latency

1.3 Latency Optimization

1.4 Throughput Optimization

Chapter 5 fully describes the techniques used to nearly eliminate data-touching

Chapter II: Related Work

2.2 Optimizing Data-Touching Overheads

2.2.2 Copying From User to Kernel

2.2.3 Copying From Kernel to User

2.3 Optimizing Non-Data-Touching Overheads

of an RPC protocol implemented on a Firefly multiprocessor. BibRef[64] concerns

2.4 Network Software Structure

2.4.2 Control Flow

2.4.3 Buffer Descriptors

manipulation. BibRef[40] shows how the x-kernel’s lack of an allocator tuned to

2.5 Performance Measurements

2.6 Workload Measurement

Chapter III: Workloads and Overheads

3.2.1 LAN Traffic

Fraction of Total Messages

3.2.2 WAN Traffic

3.3 Overhead Categories

3.4 Experimental Setup

1500 Data Move 1500 Data Move

500 ProtSpec 500 Mbuf

3.5 Packet Size Effects

Figure 5a-b: Breakdown of operation processing times expressed as accumulated percentages

FigureFoobleFigure 5 presents a different view of the data, showing the

350 ProtSpec 350

Data Struct Data Struct

FigureFoobleFigure 6 magnifies the left-most regions of the graphs in Fig-

3.5.1 Bottlenecks to Minimal Latency

3.5.2 The Importance of Non-Data-Touching Overheads

observation is that TCP/IP implementation in Ultrix 4.2a and similar Berkeley

3.5.2.1 Difficulty of Optimizing Non-Data-Touching Overheads

3.6 In-Depth Analysis

3.6.1 Touching Data

3.6.2 Protocol-Specific Processing

Mbuf Alloc is more expensive for two reasons: memory allocation is an

3.6.4 Operating System Overheads

3.6.5 Error Checking

3.6.6 Data Structure Manipulations

non-data-touching overheads consume a majority of the total software processing

Chapter IV: Optimizing Non-

4.2 PathID Implementation