Você está na página 1de 16

E2: A Framework for NFV Applications

Shoumik Palkar Chang Lan Sangjin Han


UC Berkeley UC Berkeley UC Berkeley
sppalkar@berkeley.edu clan@eecs.berkeley.edu sangjin@eecs.berkeley.edu

Keon Jang Aurojit Panda Sylvia Ratnasamy


Intel Labs UC Berkeley UC Berkeley
keon.jang@intel.com apanda@cs.berkeley.edu sylvia@eecs.berkeley.edu

Luigi Rizzo Scott Shenker


Universit`a di Pisa UC Berkeley and ICSI
rizzo@iet.unipi.it shenker@icsi.berkeley.edu

Abstract boxes into virtualized software applications that can be run


By moving network appliance functionality from propri- on commodity, general purpose processors. NFV has quickly
etary hardware to software, Network Function Virtualiza- gained significant momentum with over 220 industry partic-
tion promises to bring the advantages of cloud computing to ipants, multiple proof-of-concept prototypes, and a number
network packet processing. However, the evolution of cloud of emerging product offerings [2, 9].
computing (particularly for data analytics) has greatly bene- While this momentum is encouraging, a closer look un-
fited from application-independent methods for scaling and der the hood reveals a less rosy picture: NFV products and
placement that achieve high efficiency while relieving pro- prototypes tend to be merely virtualized software implemen-
grammers of these burdens. NFV has no such general man- tations of products that were previously offered as dedicated
agement solutions. In this paper, we present a scalable and hardware appliances. Thus, NFV is currently replacing, on a
application-agnostic scheduling framework for packet pro- one-to-one basis, monolithic hardware with monolithic soft-
cessing, and compare its performance to current approaches. ware. While this is a valuable first step as it is expected
to lower capital costs and deployment barriers it fails to
provide a coherent management solution for middleboxes.
1. Introduction Each software middlebox still comes as a closed implemen-
The proliferation of network processing appliances (mid- tation bundled with a custom management solution that ad-
dleboxes) has been accompanied by a growing recognition dresses issues such as overload detection, load balancing,
of the problems they bring, including expensive hardware elastic scaling, and fault-tolerance for that particular NF.
and complex management. This recognition led the network- This leads to two problems. First, the operator must cope
ing industry to launch a concerted effort towards Network with many NF-specific management systems. Second, NF
Function Virtualization (NFV) with the goal of bringing developers must invent their own solutions to common but
greater openness and agility to network dataplanes [8]. In- non-trivial problems such as dynamic scaling and fault tol-
spired by the benefits of cloud computing, NFV advocates erance; in the worst case this results in inadequate solutions
moving Network Functions (NFs) out of dedicated physical (e.g., solutions that do not scale well) and in the best case
Joint
results in vendors constantly reinventing the wheel.
first authors
Inspired by the success of data analytic frameworks (e.g.,
MapReduce, Hadoop and Spark), we argue that NFV needs
Permission to make digital or hard copies of part or all of this work for personal or a framework, by which we mean a software environment
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation for packet-processing applications that implements general
on the first page. Copyrights for third-party components of this work must be honored. techniques for common issues. Such issues include: place-
For all other uses, contact the owner/author(s).
ment (which NF runs where), elastic scaling (adapting the
SOSP15, October 04-07, 2015, Monterey, CA.
Copyright is held by the owner/author(s). number of NF instances and balancing load across them),
ACM 978-1-4503-3834-9/15/10. service composition, resource isolation, fault-tolerance, en-
http://dx.doi.org/10.1145/2815400.2815423

121
ergy management, monitoring, and so forth. Although we 2. Context and Assumptions
are focusing on packet-processing applications, the above We now provide a motivating context for the deployment
are all systems issues, with some aiding NF development of a framework such as E2, describe the form of hardware
(e.g., fault-tolerance), some NF management (e.g., dynamic infrastructure we assume, and briefly sketch the E2 design.
scaling) and others orchestration across NFs (e.g., place-
ment, service interconnection). 2.1 Motivation: A Scale-Out Central Office
In this paper, we report on our efforts to build such a
We present a concrete deployment context that carriers cite
framework, which we call Elastic Edge (E2). From a prac-
as an attractive target for NFV: a carrier networks broad-
tical perspective, E2 brings two benefits: (i) it allows devel-
band and cellular edge, as embodied in their Central Of-
opers to rely on external framework-based mechanisms for
fices (COs) [1]. A CO is a facility commonly located in
common tasks, freeing them to focus on their core applica-
a metropolitan area to which residential and business lines
tion logic and (ii) it simplifies the operators responsibilities,
connect. Carriers hope to use NFV to transform their COs
as it both automates and consolidates common management
to more closely resemble modern datacenters so they can
tasks. To our knowledge, no such framework for NFV exists
achieve: a uniform architecture based on commodity hard-
today, although several efforts explore individual aspects of
ware, efficiency through statistical multiplexing, centralized
the problem (as we discuss in 9).
management across CO locations, and the flexibility and
From a conceptual perspective, our contributions are also
portability of software services. Carriers cite two reasons for
twofold. First, we describe algorithms to automate the com-
overhauling CO designs [1].
mon tasks of placement, service interconnection, and dy-
First, the capital and operational expenses incurred by a
namic scaling. In other work, we also address the issue of
carriers COs are very high. This is because there are many
fault-tolerance [46], with other issues such as performance
COs, each of non-trivial scale; e.g., AT&T reports 5,000
isolation, energy management and monitoring left for future
CO locations in the US alone, with 10-100K subscribers per
work. Second, we present a system architecture that simpli-
CO. These COs contain specialized devices such as Broad-
fies building, deploying and managing NFs. Our architec-
band Network Gateways (BNGs) [3, 4] that connect broad-
ture departs from the prevailing wisdom in that it blurs the
band users to the carriers IP backbone, and Evolved Packet
traditional distinction between applications and the network.
Core (EPC) gateways that connect cellular users to the IP
Typically one thinks of applications as having fully general
backbone. These are standalone devices with proprietary in-
programming abstractions while the network has very lim-
ternals and vendor-specific management APIs.1 NFV-based
ited abstractions (essentially that of a switch); this constrains
COs would enable operators to utilize commodity hardware
how functionality is partitioned between application and net-
while a framework such as E2 would provide a unified man-
work (even when network processing is implemented at end-
agement system.
hosts [31, 39]) and encourages separate management mech-
Secondly, carriers are seeking new business models based
anisms for each. In contrast, because we focus on more lim-
on opening up their infrastructure to 3rd party services.
ited packet-processing applications and fully embrace soft-
Hosting services in their COs would enable carriers to ex-
ware switches, we can push richer programming abstractions
ploit their physical proximity to users, but this is difficult
into the network layer.
when new features require custom hardware; an NFV-based
More concretely, because of the above reasoning, we es-
CO design would address this difficulty. In fact, if carriers
chew the dominant software switch, OVS, in favor of a more
succeed in opening up their infrastructure, then one might
modular design inspired by Click [30]. We also depart from
view the network as simply an extension (closer to the user)
the traditional SDN/NFV separation of concerns that uses
of existing cloud infrastructure in which case the transition
SDN to route packets between NFs and separately lets NFV
to NFV becomes necessary for portability between cloud
manage those NFs [17, 21, 40]; instead, in E2, a single con-
and network infrastructures.
troller handles both the management and interconnection of
Carrier incentives aside, we note that a COs workload
NFs based on a global system view that spans application
is ideally suited to NFVs software-centric approach. A pe-
and network resources (e.g., core occupancy and number
rusal of broadband standards [7] and BNG datasheets [4] re-
of switch rules available). We show that E2s flexibility to-
veals that COs currently support a range of higher-level traf-
gether with its coordinated approach to management enables
fic processing functions e.g., content caching, Deep Packet
significant performance optimizations; e.g., offering a 25-
Inspection (DPI), parental controls, WAN and application
41% reduction in CPU use through flexible system abstrac-
acceleration, traffic scrubbing for DDoS prevention and en-
tions (7.1) and a 1.5-4.5x improvement in overall system
cryption in addition to traditional functions for firewalls,
throughput through better management decisions (7.2).
IPTV multicast, DHCP, VPN, Hierarchical QoS, and NAT.
1 Standardization efforts such as OpenFlow target L2 and L3 forward-

ing devices and do not address the complexity of managing these special-
ized systems or middleboxes more generally [44, 47].

122
K external ports N-K internal ports 2.3 Design Overview
Before presenting E2 in detail in the following sections, we
first provide a brief overview.
E2 Context. We assume that COs reside within an overall
Servers
network architecture in which a global SDN controller is
given (by the operator) a set of network-wide policies to
N-port implement. The SDN controller is responsible for translating
switch
these network-wide policies into instructions for each CO,
Figure 1: Hardware infrastructure that E2 manages. We show three and the E2 cluster within each CO is responsible for carrying
examples of possible forwarding paths through the cluster, includ- out these instructions.The E2 cluster is managed by an E2
ing one that involves no server. Manager, which is responsible for communicating with the
global SDN controller.

As CO workloads grow in complexity and diversity, so do E2 Interface. Akin to several recent network management
the benefits of transitioning to general-purpose infrastruc- systems [12, 1517, 20, 37, 49], E2 provides a declarative
ture, and the need for a unified and application-independent interface through which the global SDN controller tells each
approach to dealing with common management tasks. E2 cluster how traffic should be processed. It does so by
Thus, E2 addresses the question of how you efficiently specifying a set of policy statements that we call pipelets.
manage a diverse set of packet processing applications with- Each pipelet defines a traffic class and a corresponding di-
out knowing much about their internal implementation. Ef- rected acyclic graph (DAG) that captures how this traffic
ficient here means both that the management system intro- class should be processed by NFs. A traffic class here refers
duces little additional overhead, and that it enables high uti- to a subset of the input traffic; the DAG is composed of nodes
lization of system resources. which represent NFs (or external ports of the switch) and
edges which describe the type of traffic (e.g., port 80) that
should reach the downstream NF. Figure 2 shows a simpli-
2.2 Hardware Infrastructure
fied example of a pipelet.
E2 is designed for a hardware infrastructure composed of Thus, the global SDN controller hands the E2 Manager a
general-purpose servers (residing in racks) interconnected set of pipelets. The E2 Manager is responsible for executing
by commodity switches. As shown in Figure 1, we assume these pipelets on the E2 cluster as described below, while
a fabric of commodity switches with N ports, of which K communicating status information e.g., overall load or
are dedicated to be externally facing (i.e., carrying traffic hardware failure back to the global controller.
to/from the E2 cluster) while the remaining N -K intercon- In addition to policy, E2 takes two forms of external
nect the servers running NFV services. This switch fabric input: (i) a NF description enumerating any NF-specific
can be a single switch, or multiple switches interconnected constraints (e.g., whether the NF can be replicated across
with standard non-blocking topologies. Our prototype uses servers), configuration directives (e.g., number and type of
a single switch but we expect our design to scale to larger ports), resource requirements (e.g., per-core throughput),
fabrics. and (ii) a hardware description that enumerates switch and
E2 is responsible for managing system resources and server capabilities (e.g. number of cores, flow table size).
hence we briefly elaborate on the main hardware constraints
it must accommodate. First, E2 must avoid over-booking the E2 Internal Operation. Pipelets dictate what traffic should
CPU and NIC resources at the servers. Second, E2 must be processed by which NFs, but not where or how this pro-
avoid overloading the switch capacity by unnecessarily plac- cessing occurs on the physical cluster. E2 must implement
ing functions on different servers; e.g., a flow processed by the policy directives expressed by the pipelets while re-
functions running at two servers will consume 50% more specting NF and hardware constraints and capabilities, and
switching capacity than if the two functions were placed on it does so with three components, activated in response to
the same server (Figure 1). Third, since commodity switches configuration requests or overload indications. (i) The scal-
offer relatively small flow tables that can be slow to update, ing component (5.3) computes the number of NF instances
E2 must avoid excessive use of the flow table at the switch needed to handle the estimated traffic demand, and then dy-
(see 5.3). namically adapts this number in response to varying traffic
Our current prototype has only a single rack. We presume, load. It generates an instance graph, or iGraph, reflecting
based on current packet processing rates and CO traffic vol- the actual number of instances required for each NF men-
umes, that a CO can be serviced by relatively small clus- tioned in the set of pipelets, and how traffic is spread across
ter sizes (1-10 racks); while we believe that our architecture these instances. (ii) The placement component (5.1) trans-
will easily scale to such numbers, we leave an experimental lates the iGraph into an assignment of NF instances to spe-
demonstration of this to future work. cific servers. (iii) The interconnection component (5.2) con-

123
thus be viewed as composed of general attribute-value pairs,
Web Cache where attributes can be direct (defined on a packets con-
tents) or derived (capturing higher-level semantics exposed
IDS.safe && !(dst port 80) Network by network applications). A packet follows an edge only if
IDS
Monitor it matches all of the traffic filters on the edge. Note that a
Traffic traffic filter only defines which traffic flows between func-
Normalizer tions; E2s interconnection component (5.2) addresses how
this traffic is identified and forwarded across NF ports.
Figure 2: An example pipelet. Input traffic is first sent to an IDS; In addition to traffic filters, an edge is optionally anno-
traffic deemed safe by the IDS is passed to a Web Cache if its tated with an estimate of the expected rate of such traffic.
destined for TCP port 80 and to a Network Monitor otherwise. E2s placement function uses this rate estimate to derive its
Traffic that the IDS finds unsafe is passed to a Traffic Normalizer; initial allocation of resources; this estimate can be approx-
all traffic leaving the Traffic Normalizer or the Web Cache are also
imate or even absent because E2s dynamic scaling tech-
passed to the Network Monitor.
niques will dynamically adapt resource allocations to vary-
ing load.
figures the network (including network components at the
servers) to steer traffic across appropriate NF instances.
3.2 System Inputs
In the following sections we describe E2s system archi-
tecture (3), its dataplane design (4), and its control plane In addition to pipelets, E2 takes an NF description that
design (5). We present the implementation (6) and evalua- guides the framework in configuring each NF, and a hard-
tion (7) of our E2 prototype then discuss related work (8) ware description that tells the framework what hardware re-
before concluding in 9. sources are available for use. We describe each in turn.
NF descriptions. E2 uses the following pieces of informa-
3. E2 System Architecture tion for each NF. We envisage that this information (except
We now describe E2s API, inputs, and system components. the last one) will be provided by NF developers.
(1) Native vs. Legacy. E2 exports an optional API that al-
3.1 System API
low NFs to leverage performance optimizations (4). NFs
As mentioned in 2, an operator expresses her policies via a that use this API are considered native, in contrast to un-
collection of pipelets, each describing how a particular traffic modified legacy NFs running on the raw socket interface
class should be processed. This formulation is declarative, so provided by the OS; we discuss the native API further in 7.
operators can generate pipelets without detailed knowledge
(2) Attribute-Method bindings. Each derived attribute has an
of per-site infrastructure or NF implementations. The neces-
associated method for associating packets with their attribute
sary details will instead be captured in the NF and hardware
descriptions. We now elaborate on how we express pipelets. values. Our E2 prototype supports two forms of methods:
Additional detail on the policy description language we use ports and per-packet metadata (4).
to express pipelets can be found in the Appendix. With the port method, all traffic with an attribute value
Each pipelet defines a traffic class and a corresponding will be seen through a particular (virtual or physical) port.
Since a port is associated with a specific value for an at-
directed acyclic graph (DAG) that captures how this traffic
class should be processed by NFs. In our current implemen- tribute, ports are well-suited for coarse-grained attributes
that take on a small number of well-known values. E.g., in
tation, we define traffic classes in terms of packet header
Figure 2, if the IDS defines the method associated with the
fields and physical ports on the switch; for example, one
might identify traffic from a particular subscriber via the safe attribute to be port, all safe traffic exits the IDS
through one virtual port, and all unsafe traffic through an-
physical port, or traffic destined for another provider through
address prefixes. other. Legacy applications that cannot leverage the metadata
method described below fit nicely into this model.
A node in the pipelets DAG represents a NF or a physical
The metadata method is available as a native API. Con-
port on the switch, and edges describe the traffic between
ceptually, one can think of metadata as a per-packet anno-
nodes. Edges may be annotated with one or more traffic
tation [30] or tag [17] that stores the attribute-value pair; 4
filters. A filter is a boolean expression that defines what
describes how our system implements metadata using a cus-
subset of the traffic from the source node should reach the
tom header. Metadata is well-suited for attributes that take
destination node.
Filters can refer to both, the contents of the packet it- many possible values; e.g., tagging packets with the URL
self (e.g., header fields) and to semantic information asso- associated with a flow (versus using a port per unique URL).
ciated with the packet. For example, the characterization of (3) Scaling constraints tell E2 whether the application can be
traffic as safe or unsafe in Figure 2 represents seman- scaled across servers/cores or not, thus allowing the frame-
tic information inferred by the upstream IDS NF. Filters can work to react appropriately on overload (5.3).

124
#$/34241,- 4.1 Rationale
Our E2D implementation is based on SoftNIC [23], a high-
362()6-(217 performance, programmable software switch that allows ar-

&,-.,-/01,2)
89:,/9;<4),/////!"/*-,4)(62 !" !" !"/ 5 bitrary packet processing modules to be dynamically config-
ured as a data flow graph, in a similar manner to the Click
#$% modular router [30].
&'()*+
While the Click-style approach is widely used in various
academic and commercial contexts, the de-facto approach
Figure 3: The overall E2 system architecture.
to traffic management on servers uses the Open vSwitch
(OVS) and the OpenFlow interface it exports. OVS is built
(4) Affinity constraints. For NFs that scale across servers, on the abstraction of a conventional hardware switch: it is
the affinity constraints tell the framework how to split traf- internally organized as a pipeline of tables that store match-
fic across NF instances. Many NFs perform stateful opera- action rules with matches defined on packet header fields
tions on individual flows and flow aggregates. The affinity plus some limited support for counters and internal state.
constraints define the traffic aggregates the NF acts on (e.g., Given the widespread adoption of OVS, it is reasonable to
all packets with a particular TCP port, or all packets in ask why we adopt a different approach. In a nutshell, it is
a flow), and the framework ensures that packets belonging because NFV does not share many of the design considera-
to the same aggregate are consistently delivered to the same tions that (at least historically) have driven the architecture
NF instance. Our prototype accepts affinity constraints de- of OVS/Openflow and hence the latter may be unnecessarily
fined in terms of the 5-tuple with wildcards. restrictive or even at odds with our needs.
(5) NF performance. This is an estimate of the per-core, per- More specifically, OVS evolved to support network
GHz traffic rate that the NF can sustain2 . This is optional virtualization platforms (NVPs) in multi-tenant datacen-
information that E2s placement function uses to derive a ters [31]. Datacenter operators use NVPs to create multiple
closer-to-target initial allocation of cores per NF. virtual networks, each with independent topologies and ad-
dressing architectures, over the same physical network; this
Hardware description. In our current prototype, the hard-
enables (for example) tenants to cut-paste a network con-
ware constraints that E2 considers when making operational
figuration from their local enterprise to a cloud environment.
decisions include: (1) the number of cores (and speed) and
The primary operation that NVPs require on the dataplane
the network I/O bandwidth per server, (2) the number of
is the emulation of a packets traversal through a series of
switch ports, (3) the number of entries in the switch flow ta-
switches in the virtual topology, and thus OVS has focused
ble, and (4) the set of available switch actions. Our hardware
on fast lookups on OpenFlow tables; e.g., using multiple lay-
description thus includes this information. We leave to fu-
ers of caching internally [39] and limited actions.
ture work the question of whether and how to exploit richer
NFV does not face this challenge. Instead, since most
models e.g., that consider resources such as the memory
cycles will likely be consumed in NFs, we are more inter-
or CPU cache at servers, availability of GPUs or specialized
ested in performance optimizations that improve the effi-
accelerators [24], programmable switches [15], and so forth.
ciency of NFs (e.g., our native APIs below). Thus, rather
3.3 System Components than work to adapt OVS to NFV contexts, we chose to ex-
plore a Click-inspired dataflow design more suited to our
Figure 3 shows the three main system components in E2: needs. This choice allowed us to easily implement various
the E2 Manager orchestrates overall operation of the cluster, performance optimizations (7) and functions in support of
a Server Agent manages operation within each server, and dynamic scaling (5.3) and service interconnection (5.2).
the E2 Dataplane (E2D) acts as a software traffic processing
layer that underlies the NFs at each server. The E2 Manager
4.2 SoftNIC
interfaces with the hardware switch(es) through standard
switch APIs [6, 14, 36] and with the Server Agents. SoftNIC exposes virtual NIC ports (vports) to NF instances;
vports virtualize the hardware NIC ports (pports) for vir-
tualized NFs. Between vports and pports, SoftNIC allows
4. The E2 Dataplane, E2D
arbitrary packet processing modules to be configured as a
In the following subsections we describe the design of the E2 data flow graph, in a manner similar to the Click modular
Dataplane (E2D). The goal of E2D is to provide flexible yet router [30]. This modularity and extensibility differentiate
efficient plumbing across the NF instances in the pGraph. SoftNIC from OVS, where expressiveness and functionality
2 Since the performance of NFs vary based on server hardware and
are limited by the flow-table semantics and predefined ac-
tions of OpenFlow.
traffic characteristics, we expect these estimates will be provided by the
network operator (based on profiling the NF in their environment) rather SoftNIC achieves high performance by building on recent
than by the NF vendor. techniques for efficient software packet processing. Specifi-

125
cally: SoftNIC uses Intel DPDK [27] for low-overhead I/O to B
hardware NICs and uses pervasive batch processing within B
the pipeline to amortize per-packet processing costs. In ad- A B A
B
dition, SoftNIC runs on a small number of dedicated pro-
B
cessor cores for high throughput (by better utilizing the
CPU cache) and sub-microsecond latency/jitter (by elimi- (a) Original pGraph (b) iGraph with split NF B
nating context switching cost). The SoftNIC core(s) con-
tinuously polls each physical and virtual port for packets.
A A
Packets are processed from one NF to another using a push- B B
to-completion model; once a packet is read from a port, it A A
is run through a series of modules (e.g. classification, rate A A
B B
limiting, etc.) until it reaches a destination port. A A
In our experiments with the E2 prototype (7), we dedi-
cate only one core to E2D/SoftNIC as we find a single core (c) iGraph with split NF A and B (d) Optimized iGraph
was sufficient to handle the network capacity of our testbed; Figure 4: Transformations of a pGraph (a) into an iGraph (b, c, d).
[23] demonstrates SoftNICs scalability to 40 Gbps per core.

4.3 Extending SoftNIC for E2D up and configuring the interconnections between NFs), (iii)
scaling (dynamically adapting the placement decisions de-
We extend SoftNIC in the following three ways. First, we pending on load variations), and (iv) ensuring affinity con-
implement a number of modules tailored for E2D includ- straints of NFs.
ing modules for load monitoring, flow tracking, load balanc-
ing, packet classification, and tunneling across NFs. These 5.1 NF Placement
modules are utilized to implement E2s components for NF
The initial placement of NFs involves five steps:
placement, interconnection, and dynamic scaling, as will be
discussed in the rest of this paper. Step 1: Merging pipelets into a single policy graph. E2
Second, as mentioned earlier, E2D provides a native API first combines the set of input pipelets into a single policy
that NFs can leverage to achieve better system-wide perfor- graph, or pGraph; the pGraph is simply the union of the
mance and modularity. This native API provides support for: individual pipelets with one node for each NF and edges
zero-copy packet transfer over vports for high throughput copied from the individual pipelets.
communication between E2D and NFs, and rich message ab- Step 2: Sizing. Next, E2 uses the initial estimate of the
stractions which allow NFs to go beyond traditional packet- load on a NF (sum of all incoming traffic streams), and
based communication. Examples of rich messages include: its per-core capacity from the NF description, to determine
(i) reconstructed TCP bytestreams (to avoid the redundant how many instances (running on separate cores) should be
overhead at each NF), (ii) per-packet metadata tags that ac- allocated to it. The load and capacity estimates need not be
company the packet even across NF boundaries, and (iii) accurate; our goal is merely to find a reasonable starting
inter-NF signals (e.g., a notification to block traffic from an point for system bootstrapping. Dynamically adapting to
IPS to a firewall). actual load is discussed later in this section.
The richer cross-NF communication enables not only var-
ious performance optimizations but also better NF design by Step 3: Converting the pGraph to an iGraph. This step
allowing modular functions rather than full-blown NFs transforms the pGraph into the instance graph, or iGraph,
from different vendors to be combined and reused in a flex- in which each node represents an instance of a NF. Split-
ible yet efficient manner. We discuss and evaluate the native ting a node involves rewiring its input and output edges and
API further in 7. Figure 4 shows some possible cases. In the general case, as
Lastly, E2D extends SoftNIC with a control API exposed shown in Figure 4(b) and 4(c), splitting a node requires dis-
to E2s Server Agent, allowing it to: (i) dynamically cre- tributing the input traffic across all its instances in a manner
ate/destroy vports for NF instances, (ii) add/remove modules that respects all affinity constraints and generating the cor-
in E2Ds packet processing pipeline, stitching NFs together responding edge filters. As an example, NF B in Figure 4(b)
both within and across servers, and (iii) receive notifications might require traffic with the same 5-tuple go to the same
of NF overload or failure from the E2D (potentially trigger- instance, hence E2 inserts a filter that hashes traffic from A
ing scaling or recovery mechanisms). on the 5-tuple and splits it evenly towards Bs instances.
When splitting multiple adjacent nodes, the affinity con-
straints may permit optimizations in the distribute stages, as
5. The E2 Control Plane depicted in Figure 4(d). In this case, node B from the pre-
The E2 control plane is in charge of (i) placement (instan- vious example is preceded by node A that groups traffic by
tiating the pipelets on servers), (ii) interconnection (setting source IP addresses. If the affinity constraint for A already

126
satisfies the affinity constraint for B, E2 does not need to re- -=/,=/)<,-./ AB")>*%$$6&6'.
classify the outputs from As instances, and instead can cre-
ate direct connections as in Figure 4(d). By minimizing the $%&'()*+,-./) 12
!"# !"# !"#
number of edges between NF instances, instance placement
becomes more efficient, as we explain below.
3%4)5%./)-&)67.%,8 394)!:$/%:/6%/6:;)<,-./$ 3>4)?@@6:;)/.%&&6>)>*%$$6&6'.$
Step 4: Instance placement. The next step is to map each
NF instance to a particular server. The goal is to minimize Figure 5: E2 converts edge annotations on an iGraph (a) into
inter-server traffic for two reasons: (i) software forward- output ports (b) that the applications write to, and then adds traffic
ing within a single server incurs lower delay and consumes filters that the E2D implements (c).
fewer processor cycles than going through the NICs [19, 43]
and (ii) the link bandwidth between servers and the switch implement NFs, though we have not explored the issue in
is a limited resource. Hence, we treat instance placement as the current prototype.
an optimization problem to minimize the amount of traffic
5.2 Service Interconnection
traversing the switch. This can be modeled as a graph par-
tition problem which is NP-hard and hence we resort to an Recall that edges in the pGraph (and by extension, iGraph)
iterative local searching algorithm, in a modified form of the are annotated with filters. Service interconnection uses these
classic Kernighan-Lin heuristic [28]. annotations to steer traffic between NF instances in three
The algorithm works as follows: we begin with a valid stages.
solution that is obtained by bin-packing vertices into parti- Instantiating NFs ports. The NF description specifies how
tions (servers) based on a depth-first search on the iGraph; many output ports are used by a NF and which traffic
then in each iteration, we swap a pair of vertices from two attributes are associated with each port. E2D instantiates
different partitions. The pair selected for a swap is the one vports accordingly as per the NF description and the iGraph.
that leads to the greatest reduction in cross-partition traffic. For example, Fig. 5(b) shows an IDS instance with two
These iterations continue until no further improvement can vports, which output safe and unsafe traffic respectively.
be made. This provides an initial placement of NF instances
Adding traffic filters. An edge may require (as specified by
in O(n2 lg n) time where n is the number of NF instances.
the edges filters) only a subset of the traffic generated by the
In addition, we must consider incremental placement as
NF instance it is attached to. In this case, E2 will insert an
NF instances are added to the iGraph. While the above al-
additional classification stage, implemented by the E2D, to
gorithm is already incremental in nature, our strategy of mi-
ensure that the edge only receives traffic matching the edge
gration avoidance (5.4) imposes that we do not swap an
filters. Figure 5(c) illustrates an example where safe traf-
existing NF instance with a new one. Hence, the incremental
fic is further classified based on the destination port number.
placement is much simpler: we consider all possible parti-
While E2s classifier currently implements BPF filtering [35]
tions where the new instance may be placed, and choose the
on packet header fields and metadata tags, we note that it
one that will incur the least cross-partition traffic by simply
can be extended beyond traditional filtering to (for example)
enumerating all the neighboring instances of the new NF in-
filter packets based on CPU utilization or the active/standby
stance. Thus the complexity of our incremental placement
status of NF instances. To disambiguate traffic leaving man-
algorithm is O(n), where n is the number of NF instances.
gling NFs that rewrite key header fields (e.g., NAT), the
Step 5: Offloading to the hardware switch. Todays com- E2D layer dynamically creates disambiguating packet steer-
modity switch ASICs implement various low-level fea- ing rules based on the remaining header fields. 3
tures, such as L2/L3-forwarding, VLAN/tunneling, and QoS
Configuring the switch and the E2D. After these steps,
packet scheduling. This opens the possibility of offloading
E2 must configure the switch and E2D to attach NF ports
these functions to hardware when they appear on the pol-
to edges and instantiate the necessary filters. Edges that are
icy graph, similar to Dragonet [48] which offloads functions
local to one server are implemented by the E2D alone. Edges
from the end-host network stack to NIC hardware). On the
between servers also flow through the E2D which routes
other hand, offloading requires that traffic traverse physical
them to physical NICs, possibly using tunneling to multiplex
links and consume other hardware resources (table entries,
several edges into available NICs. Packet encapsulation for
switch ports, queues) that are also limited, so offloading is
tunneling does not cause MTU issues, as commodity NICs
not always possible. To reduce complexity in the placement
and switches already support jumbo frames.
decisions, E2 uses an opportunistic approach: a NF is con-
sidered as a candidate for offloading to the switch only if, 3 Our approach to handling mangling NFs is enabled by the ability to

at the end of the placement, that NFs is adjacent to a switch inject code inline in the E2D layer. This allows us to avoid the complexity
port, and the switch has available resources to run it. E2 does and inefficiency of solutions based on legacy virtual switches such as OVS;
these prior solutions involve creating multiple instances of the mangling
not preclude the use of specialized hardware accelerators to NF, one for each downstream path in the policy graph [20] and invoke the
central controller for each new flow arrival [17, 20].

127
5.3 Dynamic Scaling
The initial placement decisions are based on estimates of ?) ?) ?() ?) ?()

traffic and per-core performance, both of which are imper- A'") A'") A'") A'") A'") A'")
fect and may change over time. Hence, we need solutions
for dynamically scaling in the face of changing loads; in !"#$%&) !"#$%&) !"#$%&)
particular we must find ways to split the load over several
NF instances when a single instance is no longer sufficient. (a) (b) (c)
We do not present the methods for contraction when under-
Figure 6: (a) Flows enter a single NF instance. (b) Migration
loaded, but these are similar in spirit. We provide hooks for avoidance partitions the range of Flow IDs and punts new flows
NFs to report on their instantaneous load, and the E2D itself to a new replica using the E2D. Existing flows are routed to the
detects overloads based on queues and processing delays. same instance. (c) Once enough flows expire, E2 installs steering
We say we split an instance when we redistribute its load rules in the switch.
to two or more instances (one of which is the previous in-
stance) in response to an overload. This involves placing the the E2D on each server can track individual, active flows
new instances, setting up new interconnection state (as de- that each NF is currently handling. 4 We call Fold (A) the
scribed previously in this section), and must consider the current set of flows handled by some NF A.
affinity requirements of flows (discussed later in this sec-
When an iGraph is initially mapped onto E2, each NF in-
tion), so it is not to be done lightly.
stance A may have a corresponding range filter [X, Y ) A
To implement splitting, when a node signals overload
installed in the E2D layer or in the hardware switch. When
the Server Agent notifies the E2 Manager, which uses the
splitting A into A and A, we must partition the range [X, Y ),
incremental algorithm described in 5.1 to place the NF
but keep sending flows in Fold (A) to A until they naturally
instances. The remaining step is to correctly split incoming
terminate.
traffic across the new and old instances; we address this next.
Strawman approach. This can be achieved by replacing the
filter [X, Y ) A with two filters
5.4 Migration Avoidance for Flow Affinity [X, M ) A, [M, Y ) A
Most middleboxes are stateful and require affinity, where
traffic for a given flow must reach the instance that holds and higher priority filters (exceptions) to preserve affinity:
that flows state. In such cases, splitting a NFs instance f : f Fold (A) H(f ) [M, Y ) : f A
(and correspondingly, input traffic) requires extra measures
to preserve affinity. The number of exceptions can be very large. If the switch has
Prior solutions that maintain affinity either depend on small filtering tables (hardware switches typically have only
state migration techniques (moving the relevant state from a few thousand entries), we can reduce the range [M, Y )
one instance to another), which is both expensive and in- to keep the number of exceptions small, but this causes an
compatible with legacy applications [21], or require large uneven traffic split. This problem arises when the filters must
rule sets in hardware switches [41]; we discuss these solu- be installed on a hardware switch, and A and A reside on
tions later in 7. different servers.
We instead develop a novel migration avoidance strategy
Our solution To handle this case efficiently, our migration
in which the hardware and software switch act in concert to
avoidance algorithm uses the following strategy (illustrated
maintain affinity. Our scheme does not require state migra-
in Figure 6) :
tion, is designed to minimize the number of flow table entries
used on the hardware switch to pass traffic to NF instances, Opon splitting, the range filter [X, Y ) on the hardware
and is based on the following assumptions: switch is initially unchanged, and the new filters (two
new ranges plus exceptions) are installed in the E2D of
each flow f can be mapped (for instance, through a hash the server that hosts A;
function applied to relevant header fields) to a flow ID As flows in Fold (A) gradually terminate, the correspond-
H(f ), defined as an integer in the interval R = [0, 2N ); ing exception rules can be removed;
the hardware switch can compute the flow ID, and can When the number of exceptions drops below some
match arbitrary ranges in R with a modest number threshold, the new ranges and remaining exceptions
of rules. Even TCAM-based switches, without a native are pushed to the switch, replacing the original rule
range filter, require fewer than 2N rules for this; [X, Y ) A.
each NF instance is associated with one subrange of the 4 The NF description indicates how to aggregate traffic into flows (i.e.,

interval R; the same subset of header fields used to compute the flow ID).

128
By temporarily leveraging the capabilities of the E2D, mi- Experimental Setup. We evaluate our design choices using
gration avoidance achieves load distribution without the the above E2 prototype. We connect a traffic generator to
complexity of state migration and with efficient use of external ports on our switch with four 10 G links. We use
switch resources. The trade-off is the additional latency to a server with four 10G NICs and two Intel Xeon E5-2680
new flows being punted between servers (but this overhead v2 CPUs as a traffic generator. We implemented the traffic
is small and for a short period of time) and some additional generator to act as the traffic source and sink. Unless stated
switch bandwidth (again, for a short duration) we quantify otherwise, we present results for a traffic workload of all
these overheads in 7. minimum-sized 60B Ethernet packets.

6. Prototype Implementation 7.1 E2D: Data Plane Performance


Our E2 implementation consists of the E2 Manager, the We show that E2D introduces little overhead and that its
Server Agent, and the E2D. The E2 Manager is implemented native APIs enable valuable performance improvements.
in F# and connects to the switch and each server using an E2D Overhead. We evaluate the overhead that E2D intro-
out-of-band control network. It interfaces with the switch duces with a simple forwarding test, where packets are gen-
via an OpenFlow-like API to program the flow table, which erated by an NF and looped back by the switch to the same
is used to load balance traffic and route packets between NF. In this setup, packets traverse the E2D layer twice (NF
servers. The E2 Manager runs our placement algorithm (5) switch and switch NF directions). We record an aver-
and accordingly allocates a subset of nodes (i.e., NF in- age latency of 4.91 s.
stances) from the iGraph to each server and instructs the We compare this result with a scenario where the NF is
Server Agent to allocate cores for the NFs it has been as- directly implemented with DPDK (recall that SoftNIC and
signed, to execute the the NFs, to interface with the E2D hence E2D build on top of DPDK), in order to rule out
to allocate ports, create and compose processing modules in the overhead of E2D. In this case the average latency was
SoftNIC, and to set up paths between NFs. 4.61 s, indicating that E2D incurs 0.3 s delay (or 0.15 s
The Server Agent is implemented in Python and runs as a for each direction). Given that a typical end-to-end latency
Python daemon on each server in the E2 cluster. The Server requirement within a CO is 1 ms5 , we believe that this la-
Agent acts as a shim layer between the E2 Manager and its tency overhead is insignificant.
local E2D, and it simply executes the instructions passed by In terms of throughput, forwarding through E2D on a
the E2 Manager. single core fully saturates the servers 10 Gbps link as ex-
The E2D is built on SoftNIC (4). Our E2D contains sev- pected [23, 27].
eral SoftNIC modules which the Server Agent configures for The low latency and high throughput that E2D achieves
service interconnection and load balancing. Specifically, we is thanks to its use of SoftNIC/DPDK. Our results merely
have implemented a match/action module for packet meta- show that the baseline overhead that E2D/SoftNIC adds
data, a module for tagging and untagging packets with tun- to its underlying DPDK is minimal; more complex packet
neling headers to route between servers, and a steering mod- processing at NFs would, of course, result in proportionally
ule which implements E2Ds part in migration avoidance. higher delays and lower throughput.
The E2D implements the native API discussed in 4.3; for
legacy NFs, E2D creates regular Linux network devices. E2D Native API. Recall that E2s native API enables per-
formance optimizations through its support for zero-copy
7. Evaluation vports and rich messages. We use the latter to implement
two optimizations: (i) bytestream vports that allow the cost
Prototype. Our E2 prototype uses an Intel FM6000 Seacliff of TCP session reconstruction to be amortized across NFs
Trail Switch with 48 10 Gbps ports and 2,048 flow table and, (ii) packet metadata tags that eliminate redundant work
entries. We connect four servers to the switch, each with one by allowing semantic information computed by one NF to
10 Gbps link. One server uses the Intel Xeon E5-2680 v2 be shared with the E2D or other NFs. We now quantify the
CPU with 10 cores in each of 2 sockets and the remaining performance benefits due to these optimizations: zero-copy
use the Intel Xeon E5-2650 v2 CPU with 8 cores in each vports, bytestream vports, metadata.
of 2 sockets, for a total of 68 cores running at 2.6 GHz. On Zero-copy vports. We measure the latency and throughput
each server, we dedicate one core to run the E2D layer. The between two NFs on a single server (since the native API
E2 Manager runs on a standalone server that connects to does nothing to optimize communication between servers).
each server and to the management port of the switch on Table 1 compares the average latency and throughput of the
a separate 1 Gbps control network. legacy and native APIs along this NF E2D NF path.
We start with microbenchmarks that evaluate E2s data We see that our native API reduces the latency of NF-to-
plane (7.1), then evaluate E2s control plane techniques NF communication by over 2.5x on average and increases
(7.2) and finally evaluate overall system performance with
E2 (7.3). 5 From discussion with carriers and NF vendors.

129
Path Latency Gbps Mpps Path Latency Gbps Mpps
NF E2D NF (s) (1500B) (64B) NF E2D NF (s) (1500B) (64B)
Legacy API 3.2 7.437 0.929 Header-Match 1.56 152.515 12.76
Native Zero-Copy API 1.214 187.515 15.24 Metadata-Match 1.695 145.826 11.96
Table 1: Latency and throughput between NFs on a single server Table 2: Latency and throughput between NFs on a single server
using E2s legacy vs. native API. Legacy NFs use the Linux raw with and without metadata tags.
socket interface.
RSSHFHAQOEQ TUPB"IMFHVBF#PEQ
WXYFMNAOOCPCEQ Z?JF[ABL\
TCP SIG HTTP RE
JEKCLABEKFMNAOOCPCEQ
Library
"ABCDEFGHI

ByteStream ! !=> !=? !=@


!"#$%&'$!($)'*++,-
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Figure 8: Comparison of CPU cycles between using URL metadata
GHz per Gb traffic and a dedicated HTTP parser
Figure 7: Comparison of CPU cycles for three DPI NFs, without
and with bytestream vports. The both cases use the native API. To measure the overhead, we measure the inter-NF
throughput using our zero-copy native API under two sce-
narios. In Header-Match, the E2D simply checks a particular
throughput by over 26x; this improvement is largely due to
header field against a configured value; no metadata tags are
zero-copy vports (4) and the fact that legacy NFs incur OS-
attached to packets. In Metadata-Match, the source NF cre-
induced overheads due to packet copies and interrupts. Our
ates a metadata tag for each packet which is set to the value
native APIs matches the performance of frameworks such as
of a bit field in the payload; the E2D then checks the tag
DPDK [27] and netmap [42].
against a configured value. Table 2 shows our results. We see
Bytestream vports. TCP session reconstruction, which in-
that Metadata-Match achieves a throughput of 11.96 mpps,
volves packet parsing, flow state tracking, and TCP seg-
compared to 12.7 for Header-Match. Thus adding metadata
ment reassembly, is a common operation required by most
lowers throughput by 5.7%.
DPI-based NFs. Hence, when there are multiple DPI NFs
We demonstrate the performance benefits of metadata
in a pipeline, repeatedly performing TCP reconstruction can
tags using a pipeline in which packets leaving an upstream
waste processing cycles.
HTTP Logger NF are forwarded to a CDN NF based on
We evaluate the performance benefits of bytestream
the value of the URL associated with their session. Since
vports using a pipeline of three simple DPI NFs: (i) SIG
Logger implements HTTP parsing, a native implementation
implements signature matching with the Aho-Corasick
of the Logger NF can tag packets with their associated URL
algorithm, (ii) HTTP implements an HTTP parser, and
and the E2D layer will steer packets based on this metadata
(iii) RE implements redundancy elimination using Rabin
field. Without native metadata tags, we need to insert a
fingerprinting. These represent an IDS, URL-based filtering,
standalone URL-Classifier NF in the pipeline between the
and a WAN optimizer, respectively. The Library case in
Logger and CDN NFs to create equivalent information. In
Fig. 7 represents a baseline, where each NF independently
this case, traffic flows as Logger E2D URL-Classifier
performs TCP reconstruction over received packets with our
E2DCDN. As shown in Figure 8, the additional NF
common TCP library. In the ByteStream case, we dedicate
and E2D traversal (in bold) increase the processing load by
a separate NF (TCP) to perform TCP reconstruction and
41% compared to the use of native tags.
produce metadata (TCP state changes and reassembly
anomalies) and reassembled bytestream messages for the 7.2 E2 Control Plane Performance
three downstream NFs to reuse. E2D guarantees reliable
transfer of all messages between NFs that use bytestream We now evaluate our control plane solutions for NF place-
vports, with much less overhead than full TCP. The results ment, interconnection, and dynamic scaling, showing that
show that bytestream vports can save 25% of processing our placement approach achieves better efficiency than two
cycles, for the same amount of input traffic. strawmen solutions and that our migration-avoidance design
is better than two natural alternatives.
Metadata Tags. Tags can carry information along with pack-
ets and save repeated work in the applications; having the NF Placement. E2 aims to maximize cluster-wide through-
E2D manage tags is both a convenience and potentially also put by placing NFs in a manner that minimizes use of the
a performance benefit for application writers. The following hardware switch capacity. We evaluate this strategy by sim-
two experiments quantify the overhead and potential perfor- ulating the maximum cluster-wide throughput that a rack-
mance benefits due to tagging packets with metadata. scale E2 system (i.e., with 24 servers and 24 external ports)

130
)*%$$
E2 Heurstic Packing Random

,!"#$%!&$'F()*&+,
)*$$$
($$
Linear '$$
&$$ -#'12F31'0
%$$ 4".%.512
607
Randomized $
$ + )$ )+
-./0F(+,
0 20 40 60 80
Figure 10: Traffic load at the original and new NF instance with
Aggregate throughput (Gbps)
migration avoidance; original NF splits at 2s.
Figure 9: Maximum cluster throughput with different placement
solutions, with two different pGraphs.
=8 =

>!"#$%#&'F()*+,-
4/#%!"F.!&/"01
<8 <

.!&/"01F(,-
could sustain before any component cores, server links, or 5$%&0'F6!"#$%#&'
;8 ;
switch capacity of the system is saturated. We compare our
:8 :
solution to two strawmen: Random that places nodes on
98 9
servers at random, and Packing that greedily packs nodes
onto servers while traversing the iGraph depth-first. We con- 8 8
8 = 98 9=
sider two iGraphs: a linear chain with 5 nodes, and a more 2%3/F(,-
realistic random graph with 10 nodes.
Figure 11: Latency and bandwidth overheads of migration avoid-
Figure 9 shows that our approach outperforms the straw- ance (the splitting phase is from 1.8s to 7.4s).
men in all cases. We achieve 2.25-2.59 higher throughput
compared to random placement; bin-packing does well on
results in a total load of approximately 10,000 active concur-
a simple chain but only achieves 0.78 lower throughput
rent flows and hence dynamic scaling (effectively) requires
for more general graphs. Thus we see that our placement
shifting load equivalent to 5,000 flows off the original NF.
heuristic can improve the overall cluster throughput over the
Fig. 10 shows the traffic load on the original and new NF
baseline bin-packing algorithm.
instances over time; migration avoidance is triggered close
Finally, we measure the controllers time to compute
to the 2 second mark. We see that our prototype is effective
placements. Our controller implementation takes 14.6ms to
at balancing load: once the system converges, the imbalance
compute an initial placement for a 100-node iGraph and has
in traffic load on the two instances is less than 10%.
a response time of 1.76ms when handling 68 split requests
We also look at how active flows are impacted during the
per second (which represents the aggressive case of one split
process of splitting. Fig. 11 shows the corresponding packet
request per core per second). We conclude that a centralized
latency and switch bandwidth consumption over time. We
controller is unlikely to be a performance bottleneck in the
see that packet latency and bandwidth consumption increase
system.
during the splitting phase (roughly between the two and
Updating Service Interconnection. We now look at the
eight second markers) as would be expected given we de-
time the control plane takes to update interconnection paths.
tour traffic through the E2D layer at the original instance.
In our experiments, the time to update a single rule in the
However this degradation is low: in this experiment, latency
switch varies between 1-8ms with an average of 3ms (the
increases by less than 10secs on average, while switch
datasheet suggests 1.8ms as the expected time); the switch
bandwidth increases by 0.5Gbps in the worst case, for a
API only supports one update at a time. In contrast, the per-
small period of time; the former overhead is a fixed cost, the
rule update latency in E2D is only 20 s, which can be fur-
latter depends on the arrival rate of new flows which is rela-
ther amortized with a batch of multiple rules. The relatively
tively high in our experiment. In summary: migration avoid-
long time it takes to update the hardware switch (as com-
ance balances load evenly (within 10% of ideal) and within
pared to the software switch) reinforces our conservative use
a reasonable time frame (shifting load equivalent to roughly
of switch rule entries in migration avoidance. Reconfiguring
5,000 flows in 5.6 seconds) and does so with minimal im-
the E2D after creating a new replica takes roughly 15ms, in-
pact to active flows (adding less than 10seconds to packet
cluding the time to coordinate with the E2 Manager and to
latencies) and highly scalable use of the switch flow table.
invoke a new instance.
We briefly compare to two natural strawmen. An always
Dynamic Scaling. We start by evaluating migration avoid-
migrate approach, as explored in [41] and used in [21], mi-
ance for the simple scenario of a single NF instance that
grates half the active flows to the new NF instance. This ap-
splits in two; the NF requires flow-level affinity. We drive
proach achieves an ideal balance of load but is complex6 and
the NF with 1 Gbps of input traffic, with 2,000 new flows
arriving each second on average and flow length distribu- 6 For example, [41] reroutes traffic to an SDN controller while migra-

tions drawn from published measurement studies [32]. This tion is in progress while [21] requires a two-phase commit between the

131
CPU Usage Optimal CPU load. As points of comparison, we also plot the input traffic
80 40
Input Traffic Switch Traffic load (a lower bound on the switch bandwidth usage) and a

Throughput (Gbps)
60 30 computed value of the optimal number of cores. We derived
CPU (cores)

the optimal number of cores by summing up the optimal


40 20 number of NF instances for each NF in the pipeline. To
derive the optimal number of NF instances for a NF, we
20 10 multiply the cycles per unit of traffic that the NF consumes
0 0 when running in isolation by the total input traffic to the NF,
0 20 40 60 80 100 then we divide it by the cycle frequency of a CPU core.
Time (s) We observe that E2s resource consumption (both CPU
and switch bandwidth) scales dynamically to track the trend
Figure 12: E2 under dynamic workload.
in input load. At its maximum scaling point, the system
consumed up to 60 cores, running an iGraph of 56 vertices
)!" #$%&'!(( )*+ ,-. (i.e., 56 NF instances) and approximately 600 edges. We
also observe the gap between actual vs. optimal resource
usage in terms of both CPU cores (22.7% on average) and
Figure 13: Pipeline used for the evaluation
the switch bandwidth (16.4% on average). We note that
our measured CPU usage does include the cores dedicated
requires non-trivial code modifications to support surgical
to running SoftNIC (which was always one per server in
migration of per-flow state. In addition, the disruption due
our experiments) while these cores are not included in the
to migration is non-trivial: the authors of [41] report taking
optimal core count. Thus the overheads that E2 incurs appear
5ms to migrate a single flow during which time traffic must
reasonable given that our lower bounds ignore a range of
be paused; the authors do not report performance when
system overheads around forwarding packets between NFs,
migrating more than a single flow.
the NIC and the switch, as also the overheads around running
A never migrate approach that does not leverage soft-
multiple NFs on a shared core or CPU (cache effects, etc.),
ware switches avoids migrating flows by pushing exception
and the efficiencies that result from avoiding migration and
filters to the hardware switch. This approach is simple and
incremental placement as traffic load varies. Finally, we note
avoids the overhead of detouring traffic that we incur. How-
that our NFs do not use our bytestream and metadata APIs
ever, this approach scales poorly; e.g., running the above ex-
which we expect could yield further improvements.
periment with never-migrate resulted in a 80% imbalance
while consuming all 2,048 rules on the switch.7 Not only
was the asymptotic result poor, but convergence was slow 8. Related Work
because the switch takes over 1ms to add a single rule and
we needed to add close to 2,000 rules. We elaborate on related efforts beyond the work mentioned
inline throughout this paper.
7.3 E2 Whole-System Performance Industry efforts. Within industry, ETSI operates as the stan-
To test overall system performance for more realistic NF dards body for NFV and OPNFV is a nascent open-source
workloads, we derived a policy graph based on carrier guide- effort with many of the same participants. While orches-
lines [7] and BNG router datasheets [4] with 4 NFs: a NAT, a tration figures prominently in their white papers, discus-
firewall, an IDS and a VPN, as shown in Figure 13. All NFs sions with participants in these efforts revealed that few
are implemented in C over our zero-copy native API. demonstrated solutions exist as yet. Components of E2 have
We use our prototype with the server and switch configu- been approved as an informative standard at ETSI to fill this
ration described earlier. As in prior work on scaling middle- role [10].
boxes [21], we generate traffic to match the flow-size distri-
bution observed in real-world measurements [13]. In terms of academic projects, there are various systems that
We begin the experiment with an input load of 7.2 Gbps address individual aspects of E2s functionality, such as load
and the optimal placement of NFs. Over the course of the balancing [18, 38], interconnection [17], declarative policies
experiment, we then vary the input load dynamically up to a [49], migration [21, 41], but do not provide an overall frame-
maximum of 12.3 Gbps and measure the CPU utilization and work and the system optimizations this enables.
switch bandwidth used by E2. Figure 12 plots this measured End-to-end NF management. Closest to E2 is Stratos [20],
CPU and switch resource consumption under varying input an orchestration layer for NFs deployed in clouds. E2 and
Stratos share similar goals but differ significantly in their
controller and switches; the crux of the problem here is the need for close
coordination between traffic and state migration. design details. At a high level, we believe these differences
7 The always migrate prototype in [41] also uses per-flow rules in stem from E2s decision to deviate from the canonical SDN
switches but this does not appear fundamental to their approach. architecture. In canonical SDN, the virtual switch at servers

132
offers only the limited processing logic associated with the do not address the consistency issues that arise when global
OpenFlow standard. Instead, E2s control solutions exploit or aggregate state is spread across multiple NF instances.
the ability to inject non standard processing logic on the This is a difficult problem that is addressed in the Split-
data path and this allows us to devise simpler and more Merge [41] and OpenNF [21] systems. These systems re-
scalable solutions for tasks such as service interconnection, quire that NF vendors adopt a new programming model [41]
overload detection, and dynamic scaling. or add a non-trivial amount of code to existing NF imple-
NF interconnection. Our use of metadata tags (3) takes mentations [21]. To support NFs that adopt the Split-Merge
inspiration from prior work on FlowTags [17] but with a few or OpenNF architecture, we could extend the E2 controller
simplifying differences: (1) we do not require metadata to to implement their corresponding control mechanisms.
accommodate mangling NFs, such as NAT (a key focus in
[17]); and (2) metadata tags are only declared and configured 9. Conclusion
at the time of system initialization, and at runtime the E2 In this paper we have presented E2, a framework for NFV
datapath does not require reactive involvement of the SDN packet processing. It provides the operator with a single co-
controller on a per-flow basis, as in FlowTags. Once again, herent system for managing NFs, while relieving developers
we believe these differences follow from our decision to from having to develop per-NF solutions for placement, scal-
embed rich programmability in the E2D layer. Our metadata ing, fault-tolerance, and other functionality. We hope that an
tags are also inspired by the concept of packet annotations in open-source framework such as E2 will enable potential NF
Click [30] and network service headers as defined by the vendors to focus on implementing interesting new NFs while
IETFs Service Function Chaining [45]. network operators (and the open-source community) can fo-
Hardware platform. E2s platform of choice is a combina- cus on improving the common management framework.
tion of commodity servers and merchant switch silicon. The We verified that E2 did not impose undue overheads, and
ServerSwitch system [33] and some commercial service enabled flexible and efficient interconnection of NFs. We
routers and appliances [5] also combine x86 and switch- also demonstrated that our placement algorithm performed
ing hardware; however, in these designs, the two are tightly substantially better than random placement and bin-packing,
integrated, locking operators into a fixed ratio of switching and our approach to splitting NFs with affinity constraints
to compute capacity. E2 is instead based on loose integra- was superior to the competing approaches.
tion: operators can mix-and-match components from dif-
ferent vendors and can reconfigure (even in the field) the Acknowledgements
amount of switching and compute capacity based on their
We thank our shepherd, Jeff Mogul, and the anonymous re-
evolving needs. Greenhalgh et al. [22] describe their vision
viewers of the SOSP program committee, for their thought-
of a flowstream platform that combines switch and x86
ful feedback which greatly improved this paper. Ethan Jack-
hardware in a manner similar to E2 but we are unaware of
son, Kay Ousterhout, and Joshua Reich offered feedback on
a detailed design or implementation based on the proposed
early drafts of this paper and Maziar Manesh helped us de-
platform, nor do they articulate the need for a framework of
fine our lab environment. We are indebted to Tom Anschutz
the form E2 aims to provide.
(AT&T), Wenjing Chu (Dell), Yunsong Lu (Huawei), Chris-
Data plane components. Multiple recent efforts pro- tian Maciocco (Intel), Pranav Mehta (Intel), Andrew Ran-
vide specialized platforms in support of efficient software dall (Metaswitch), Attila Takacs (Ericsson), Percy Tarapore
packet processing. Frameworks such as DPDK [27] and (AT&T), Martin Taylor (Metaswitch) and Venky Venkatesan
netmap [42] are well established tools for high perfor- (Intel) for discussions on NFV efforts in industry.
mance packet I/O over commodity NICs. Other systems
address efficient packet transfer between VMs in a single Appendix: E2 Policy Language
server [19, 25, 26, 43], still others explore the trade-offs in
hosting NFs in processes, containers, or VMs [29, 34]. All In E2, the operator writes a collection of policy statements,
these systems address issues that are complementary but or- each of which is represented as a directed acyclic graph
thogonal to E2: these systems do not address the end-to-end which we call a pipelet. Nodes in a pipelet correspond to
NFV orchestration that E2 does (e.g., placement, scaling), Network Functions (NFs) which receive packets on inbound
but E2 (and the NFs that E2 supports) can leverage ideas edges and output packets to outbound edges. Edges are as-
developed in these systems for improved performance. sociated with filters which we elaborate on shortly.
NFs are instantiated from specific application types,
Dynamic scaling with cross-NF state. Our migration much like objects are instantiated from classes in object-
avoidance scheme (5.4) avoids the complexity of state mi- oriented programming. In addition to user-defined NFs,
gration for NFs that operate on state that is easily partitioned there are predefined ones such as Port which denotes a port
or replicated across NF instances; e.g., per-flow counters, on the switch, Drop, which discards packets, and Tee which
per-flow state machines and forwarding tables. However, we creates multiple copies of a packet.

133
// First, instantiate NFs from application types.
Proxy p;
dst port 80 int ext
NAT n;
FW f; !(dst port 80)
Port<0-7> int; // internal customer-facing ports p f
Port<8-15> ext; // external WAN-facing ports !(dst ip my_net) FW.safe &&
(dst ip wan_net)
// subnet declarations, to simplify filter writing n
Address my_net 10.12.30.0/24; // private IP addr n
Address wan_net 131.114.88.92/30; // public IP addr src port 80 &&
!(src port 80) && dst ip my_net
pipelet { // outbound traffic f (dst ip my_net)
FW.safe &&
int [dst port 80] -> p;
!(dst ip wan_net) !FW.safe p
int [!(dst port 80)] -> n;
p [!(dst ip my_net)] -> n; ext Drop int dst ip my_net
n -> f;
f [FW.safe && !(dst ip wan_net)] -> ext;
f [!FW.safe] -> Drop; (a) pipelet for outbound traffic (b) Pipelet for outbound traffic
}

pipelet { // inbound traffic Figure 15: Graphical representation of the pipelets in Figure 14.
ext -> f;
f [FW.safe && (dst ip wan_net)] -> n; RateLimiter r; // two input ports, vp0 and vp1
n [(src port 80) && (dst ip my_net)] -> p; Port<0-7> int; // internal customer-facing ports
n [!(src port 80) && (dst ip my_net)] -> int;
p [dst ip my_net] -> int; pipelet { // outbound traffic
} int [tos 0] -> [prio0] r;
Figure 14: An example specification of a policy. (Certain drop int [tos 1] -> [prio1] r;
actions are omitted for clarity.) r -> ...
}
Figure 16: An example of a policy that uses filter in annotations.
Figure 14 shows an example of a policy with two pipelets,
each of which represents a subset of the possible paths for the operator to be agnostic to how the rate-limiter identifies
forward and reverse traffic respectively. Figure 15 shows
prio0 vs. prio1 traffic. For example, if this is done using
the same example in graph form. The first five lines define
the nodes in the graph. Following this are two pipelets, metadata (a native rate limiter), E2 will automatically con-
each defining a graph for a specific traffic class. Within the figure the E2D layer to add the appropriate metadata tags; if
pipelet, we list all the edges forming a policy graph for that instead the rate-limiter offers different input ports for each
traffic class. An edge is described using the simple syntax: priority class, then the E2D layer will simply steer traffic to
src [filter_out] -> (bw) [filter_in] dst; the appropriate vport (with no tags, as in our example).
where all three annotationsfilter out, bw, and filter in E2 merges all pipelets into a single policy graph, termed
are optional. Filters are boolean expressions computed over a pGraph. During this process, E2 checks that each packet
packet header fields, physical/virtual ports, or metadata tags, coming from a node has exactly one destination. This is ver-
and are written in the libpcap-filter syntax [11]. The fil- ified by checking that the filters on every pair of edges com-
ter out annotations specify which packets generated from a ing out from a NF has an empty intersection, and that the
NF should enter the edge and implicitly define the traffic union of all filters attached to a NFs output evaluates to
class; filter in annotations capture requirements on incom- true. If a packet has more than one possible destination, E2
ing traffic that are imposed by the downstream NF (example first attempts to remedy this by adding filters to ambiguous
below) and bw denotes the expected amount of traffic on the edges, specifying the traffic class of the pipelet correspond-
edge, at system bootup. ing to that edge. E.g., simply merging the two pipelets in
Figure 16 shows a partial example of the use of filter in Figure 15 results in ambiguity for packets leaving the NAT.
annotations. Here outbound traffic is passed through a rate- E2 will thus add a filter that limits the edge for the n -> f
limiter with two input vports; high priority traffic must ar- rule to traffic arriving from an internal port. If E2 is unable
rive on one vport and low priority traffic on the other. Thus to automatically resolve ambiguity, it returns an error (this
traffic must be filtered prior to entering the downstream NF; is possible if the operator writes a malformed pipelet); E2
we use filter in annotations to capture such requirements. In also returns an error if a packet coming from a node has no
this example, prio0 and prio1 are NF-specific metadata that, destination.
in this case, are resolved to ports vp0 and vp1 at compile
time based on information in the rate-limiters NF descrip-
tion (see 2). In some sense, these are placeholder metadata
in that they serve as a level of indirection between policy
and mechanism but may not appear at runtime. This allows

134
References ing with Hardware and Software. In Proc. ACM SIGCOMM
[1] AT&T Domain 2.0 Vision White Paper. https: (2014).
//www.att.com/Common/about_us/pdf/AT&T% [19] G ARZARELLA , S., L ETTIERI , G., AND R IZZO , L. Virtual
20Domain%202.0%20Vision%20White%20Paper.pdf. Device Passthrough for High Speed VM Networking. In Proc.
[2] Brocade Vyatta 5400 vRouter. http://www.brocade.com/ ANCS (2015).
products/all/network-functions-virtualization/ [20] G EMBER , A., K RISHNAMURTHY, A., J OHN , S. S.,
product-details/5400-vrouter/index.page. G RANDL , R., G AO , X., A NAND , A., B ENSON , T., A KELLA ,
[3] Ericsson SE Family. http://www.ericsson.com/ A., AND S EKAR , V. Stratos: A Network-Aware Orchestration
ourportfolio/products/se-family. Layer for Middleboxes in the Cloud. CoRR abs/1305.0209
(2013).
[4] Evolution of the Broadband Network Gateway. http://
resources.alcatel-lucent.com/?cid=157553. [21] G EMBER -JACOBSON , A., V ISWANATHAN , R., P RAKASH ,
C., G RANDL , R., K HALID , J., DAS , S., AND A KELLA , A.
[5] Evolved Packet Core Solution. http://lte.alcatel- OpenNF: Enabling Innovation in Network Function Control.
lucent.com/locale/en_us/downloads/wp_mobile_ In Proc. ACM SIGCOMM (2014).
core_technical_innovation.pdf.
[22] G REENHALGH , A., H UICI , F., H OERDT, M., PAPADIM -
[6] Intel Ethernet Switch FM6000 Series - Software Defined Net- ITRIOU , P., H ANDLEY, M., AND M ATHY, L. Flow Process-
working. http://www.intel.com/content/dam/www/ ing and the Rise of Commodity Network Hardware. ACM
public/us/en/documents/white-papers/ethernet- SIGCOMM Computer Communications Review 39, 2 (2009),
switch-fm6000-sdn-paper.pdf. 2026.
[7] Migration to Ethernet-Based Broadband Aggregation.
[23] H AN , S., JANG , K., PANDA , A., PALKAR , S., H AN , D.,
http://www.broadband-forum.org/technical/
AND R ATNASAMY, S. SoftNIC: A Software NIC to Augment
download/TR-101_Issue-2.pdf.
Hardware. UCB Technical Report No. UCB/EECS-2015-155
[8] Network Functions Virtualisation. http://www.etsi.org/ (2015).
technologies-clusters/technologies/nfv.
[24] H AN , S., JANG , K., PARK , K., AND M OON , S. Packet-
[9] NFV Proofs of Concept. http://www.etsi.org/ Shader: a GPU-Accelerated Software Router. In Proc. ACM
technologies-clusters/technologies/nfv/nfv-poc. SIGCOMM (2010).
[10] REL002: Scalable Architecture for Reliability (work in [25] H ONDA , M., H UICI , F., L ETTIERI , G., AND R IZZO , L.
progress). http://docbox.etsi.org/ISG/NFV/Open/ mSwitch: A Highly-Scalable, Modular Software Switch. In
Drafts/. Proc. SOSR (2015).
[11] pcap-filter(7) FreeBSD Man Pages, Jan 2008. [26] H WANG , J., R AMAKRISHNAN , K. K., AND W OOD , T.
[12] A NDERSON , C. J., F OSTER , N., G UHA , A., J EANNIN , J.- NetVM: High Performance and Flexible Networking Using
B., KOZEN , D., S CHLESINGER , C., AND WALKER , D. Virtualization on Commodity Platforms. IEEE Transactions
NetKAT: Semantic Foundations for Networks. In Proc. ACM on Network and Service Management 12, 1 (2015), 3447.
POPL (2014). [27] Intel Data Plane Development Kit. http://dpdk.org.
[13] B ENSON , T., A KELLA , A., AND M ALTZ , D. Network Traffic [28] K ERNIGHAN , B., AND L IN , S. An Efficient Heuristic Proce-
Characteristics of Data Centers in the Wild. In Proc. Internet dure for Partitioning Graphs. Bell System Technical Journal
Measurement Conference (2010). 49, 2 (February 1970).
[14] B OSSHART, P., DALY, D., I ZZARD , M., M C K EOWN , N., [29] K IVITY, A., L AOR , D., C OSTA , G., E NBERG , P., H AR E L ,
R EXFORD , J., TALAYCO , D., VAHDAT, A., VARGHESE , N., M ARTI , D., AND Z OLOTAROV, V. OSvOptimizing the
G., AND WALKER , D. Programming Protocol-Independent Operating System for Virtual Machines. In Proc. USENIX
Packet Processors. CoRR abs/1312.1719 (2013). ATC (2014).
[15] B OSSHART, P., G IBB , G., K IM , H.-S., VARGHESE , [30] KOHLER , E., M ORRIS , R., C HEN , B., JANNOTTI , J., AND
G., M C K EOWN , N., I ZZARD , M., M UJICA , F., AND K AASHOEK , M. F. The Click Modular Router. ACM Trans-
H OROWITZ , M. Forwarding Metamorphosis: Fast Pro- actions on Computer Systems 18, 3 (August 2000), 263297.
grammable Match-Action Processing in Hardware for SDN.
In Proc. ACM SIGCOMM (2013). [31] KOPONEN , T., A MIDON , K., BALLAND , P., C ASADO , M.,
C HANDA , A., F ULTON , B., G ANICHEV, I., G ROSS , J., I N -
[16] C ASADO , M., F REEDMAN , M. J., P ETTIT, J., L UO , J., GRAM , P., JACKSON , E., L AMBETH , A., L ENGLET, R., L I ,
M C K EOWN , N., AND S HENKER , S. Ethane: Taking Control S.-H., PADMANABHAN , A., P ETTIT, J., P FAFF , B., R A -
of the Enterprise. In Proc. ACM SIGCOMM (2007). MANATHAN , R., S HENKER , S., S HIEH , A., S TRIBLING , J.,
[17] FAYAZBAKHSH , S., C HIANG , L., S EKAR , V., Y U , M., AND T HAKKAR , P., W ENDLANDT, D., Y IP, A., AND Z HANG , R.
M OGUL , J. FlowTags: Enforcing Network-Wide Policies in Network Virtualization in Multi-tenant Datacenters. In Proc.
the Face of Dynamic Middlebox Actions. In Proc. USENIX USENIX NSDI (2014).
NSDI (2014). [32] L EE , D., AND B ROWNLEE , N. Passive Measurement of
[18] G ANDHI , R., L IU , H. H., H U , Y. C., L U , G., PADHYE , J., One-way and Two-way Flow Lifetimes. ACM SIGCOMM
Y UAN , L., AND Z HANG , M. Duet: Cloud Scale Load Balanc- Computer Communications Review 37, 3 (November 2007).

135
[33] L U , G., G UO , C., L I , Y., Z HOU , Z., Y UAN , T., W U , H., [41] R AJAGOPALAN , S., W ILLIAMS , D., JAMJOOM , H., AND
X IONG , Y., G AO , R., AND Z HANG , Y. ServerSwitch: A Pro- WARFIELD , A. Split/Merge: System Support for Elastic Ex-
grammable and High Performance Platform for Data Center ecution in Virtual Middleboxes. In Proc. USENIX NSDI
Networks. In Proc. USENIX NSDI (2011). (2013).
[34] M ARTINS , J., A HMED , M., R AICIU , C., O LTEANU , V., [42] R IZZO , L. netmap: A Novel Framework for Fast Packet I/O.
H ONDA , M., B IFULCO , R., AND H UICI , F. ClickOS and In Proc. USENIX ATC (2012).
the Art of Network Function Virtualization. In Proc. USENIX
NSDI (2014). [43] R IZZO , L., AND L ETTIERI , G. VALE: A Switched Ethernet
for Virtual Machines. In Proc. ACM CoNEXT (2012).
[35] M C C ANNE , S., AND JACOBSON , V. The BSD Packet Filter:
A New Architecture for User-level Packet Capture. In Proc. [44] S EKAR , V., R ATNASAMY, S., R EITER , M. K., E GI , N., AND
USENIX Winter (1993). S HI , G. The Middlebox Manifesto: Enabling Innovation in
Middlebox Deployment. In Proc. ACM HotNets (2011).
[36] M C K EOWN , N., A NDERSON , T., BALAKRISHNAN , H.,
PARULKAR , G., P ETERSON , L., R EXFORD , J., S HENKER , [45] Network Service Header. https://tools.ietf.org/
S., AND T URNER , J. OpenFlow: Enabling Innovation in html/draft-quinn-nsh-00.
Campus Networks. ACM SIGCOMM Computer Communi-
[46] S HERRY, J., G AO , P., BASU , S., PANDA , A., K RISH -
cations Review 38, 2 (2008), 6974.
NAMURTHY, A., M ACCIOCCO , C., M ANESH , M., M AR -
[37] M ONSANTO , C., R EICH , J., F OSTER , N., R EXFORD , J., TINS , J., R ATNASAMY, S., R IZZO , L., AND S HENKER , S.
AND WALKER , D. Composing Software-Defined Networks. Rollback-Recovery for Middleboxes. In Proc. ACM SIG-
In Proc. USENIX NSDI (2013). COMM (2015).
[38] PATEL , P., BANSAL , D., Y UAN , L., M URTHY, A., G REEN -
[47] S HERRY, J., H ASAN , S., S COTT, C., K RISHNAMURTHY, A.,
BERG , A., M ALTZ , D. A., K ERN , R., K UMAR , H., Z IKOS ,
R ATNASAMY, S., AND S EKAR , V. Making Middleboxes
M., W U , H., K IM , C., AND K ARRI , N. Ananta: Cloud Scale
Someone Elses Problem: Network Processing as a Cloud
Load Balancing. In Proc. ACM SIGCOMM (2013).
Service. In Proc. ACM SIGCOMM (2012).
[39] P FAFF , B., P ETTIT, J., KOPONEN , T., C ASADO , M., AND
S HENKER , S. Extending Networking into the Virtualization [48] S HINDE , P., K AUFMANN , A., ROSCOE , T., AND K AESTLE ,
Layer. In Proc. ACM HotNets (2009). S. We Need to Talk About NICs.

[40] Q AZI , Z., T U , C., C HIANG , L., M IAO , R., V YAS , S., AND [49] S OUL E , R., BASU , S., K LEINBERG , R., S IRER , E. G., AND
Y U , M. SIMPLE-fying Middlebox Policy Enforcement Using F OSTER , N. Managing the Network with Merlin. In Proc.
SDN. In Proc. ACM SIGCOMM (2013). ACM HotNets (2013).

136

Você também pode gostar