Escolar Documentos
Profissional Documentos
Cultura Documentos
Mark S. Birrittella, Mark Debbage, Ram Huggahalli, James Kunz, Tom Lovett, Todd Rimmer, Keith D. Underwood,
Robert C. Zak
Data Center Group
Intel Corporation
USA
{Mark.S.Birrittella, Mark.Debbage, Ram.Huggahalli, James.Kunz, Thomas.D.Lovett, Todd.Rimmer, Keith.D.Underwood,
Robert.C.Zak} @intel.com
I. INTRODUCTION
The Intel Omni-Path Architecture (Intel OPA) defines a
next generation fabric with its heritage in the Intel True Scale
[1] product line and the Cray Aries interconnect [2]. The Intel
OPA is designed for the integration of fabric components with
CPU and memory components to enable the low latency, high
bandwidth, dense systems required for the next generation of
datacenter. Fabric integration takes advantage of locality
between processing, cache and memory subsystems, and
communication infrastructure to enable more rapid innovation.
Near term enhancements include higher overall bandwidth from
the processor, lower latency, and denser form factor systems.
The first implementation of Intel OPA-based products
focuses on supercomputing as well as the broader high
performance computing environments; however, Intel OPA is
generally applicable to a class of data center level computation
[3] requiring scalable, tightly coupled, CPU, memory, and
storage resources. Intel OPA defines a single interoperable OSI
Layers 1 and 2 architecture that can be used to provide
connectivity between elements in energy efficient
Partition A
Node 0
(Service)
Node x
Node x
HFI
HFI
HFI
Switch
Fabric
Manager
Switch
Switch
FLITs are encoded with FLIT[62] set to 1 (i.e. 001). All other
(non-body/non-head/non-tail) FLITs are encoded with FLIT[62]
set to 0 (i.e. 000). Control FLITs are used to orchestrate the
retransmission protocol. Command FLITs are used to return VL
flow control credits [8] back to the transmit side of the link. FP
FLITs and Command FLITs can be mixed within reliable LTPs
for transmission over the link. Control FLITs are sent only in
special null LTPs and are not part of any fabric packet. Idle
FLITs are inserted into the continuously transmitted LTPs when
no FP FLITs are available at the egress port feeding the link.
B. Link Transfer Packets (LTPs)
A single size Link Transfer packet holds sixteen FLITs and
the associated FLIT type bits for transmission over the link. In
addition to the sixteen FLITs each LTP has a two bit VL credit
sideband channel and a 14 bit LCRC which covers the contents
of each individual LTP. There are a total of 128 Bytes (16
FLITS) of data payload and 4 Bytes of overhead (16 FLIT type
bits, 14 CRC, and 2 VL credit bits) resulting in a transfer
efficiency of 64/66. There are two types of LTPs. Reliable LTPs
contain FP FLITs and VL credit return FLITs and are held in a
replay buffer for period of time that is long enough to guarantee
that a lack of a retransmission request indicates it has been
received successfully by the peer. Null LTPs do not consume
space in the replay buffer and are never retransmitted. They are
distinguished using a Control FLIT which specifies a specific
operation in the retransmission protocol..
C. Interface to the Physical Layer
Each LTP is segmented into link transfer quantities without
regard to FLIT and type bit boundaries for transmission over a 1
to 4 lane wide link. Asymmetric link widths of 1x, 2x, 3x, and
4x are supported. Bit alignment within each lane, lane polarity
adjustment, peer physical lane identification detection for lane
swizzle adjustment, packet framing, and lane de-skew are all
accomplished by the Link Transfer (LT) Layer completely
external to the physical layer. The LT Layer utilizes an
additive/synchronous LFSR-based bit scrambler on the transmit
side and de-scrambler on the receive side. The scrambler is
turned on and synchronized once using a turn-on frame at the
end of the link initialization process. The scrambler provides a
high enough probability of edges such that 64/66b type encoding
is not required. This enables encoding LTPs at the same link
transfer efficiency as basic 64/66b encoding.
D. Intel Omni-Path Packet Integrity Protection (PIP)
Intel Omni-Path PIP is used to enhance the reliability of
transmission across the link. When a failed CRC check occurs,
a retry request containing the failed LTP sequence number is
sent to the peer informing it to retransmit the failed LTP and all
subsequent LTPs. The receive side drops all LTPs until the retry
begins, as indicated a special retry marker null LTP. A replay
buffer is used to provide enough temporary storage to cover the
round trip time of a maximum length cable. The retransmission
protocol uses implied acknowledgments to reduce overhead.
After link initialization, the transmitter sends a special retry
marker null LTP indicating that reliable LTPs will commence
immediately. In response, the peer sends a special one-time
round trip marker null LTP back to inform the transmitter that
the first round trip is complete. As long as a retry request is not
received, LTPs are implicitly acknowledged as being
HFI1
SC0
SL0
Request/Response
Protocol
(e.g. PGAS)
TC0
Storage
Protocol
TC1
SL0
SC1
TC0
SC2
SL1
SL1
SC3
SC4
SL2
SL2
SC5
link1
link2
link3
link4
link5
link6
Figure 2: Example of Traffic Class (TC), Service Level (SL), and Service
Class (SC) usage in a topology with credit loop avoidance
paths, memory and control logic, and with shared ASIC level
infrastructure for power, clock, ASIC reset, low-level
control/management and some shared pins. The maximum
available fabric bandwidth for two HFIs using both directions of
the fabric link is 50GB/s. The ASIC can be used in a variety of
product configurations including standard form factor PCIe
cards, custom mezzanine card form factors, LAN on
Motherboard (LOM) solutions, and integrated into the processor
package.
The HFI partitions the layer 4 transport function between the
HFI hardware and the host software stack. In addition to typical
Link Layer offloads (e.g. packet formation and CRC
generation), the hardware provides targeted offloads to
accelerate typical per-packet transport layer capabilities and
reduce software overhead. In turn, the software on-load
component is responsible for higher level transport protocol
functionality including maintaining connection state, mapping
message-level transactions to packet sequences, and the end-toend reliability protocol.
Host software can select between two send mechanisms
provided by the hardware. Short messages use the Programmed
Input/Output (PIO) Send mechanism to minimize latency. A
large PIO send buffer is dynamically partitioned into contexts
that are mapped for direct user mode application access.
Sufficient contexts are provided to support the large core counts
on the next generation of Intel Xeon Phi processors. Each
context provides a memory mapped FIFO that is used to
generate a flow-controlled stream of packets to the link. Host
software writes data directly into the send context to generate
the packet on the fabric.
For longer messages, typically consisting of multiple
packets, the Send Direct memory Access (SDMA) mechanism
is used. Each SDMA engine processes an independent
descriptor queue held in host memory, and host software
appends descriptors to these queues. Taking the CPU core out of
the data movement path allows the SDMA mechanism to
achieve higher bandwidth and reduced CPU overhead.
Additionally, there is an Automatic Header Generation (AHG)
feature that allows hardware to generate headers for a stream of
packets based on update information held in the descriptors to
reduce the overhead of descriptor creation. Hardware arbitrates
across all the packet sources, arbitrates for VL resources and
link-level credits, performs packet header integrity checking,
and finally egresses to the link.
The receive hardware provides a matching number of
receive contexts to enable packets to be directly delivered to a
user process. A variety of mapping algorithms are provided
including the receive context number being specified directly in
a packet header field, a lookup table based on the Queue Pair
Number (QPN) packet field, and a programmable mapping
based on hashes of arbitrary header fields called Receive Side
Mapping (RSM). The receive side of the HFI provides eager
receive and rendezvous receive capabilities for short and long
messages, respectively. Host software chooses between the
protocols for a given message. Eager provides lowest latency
and highest message rate by delivering packets to FIFO ring
buffers in host memory. Payloads are subsequently consumed
by host software, typically involving a memory copy to the final
3) OFA OFI
The communication operations of libfabric are layered on
top of the APIs mentioned in the previous sections. The message
queue and tagged message queue operations are layered on PSM
MQ, and collectives can be synthesized in terms of PSM MQ
operations. Put, get, atomics and other operations designed to
EEPROM
Cport
Port
0
MRT
I2C
MCU
Central Crossbar
Mport 0
URT
Mport 11
URT
TX
RX
TX
RX
Mport Crossbar
TX
RX
Mport Crossbar
Base
Mgr
TX
RX
2) OFA Verbs
The implementation is partitioned between hardware
capability in the HFI and host software. The hardware
capabilities include the PIO send contexts, SDMA engines and
AHG feature on the send side. Under driver control some portion
of the PIO send contexts are set aside for driver use for Verbs
protocols and these are typically used for small messages and
for Verbs ACKs. Some or all of the SDMA engines are used for
large message transfers for Verbs transactions. On the receive
side the incoming Verbs protocol packets are spread across
receive contexts and CPU cores using lower order bits from the
QPN value and a mapping table. Interrupt coalescing is used to
moderate the host interrupt rate without overly delaying
incoming packets that are latency sensitive. These features allow
the Verbs performance to scale as more CPU core resources are
assigned to running the Verbs protocol code.
PCI-e
TX
RX
External
CPU
TX
RX
TX
RX
TX
RX
SERDES
SERDES
SERDES
SERDES
SERDES
SERDES
SERDES
SERDES
P1
P48
D. Reliability Implementation
Data integrity and reliability is achieved at multiple levels.
The fabric links are protected by link-level CRC with a linklevel retry mechanism. The HFI and switch hardware paths are
protected by error correcting code (ECC) and parity. End-to-end
packet integrity is provided by a 32-bit invariant ICRC that is
generated and validated in hardware. Packet sequence numbers
(PSNs) are used to detect missing, duplicated and reordered
packets. The standard 24-bit PSNs are expanded to 31 bits to
give improved robustness for excessive delays in the fabric that
could otherwise potentially lead to sequence number roll-over in
extreme scenarios. Recovery mechanisms are implemented in
host software using time-out and retry techniques.
REFERENCES
Intel
True Scale
Intel OPA
10
25.78
32
100
35
160
Switch Ports
36
48
42
195
165-175
100-110