Você está na página 1de 17

Infiniband architecture

Specification (Infiniband architecture specification release 1.2, Oct. 5, 2004) available at Infiniband Trade Association (http://www.infinibandta.org)

Potential improvements

Infiniband architecture overview

Infiniband architecture overview


Components:
Links Channel adaptors Switches Routers

The specification allows Infiniband wide area network, but mostly adopted as a system/storage area network. Topology:
Irregular Regular: Fat tree

Link speed:
2.5Gbps (X), 10Gbps (4X), and 30Gbps (12X).

Layers: somewhat similar to TCP/IP


Physical layer Link layer
Error detection (CRC checksum) flow control (credit based) switching, virtual lanes (VL), forwarding table computed by subnet manager
Not adaptive

Network layer: across subnets.


No use for the cluster environment

Transport layer
Reliable/unreliable, connection/datagram

Verbs: interface between adaptors and OS/Users

Packet format:

Local Route Header (LRH): 8 bytes. Used for local routing by switches within a IBA subnet Global Route Header (GRH): 40 Bytes. Used for routing between subnets Base Transport header (BTH): 12 Bytes, for IBA transport Reliable datagram extended transport header (RDETH): 4 bytes, just for reliable datagram Datagram extended transport header (DETH): 8 bytes RDMA extended transport header (RETH): 16 bytes Atomic, ACK, Atomic ACK, Immediate DATA extended transport header: 4 bytes, optimized for small packets. Invalidate Invariant CRC and variant CRC:
CRC for fields not changed and changed.

Local Route Header:

Switching based on the destination port address (LID) Multipath switching by allocating multiple LIDs to one port

Local Route Header:


Switching based on the destination port address (LID) Multipath switching by allocating multiple LIDs to one port

GRH: same format as IPV6 address (16 bytes address)

Base transport header:

Verbs
OS/Users access the adaptor through verbs Communication mechanism: Queue Pair (QP)
Support the four types of services, including reliable connection service Each connection takes one QP on each end. Each QP has a send queue and a receive queue. Users can post send requests to the send queue and receive requests to the receive queue. Three types of send operations: SEND, RDMA(WRITE, READ, ATOMIC), MEMORY-BINDING One receive operation (matching SEND)

Queue Pair:
The status of the result of an operation (send/receive) is stored in the complete queue. Send/receive queues can bind to different complete queues.

Related system level verbs:


Open QP, create complete queue, Open HCA, open protection domain, register memory, allocate memory window, etc

User level verbs:


post send/receive request, poll for completion.

To communicate:
Make system calls to setup everything (open QP, bind QP to port, bind complete queues, connect local QP to remote QP, register memory, etc). Post send/receive requests. Check completion.

What if a packet arrives before a receive request is posted?


Not specified in the standard The right response should be a receiver not ready (RNR) error. The sender is back-pressed in this case.

Infiniband has a perfect software interface (Chien'94 paper):


The network subsystem realizes all user level functionality. User level accesses to the network interface. A few machine instructions will accomplish the transmission task without involving the OS. Network supports in-order delivery and and fault tolerance. Buffer management is pushed out to the user.

SilverStorm 9024:
24 ports 4X(10Gbps) or 8 ports 12X(30 Gbps) switch type: cut-through switch latency: < 140ns switch bandwidth: 480 Gbps forwarding table size: 48K VL support: 8 + 1 management

SilverStorm 9240:
24 expansion slots, each expansion model 12 port 4X or 4 port 12X (24x12 = 288, 288 by 288 switch) switch type cut-through switch latency: < 140ns to < 420ns switch bandwidth: 5.76Tbps forwarding table size: 48K VL support: 8 + 1 management

Potential improvements on Infiniband using compiled communication


Improving the internal Infiniband fabric:
Offline routing for static pattern (static SM for a reduced traffic pattern) can be beneficial for irregular networks. Simplify the layer architecture by having a direct link model (for known patterns), the header can be simplified, may not matter much (Infiniband layers are thin). Simplify the protection mechanism. Circuit switch type Infiniband. Reliable communication protocol is still needed. Potential benefits can be evaluated by simulation.

Improving the messaging software (software to hardware interface): no chance. Improving the MPI implementation over Infiniband: similar to our current work on Ethernet
Message scheduling for collective/point-to-point communications based on the network topology. Exploring NIC features (buffers in NIC, multicast) Reducing the number of instructions in a library routine makes sense. Compiled communication can be used to optimize the MPI library. Compiled communication can help improving the library implementation (e.g. reducing the number of message copies, early requests posting , using RDMA, etc).

One particular project:


Design algorithms for Infiniband subnet manager Improving routing performance for Infiniband subnet manager (SM).
Objective: minimize the maximum channel load for an given traffic pattern Optimize according to a given pattern: the traffic pattern in an application is usually not all-to-all
Default routing used in IBA SM

For a sparse traffic pattern, the maximum channel load can usually be minimized using the minimim interference principle.
Need to extend minimum interference routing for load balance deadlock free routing.

The best way to realize IBA SM is still not clear (unknown) at this time, we can probably do something here.
Irregular network or Fat tree network

Você também pode gostar