Escolar Documentos
Profissional Documentos
Cultura Documentos
ABSTRACT: The architecture of the Intel QuickPath Interconnect (Intel QPI), like many components in a computer system, is comprised of a number layers. Each of these layers performs a distinct function in controlling the logical operation of the Intel QPI. This article introduces the four architectural layers of the Intel QPI - the physical, link, routing, protocol layers and then introduces other aspects of the architecture including performance and reliability, as well as some initial design considerations for a system using Intel QPI.
cache structures that perform the actual instruction set execution. The connection between two Intel QPI devices is termed a link and is composed of a set of unidirectional signals transmitted by one device and received by the other. The individual signals within a link are called lanes. Multiple lanes operating in parallel form the link. A full bidirectional communication pathway between two devices uses two links working in concert with each other. This is sometimes referred to as a link pair.
One major characteristic of the protocol layer is that it deals with messages across multiple links, involving multiple agents in multiple devices. Lower layers typically deal only with two directly connected devices.
Figure 1
Figure 1 also introduces three terms that identify the granularity of the information being exchanged between the layers. These terms are phits, flits, and packets. These terms, and each layers functions, will be examined in a little more detail in the following sections. Note that this layered architecture allows for a great deal of implementation flexibility and future growth, all within the scope of Intel QPI. The modular approach allows for extensions to be accommodated in an incremental manner. Therefore, in later sections of this text when we talk of certain characteristics of Intel QPI, we are pointing out things that usually will reflect current implementations. Those should not be construed as limitations on the overall scope of the Intel QuickPath Architecture. One prime example is the physical layer descriptions. Today a certain set of operational speed or interconnect channel characteristics may be supported. That does not prohibit Intel QPI from evolving into different speeds, communication channels, or even in the types of messages being used.
Figure 2
With the currently defined physical layer, eighty-four pins are used to carry all the signals of one Intel QPI link operating at its full width. In some applications, the link can also operate at half or quarter widths in order to reduce power consumption or work around failures. The unit of information transferred in each unit of time by the physical layer is termed a phit, which is an acronym for physical unit. In the example shown in Figure 2, each phit would contain 20 bits of information. Typical signaling speeds of the link in current products calls for operation at 6.4 GT/s for systems with short traces between components, and 4.8 GT/s for longer traces found in large multiprocessor systems. The physical layer is divided into two sections. The analog or electrical section manages the transmission of the digital data on the traces. This section drives the appropriate signal levels with the proper timing relative to the clock signal and then recovers the data at the other end and converts it back into digital data. The logical portion of the physical layer interfaces with the link layer and manages the flow of information back and forth between them. It also handles initialization and training of the link and manages the width of operation.
Figure 3
The link layer handles the flow of data and ensures that only enough data is sent to the link as the receiving agent can accept without overruns. It uses a credit exchange mechanism where the two agents on a link exchange information about the number of buffers (and hence credits) they support. The link layer counts down the credits for each data entity it sends, ensuring it will not overrun the receiver. The receiver returns credits to the sender as it frees up its buffers.
Copyright 2009 Intel Corporation
The link layer also ensures that data is transferred reliably across the link. Every flit received by the link layer is checked for errors and groups of flits are acknowledged if they are free of errors. Otherwise the receiving link layer requests retransmission of the flit with errors and all flits subsequent to it that may have been transmitted. Both short transient errors, and burst errors that affect several flits, can be corrected by this means. Moreover, the order of transmission of the flits is maintained. The link layer abstracts the physical link of Intel QPI into a set of message classes that operate independently of each other. The message classes are very similar to the different types of mail handled by the post office. You can request the post office to send a letter as ordinary first class mail, or registered mail, or even as express mail. You may also have different types of things to send, a letter and a large package for example, and you may wish to use different delivery mechanisms for the different types. Each one of these classes of mail is handled independently of the other, very much like the message classes of the Intel QPI link layer. We will briefly introduce the message classes here without going into more details of their operation and usage. The six message classes are: Home (HOM), Data Response (DRS), Non-Data Response (NDR), Snoop (SNP), NonCoherent Standard (NCS), and Non-Coherent Bypass (NCB). The link layer extends the notion of message classes to another level. One collection of the six message classes is called a virtual network. Intel QPI supports up to three independent virtual networks in a system. These are labeled VN0, VN1, and VNA. A basic one- or two-processor system can be implemented with just two networks and typically VN0 and VNA are used. All three networks are typically used in multiple processor systems that use the extra networks to manage traffic loading across the links and also avoid deadlocks and work around link failures.
The routing tables are set up by the firmware when the system is first powered up. Typical small systems will usually run with these values unchanged. Multiprocessor systems typically have more elaborate routing tables that contain information about alternative paths to reach the same destination. These can be used to help redirect traffic around a link that is heavily loaded. Fault resilient multiprocessor systems can also use this information to work around failures in one or more links. The routing layer can also help to partition and reconfigure multi-processor systems into several smaller systems that logically operate independently of each other while sharing some of the same physical resources. Intel offers Xeon and Itanium processors for high reliability servers that implement several of these features in their routing layers.
The team decides that time is of the essence and they should come up with a mechanism that lets them get at the pages of interest as quickly as possible. So when Robert decides he needs page eleven of the document, he asks Mary for a copy from the central repository. However, he simultaneously sends messages to each of his peers to see if they might have copies of the page. If any one of them has a Modified copy, that person will send it to Robert and let Mary know she or he has done so. So Robert gets the page in two messages or hops in Intel QPI terms. This is the source snooped method of managing cache coherence, as the source of the requestor for data also sends snoop messages to all peer caching agents. This is a good time to revisit the Forward (F) state. In the example above Robert requests page eleven from Mary and lets each peer know that he is looking for that page. Normally if any one of Roberts peers had the page in the Shared state they would let Mary know of that fact and Mary would send a copy of the page to Robert. This would take a while as Mary has to go through the entire repository to find the page, not unlike the long access time to read the contents of DRAM in a computer system. However, suppose Patty does have a copy, she can send it directly to Robert as she can very quickly look for it in her small stack of cached pages. If both Patty and Tom had copies of page eleven, one of them would be marked as the designated Forwarder so that Robert would not get two copies of the same page. The Forward (F) state is used to designate one of the caching agents (among those with data in the Shared state), as the one responsible for forwarding data on to the next requestor. This reduces the time to get data to the destination and therefore improves system performance. The team can use an alternate method where Mary at the repository of all the pages is responsible for checking with all the others for any cached copies of pages. So when Robert requests a page from Mary, she first sends snoop messages to all of Roberts peers to see if any of them have a copy of page eleven. If none of them do, they let Mary know of that fact by their own messages and Mary can send the page to Robert from her central repository. However, if Janice had a copy of the page and she had changed it, Janice would send it directly to Robert and let Mary know she has done so. Janice would also let Robert know that she has kept a copy of the page so that Robert can mark his copy of the page as Shared. In this example Mary had to initiate all the snoops, meaning she was using a home snooped approach, and the page reached Robert after three hops. Let us extend our example a bit and introduce the notion of a directory. Tom is called on to travel on a business trip and he takes his copies of the pages of the contract with him. When Robert requests page eleven from Mary, then Mary has to ask all the others for the status of their caches regarding page eleven, just as before. However, contacting Tom takes a long time and this is becoming a
Copyright 2009 Intel Corporation 8
bottleneck for the whole team. So Mary decides that she is going to keep a list of what pages she has handed out to whomin effect creating a directory of the state of sharing of pages of the document. So when Robert requests page eleven, Mary checks her directory and immediately knows whom to contact about that page. If Marys directory indicates that Tom does not have a copy of page eleven, she can give Robert the copy of the page and let him proceed with his work. The directory is an excellent way to reduce the number of snoops that have to be sent across the system. A directory can also be used to manage traffic to agents with long latencies such as those in remote nodes in large systems with multiple nodes. Intel QPI provides all the rules necessary to create such systems, and have them operate effectively. Our examples above lay out the basic mechanisms of cache coherence management in Intel QPI.
query them in order to cut down on traffic for cache coherence resolution. Product designers can choose from a wide range of snoop filter or directory mechanisms to help reduce the traffic across the links. The Intel QPI coherence protocol provides considerable flexibility in this area. Larger system configurations with tens of processors can also be readily built in a hierarchical manner. Groups of two to eight processors and a node controller can be connected to other such nodes, where the node controllers are tied to each other over a set of links. Such a system configuration of two tiers of interconnect is shown in Figure 1. The second level of interconnect between node controllers may use Intel QPI, that is a platform architectural decision. Alternatively, an existing protocol can be used between nodes while Intel QPI is used for connectivity within the node. Such systems can take advantage of relatively inexpensive mass-produced processors to build massively parallel computing systems that can take on very large computing problems that require access to shared memory. Such node controllers can also incorporate sophisticated traffic filtering algorithms, taking advantage of the rich and flexible protocol of Intel QPI.
10
systems based on Intel QPI, traffic travels across multiple links to multiple memory controllers all working in loose concert with each other. Thus anyone modeling system operational behavior for performance analysis purposes must understand the nature of the interactions between the various elements. As an example, two processors in a multiple link system can issue transactions to different memory controllers down two different links. Although each of these transactions can be considered to be essentially independent of the other, they may share resources within the processors and create snoop traffic on adjacent links. Similarly, trying to observe the transactions of a single processor for the purpose of tracking down an anomaly requires integrating information from several locations in the system to create a complete picture. The high speed signaling technology of the Intel QPI links cannot be observed by a direct electrical connection to the signals. Intel, working with logic analyzer vendors, has created solutions to provide a way to observe these signals directly or indirectly. Several such probes are required to observe the transactions across many links while debugging a multiple processor system. The user then integrates information from all these sources to get a complete picture of the transactions in the system. All this is a significant departure from the methods used in FSB systems where one can observe all the system traffic on one logic analyzer connected to the bus.
Summary
With its high-bandwidth, low-latency characteristics, the Intel QuickPath Interconnect advances the processor bus evolution, unlocking the potential of next-generation microprocessors. With all these features and performance its no surprise that numerous vendors are designing innovative products around this interconnect.
12
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4744. Requests to the Publisher for permission should be addressed to the Publisher, Intel Press, Intel Corporation, 2111 NE 25 Avenue, JF3-330, Hillsboro, OR 97124-5961. E-mail: intelpress@intel.com 20090428
13