Você está na página 1de 47

Intel Multi-core

Architecture and
Implementations
Benson Inkley
Desktop Processor PAE Manager

Scott Tetrick
Principal Engineer

Intel Architecture Group


March 7, 2006
Session: MATS002

1
Legal Disclaimer
y INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL INTEL® PRODUCTS. NO LICENSE, EXPRESS OR
IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT
AS PROVIDED IN INTEL’
INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY
WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL
PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR FOR A PARTICULAR PURPOSE, MERCHANTABILITY,
OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY PROPERTY RIGHT. Intel products are not intended
for use in medical, life saving, or life sustaining applications.
applications.
y Intel may make changes to specifications and product descriptions
descriptions at any time, without notice.
y Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel Intel
reserves these for future definition and shall have no responsibility
responsibility whatsoever for conflicts or incompatibilities arising from
future changes to them.
y The Intel®
Intel® processors mentioned may contain design defects or errors known as errata which may cause the product to deviate
from published specifications. Current characterized errata are available on request.
y Contact your local Intel sales office or your distributor to obtain
obtain the latest specifications and before placing your product order.
order.
y This document contains information on products in the design phase phase of development. Do not finalize a design with this
product is available. Verify with your local sales office that you have
information. Revised information will be published when the product have
the latest datasheet before finalizing a design.
y Conroe, Paxville, Merom,
Merom, Tulsa, Sossaman,
Sossaman, Kentsfield and other code names featured are used internally within Intel to identify
products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are
not authorized by Intel to use code names in advertising, promotion
promotion or marketing of any product or services and any such use of
Intel's internal code names is at the sole risk of the user.
y All dates specified are target dates, are provided for planning purposes only and are subject to change.
y All products, dates, and figures specified are preliminary based on current expectations, provided for planning purposes only, and and
are subject to change without notice.
y Intel and the Intel logo is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and and
other countries.
y *Other names and brands are the property of their respective owners.owners.
y Copyright © 2006, Intel Corporation

2
Session Objectives
y Describe the physical implementations of Intel's multi-core
processors on the 2006 and 2007 Intel platforms
y Explain the technical and performance differences between
Multi-Core processors and Hyper-Threading Technology
y A discussion of bus traffic behaviour between processors with
Hyper-Threading technology and Multi-Core processors
y Provide insight into the differences between shared and
independent cache architectures

3
Agenda

y Intel Multi-Core processors for 2006 and 2007


y The difference between Hyper-Threading Technology and
Multi-Core processors
y Bus Traffic Analysis
y Independent and shared cache designs

4
Multi-Core Physical Characteristics
y Two independent execution cores in one processor
y Monolithic and Multi-Chip configurations
– Implementations will vary over time
– Driven by design optimizations and market requirements
– May share L2 cache in monolithic designs

Multi-Chip Monolithic
Ex: 65 nm Pentium® D processor Ex: 65 nm Conroe
(900 Sequence)

5 Not representative of relative die sizes


2006 65nm Mobile Multi-Core
Processors

Merom

6 Not representative of relative die sizes


All products and dates are preliminary and subject to change without notice.
2006 65nm Client Multi-Core
Processors

Core 1 Core 2 Core 1 Core 2

2MB L2 2MB L2
Cache Cache 4 MB L2
2 MB L2 Cache
Cache
Conroe

7 Not representative of relative die sizes


All products and dates are preliminary and subject to change without notice.
Intel Server Multi-Core Processors

Execution Execution Core Core


Core Core Core Core
L2 Cache L2 Cache
2MB L2 2MB L2 L2 Cache L2 Cache 16MB L3 Cache
Cache Cache
Bus Interface Bus I/F
Intel® Xeon® processor DP Intel® Xeon® processor MP Tulsa

Core Core

2 MB L2 Cache

Bus Interface
Sossaman

8
Intel® Itanium® 2 Processor
All products and dates are preliminary and subject to change without notice. Not representative of relative die sizes
Intel Desktop Chipsets

Pentium® D processor Conroe Kentsfield

Intel 945 Express Intel 965 Express Intel® 975X Express


Chipset Family Chipset Family Chipset

9 Not representative of relative die sizes


DP Platform Bus Topology
y Dual Independent Bus maintains 3 electrical loads per bus
y Three loads improves signal quality at high bus clock speeds

Core Core Core Core

L2 Cache L2 Cache L2 Cache L2 Cache


Bus I/F Bus I/F

Intel® E7520
Chipset

Core Core Core Core

Cache Cache Cache Cache

Bus I/F Bus I/F Bus I/F Bus I/F

Intel® 5000
Chipset
10
MP Platform Bus Topology

Core Core Core Core Core Core Core Core


L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache
16MB L3 Cache 16MB L3 Cache 16MB L3 Cache 16MB L3 Cache
Bus I/F Bus I/F Bus I/F Bus I/F

Intel E8500
and E8501
Chipsets

y Supports a total of 4 sockets with multiple cores per processor

11
Agenda

y Intel Multi-Core processors for 2006


y The difference between Hyper-Threading Technology and
Multi-Core processors
y Bus Traffic Analysis
y Independent and shared cache designs

12
Hyper-Threading and Multi-Core Definitions

y HT – Hyper-Threading Technology: 2 threads on the same


execution core
– Shares functional execution units
– Shares cache hierarchy
y MC – Multi-core: 2 or more execution cores in the same processor
– Shares systems bus and may share caches
– Similar concept as Multi-Processing
y DP/MP – Dual/Multi-Processing: 2 or more processors in the same
system
– May share the system bus
– Multiple sockets multiply execution resources

13
Pentium® 4 Processor Block Diagram
System Bus

BTB & I-TLB

Decoder

L2 Cache and Control


BTB
uCode
ROM

Trace Cache

Rename/Alloc

uop Queues

Schedulers

FP RF Integer RF
FP move

ALU
ALU
ALU

Store
FP store

ALU

Load
AGU
AGU
FAdd
MMX

FMul
SSE

L1 D-Cache and D-TLB

14
Pentium® D Processor Block Diagram
90nm
System Bus

BTB & I-TLB BTB & I-TLB

Decoder Decoder

L2 Cache and Control

L2 Cache and Control


BTB

BTB
uCode

uCode
ROM

ROM
Trace Cache Trace Cache

Rename/Alloc Rename/Alloc

uop Queues uop Queues

Schedulers Schedulers

FP RF Integer RF FP RF Integer RF
FP move

FP move
ALU
ALU

ALU
ALU

ALU
FP store

Store

ALU
ALU

FP store

Store
ALU
Load

Load
AGU

AGU

AGU

AGU
FAdd

FAdd
MMX

FMul

MMX

FMul
SSE

SSE

L1 D-Cache and D-TLB L1 D-Cache and D-TLB

15
65nm Pentium® D and DP Xeon®
Processor Block Diagrams (Presler and Dempsey)
System Bus

BTB & I-TLB BTB & I-TLB

Decoder Decoder

L2 Cache and Control

L2 Cache and Control


BTB
uCode

BTB
uCode
ROM

ROM
Trace Cache Trace Cache

Rename/Alloc Rename/Alloc

uop Queues uop Queues

Schedulers Schedulers
FP RF Integer RF FP RF Integer RF
FP move

ALU
ALU
ALU
FP store

Store
ALU

FP move
Load

ALU
ALU
ALU

Store
FP store

ALU
AGU

AGU

Load
AGU

AGU
FAdd
MMX

FMul

FAdd
MMX

FMul
SSE

SSE

L1 D-Cache and D-TLB L1 D-Cache and D-TLB

16
Xeon® MP Processors
System Bus

Bus Interface
BTB & I-TLB BTB & I-TLB

Decoder Decoder

L2 Cache and Control

L2 Cache and Control


BTB

BTB
uCode

uCode
ROM

ROM
Trace Cache Trace Cache

Rename/Alloc Rename/Alloc

uop Queues uop Queues

Schedulers Schedulers

FP RF Integer RF FP RF Integer RF
FP move

FP move
ALU
ALU

ALU
ALU

ALU
FP store

Store

ALU
ALU

FP store

Store
ALU
Load

Load
AGU

AGU

AGU

AGU
FAdd

FAdd
MMX

FMul

MMX

FMul
SSE

SSE

L1 D-Cache and D-TLB L1 D-Cache and D-TLB

17
Tulsa Block Diagram
System Bus

Bus Interface
16MB Shared L3 Cache

BTB & I-TLB BTB & I-TLB

Decoder Decoder

L2 Cache and Control

L2 Cache and Control


BTB

BTB
uCode

uCode
ROM

ROM
Trace Cache Trace Cache

Rename/Alloc Rename/Alloc

uop Queues uop Queues


Schedulers Schedulers
FP RF Integer RF FP RF Integer RF
FP move

FP move
FP store
FP store

Store

Store
ALU

ALU
ALU

ALU
ALU
ALU

ALU
ALU
Load

Load
AGU
AGU

AGU

AGU
FAdd

FAdd
MMX

FMul

MMX

FMul
SSE

SSE

L1 D-Cache and D-TLB L1 D-Cache and D-TLB

18
Merom, Conroe and Woodcrest
Block Diagram
System Bus

Instruction Fetch Instruction Fetch

L2 Cache and Control


and PreDecode and PreDecode

Instruction Queue Instruction Queue

uCode
uCode

ROM
ROM

Decode Decode

Rename/Alloc Rename/Alloc

Reorder Buffer Reorder Buffer


Retirement Unit Retirement Unit

Schedulers Schedulers

FPU ALU ALU Load Store Store Load ALU ALU FPU

L1 D-Cache and D-TLB L1 D-Cache and D-TLB

19
Pentium 4 Processor Without HT
®

Integer Thread

L2L2Cache
Cache and Control
and Control

Integer
Floating
Point
20
Pentium 4 Processor Without HT
®

Floating Point Thread

L2L2Cache
Cache and Control
and Control

Integer
Floating
Point
21
Pentium 4 Processor With HT
®

Integer and Floating Point Threads

L2L2Cache
Cache and Control
and Control

Integer
Floating
Point
22
Pentium 4 Processor With HT
®

Two Identical Floating Point Threads

L2L2Cache
Cache and Control
and Control

Integer
Floating
Point
23
Pentium 4 Processor With HT
®

Two Identical Floating Point Threads With Offset Timing

L2L2Cache
Cache and Control
and Control

Integer
Floating
Point
24
Pentium D Processor
®

Two Identical Floating Point Threads

L2L2Cache
Cache and Control
and Control

Integer
Floating
Point
L2L2Cache
Cache and Control
and Control

Integer
Floating
Point
25
®
Dual-Core Pentium Processor Extreme Edition
Supports HT
Multiple Integer and Floating Point Threads

L2L2Cache
Cache and Control
and Control

Integer
Floating
Point
L2L2Cache
Cache and Control
and Control

Integer
Floating
Point
26
Four Core Multi-core Processors
Two Floating Point and Two Integer Threads

27
Agenda
y Intel Multi-Core processors for 2006
y The difference between Hyper-Threading Technology and
Multi-Core processors
y Bus Traffic Analysis
y Independent and shared cache designs

28
Bus Traffic Data Format
Software loop and processor configuration

Time in Seconds
Average data transfers per second

y Data sample rate: 1000/second


y Max bus bandwidth is 6.4GB/sec 800MHz bus
y Spec version 1.2 with Intel compiler version 8.1

29
Bus Traffic gzip Single Core
y Gzip is a small program that fits within the 2MB cache
y Bus utilization is generally low with bursts of high use

4 gzip threads on Pentium 4 Processor without HT

4 gzip threads on Pentium 4 Processor with HT


30
Bus Traffic gzip Dual Core
y Substantial reduction in program execution time when moving from
single core to dual core

4 gzip threads on Pentium D Processor without HT

4 gzip threads on Pentium Processor EE with HT


31
Bus Traffic swim Single and Dual Core
y Swim is a bus intensive program and is memory limited
y Substantially higher memory speed would be needed to fully utilize the bus
6

4 Swim threads on Pentium 4 Processor without HT

4 Swim threads on Pentium D Processor without HT


32
Memory Ranks
y 1024 Pages per Bank with 4 Banks per Rank (4096 Pages per Rank)
y Can have up to 4 open Pages per Rank

Open Pg
Open Pg Open Pg Open Pg

Open Pg

Open Pg
Rank 1 Rank 3
Open Pg

Open Pg

Chipset
System Bus Memory Bus

Open Pg

Rank 2 Open Pg Open Pg


Rank 4
Open Pg Open Pg
Open Pg
Open Pg
Open Pg

33
Memory Influence on Data Speed
y Memory configuration influences data transfer rate
y Same amount of memory, but different Rank configurations
y 3.2GHz Pentium® Extreme Edition dual core processor

34
Memory Influence on Data Speed
y Faster memory and more ranks improves performance

35
Memory Influence on Data Speed
y Memory cannot fully utilize 6.4 GB/sec bus bandwidth
y 4 Rank DDR2-667 Memory at 800MHz

36
Intel® Smart Cache
y Enables access to full cache size when only one core is active
y Dynamically allocates cache space between cores
y Minimizes bus traffic by allowing both cores to access single copy of
data

Core 1 Core 2

2 MB L2 Cache

Intel Core™ Duo

37
All products and dates are preliminary and subject to change without notice.
Independent Vs. Shared Cache
Designs
y Independent caches transfer data via the Bus
y Shared L2 cache enables single copy of data to be used by each
execution core
y Shared cache designs will directly transfer data between L2 and L1
caches
Independent Caches Shared Caches

Core 0 Core 1 Core 0 Core 1

L1 L1 L1 L1
L2 L2 L2 Cache Control
L2

MCH MCH

Mem 38 Mem
4 MB Shared vs 2 x 2 MB Independent Cache
y Conroe bandwidth higher but for only about ½ as long
y Work completed in less time with plenty of bus headroom
Spec “vpr” subroutine Intel 975 Chipset
Presler with both cores active

Presler with one core active and one core idle

Conroe with both cores active

Conroe with one core active and once core idle

39
4 MB Shared vs 2 x 2 MB Independent Cache
y 4 MB shared cache on Conroe dramatically reduces bus utilization
y Memory technology is the limiter

Spec “galgel” subroutine Intel 975 Chipset


Presler with both cores active

Presler with one core active and one core idle

Conroe with both cores active

Conroe with one core active and once core idle

40
Influence of Timing on Performance
y Scheduling of threads on a multithreaded processor can
influence performance
y Performance of four identical threads can be improved by
offsetting the start times

Thread 1

Thread 2

Thread 3

Thread 4

Time

41
Influence of Timing on Performance

Thread delay in Seconds


42
Conclusions
y Processor bus architecture has plenty of bandwidth to support
multi-core processors
y Memory speed cannot keep up with bus capabilities
y Multiple ways to work around memory speed limitations
– Write applications to be multi-threaded
– Shift start times of identical threads
– Increase cache size
– Multiple DIMMs of smaller size is better than one large DIMM

43
Summary
y Multi-core processors provide multiple execution cores in a
single processor package
y Larger caches and shared caches improve performance by
reducing latency to frequently used data
y Choose memory implementation to maximize data transfers
y Today’s bus architecture is a high speed interface with plenty
of bandwidth for multi-core processors

44
Please fill out the
Session Evaluation Form.

Chalk Talk Wednesday 5:00 – 5:50 Room 2001B

Intel® Core™ Microarchitecture Class


Wednesday 2:00 - 3:50 Room 2011

Multi-threading Development Methodologies


Wednesday 4:00 - 4:50 Room 2011

Thank You!

45

Você também pode gostar