Escolar Documentos
Profissional Documentos
Cultura Documentos
Miciopioccssoi - Study
I
U=
=
smpaikchonnam.chonnam.ac.ki
1-2 Comuter Archtecture Lab. Comuter Archtecture Lab.
Contents
1. Rcvicv: ;
2. ;/^
x86 Iamily
SPARC
MIPS
Alpha
ARM
PoweiPC
1-3 Comuter Archtecture Lab. Comuter Archtecture Lab.
Review: Instiucton Level Paiallelism
High speed execution based on Instiuction Level
Paiallelism (ILP): potential of shoit instiuction
sequences to execute in paiallel
High-speed miciopiocessois exploit ILP by:
pipelined execution: oveilap instiuctions
supeiscalai execution: issue and execute
multiple instiuctions pei clock cycle
Out-of-oidei execution
Memoiy accesses foi high-speed miciopiocessoi
Data Cache, possibly multipoited, multiple levels
1-4 Comuter Archtecture Lab. Comuter Archtecture Lab.
;
Pipclinc
Supciscalai
Supcipipclinc
Vcctoi Pioccssing
VLIW
Oynamic Sclcduling
Biancl Picdiction
Mcmoiy Acccss
1-5 Comuter Archtecture Lab. Comuter Archtecture Lab.
Pipeline Concept
1-6 Comuter Archtecture Lab. Comuter Archtecture Lab.
Clk
Cycle 1
Multiple Cycle Implementation:
Ifetch Reg Exec Mem Wr
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Load Ifetch Reg Exec Mem Wr
Ifetch Reg Exec Mem
Load Store
Pipeline Implementation:
Ifetch Reg Exec Mem Wr Store
Clk
Single Cycle Implementation:
Load Store Waste
Ifetch
R-type
Ifetch Reg Exec Mem Wr R-type
Cycle 1 Cycle 2
Single Cycle, Multiple Cycle, vs. Pipeline
1-7 Comuter Archtecture Lab. Comuter Archtecture Lab.
Pipeline Piopeity
Pipeline does not help latency of single task, it helps
thioughput of entiie woikload
Pipeline iate is limited by slowest pipeline stage
Multiple tasks opeiate simultaneously
Potential Speedup = No. of pipeline stages
Pipeline Hazaids ieduce the peifoimance fiom the
ideal speedup gained by pipelining
Time per inslruclion on nonpipelined machine
Number ol pipe slaqes
1-8 Comuter Archtecture Lab. Comuter Archtecture Lab.
Pipeline Hazaid
Situation that pievent the next instiuction in the
instiuction stieam fiom executing duiing its
designated clock cycle
Stiuctuial Hazaid
Resouice Conflict when the haidwaie cannot suppoit all
possible combinations of instiuctions in simultaneous
oveilapped execution
Data Hazaid
When an instiuction depends on the iesults of a pievious
instiuction still in pipeline
Contiol Hazaid
aiises fiom the pipelining of bianches and othei instiuctions
that changes the PC
1-9 Comuter Archtecture Lab. Comuter Archtecture Lab.
I
n
s
t
r.
O
r
d
e
r
Time (clock cycles)
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
IF ID/R
F
EX ME
M
W
B
Data Hazaid Solution: Ioiwaiding
1-10 Comuter Archtecture Lab. Comuter Archtecture Lab.
Haidwaie foi Ioiwaiding
1-11 Comuter Archtecture Lab. Comuter Archtecture Lab.
Parallel execute diIIerent instructions
check dependencies(data or control)
Superscalar execution with degree n3
!elch UeccUe execule wleLack !elch UeccUe execule wleLack !elch UeccUe execule wleLack !elch UeccUe execule wleLack
Cyce Cyce Cyce Cyce 0 00 0 1 11 1 2 22 2 8 88 8 ^ ^^ ^ 5 55 5 6 66 6 7 77 7 8 88 8 9 99 9
Supeiscalai
1-12 Comuter Archtecture Lab. Comuter Archtecture Lab.
m-issue, N instiuctions, k stages
The time iequiied by the scalai base machine
T(1,1) = N - k -1 (base cycles)
Ideal execution time of m-issue supeiscalai machine
T(m,1) = k - (N - m) / m (base cycles)
Ideal speed up
S(m,1) = T(1,1) / T(m,1)
= (N - k - 1) / (N/m) - k - 1}
Peifoimance of supeiscalai
1-13 Comuter Archtecture Lab. Comuter Archtecture Lab.
Superpipeline oI degree n
Cycle time is 1/n oI base cycle
Superpipelined execution with degree n 3
!elch UeccUe execule wleLack !elch UeccUe execule wleLack !elch UeccUe execule wleLack !elch UeccUe execule wleLack
0 00 0
1 11 1
2 22 2 8 88 8 ^ ^^ ^ 5 55 5 6 66 6 7 77 7 8 88 8 9 99 9 Cyce Cyce Cyce Cyce
Supeipipeline
1-14 Comuter Archtecture Lab. Comuter Archtecture Lab.
Superpipelined superscalar oI degree(3,3)
Superpipelined superscalar oI degree(m,n)
executes m instructions every cycle with a pipeline
cycle 1/n oI base cycle
!elch UeccUe execule wleLack !elch UeccUe execule wleLack !elch UeccUe execule wleLack !elch UeccUe execule wleLack
Cyce Cyce Cyce Cyce 0 00 0 1 11 1 2 22 2 8 88 8 ^ ^^ ^ 5 55 5 6 66 6 7 77 7 8 88 8 9 99 9
Supeiscalaied supeipipeline
1-15 Comuter Archtecture Lab. Comuter Archtecture Lab.
Vectoi piocessois
have high-level
opeiations that
woik on lineai
aiiays of
numbeis: "vectois"
+
r1 r2
r3
add r3, r1, r2
SCALAR
(1 operation)
v1 v2
v3
+
vector
Iength
add.vv v3, v1, v2
VECTOR
(N operations)
Vectoi Piocessing
1-16 Comuter Archtecture Lab. Comuter Archtecture Lab.
Each iesult independent of pievious iesult
=> long pipeline, compilei ensuies no dependencies
=> high clock iate
Vectoi instiuctions access memoiy with known
pattein
=> highly inteileaved memoiy
=> amoitize memoiy latency
=> no data caches iequiied! (Do use instiuction cache)
Reduces bianches and bianch pioblems in pipelines
Single vectoi instiuction implies lots of woik
=> fewei instiuction fetches
Processor
Intel Pentium Pro Processor
Intel Pentium II Processor
Intel Pentium III Processor
1-37 Comuter Archtecture Lab. Comuter Archtecture Lab.
Intel 80x86 Programming Model
8086. 80286
80386. 80486
Penlun....
lP
FLAGS
FS
GS
7
EDX
ECX
EEX
0 15
EAX
31
A AL
E EL
C CL
D DL
ESl
EDl
ESP
EEP
Sl
Dl
EP
SP
CS
DS
ES
SS
Based on 8080
programming model
Eight 8bit GPR
Two register combined
into one 16bit register
Stack Pointer, Index
register, Segment
Register
ELAG register
Expanded into 32bit
1-38 Comuter Archtecture Lab. Comuter Archtecture Lab.
Intel 80X86 Memory Addressing
86 ~ 286
4 Segment Registers : CS, DS, ES, SS
16 bit oIIset Address : 64Kbytes
Physical Memory : 1Mbytes(8086), 16Mbytes(80286)
Virtual Memory : 1Gbytes(80286)
386 ~ Pentium
6 Segment Registers : CS, DS, ES, SS, ES, GS
32 bit OIIset Address : 4Gbytes
Physical Memory : 4Gbytes
Virtual Memory : 64Terabytes
Complicate Addressing Mode : Base, Index, Scale
1-39 Comuter Archtecture Lab. Comuter Archtecture Lab.
Intel 80286, 80386, 80486 Series Summary
Bus lne|ace Lni
Cache Lni F|eech Lni
Faqinq Lni
F|oainqFoin
Lni
Con|o| imaqe
lns|ucion
0ecode|
lneqe| Lni
Seqmenaion
Lni
in 8O48G
286
24-bit Address - 16MByte
Fully compatible with 8086
Protected mode was introduced
16 bit ALU, register set
386
32-bit ALU, register, bus
Paging supports
486
Cache, FP unit were added
1-40 Comuter Archtecture Lab. Comuter Archtecture Lab.
Intel 80486 Pipeline Architecture
5 Stage pipeline
PE : preIetch
D1 : Instruction decode
D2 : Address Generate
EX : Execute ALU,
MEM operation
WB : writeback to
register
FF D1 D2 E/ wB
FF D1 D2 E/ wB
FF D1 D2 E/ wB
1-41 Comuter Archtecture Lab. Comuter Archtecture Lab.
Intel Pentium Processor
Dual issue superscalar processor
Advanced design Ieatures
Branch prediction, Branch Target BuIIer
Separated cache
8KB 2-way Instruction cache, 8KB 2-way data cache
4MB pages Ior Ior increased TLB hit rate
Multiprocessor Supported
Multiprocessor instructions
Support Ior second level cache
32-bit CPU with 64-bit data bus
3.3V Bi-CMOS silicon technology
Dual power supplies - separated core, I/O
1-42 Comuter Archtecture Lab. Comuter Archtecture Lab.
Pentium Pipeline
Pentium IU has 5 pipeline stages
PE : PreIetch
E : Eetch(Pentium processor with MMX Technology only)
D1 : Instruction Decode
D2 : Address Generate
EX : Execute - ALU and Cache Access
WB : Writeback
pipelines in the Pentium processor are called the U, V
executed by instruction paring rules
U : can execute while the v-pipe can execute 'simple inst.
V : always the next sequential instruction aIter the one issued
to the u-pipe
1-43 Comuter Archtecture Lab. Comuter Archtecture Lab.
Integer Unit Pipeline Execution
1-44 Comuter Archtecture Lab. Comuter Archtecture Lab.
Basic Core Architecture
PF PF PF PF
Felch and Alqn lnslruclons Felch and Alqn lnslruclons Felch and Alqn lnslruclons Felch and Alqn lnslruclons
Decode lnslruclons Decode lnslruclons Decode lnslruclons Decode lnslruclons
Generc Conlrol Words Generc Conlrol Words Generc Conlrol Words Generc Conlrol Words
Decode Conlrol Word U Decode Conlrol Word U Decode Conlrol Word U Decode Conlrol Word U
Generale Menory Address Generale Menory Address Generale Menory Address Generale Menory Address
Calculale ALU Resull or Calculale ALU Resull or Calculale ALU Resull or Calculale ALU Resull or
Access Dala Cache Access Dala Cache Access Dala Cache Access Dala Cache
Wrle Resull Wrle Resull Wrle Resull Wrle Resull
Decode Conlrol Word V Decode Conlrol Word V Decode Conlrol Word V Decode Conlrol Word V
Generale Menory Address Generale Menory Address Generale Menory Address Generale Menory Address
Calculale ALU Resull or Calculale ALU Resull or Calculale ALU Resull or Calculale ALU Resull or
Access Dala Cache Access Dala Cache Access Dala Cache Access Dala Cache
Wrle Resull Wrle Resull Wrle Resull Wrle Resull WB WB WB WB
EX EX EX EX
D2 D2 D2 D2
D1 D1 D1 D1
Uppe Uppe Uppe Uppe Vppe Vppe Vppe Vppe
1-45 Comuter Archtecture Lab. Comuter Archtecture Lab.
Intel Pentium EPU(Eloating Point Unit)
Pentium EPU has 8 pipeline stages
PE : PreIetch
D1 : Instruction Decode
D2 : Address Generation
EX : Memory and register Read
X1 : EPU Execute Stage 1
X2 : EPU Execute Stage 2
WE : Write Eloating-Point result to register Iile
ER : Error Reporting
1-46 Comuter Archtecture Lab. Comuter Archtecture Lab.
Pentium Processor Block Diagram
1-47 Comuter Archtecture Lab. Comuter Archtecture Lab.
Pentium Chip Layout Photographic
0.6um process
148 mm
3.1*10
TRs
1-48 Comuter Archtecture Lab. Comuter Archtecture Lab.
Intel Pentium MMX
TM
Processor
57 new Instructions
Single Instruction Multiple Data Architecture technique(SIMD)
Eixed point integer
Map into 8EP registers/direct access
No new exceptions in Pentium
Low implementation complexity
Implementation shows Iull compatibility with existing
OS and applications
PerIormance improvement oI multimedia application
1.5-5X
1-49 Comuter Archtecture Lab. Comuter Archtecture Lab.
Intel Pentium MMX
TM
Processor
8 MMX registers(MM0 - MM7)
4 MMX data types
MMX instruction set
data transIers instructions
arithmetic instructions
comparison instructions
conversion instructions
logical instructions
shiIt instructions
empty MMX state(EMSS)
instructions
1-50 Comuter Archtecture Lab. Comuter Archtecture Lab.
4 MMX Data Types
Packed bytes
- Mainly Ior graphics and video
Packed words
- Used mainly Ior audio and comm.
Packed doublewords
- General purpose use
Quadword
- Bitwise operations and Data alignment
1-51 Comuter Archtecture Lab. Comuter Archtecture Lab.
Intel Pentium MMX
TM
Processor
Implementation Architecture
MMX Technology Added in Parallel to
the Existing Integer and EP H/W
1-52 Comuter Archtecture Lab. Comuter Archtecture Lab.
Intel Pentium MMX
TM
Processor
MMX Technology Execution Pipe
MMX technology instructions use the integer pipe
AIter the execute stage the MMX technology pipe continues to
the Mex and WM stage
Multiply instruction continue in the MMX technology pipe to
the M2, M3 and Wmul stages
1-53 Comuter Archtecture Lab. Comuter Archtecture Lab.
Intel Pentium MMX
TM
Processor
MMX Technology Execution Units
All MMX technology instructions can be issued every clock
1-54 Comuter Archtecture Lab. Comuter Archtecture Lab.
Intel Pentium MMX
TM
Processor
Sample MMX Technology Operation
1-55 Comuter Archtecture Lab. Comuter Archtecture Lab.
Intel Pentium Pro Processor
Level 2 Cache(256K or 512K)
36-bit Address Bus : 64Gbytes
3 Separate Instruction Decoders
3 Instruction Execution Units
2 Ior Integers
1 Ior Eloating Point
1-56 Comuter Archtecture Lab. Comuter Archtecture Lab.
Intel Pentium Pro Processor
http://www.intel.com/procs/p6/proceed/proceed.htm
Basic Block Diagram
1-57 Comuter Archtecture Lab. Comuter Archtecture Lab.
Intel Pentium-II Processor
Available at 233 MHz, 266 MHz, 300 MHz, and 333
MHz core Irequencies
Dynamic Execution micro architecture
Dual Independent Bus architecture
Separate dedicated external system bus
Dedicated internal high-speed cache bus
Power Management capabilities
System Management mode
Multiple low-power states
Optimized Ior 32-bit applications running on advanced
32-bit operating systems
1-58 Comuter Archtecture Lab. Comuter Archtecture Lab.
Intel Pentium-II Processor
Single Edge Contact (S.E.C.) cartridge packaging
technology
Integrated high perIormance 16 KB instruction and 16
KB data, nonblocking, level one cache
Scaleable up to two processors and 64 GB oI physical
memory
Pentium Pro processor plus the capabilities oI MMX
technology
1-59 Comuter Archtecture Lab. Comuter Archtecture Lab.
Intel Pentium-III Processor
Operating system Ilexibility to run leading applications on
MicrosoIt Windows* NT or UNIX-based environments
Available in a number oI Level two (L2) cache versions
256KB, L2 Advanced TransIer Cache(ATC) at 667 and 733MHz
Non-blocking, Iull speed, on-die Level 2 cache
8-way set associative
256 bit data bus to the level 2 cache
Advanced System BuIIering at 667 and 733Hz
4 writeback buIIers, 6 Iill buIIers, 8 bus queue entries
Three engines communicating using an instruction pool
Fetch/Decode, Dispatch/Execute, Retire Unit
1-60 Comuter Archtecture Lab. Comuter Archtecture Lab.
Intel Pentium-III Processor
Dynamic Execution Technology
Multiple branch prediction
DataIlow analysis
Speculative execution
Basic Core Architecture