Microprocessor LOL1

Comuter Archtecture Lab. Comuter Archtecture Lab.
Miciopioccssoi - Study
I
U=
=
smpaikchonnam.chonnam.ac.ki
1-2 Comuter Archtecture Lab. Comuter Archtecture Lab.
Contents
1. Rcvicv: ;
2. ;/^
x86 Iamily
SPARC
MIPS
Alpha
ARM
PoweiPC
Review: Instiucton Level Paiallelism
High speed execution based on Instiuction Level
Paiallelism (ILP): potential of shoit instiuction
sequences to execute in paiallel
High-speed miciopiocessois exploit ILP by:
pipelined execution: oveilap instiuctions
supeiscalai execution: issue and execute
multiple instiuctions pei clock cycle
Out-of-oidei execution
Memoiy accesses foi high-speed miciopiocessoi
Data Cache, possibly multipoited, multiple levels
;
Pipclinc
Supciscalai
Supcipipclinc
Vcctoi Pioccssing
VLIW
Oynamic Sclcduling
Biancl Picdiction
Mcmoiy Acccss
Pipeline Concept
Clk
Cycle 1
Multiple Cycle Implementation:
Ifetch Reg Exec Mem Wr
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Load Ifetch Reg Exec Mem Wr
Ifetch Reg Exec Mem
Load Store
Pipeline Implementation:
Ifetch Reg Exec Mem Wr Store
Clk
Single Cycle Implementation:
Load Store Waste
Ifetch
R-type
Ifetch Reg Exec Mem Wr R-type
Cycle 1 Cycle 2
Single Cycle, Multiple Cycle, vs. Pipeline
Pipeline Piopeity
Pipeline does not help latency of single task, it helps
thioughput of entiie woikload
Pipeline iate is limited by slowest pipeline stage
Multiple tasks opeiate simultaneously
Potential Speedup = No. of pipeline stages
Pipeline Hazaids ieduce the peifoimance fiom the
ideal speedup gained by pipelining
Time per inslruclion on nonpipelined machine
Number ol pipe slaqes
Pipeline Hazaid
Situation that pievent the next instiuction in the
instiuction stieam fiom executing duiing its
designated clock cycle
Stiuctuial Hazaid
Resouice Conflict when the haidwaie cannot suppoit all
possible combinations of instiuctions in simultaneous
oveilapped execution
Data Hazaid
When an instiuction depends on the iesults of a pievious
instiuction still in pipeline
Contiol Hazaid
aiises fiom the pipelining of bianches and othei instiuctions
that changes the PC
I
n
s
t
r.
O
r
d
e
r
Time (clock cycles)
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
IF ID/R
F
EX ME
M
W
B

Data Hazaid Solution: Ioiwaiding
Haidwaie foi Ioiwaiding
Parallel execute diIIerent instructions
check dependencies(data or control)
Superscalar execution with degree n3
!elch UeccUe execule wleLack !elch UeccUe execule wleLack !elch UeccUe execule wleLack !elch UeccUe execule wleLack
Cyce Cyce Cyce Cyce 0 00 0 1 11 1 2 22 2 8 88 8 ^ ^^ ^ 5 55 5 6 66 6 7 77 7 8 88 8 9 99 9
Supeiscalai
m-issue, N instiuctions, k stages
The time iequiied by the scalai base machine
T(1,1) = N - k -1 (base cycles)
Ideal execution time of m-issue supeiscalai machine
T(m,1) = k - (N - m) / m (base cycles)
Ideal speed up
S(m,1) = T(1,1) / T(m,1)
= (N - k - 1) / (N/m) - k - 1}
Peifoimance of supeiscalai
Superpipeline oI degree n
Cycle time is 1/n oI base cycle
Superpipelined execution with degree n 3
0 00 0
1 11 1
2 22 2 8 88 8 ^ ^^ ^ 5 55 5 6 66 6 7 77 7 8 88 8 9 99 9 Cyce Cyce Cyce Cyce
Supeipipeline
Superpipelined superscalar oI degree(3,3)
Superpipelined superscalar oI degree(m,n)
executes m instructions every cycle with a pipeline
cycle 1/n oI base cycle
Cyce Cyce Cyce Cyce 0 00 0 1 11 1 2 22 2 8 88 8 ^ ^^ ^ 5 55 5 6 66 6 7 77 7 8 88 8 9 99 9
Supeiscalaied supeipipeline
Vectoi piocessois
have high-level
opeiations that
woik on lineai
aiiays of
numbeis: "vectois"
+
r1 r2
r3
add r3, r1, r2
SCALAR
(1 operation)
v1 v2
v3
+
vector
Iength
add.vv v3, v1, v2
VECTOR
(N operations)
Vectoi Piocessing
Each iesult independent of pievious iesult
=> long pipeline, compilei ensuies no dependencies
=> high clock iate
Vectoi instiuctions access memoiy with known
pattein
=> highly inteileaved memoiy
=> amoitize memoiy latency
=> no data caches iequiied! (Do use instiuction cache)
Reduces bianches and bianch pioblems in pipelines
Single vectoi instiuction implies lots of woik
=> fewei instiuction fetches
Piopeities of Vectoi Piocessois

Originated Irom parallel microcode
Code compaction by compiler
Man
Memcry
Resler Fe
LcaU/
Slcre
Lnl
F.P
/UU
Lnl
lnleer
/LL
Eranch
Lnl
. . .
LcaU/Slcre FP /UU FP Mul Eranch lnl /LL ...
VLIW(Veiy Long Instiuction Woid)
Easy decoding
Possibility of having low code density
Diffeient paiallelism iequiie diffeient instiuction sets
Random paiallelism
Regulai paiallelism when Vectoi and SIMD
i!eIcI UeccUe execLIe i!eIcI UeccUe execLIe i!eIcI UeccUe execLIe i!eIcI UeccUe execLIe
vriIeLeck vriIeLeck vriIeLeck vriIeLeck
Cyce Cyce Cyce Cyce
C CC C 1 11 1 2 22 2 S SS S ^ ^^ ^ 5 55 5 G GG G 7 77 7 B BB B O OO O
Piopeities of VLIW
Haidwaie ieaiianges instiuction execution to pievent
stalls
enables handling unknown dependencies(e.q., memoiy
iefeience) and simplifies compilei
enables compiled code to iun efficiently on diffeient
platfoims
complicates haidwaie
complicates exception handling
impiecise exceptions
difficult to iestait aftei inteiiupt
Dynamic Oideiing
Issue
Completion
Dynamic Scheduling
Out-of-oidei Issue
Cential Window
Complex (de)allocation
Need capability of any type of
instiuction
Selects among a laigei numbei
of instiuctions
Reseivation Station
Paitition instiuction by
functional unit
Simple, duplicated contiol
logic
Laigei numbei of entiy at
equivalent peifoimance
!r:trDctcr !r:trDctcr !r:trDctcr !r:trDctcr
cccccr cccccr cccccr cccccr

!r:trDctcr !r:trDctcr !r:trDctcr !r:trDctcr
cccccr cccccr cccccr cccccr

CcrtrCI CcrtrCI CcrtrCI CcrtrCI
Vrccv Vrccv Vrccv Vrccv
Fc:crvCtcr Fc:crvCtcr Fc:crvCtcr Fc:crvCtcr
CtCtcr CtCtcr CtCtcr CtCtcr
Modein miciopiocessoi
deepei and deepei pipeline
widei and widei execution
contiol hazaid
fiequent bubble makei
significant peifoimance factoi
How to solve hazaid!
Loop countei
Conditional instiuction
Bianch piediction

r:..^ r:..^ r:..^ r:..^
.:,- .:,- .:,- .:,-
Contiol Hazaid
Picdict-takcn
MIPS-X
Picdict-not-takcn
Motoiola MC88OOO
Picdiction bit in instiuction
compilei suppoit
PoweiPC, SPARC V9
Static Bianch Piediction
Biancl Picdiction
Bianch addiess
Bianch histoiy
taken oi not-taken about pievious bianches
two-level adaptive
Hybiid
coiielated
Biancl Taigct Buffci
lookup cache with bianch addiess
taiget addiess without computing
fetch taiget instiuction without stall
Dynamic Piediction
The aim oI MMU
MMU provides a programmer large virtual memory
through address translation
Role oI MMU
Protection and Sharing
Relocation
Logical/Physical memory organization
CP CP CP CP VV VV VV VV Vemory Vemory Vemory Vemory
\irtual \irtual \irtual \irtual
address address address address
Physioal Physioal Physioal Physioal
address address address address
Memoiy Management Unit(MMU)
Conceptual view oI MMU
Address translation mechanism
Paging
Segmentation
Paged segmentation
1 11 1
0 00 0
2 22 2
! !! !
! !! !
0 00 0
1 11 1
2 22 2
8 88 8
^ ^^ ^
0 00 0
1 11 1
2 22 2
8 88 8
! !! !
Tansalcn laLe Tansalcn laLe Tansalcn laLe Tansalcn laLe Lala Lala Lala Lala
Vlua Vlua Vlua Vlua
aUUess aUUess aUUess aUUess
space space space space
Physca Physca Physca Physca
aUUess aUUess aUUess aUUess
space space space space
Faul Faul Faul Faul
Addiess tianslation
Paging
Most widely used viitual memoiy technique
Divide viitual and physical addiess space into pages of the
same size
Contiguous viitual addiess scatteied in physical addiess
space
Segmentation
Divide the viitual addiess space into segments which
diiectly ielate to objects at the piogiamming level
Segments aie even vaiy in size duiing piocess execution
Piotection and shaiing is possible at the object level
Paging vs. Segmentation
Address translation using paging
Rp physica|(|ea|) paqe numbe| Rp physica|(|ea|) paqe numbe| Rp physica|(|ea|) paqe numbe| Rp physica|(|ea|) paqe numbe|
\p vi|ua| paqe numbe| \p vi|ua| paqe numbe| \p vi|ua| paqe numbe| \p vi|ua| paqe numbe|
\ \\ \ Disp Disp Disp Disp
FT]\ FT]\ FT]\ FT]\
FTR FTR FTR FTR
Disp Disp Disp Disp R RR R
\i|ua| add|ess \i|ua| add|ess \i|ua| add|ess \i|ua| add|ess
Faqe Tab|e Faqe Tab|e Faqe Tab|e Faqe Tab|e
Fhysica|(Rea|) add|ess Fhysica|(Rea|) add|ess Fhysica|(Rea|) add|ess Fhysica|(Rea|) add|ess
Paging
Address translation using segmentation
V VV V Lsp Lsp Lsp Lsp
ST|Vs ST|Vs ST|Vs ST|Vs] ]] ]
STR STR STR STR
Vlua aUUess Vlua aUUess Vlua aUUess Vlua aUUess
Semenl TaLe Semenl TaLe Semenl TaLe Semenl TaLe
Physca(Rea) aUUess Physca(Rea) aUUess Physca(Rea) aUUess Physca(Rea) aUUess
Segmentation
Address translation
\.|+' 11.-ss \.|+' 11.-ss \.|+' 11.-ss \.|+' 11.-ss
s-, +m|-. s-, +m|-. s-, +m|-. s-, +m|-. (,- +m|-. (,- +m|-. (,- +m|-. (,- +m|-.
,- ||'- :| ,- ||'- :| ,- ||'- :| ,- ||'- :|
|- s-,m-| |- s-,m-| |- s-,m-| |- s-,m-|
'R 'R 'R 'R
\ \ \ \s ss s; ;; ;
Ds( Ds( Ds( Ds(
Ds( Ds( Ds( Ds(
R RR R( (( (
R-' 11.-ss R-' 11.-ss R-' 11.-ss R-' 11.-ss
'-,m-| ||'- '-,m-| ||'- '-,m-| ||'- '-,m-| ||'-
'\ '\ '\ '\s ss s; ;; ;
Paged Segmentation
The iole of TLB
Caches the iecent addiess mappings
Acceleiate addiess tianslation by haidwaie assistant
TLB foi paged segmentation
/ccess /ccess /ccess /ccess
tyoe tyoe tyoe tyoe
\i|tua| add|ess \i|tua| add|ess \i|tua| add|ess \i|tua| add|ess
TLB TLB TLB TLB
Physica| add|ess Physica| add|ess Physica| add|ess Physica| add|ess
se|ect se|ect se|ect se|ect
PWX PWX PWX PWX \ \\ \
\ \\ \
P PP P
\ \\ \
\ \\ \
P PP P Diso Diso Diso Diso
Diso Diso Diso Diso
PWX PWX PWX PWX
Tianslation Lookaside Buffei(TLB)
Modein laige viitual addiess space iequiie laige page table oi
multilevel paging
Inveited Page Table
Reduce the page table size
One entiy foi each physical memoiy page, instead of each viitual page
Uses associative seaich oi hashing functions
Vp Oisp
Mash Link Vp Fp
Vp Oisp
lrverlec aQe laLe
Virlua accress
Inveited Page Table
Rcplaccmcnt policics foi fixcd-sizc paititions
Optimal(MIN) ieplacement
Ideal and used as theoietical policy
Random ieplacement
IIIO ieplacement
IINUIO(fiist-in not-used fiist-out)
Each page has the used flag
Recentness pioblem
LRU(least iecently used) ieplacement
Replacement Policies
The Dilemma : fast, laige and not expensive memoiy
Locality solves the dilemma
Memoiy Hieiaichy
Design
Consideration
Cache Size
Block Mapping
Replacement
Strategy
Write Policy
UniIied / Separated
Cache Coherency
CPO
Cache
Main
Memory
WcU
lans!e
Ecck
lans!e
Cache Piinciples
Pipelined Cache
cleai the main ciitical path of modein miciopiocessoi
designs
Non-blocking Cache
adopted foi out-of-oidei execution
a cache miss not block successive cache seivice
Data Piefetching Cache
based on the data iefeience patteins
Tightly Coupled Cache
integiation of addiess decoding and geneiation
diiect binding with iegistei files
Cache Tiends
Comuter Archtecture Lab. Comuter Archtecture Lab.
Intel x86 Iamily
Miciopioccssoi - Study
I
Intel x86 Eamily
Intel 80x86 Programming Model
Intel 80x86 Memory Addressing
Intel 80286, 80386, 80486 Series Summary
Intel 80486 Pipeline Architecture
Intel Pentium Processor
Intel Pentium MMX
Processor
Intel Pentium Pro Processor
Intel Pentium II Processor
Intel Pentium III Processor
Intel 80x86 Programming Model
8086. 80286
80386. 80486
Penlun....
lP
FLAGS
FS
GS
7
EDX
ECX
EEX
0 15
EAX
31
A AL
E EL
C CL
D DL
ESl
EDl
ESP
EEP
Sl
Dl
EP
SP
CS
DS
ES
SS
Based on 8080
programming model
Eight 8bit GPR
Two register combined
into one 16bit register
Stack Pointer, Index
register, Segment
Register
ELAG register
Expanded into 32bit
Intel 80X86 Memory Addressing
86 ~ 286
4 Segment Registers : CS, DS, ES, SS
16 bit oIIset Address : 64Kbytes
Physical Memory : 1Mbytes(8086), 16Mbytes(80286)
Virtual Memory : 1Gbytes(80286)
386 ~ Pentium
6 Segment Registers : CS, DS, ES, SS, ES, GS
32 bit OIIset Address : 4Gbytes
Physical Memory : 4Gbytes
Virtual Memory : 64Terabytes
Complicate Addressing Mode : Base, Index, Scale
Intel 80286, 80386, 80486 Series Summary
Bus lne|ace Lni
Cache Lni F|eech Lni
Faqinq Lni
F|oainqFoin
Lni
Con|o| imaqe
lns|ucion
0ecode|
lneqe| Lni
Seqmenaion
Lni
in 8O48G
286
24-bit Address - 16MByte
Fully compatible with 8086
Protected mode was introduced
16 bit ALU, register set
386
32-bit ALU, register, bus
Paging supports
486
Cache, FP unit were added
Intel 80486 Pipeline Architecture
5 Stage pipeline
PE : preIetch
D1 : Instruction decode
D2 : Address Generate
EX : Execute ALU,
MEM operation
WB : writeback to
register
FF D1 D2 E/ wB
FF D1 D2 E/ wB
FF D1 D2 E/ wB
Intel Pentium Processor
Dual issue superscalar processor
Advanced design Ieatures
Branch prediction, Branch Target BuIIer
Separated cache
8KB 2-way Instruction cache, 8KB 2-way data cache
4MB pages Ior Ior increased TLB hit rate
Multiprocessor Supported
Multiprocessor instructions
Support Ior second level cache
32-bit CPU with 64-bit data bus
3.3V Bi-CMOS silicon technology
Dual power supplies - separated core, I/O
Pentium Pipeline
Pentium IU has 5 pipeline stages
PE : PreIetch
E : Eetch(Pentium processor with MMX Technology only)
D1 : Instruction Decode
D2 : Address Generate
EX : Execute - ALU and Cache Access
WB : Writeback
pipelines in the Pentium processor are called the U, V
executed by instruction paring rules
U : can execute while the v-pipe can execute 'simple inst.
V : always the next sequential instruction aIter the one issued
to the u-pipe
Integer Unit Pipeline Execution
Basic Core Architecture
PF PF PF PF
Felch and Alqn lnslruclons Felch and Alqn lnslruclons Felch and Alqn lnslruclons Felch and Alqn lnslruclons
Decode lnslruclons Decode lnslruclons Decode lnslruclons Decode lnslruclons
Generc Conlrol Words Generc Conlrol Words Generc Conlrol Words Generc Conlrol Words
Decode Conlrol Word U Decode Conlrol Word U Decode Conlrol Word U Decode Conlrol Word U
Generale Menory Address Generale Menory Address Generale Menory Address Generale Menory Address
Calculale ALU Resull or Calculale ALU Resull or Calculale ALU Resull or Calculale ALU Resull or
Access Dala Cache Access Dala Cache Access Dala Cache Access Dala Cache
Wrle Resull Wrle Resull Wrle Resull Wrle Resull
Decode Conlrol Word V Decode Conlrol Word V Decode Conlrol Word V Decode Conlrol Word V
Generale Menory Address Generale Menory Address Generale Menory Address Generale Menory Address
Calculale ALU Resull or Calculale ALU Resull or Calculale ALU Resull or Calculale ALU Resull or
Access Dala Cache Access Dala Cache Access Dala Cache Access Dala Cache
Wrle Resull Wrle Resull Wrle Resull Wrle Resull WB WB WB WB
EX EX EX EX
D2 D2 D2 D2
D1 D1 D1 D1
Uppe Uppe Uppe Uppe Vppe Vppe Vppe Vppe
Intel Pentium EPU(Eloating Point Unit)
Pentium EPU has 8 pipeline stages
PE : PreIetch
D1 : Instruction Decode
D2 : Address Generation
EX : Memory and register Read
X1 : EPU Execute Stage 1
X2 : EPU Execute Stage 2
WE : Write Eloating-Point result to register Iile
ER : Error Reporting
Pentium Processor Block Diagram
Pentium Chip Layout Photographic
0.6um process
148 mm
3.1*10
TRs
Intel Pentium MMX
TM
Processor
57 new Instructions
Single Instruction Multiple Data Architecture technique(SIMD)
Eixed point integer
Map into 8EP registers/direct access
No new exceptions in Pentium
Low implementation complexity
Implementation shows Iull compatibility with existing
OS and applications
PerIormance improvement oI multimedia application
1.5-5X
Intel Pentium MMX
TM
Processor
8 MMX registers(MM0 - MM7)
4 MMX data types
MMX instruction set
data transIers instructions
arithmetic instructions
comparison instructions
conversion instructions
logical instructions
shiIt instructions
empty MMX state(EMSS)
instructions
4 MMX Data Types
Packed bytes
- Mainly Ior graphics and video
Packed words
- Used mainly Ior audio and comm.
Packed doublewords
- General purpose use
Quadword
- Bitwise operations and Data alignment
Intel Pentium MMX
TM
Processor
Implementation Architecture
MMX Technology Added in Parallel to
the Existing Integer and EP H/W
Intel Pentium MMX
TM
Processor
MMX Technology Execution Pipe
MMX technology instructions use the integer pipe
AIter the execute stage the MMX technology pipe continues to
the Mex and WM stage
Multiply instruction continue in the MMX technology pipe to
the M2, M3 and Wmul stages
Intel Pentium MMX
TM
Processor
MMX Technology Execution Units
All MMX technology instructions can be issued every clock
Intel Pentium MMX
TM
Processor
Sample MMX Technology Operation
Level 2 Cache(256K or 512K)
36-bit Address Bus : 64Gbytes
3 Separate Instruction Decoders
3 Instruction Execution Units
2 Ior Integers
1 Ior Eloating Point
http://www.intel.com/procs/p6/proceed/proceed.htm
Basic Block Diagram
Intel Pentium-II Processor
Available at 233 MHz, 266 MHz, 300 MHz, and 333
MHz core Irequencies
Dynamic Execution micro architecture
Dual Independent Bus architecture
Separate dedicated external system bus
Dedicated internal high-speed cache bus
Power Management capabilities
System Management mode
Multiple low-power states
Optimized Ior 32-bit applications running on advanced
32-bit operating systems
Intel Pentium-II Processor
Single Edge Contact (S.E.C.) cartridge packaging
technology
Integrated high perIormance 16 KB instruction and 16
KB data, nonblocking, level one cache
Scaleable up to two processors and 64 GB oI physical
memory
Pentium Pro processor plus the capabilities oI MMX
technology
Intel Pentium-III Processor
Operating system Ilexibility to run leading applications on
MicrosoIt Windows* NT or UNIX-based environments
Available in a number oI Level two (L2) cache versions
256KB, L2 Advanced TransIer Cache(ATC) at 667 and 733MHz
Non-blocking, Iull speed, on-die Level 2 cache
8-way set associative
256 bit data bus to the level 2 cache
Advanced System BuIIering at 667 and 733Hz
4 writeback buIIers, 6 Iill buIIers, 8 bus queue entries
Three engines communicating using an instruction pool
Fetch/Decode, Dispatch/Execute, Retire Unit
Intel Pentium-III Processor
Dynamic Execution Technology
Multiple branch prediction
DataIlow analysis
Speculative execution
Basic Core Architecture

Microprocessor LOL1

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Microprocessor LOL1

Enviado por

Direitos autorais:

Formatos disponíveis

Comuter Archtecture Lab. Comuter Archtecture Lab.

Piopeities of Vectoi Piocessois

Você também pode gostar