CC 513 Lecture 03

CC513: Computing Systems
Lecture 3: Processors & Memory Hierarchy

Chapter 4: Advanced Computing Architecture
Kai Hwang
1 By: Dr Wael Hosny

Processors & Memory Hierarchy
 In this chapter the following points will be covered
1. Instruction set Architectures
 CISC
 RISC
2. Processors
 Superscalar
 VLIW
 Superpipelined
 Vector
3. Memory Hierarchy & Capacity Planning
4. Virtual memory, address translation & page replacement
policies
2 By: Dr Wael Hosny
Advanced Processor Technology
 Major processor families
 CISC
 RISC
 Superscalar
 VLIW
 Superpipelined
 Vector
 Symbolic
 Scalar & vector processors  Numerical computations
 Symbolic processors  AI applications
3 By: Dr Wael Hosny

Design Space of Processors
 Processors families are classified according to:

 Clock rates (Hz)
 CPI (cycle per instruction)
 It is better to have
 More clock rates
 Less CPI (by hardware & software approaches)
4 By: Dr Wael Hosny

Design Space of Modern Processor
Families
Scalar
CISC Super
pipelined
Scalar
RISC
CPI Rate
Superscalar
RISC
Vector
VLIW Supercomputer
5 By: Dr Wael Hosny

Clock Rate
 Note that CISC falls at upper left where high CPI (bad) & low clock
rate (bad)
 Complex instruction set computing e.g. i486, M68040
 Today RISC faster clock rates (Reduced Instruction Set Computer)
 Superscalar processors:
 Special subclasses of RISC processors which allows multiple instruction
to be issued simultaneously during each cycle
 VLIW (Very Long Instruction Word): Uses more functional units
than superscalar processor uses long instructions (256  1024 bits
per instruction) implemented in microprogrammed control
 Superpipelined
 Use multiphase clock
 Increase in clock rate (good)
 CPI is high (bad)
6 By: Dr Wael Hosny
 Vector supercomputers
 Processors are pipelined
 Use multiple functional units for concurrent scalar & vector
operations
 Cost increases if going to lower right corner
7 By: Dr Wael Hosny

Instruction Pipelines
 Execution of typical instruction includes 4 phases:

 Fetch
 Decode
 Execute
 Write back
 These phases are often executed in an instruction pipeline
8 By: Dr Wael Hosny

Pipeline Cycle
 It is the time required for each phase to complete its

operation, assuming equal delay in all phases (pipeline stages)
9 By: Dr Wael Hosny

Basic definitions associated with instruction pipeline
operations:
1. Instruction Pipeline Cycle

 The clock period of instruction pipeline
2. Instruction issue latency
 The time (in cycles) required between the issuing of two adjacent
instructions
3. Instruction issue rate
 The number of instructions issued per cycle (degree of superscalar processor)
10 By: Dr Wael Hosny

4. Simple Operation Latency (measured in number of cycles)
 Simple operations make up the vast majority of instructions
executed by the machine
 E.g.: Adds, loads, stores, branches, moves…etc
 But complex operations: requires order of magnitude longer
latencies
 E.g.: Divides, cache misses
5. Resource Conflict
 Refers to situation where 2 or more instruction demand the use of
the same functional unit at the same time

Base Scalar Processor
 Machine with ONE instruction issued per cycle
 One latency for simple operation
 One cycle latency between instruction issues (fully utilized)
 Instruction latency can be more than 1 cycle

Execution in Base Scalar Processor
Successive Instructions
F D E W
F D E W
F D E W
F D E W
Time in base cycles

 Although max utilization for instruction pipeline is when one
instruction issue per cycle

Underpipelined with 2 cycles per
instruction issue
F D E W
F D E W
F D E W
F D E W
Time in base cycles

 There would be instruction issue latency is 2 cycles per
instruction  instruction underutilized

Underpipelined with twice the base cycle
F D E W
F D E W
F D E W
Time in base cycles

 Another underpipelined situation pipeline cycle time is
doubled by combining pipeline stages
 In this case fetch & decode are combined into one pipeline
stage and also execute & write back  poor utilization
 The effective CPI is 1 cycle per instruction
 The clock rate is lowered by one half

Processors & Coprocessors
 Processor = Central Processing Unit (CPU)
 CPU is essentially a scalar processor that consists of multiple
functional units
 Functional units
 Arithmetic & Logic Unit (ALU)
 Floating Point Accelerator
 … etc

Architectural Models of a Basic Scalar
Computer System
1. Shows CPU with built in floating point unit
2. Shows CPU with an attached coprocessor

CPU with built in Floating Point Unit
DMA
Main Memory
CPU
Integer ALU Cache I/O Mass

subsystem Storage
Floating Control I/O bus
Point Unit Unit
User

CPU with an Attached Coprocessor
Main Memory
Data Instruction/data
Instruction CPU
Coprocessor
processor
I/O Mass
subsystem Storage

 The coprocessor executes, instructions dispatched from the
CPU
 A coprocessor may be a floating point accelerator executing
 Scalar data
 Vector processor
 Digital Signal Processor (DSP)
 LISP processor executing AI programs

Instruction Set Architecture
 In this section we characterize instruction sets & examine
higher features built into RISC & CISC scalar processors
 The instruction set of a computer specifies the primitive
commands or machine instructions that a programmer can
use in programming the machine
 The complexity of an instruction set is attributed to the
 Instruction formats
 Data formats
 Addressing modes
 General purpose registers
 Opcode specifications & flow control mechanisms used

Complex Instruction Sets
 At the early days of computer families they started with
instruction set was simple because of the high cost of hardware
 But the hardware cost has dropped & software cost has gone up in
last 3 decades
 The result that more functional units have been built into the
hardware
 Making the instruction set very large & complex
 The growth of instruction sets was encouraged by the popularity
of microprogrammed control (which was flexible in use)
 Typical CISC instruction set contains approx 120 – 350
instructions using variable instruction/data formats

Reduced Instruction Sets
 We started with RISC instruction sets & gradually moved to CISC
instruction set during 1980s
 After 2 decades of using CISC processors, computer users began to
reevaluate the performance relationship between:
 Instruction set architecture
 Available hardware/software technology
 Through many years of programming they have found that:
 25% of instructions are used frequently about 95% of the time
 This implies that about 75% of hardware supported instructions
often are not used at all
 So why we waste valuable chip area for rarely used instructions (good
tip)
 So for these rarely used instruction remove them from hardware
& let software deals with them
 In addition: we can use the saved area (chip area) to empower the
RISC performance
 RISC instruction set contains less than 100 instruction  fixed
instruction format
 3-5 simple addressing modes where used
 Hardware control
 Large register file for multiusers system used for fast context
switching among them

Architectural Distinctions
 Hardware features built into CISC & RISC processors are
compared below
 The following figure shows the architecture distinction between

modern CISC & traditional RISC
a. CISC architectural with microprogrammed control & unified
cache
b. RISC architecture with hardwired control & split instruction
cache & data cache

CISC Architectural with Microprogrammed
Control & Unified cache
Control Unit Instruction & data path
Microprogrammed Cache
Control Memory
Main
Memory

RISC Architecture with Hardwired Control &
split instruction cache & data cache
Hardwired Control Unit Instruction & data path
Instruction Cache Data Cache
Instruction Data
Main Memory

 In CISC uses a unified cache for holding both instructions & data
 They must share the same data/instruction path
 In RISC separate instruction & data caches are used with

different access paths

Architectural CISC RISC
Characteristics
Instruction set size & Large set of instruction with Small sets of instruction with
instruction formats variable formats (16 – 64 bits fixed (32 bit)
per instruction)
Addressing modes 12-24 Limited to 3-5
General purpose register & 8-24 GPR-unified cache for Large number (32-192) of
cache design instruction & data GPR wit mostly split data
cache & instruction cache
Clock rate & CPI 30-50MHz 50-150 MHz
CPI=215 CPI < 1.5
CPU control Most micro-coded using Most hardwired without
control memory (ROM), but control memory
modern CISC also use
hardwired control

CISC Scalar Processors
Digital Equipment VAX 8600 Processor Architecture
 It is an example on a typical CISC processor architecture
 It consists of 2 functional units for concurrent execution of
integer & floating point instructions
 Unified cache for
 Instruction
 Data
 Translation lookaside buffer (TLB) is used in the memory
control unit for fast generation of a physical address from a
virtual address
 Performance of processor pipelines relies on the cache hit
ratio and on minimal branching damage to the pipeline flow
Typical CISC Processor Architecture
Console bus
Console Virtual Address
Execution
Instruction Memory & I/O I/O
Unit Integer
Unit (16 GPR)
Cache Control (TLB) Subsystem
ALU
Main
Control Memory
Floating Point Memory
Unit
Operand bus
Example: The Motorola MC68040
Microprocessor Architecture
 The processor implements over 100 instruction using 16 general
purpose registers
 4Kbyte data cache
 4Kbyte instruction cache with separate memory management
units (MMU) supported by an Address Translation Cache (ATC)
 ATC= TLB used in other systems
 18 addressing modes are supported
 Integer unit is organized in 6 stage instruction pipeline
 Floating point units consists of 3 pipeline stages

Motorola MC68040 Microprocessor
Architecture

 Dual MMUs allow interleaved fetch of instructions & data from
the main memory
 Both the address bus data bus are 32 bits wide
 3 simultaneous memory requests can be generated by the dual
MMU including data operand read & write
 Snooping logic is built into the memory units for monitoring bus
events for cache invalidation
 Memory management is provided with a virtual demand paged
operating system

RISC Scalar Processor
 Scalar RISC are designed to issue one instruction per cycle
 In theory, both RISC & CISC scalar processors should perform
about the same if they run with the same clock rate & equal
program length
 These 2 assumptions are not always valid, because the
architecture affects
 Quality
 Density of code generated by the compiler
 The reliance on a good compiler is much demanding in a RISC
processor than in CISC
 Instruction level parallelism is exploited by pipelining in both
processor architectures
 Without
Neither RISC nor CISC can perform
 High clock rate as well as designed
 Low CPI
 Good compilation support
 Simplicity of a RISC processor may lead to the ideal
performance of the base scalar as
F D E W
F D E W
F D E W
F D E W
39 By: Dr Wael Hosny Time in base cycles

Representation RISC Processors
 4 based processors use 32 bit instructions
 The instruction sets consist of 51 to124 basic instructions
 Among the 4 scalar RISC processors, we choose to examine the
SUN SPARC & i860 architecture
 SPARC stands for scalable processor architecture
 The scalability of SPARC refers to use of a different number of
register window in different SPARC implementations
 But scalability in M88100 refers to the number of special
functional units (SFUs)

SUN Microsystems SPARC Architecture
 Implement floating point unit (FPU) on a separate coprocessor
 Contains a RISC integer unit (IU) implemented with 2 to 32
register windows
 The SPARC runs each procedure with a set of thirty two 32-bit IU
registers
 8 of these registers are global registers shared by all procedures
 24 are window registers
 Each register window is divided into 3 eight register sections
labeled INs, locals and OUTs
 Local registers are only locally addressable by each procedure

 The INs and OUTs are shared among procedures
15
31
24
INs Locals OUTs 8
15
31
24
INs Locals OUTs 8
7
0 Globals

 The calling procedure passes parameters to the called procedure
via its OUTs (r8 to r15) registers which are the Ins of the called
procedure
 The window of the currently running procedure is called the
active window pointed by the current window pointer
 Window Invalid Mask (WIM) is used to indicate which window is
invalid
 The overlapping windows can significantly save the time required
for interprocedure communications resulting in much faster
context switching among cooperative procedure

RISC Impacts
 RISC processor lacks same sophisticated instructions found
in CISC processors
 The increase in RISC program length implies more
instruction traffic & greater memory demand
 Problem caused by the large register file
 Although register file
 Reduce data traffic between CPU & memory
 Holds intermediate results
 RISC hardwired: less flexible

Superscalar & Vector Processors
 CISC or RISC scalar processors can be improved with a
superscalar or vector architecture
 Scalar processors are these executing 1 instruction per cycle
(i.e., only 1 instruction issued per cycle)
 The superscalar processor, multiple instruction pipelines are
used
 This implies multiple instruction are issued per cycle and
multiple results are generated per cycle
 A vector processor executes vector instruction on arrays of
data
 Thus, each instruction involves a strong of repeated operations
which are ideal pipelining with one result per cycle
Superscalar Processors
 Superscalar processors are designed to exploit more instruction
level parallelism in user programs
 Only INDEPENDENT instructions can be executed in parallel
without causing a wait state
 The amount of instruction level parallelism varies widely
depending on the type of code being executed
 The shown fig. shows the use of 3 instruction pipelines in parallel
for triple issue processor
 Superscalar processors were developed as an alternative for
vector processors
 A scalar processor of degree m can issue m instructions per cycle

 Thus, the base scalar processor, implemented in either RISC or
CISC has m=1
 A superscalar machine that can issue a fixed point, floating point,
load and branch all in one cycle achieves the same effective
parallelism as a vector machine which executes a vector load,
chained to vector add
 A typical architecture for superscalar RISC
 Multiple instruction pipelines are used
 Instruction cache supplies multiple instructions per fetch
 The actual number of instructions used to various function units
may vary each cycle
 The number of instructions is constrained by
 Data dependencies
 Resource conflicts among instructions
Instruction Memory A Typical Architecture for Superscalar RISC
Instruction Cache
Decoder Register File Reorder Buffer
RS
Branch ALU Shifter Load Store
Decoder Register File Reorder Buffer
RS
Float Float Float Float Float Float

Add Convert Multiply Divide Load Store
Address/data
48
Data Memory Data Cache
 Multiple functional units are built into the integer unit and into
floating point unit
 Multiple data buses exist among the functional units
 All functional units can be simultaneously used
 No conflict
 Independencies

CC 513 Lecture 03

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

CC 513 Lecture 03

Enviado por

Direitos autorais:

Formatos disponíveis

CC513: Computing Systems

Lecture 3: Processors & Memory Hierarchy

1 By: Dr Wael Hosny

3 By: Dr Wael Hosny

 Processors families are classified according to:

4 By: Dr Wael Hosny

5 By: Dr Wael Hosny

 Cost increases if going to lower right corner

7 By: Dr Wael Hosny

 Execution of typical instruction includes 4 phases:

8 By: Dr Wael Hosny

 It is the time required for each phase to complete its

9 By: Dr Wael Hosny

1. Instruction Pipeline Cycle

10 By: Dr Wael Hosny

11 By: Dr Wael Hosny

 Instruction latency can be more than 1 cycle

12 By: Dr Wael Hosny

Time in base cycles

13 By: Dr Wael Hosny

14 By: Dr Wael Hosny

Time in base cycles

15 By: Dr Wael Hosny

16 By: Dr Wael Hosny

Time in base cycles

17 By: Dr Wael Hosny

18 By: Dr Wael Hosny

19 By: Dr Wael Hosny

1. Shows CPU with built in floating point unit

2. Shows CPU with an attached coprocessor

20 By: Dr Wael Hosny

Integer ALU Cache I/O Mass

21 By: Dr Wael Hosny

22 By: Dr Wael Hosny

23 By: Dr Wael Hosny

24 By: Dr Wael Hosny

25 By: Dr Wael Hosny

27 By: Dr Wael Hosny

 The following figure shows the architecture distinction between

28 By: Dr Wael Hosny

Control Unit Instruction & data path

29 By: Dr Wael Hosny

Hardwired Control Unit Instruction & data path

Instruction Cache Data Cache

30 By: Dr Wael Hosny

 In RISC separate instruction & data caches are used with

31 By: Dr Wael Hosny

32 By: Dr Wael Hosny

35 By: Dr Wael Hosny

36 By: Dr Wael Hosny

37 By: Dr Wael Hosny

39 By: Dr Wael Hosny Time in base cycles

40 By: Dr Wael Hosny

41 By: Dr Wael Hosny

42 By: Dr Wael Hosny

43 By: Dr Wael Hosny

44 By: Dr Wael Hosny

46 By: Dr Wael Hosny

Decoder Register File Reorder Buffer

Branch ALU Shifter Load Store

Decoder Register File Reorder Buffer

Float Float Float Float Float Float