CIS775: Computer Architecture: Chapter 1: Fundamentals of Computer Design

CIS775: Computer Architecture
Chapter 1: Fundamentals of Computer Design

1
Course Objectives
To evaluate the issues involved in choosing and designing instruction set. To learn concepts behind advanced pipelining techniques. To understand the hitting the memory wall problem and the current state-of-art in memory system design. To understand the qualitative and quantitative tradeoffs in the design of modern computer systems
2
What is Computer Architecture?

Functional operation of the individual HW units within a computer system, and the flow of information and control among them.
Technology
Parallelism
Programming Language Interface
Computer Hardware Organization Architecture:
Interface Design (ISA) Applications OS

3
Measurement & Evaluation
Computer Architecture Topics

Input/Output and Storage
Disks, WORM, Tape
RAID Emerging Technologies Interleaving Memories Coherence, Bandwidth, Latency
DRAM
Memory Hierarchy
L2 Cache
L1 Cache
VLSI
Instruction Set Architecture
Addressing, Protection, Exception Handling

Pipelining and Instruction Level Parallelism
4
Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, DSP
Computer Architecture Topics

P M P M P M P M
Shared Memory, Message Passing, Data Parallelism
Interconnection Network
Network Interfaces
Topologies, Routing, Bandwidth, Latency, Reliability
Processor-Memory-Switch
Multiprocessors Networks and Interconnections
Measurement and Evaluation

Design
Architecture is an iterative process: Searching the space of possible designs At all levels of computer systems
Analysis
Creativity
Cost / Performance Analysis
Good Ideas
Bad Ideas
Mediocre Ideas
6
Issues for a Computer Designer

Functional Requirements Analysis (Target)
Scientific Computing HiPerf floating pt. Business transactional support/decimal arith. General Purpose balanced performance for a range of tasks
Level of software compatibility

PL level
Flexible, Need new compiler, portability an issue
Binary level (x86 architecture)

Little flexibility, Portability requirements minimal
OS requirements
Address space issues, memory management, protection
Conformance to Standards
Languages, OS, Networks, I/O, IEEE floating pt.
Computer Systems: Technology Trends

1988
Supercomputers Massively Parallel Processors Mini-supercomputers Minicomputers Workstations PCs
2002
Powerful PCs and SMP Workstations Network of SMP Workstations Mainframes Supercomputers Embedded Computers
Why Such Change in 10 years?

Performance
Technology Advances
CMOS (complementary metal oxide semiconductor) VLSI dominates older technologies like TTL (transistor transistor logic) in cost AND performance
Computer architecture advances improves low-end

RISC, pipelining, superscalar, RAID,
Price: Lower costs due to

Simpler development
CMOS VLSI: smaller systems, fewer components
Higher volumes Lower margins by class of computer, due to fewer services
Function :Rise of networking/local interconnection technology
Growth in Microprocessor Performance
10
Six Generations of DRAMs
11
Updated Technology Trends (Summary)

Capacity Logic DRAM Disk 4x in 4 years 4x in 3 years 4x in 2 years Speed (latency) 2x in 3 years 2x in 10 years 2x in 10 years
Network (bandwidth) 10x in 5 years Updates during your study period??
BS (4 yrs)
MS (2 yrs) PhD (5 yrs)
12
Cost of Microprocessors
12
13
Integrated Circuits Costs

IC cost = Die cost + Testing cost + Packaging cost Final test yield Die cost = Wafer cost Dies per Wafer * Die yield Dies per wafer = * ( Wafer_diam / 2)2 * Wafer_diam Test dies Die Area 2 * Die Area
Die Yield = Wafer yield * 1 +
Defects_per_unit_area * Die_Area
Die Cost goes roughly with die area4
14 DAP.S98 1
Performance Trends (Summary)

Workstation performance (measured in Spec Marks) improves roughly 50% per year (2X every 18 months)
Improvement in cost performance estimated at 70% per year
15
Computer Engineering Methodology

Implementation Complexity
Evaluate Existing Systems for Bottlenecks Benchmarks
Technology Trends
Implement Next Generation System Simulate New Designs and Organizations
16
Workloads
How to Quantify Performance?

Plane DC to Paris Speed Passengers
Throughput (pmph) 286,700 Boeing 747 6.5 hours 610 mph 470
BAD/Sud Concodre
3 hours
1350 mph
132
178,200
Time to run the task (ExTime)

Execution time, response time, latency
Tasks per day, hour, week, sec, ns (Performance)

Throughput, bandwidth
17
The Bottom Line: Performance and Cost or Cost and Performance?

"X is n times faster than Y" means ExTime(Y) --------ExTime(X) Performance(X) --------------Performance(Y) =
Speed of Concorde vs. Boeing 747 Throughput of Boeing 747 vs. Concorde Cost is also an important parameter in the equation which is why concordes are being put to pasture! 18
Measurement Tools
Benchmarks, Traces, Mixes Hardware: Cost, delay, area, power estimation Simulation (many levels)
ISA, RT, Gate, Circuit
Queuing Theory Rules of Thumb Fundamental Laws/Principles Understanding the limitations of any measurement tool is crucial.
19
Metrics of Performance
Application Programming Language Compiler
ISA
Answers per month Operations per second
(millions) of Instructions per second: MIPS (millions) of (FP) operations per second: MFLOP/s
Megabytes per second Cycles per second (clock rate)
Datapath Control Function Units Transistors Wires Pins
20
Cases of Benchmark Engineering

The motivation is to tune the system to the benchmark to achieve peak performance. At the architecture level
Specialized instructions
At the compiler level (compiler flags)

Blocking in Spec89 factor of 9 speedup Incorrect compiler optimizations/reordering. Would work fine on benchmark but not on other programs
I/O level
Spec92 spreadsheet program (sp) Companies noticed that the produced output was always out put to a file (so they stored the results in a memory buffer) and then expunged at the end (which was not measured). 21 One company eliminated the I/O all together.
After putting in a blazing performance on the benchmark test, Sun issued a glowing press release claiming that it had outperformed Windows NT systems on the test. Pendragon president Ivan Phillips cried foul, saying the results weren't representative of real-world Java performance and that Sun had gone so far as to duplicate the test's code within Sun's Just-In-Time compiler. That's cheating, says Phillips, who claims that benchmark tests and real-world applications aren't the same thing. Did Sun issue a denial or a mea culpa? Initially, Sun neither denied optimizing for the benchmark test nor apologized for it. "If the test results are not representative of real-world Java applications, then that's a problem with the benchmark," Sun's Brian Croll said. After taking a beating in the press, though, Sun retreated and issued an apology for the optimization.[Excerpted from PC Online22 1997]
Issues with Benchmark Engineering

Motivated by the bottom dollar, good performance on classic suites more customers, better sales. Benchmark Engineering Limits the longevity of benchmark suites Technology and Applications Limits the longevity of benchmark suites.
23
SPEC: System Performance Evaluation Cooperative

First Round 1989
10 programs yielding a single number (SPECmarks)
Second Round 1992

SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs)
Compiler Flags unlimited. March 93 new set of programs: SPECint95 (8 integer programs) and SPECfp95 (10 floating point)
benchmarks useful for 3 years Single flag setting for all programs: SPECint_base95, SPECfp_base95 SPEC CPU2000 (11 integer benchmarks CINT2000, and 14 floating-point benchmarks CFP2000
24
SPEC 2000 (CINT 2000)Results
25
SPEC 2000 (CFP 2000)Results
26
Reporting Performance Results

Reproducability Apply them on publicly available benchmarks. Pecking/Picking order
Real Programs Real Kernels Toy Benchmarks Synthetic Benchmarks
27
How to Summarize Performance

Arithmetic mean (weighted arithmetic mean) tracks execution time: sum(Ti)/n or sum(Wi*Ti) Harmonic mean (weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time: n/sum(1/Ri) or 1/sum(Wi/Ri) Normalized execution time is handy for scaling performance (e.g., X times faster than SPARCstation 10) But do not take the arithmetic mean of normalized execution time, use the geometric mean = (Product(Ri)^1/n)
28
Performance Evaluation
For better or worse, benchmarks shape a field Good products created when have:
Good benchmarks Good ways to summarize performance
Given sales is a function in part of performance relative to competition, investment in improving product as reported by performance summary If benchmarks/summary inadequate, then choose between improving product for real programs vs. improving product to get more sales; Sales almost always wins! Execution time is the measure of computer performance!
29
Simulations
When are simulations useful?
What are its limitations, I.e. what real world phenomenon does it not account for?
The larger the simulation trace, the less tractable the post-processing analysis.
30
Queueing Theory
What are the distributions of arrival rates and values for other parameters?
Are they realistic?
What happens when the parameters or distributions are changed?

31
Quantitative Principles of Computer Design

Make the Common Case Fast
Amdahls Law
CPU Performance Equation

Clock cycle time CPI Instruction Count
Principles of Locality Take advantage of Parallelism

32
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E Speedup(E) = ------------ExTime w/ E =
Performance w/ E ----------------Performance w/o
Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected
33
Amdahls Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced ExTimeold ExTimenew 1 = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced
Speedupoverall =
34
Amdahls Law
Floating point instructions improved to run 2X; but only 10% of actual instructions are FP
ExTimenew =
Speedupoverall =
35
CPU Performance Equation

CPU time = Seconds = Instructions x Cycles x Seconds
Program
Program
Instruction
Cycle
Program
Inst Count CPI X
Clock Rate
Compiler
Inst. Set.
X
X
(X)
X
Organization
Technology
X
X
36
Cycles Per Instruction

Average Cycles per Instruction
CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count n CPU time = CycleTime *
CPI
i =1
* I i
Instruction Frequency
n
CPI =
CPI
i =1
*i F
where F
= i
Instruction Count
Invest Resources where time is Spent!

37
Example: Calculating CPI

Base Machine (Reg / Reg) Op Freq Cycles CPI(i) ALU 50% 1 .5 Load 20% 2 .4 Store 10% 2 .2 Branch 20% 2 .4 1.5
Typical Mix
(% Time) (33%) (27%) (13%) (27%)
38
Chapter Summary, #1
Designing to Last through Trends
Capacity Logic DRAM Disk 2x in 3 years 4x in 3 years 4x in 3 years Speed 2x in 3 years 2x in 10 years 2x in 10 years
6yrs to graduate => 16X CPU speed, DRAM/Disk size Execution time, response time, latency
Time to run the task

Tasks per day, hour, week, sec, ns,
Throughput, bandwidth
X is n times faster than Y means

ExTime(Y) --------ExTime(X) = Performance(X) -------------Performance(Y)
39
Chapter Summary, #2
Amdahls Law:
Speedupoverall = ExTimeold ExTimenew = 1 (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced
= Instructions x Program Cycles x Seconds Cycle Program Instruction
CPI Law:
CPU time = Seconds
Execution time is the REAL measure of computer performance! Good products created when have:
Die Cost goes roughly with die
Good benchmarks, good ways to summarize performance
area4
40
Food for thought

Two companies reports results on two benchmarks one on a Fortran benchmark suite and the other on a C++ benchmark suite. Company As product outperforms Company Bs on the Fortran suite, the reverse holds true for the C++ suite. Assume the performance differences are similar in both cases. Do you have enough information to compare the two products. What information will you need?
41
Food for Thought II

In the CISC vs. RISC debate a key argument of the RISC movement was that because of its simplicity, RISC would always remain ahead. If there were enough transistors to implement a CISC on chip, then those same transistors could implement a pipelined RISC If there was enough to allow for a pipelined CISC there would be enough to have an on-chip cache for RISC. And so on. After 20 years of this debate what do you think? Hint: Think of commercial PCs, Moores law and some of the data in the first chapter of the book (and on these slides)
42
Amdahls Law (answer)

Floating point instructions improved to run 2X; but only 10% of actual instructions are FP
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
Speedupoverall = 1 0.95 = 1.053
43

CIS775: Computer Architecture: Chapter 1: Fundamentals of Computer Design

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

CIS775: Computer Architecture: Chapter 1: Fundamentals of Computer Design

Enviado por

Direitos autorais:

Formatos disponíveis

CIS775: Computer Architecture

Chapter 1: Fundamentals of Computer Design

What is Computer Architecture?

Programming Language Interface

Computer Hardware Organization Architecture:

Interface Design (ISA) Applications OS

Measurement & Evaluation

Computer Architecture Topics

Disks, WORM, Tape

RAID Emerging Technologies Interleaving Memories Coherence, Bandwidth, Latency

Addressing, Protection, Exception Handling

Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, DSP

Computer Architecture Topics

Shared Memory, Message Passing, Data Parallelism

Multiprocessors Networks and Interconnections

Measurement and Evaluation

Issues for a Computer Designer

Level of software compatibility

Binary level (x86 architecture)

Computer Systems: Technology Trends

Why Such Change in 10 years?

Computer architecture advances improves low-end

Price: Lower costs due to

Higher volumes Lower margins by class of computer, due to fewer services

Function :Rise of networking/local interconnection technology

Growth in Microprocessor Performance

Six Generations of DRAMs

Updated Technology Trends (Summary)

Network (bandwidth) 10x in 5 years Updates during your study period??

Integrated Circuits Costs

Die Yield = Wafer yield * 1 +

Die Cost goes roughly with die area4

Performance Trends (Summary)

Computer Engineering Methodology

How to Quantify Performance?

Time to run the task (ExTime)

Tasks per day, hour, week, sec, ns (Performance)

The Bottom Line: Performance and Cost or Cost and Performance?

Answers per month Operations per second

Datapath Control Function Units Transistors Wires Pins

Cases of Benchmark Engineering

At the compiler level (compiler flags)

Issues with Benchmark Engineering

SPEC: System Performance Evaluation Cooperative

Second Round 1992

SPEC 2000 (CINT 2000)Results

SPEC 2000 (CFP 2000)Results

Reporting Performance Results

How to Summarize Performance

What happens when the parameters or distributions are changed?

Quantitative Principles of Computer Design

CPU Performance Equation

Principles of Locality Take advantage of Parallelism

Performance w/ E ----------------Performance w/o

CPU Performance Equation

Inst Count CPI X

Cycles Per Instruction

Invest Resources where time is Spent!

Example: Calculating CPI

(% Time) (33%) (27%) (13%) (27%)

Time to run the task

X is n times faster than Y means