Você está na página 1de 77

Introduction

Background: ECGR 3183 or equivalent, based on


Hennessy and Pattersons Computer Organization and
Design

ECGR 4181/5181

UNCC 2015

Computer Technology
Performance improvements:
Improvements in semiconductor technology
- Feature size, clock speed
- Transistor density increases by 35% per year and die size
increases by 10-20% per year more functionality
- Transistor speed improves linearly with size (complex
equation involving voltages, resistances, capacitances)
can lead to clock speed improvements!
- Wire delays do not scale down at the same rate as logic
delays
- The power wall: it is not possible to consistently run at higher
frequencies without hitting power/thermal limits (Turbo Mode
can cause occasional frequency boosts)
- In a clock cycle, can do more work -- since transistors are
faster, transistors are more energy-efficient, and there are
more of them
ECGR 4181/5181

UNCC 2015

Computer Technology
Performance improvements:
Improvements in computer architectures
- Enabled by HLL compilers, UNIX
- Lead to RISC architectures
Together have enabled:
- Lightweight computers
- Productivity-based managed/interpreted programming
languages

ECGR 4181/5181

UNCC 2015

Single Processor Performance


Move to multi-processor

RISC

ECGR 4181/5181

UNCC 2015

Current Trends in Architecture


Cannot continue to leverage Instruction-Level
parallelism (ILP)
Single processor performance improvement ended in
2003

New models for performance:


Data-level parallelism (DLP)
Thread-level parallelism (TLP)
Request-level parallelism (RLP)

These require explicit restructuring of the


application

ECGR 4181/5181

UNCC 2015

Classes of Computers
Personal Mobile Device (PMD)
e.g. smart phones, tablet computers
Emphasis on cost, energy efficiency, media performance, and real-time

Desktop Computing
Emphasis on price-performance, energy, graphics

Servers
Emphasis on availability, scalability, throughput, energy

Clusters / Warehouse Scale Computers


Used for Software as a Service (SaaS)
Emphasis on availability, price-performance, energy proportionality
Sub-class: Supercomputers, emphasis: floating-point performance and
fast internal networks

Embedded Computers
Emphasis: price
ECGR 4181/5181

UNCC 2015

Embedded Processor Characteristics


The largest class of computers spanning the widest range
of applications and performance

Often have minimum performance requirements.


Example?

Often have stringent limitations on cost. Example?


Often have stringent limitations on power consumption.
Example?

Often have low tolerance for failure. Example?

ECGR 4181/5181

UNCC 2015

Characteristics of Each Segment


Desktop: Need high performance (graphics performance?)
and low price
Servers: Need high throughput, availability (revenue
losses of $14K-$6.5M per hour of downtime), scalability
Embedded: Need low power, low cost, low memory (almost
dont care for performance, except real-time constraints)
Feature

Desktop

Server

Embedded

System price

$1000 - $10,000

$10,000 - $10,000,000

$10 - $100,000

Processor price

$100 - $1000

$200 - $2000

$0.20 - $200

Design issues

Price-perf, graphics

Throughput, availability,
scalability

Price, power, appspecific performance

ECGR 4181/5181

UNCC 2015

Parallelism
Classes of parallelism in applications:
Data-Level Parallelism (DLP)
Task-Level Parallelism (TLP)

Classes of architectural parallelism:


Instruction-Level Parallelism (ILP)
Vector architectures/Graphic Processor Units (GPUs)
Thread-Level Parallelism
Request-Level Parallelism

ECGR 4181/5181

UNCC 2015

Flynns Taxonomy
Single instruction stream, single data stream (SISD)
Single instruction stream, multiple data streams (SIMD)
Vector architectures
Multimedia extensions
Graphics processor units

Multiple instruction streams, single data stream (MISD)


No commercial implementation

Multiple instruction streams, multiple data streams


(MIMD)
Tightly-coupled MIMD
Loosely-coupled MIMD

ECGR 4181/5181

UNCC 2015

Defining Computer Architecture


Old view of computer architecture:
Instruction Set Architecture (ISA) design
i.e. decisions regarding:
- registers, memory addressing, addressing modes,
instruction operands, available operations, control flow
instructions, instruction encoding

Real computer architecture:


Specific requirements of the target machine
Design to maximize performance within constraints:
cost, power, and availability
Includes ISA, microarchitecture, hardware
ECGR 4181/5181

UNCC 2015

Trends in Technology
Integrated circuit technology
Transistor density: 35%/year
Die size: 10-20%/year
Integration overall: 40-55%/year

DRAM capacity: 25-40%/year (slowing)


Flash capacity: 50-60%/year
15-20X cheaper/bit than DRAM

Magnetic disk technology: 40%/year


15-25X cheaper/bit then Flash
300-500X cheaper/bit than DRAM

ECGR 4181/5181

UNCC 2015

Bandwidth and Latency


Bandwidth or throughput
Total work done in a given time
10,000-25,000X improvement for processors
300-1200X improvement for memory and disks

Latency or response time


Time between start and completion of an event
30-80X improvement for processors
6-8X improvement for memory and disks

ECGR 4181/5181

UNCC 2015

Bandwidth and Latency

Log-log plot of bandwidth and latency milestones


ECGR 4181/5181

UNCC 2015

Transistors and Wires


Feature size
Minimum size of transistor or wire in x or y dimension
10 microns in 1971 to .032 microns in 2011
Transistor performance scales linearly
- Wire delay does not improve with feature size!
Integration density scales quadratically

ECGR 4181/5181

UNCC 2015

Trends in Cost
Cost driven down by learning curve
Yield

DRAM: price closely tracks cost

Microprocessors: price depends on volume


10% less for each doubling of volume

ECGR 4181/5181

UNCC 2015

Integrated Circuit Cost


Integrated circuit

Bose-Einstein formula:

Defects per unit area = 0.016-0.057 defects per square cm (2010)


N = process-complexity factor = 11.5-15.5 (40 nm, 2010)
ECGR 4181/5181

UNCC 2015

Measuring Performance
Typical performance metrics:
Response time
Throughput

Speedup of X relative to Y
Execution timeY / Execution timeX

Execution time
Wall clock time: includes all system overheads
CPU time: only computation time

Benchmarks
SPEC CPU 2006: cpu-oriented programs (for desktops)
SPECweb, TPC: throughput-oriented (for servers)
EEMBC: for embedded processors/workloads

ECGR 4181/5181

UNCC 2015

Principles of Computer Design


Take Advantage of Parallelism
e.g. multiple processors, disks, memory banks,
pipelining, multiple functional units

Principle of Locality
Reuse of data and instructions

Focus on the Common Case


Amdahls Law

ECGR 4181/5181

UNCC 2015

Compute Speedup Amdahls Law


Speedup is due to enhancement(E):
TimeBefore
TimeAfter

Let F be the fraction where enhancement is applied => Also,


called parallel fraction and (1-F) as the serial fraction
F
Execution timeafter = ExTime
x
[
(1-F)
+
before

Speedup(E)

ECGR 4181/5181

ExTimebefore
ExTimeafter

]
1

=
[(1-F) +

F ]
S

UNCC 2015

Principles of Computer Design


The Processor Performance Equation

ECGR 4181/5181

UNCC 2015

Principles of Computer Design


Different instruction types having different CPIs

ECGR 4181/5181

UNCC 2015

Measuring Performance

How do we conclude that System-A is better than


System-B?

ECGR 4181/5181

UNCC 2015

Performance Metrics
Purchasing perspective
given a collection of machines, which has the
best performance?
least cost?
best cost/performance?

Design perspective
faced with design options, which has the
best performance improvement?
least cost?
best cost/performance?

Both require
basis for comparison
metric for evaluation

Our goal is to understand what factors in the architecture


contribute to overall system performance and the relative
importance (and cost) of these factors
ECGR 4181/5181

UNCC 2015

Defining (Speed) Performance


Normally interested in reducing
Response time (execution time) the time between the start and
the completion of a task
- Important to individual users

Thus, to maximize performance, need to minimize execution time

performanceX = 1 / execution_timeX
If X is n times faster than Y, then

performanceX
execution_timeY
-------------------- = --------------------- = n
performanceY
execution_timeX
Throughput the total amount of work done in a given time
Important to data center managers

Decreasing response time almost always improves throughput


ECGR 4181/5181

UNCC 2015

Performance Factors
Want to distinguish elapsed time and the time spent on
our task
CPU execution time (CPU time) time the CPU spends
working on a task
Does not include time waiting for I/O or running other programs

CPU execution time = # CPU clock cyclesx clock cycle time


for a program
for a program
or

CPU execution time = #------------------------------------------CPU clock cycles for a program


for a program
clock rate
Can improve performance by reducing either the length
of the clock cycle or the number of clock cycles required
for a program
ECGR 4181/5181

UNCC 2015

Clock Cycles per Instruction


Not all instructions take the same amount of time to
execute
One way to think about execution time is that it equals the
number of instructions executed multiplied by the average time
per instruction

# CPU clock cycles


# Instructions Average clock cycles
= for a program x
for a program
per instruction
Clock cycles per instruction (CPI) the average number
of clock cycles each instruction takes to execute
A way to compare two different implementations of the same
architecture

CPI
ECGR 4181/5181

CPI for this instruction class


A
B
C
1
2
3
UNCC 2015

Effective CPI
Computing the overall effective CPI is done by looking at
the different types of instructions and their individual
cycle counts and averaging
n

Overall effective CPI =

(CPIi x ICi)

i=1

Where ICi is the count (percentage) of the number of instructions


of class i executed
CPIi is the (average) number of clock cycles per instruction for
that instruction class
n is the number of instruction classes

The overall effective CPI varies by instruction mix a


measure of the dynamic frequency of instructions across
one or many programs
ECGR 4181/5181

UNCC 2015

THE Performance Equation


Our basic performance equation is then
CPU time

= Instruction_count x CPI x clock_cycle


or

CPU time

Instruction_count x
CPI
----------------------------------------------clock_rate

These equations separate the three key factors that


affect performance
Can measure the CPU execution time by running the program
The clock rate is usually given
Can measure overall instruction count by using profilers/
simulators without knowing all of the implementation details
CPI varies by instruction type and ISA implementation for which
we must know the implementation details
ECGR 4181/5181

UNCC 2015

Determinates of CPU Performance


CPU time

= Instruction_count x CPI x clock_cycle

Algorithm
Programming
language
Compiler
ISA
Processor
organization
Technology
ECGR 4181/5181

Instruction_
count

CPI

clock_cycle

X
X
UNCC 2015

A Simple Example
Op

Freq

CPIi

Freq x CPIi

ALU

50%

.5

.5

.5

.25

Load

20%

1.0

.4

1.0

1.0

Store

10%

.3

.3

.3

.3

Branch

20%

.4

.4

.2

.4

2.2

1.6

2.0

1.95

How much faster would the machine be if a better data cache


reduced the average load time to 2 cycles?
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster

How does this compare with using branch prediction to shave


a cycle off the branch time?
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster

What if two ALU instructions could be executed at once?


CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster
ECGR 4181/5181

UNCC 2015

m/c A

m/c B

m/c C

Prog 1 (secs)

10

20

Prog 2 (secs)

1000

100

20

Total time: 1001, 101, and 40 seconds

ECGR 4181/5181

UNCC 2015

Comparing and Summarizing Performance


How do we summarize the performance for benchmark
set with a single number?
The average of execution times that is directly proportional to
total execution time is the arithmetic mean (AM)
n

AM =

1/n

Timei

i=1

Where Timei is the execution time for the ith program of a total of
n programs in the workload
A smaller mean indicates a smaller average execution time and
thus improved performance

Guiding principle in reporting performance measurements


is reproducibility list everything another experimenter
would need to duplicate the experiment (version of the
operating system, compiler settings, input set used,
specific computer configuration (clock rate, cache sizes
and speed, memory size and speed, etc.))
ECGR 4181/5181

UNCC 2015

Normalized Execution Time

A second approach to unequal mixture of programs in the workload normalize execution times
to a reference m/c and then take the average of the normalized execution times.

Average normalized execution time either an arithmetic or geometric mean

Geometric mean
n

execution time

i =1

where execution time ratio is the execution time for the program i normalized to the reference m/c
Interesting property

X
Geometric mean( X i )
= Geometric mean i
Geometric mean(Yi )
Yi

ECGR 4181/5181

UNCC 2015

Normalized Execution Times

Geometric means of normalized execution times are consistent - does not matter which m/c is
the reference

Not true for arithmetic mean should not be used to average normalized execution times

Weighted arithmetic means weightings are proportional to execution times influenced by


frequency of use in workload, type of m/c, and size of program input. Geometric mean is
independent of running times of individual programs and any m/c can be used for normalization

Example: Comparative performance evaluation where programs were fixed and inputs are not
Competitors could calculate weighted arithmetic mean by using their best performing
benchmark for the longest input.
Geometric mean would be less misleading than arithmetic mean for such situations

ECGR 4181/5181

UNCC 2015

Normalized Execution Times

m/c A

m/c B

m/c C

Prog 1(secs)

10

20

Prog 2 (secs)

1000

100

20

P1

1.0

10.0

20.0

0.1

1.0

2.0

0.05

0.5

1.0

P2

1.0

0.1

0.02

10.0

1.0

0.2

50.0

5.0

1.0

AM

1.0

5.05

10.01

5.05

1.0

1.1

25.03

2.75

1.0

GM

1.0

1.00

0.63

1.0

1.0

0.63

1.58

1.58

1.0

Total

1.0

0.11

0.04

9.1

1.0

0.36

25.03

2.75

1.0

ECGR 4181/5181

UNCC 2015

Normalized Execution Times

From geometric means:


P1 and P2 perform the same on A and B true only if p1 runs 100 times for every run of
P2
For such a workload execution times on machines A and B are 1100 secs and the
execution time on m/c C is 2020 secs. Geometric means suggests C is faster than A and
B.

Conclusion In general there is no workload for three or more machines that will
match the performance predicted by the geometric means of normalized execution
times.

Another problem with GM It encourages h/w and s/w designers to focus their
attention on the benchmarks where performance is easiest to improve rather than on
the benchmarks that are slowest.
Example: cut running time from 2 seconds to 1 and from 10,000 seconds to 5,000

ECGR 4181/5181

UNCC 2015

Example
A new laptop has an IPC that is 20% worse than the old
laptop. It has a clock speed that is 30% higher than the old
laptop. The same binaries on both machines.
What is the speedup of the new laptop?

ECGR 4181/5181

UNCC 2015

Example

A new laptop has an IPC that is 20% worse than the old
laptop. It has a clock speed that is 30% higher than the old
laptop. The same binaries on both machines.Their IPCs are
listed below. The binaries are run such that each binary
gets an equal share of CPU time.
What is the speedup of the new laptop?
P1 P2 P3
Old-IPC
1.2 1.6 2.0
New-IPC
1.6 1.6 1.6

ECGR 4181/5181

UNCC 2015

Example

A new laptop has an IPC that is 20% worse than the old
laptop. It has a clock speed that is 30% higher than the old
laptop. The same binaries on both machines.Their IPCs are
listed below. The binaries are run such that each binary
gets an equal share of CPU time.
What is the speedup of the new laptop?

Old-IPC
New-IPC

ECGR 4181/5181

P1 P2
1.2 1.6
1.6 1.6

P3
2.0
1.6

AM GM
1.6 1.57
1.6 1.6

UNCC 2015

Speedup Vs. Percentage


Speedup is a ratio = old exec time / new exec time
Improvement, Increase, Decrease usually refer to
percentage relative to the baseline
= (new perf old perf) / old perf
A program ran in 100 seconds on the old laptop and in 70
seconds on the new laptop
What is the speedup?
What is the percentage increase in performance?
What is the reduction in execution time?

ECGR 4181/5181

UNCC 2015

Quantitative Principles of Computer Design

Designing for performance and cost-performance

One of the most important principle make the common case fast favor the frequent case over the
infrequent case design trade-off

How much performance can be improved by making the frequent case faster?

Amdahls Law: The performance improvement to be gained from using some faster mode of execution
is limited by the fraction of the time the faster mode can be used defines the speedup that can be
gained by using a particular feature.

speedup =

speedup =

ECGR 4181/5181

performance for the entire task using the enhancement


performance for the entire task without using the performance

exection time for the entire task without using the enhancement
execution time for the entire task using the performance
UNCC 2015

Quantitative Principles of Computer Design

Depends on two factors:


The fraction of the computational time in the original machine that can be converted to take
advantage of the enhancement
Example: 20 seconds of the execution time of a program that takes 80 seconds in total
can use an enhancement, the fraction is 20/80

fraction enhancement 1
The improvement gained by the enhanced execution mode how much faster the task
would run if the enhanced mode were used for the entire program

fraction enhanced

execution time new = execution timeold (1 fraction enhanced ) +


speedup
enhanced

The overall speedup is the ratio of the execution times:

speedup overall =

ECGR 4181/5181

execution timeold
1
=
fraction enhanced
execution time new (1 - fraction
)
+
enhanced
speedup enhanced

UNCC 2015

Quantitative Principles of Computer Design


Example
Suppose we are considering an enhancement to the processor of a server
system used for Web serving. The new CPU is 10 times faster on
computation in the Web serving application than the original processor.
Assuming that the original CPU is busy with computation 40% of the time
and is waiting for I/O 60% of the time, what is the overall speedup gained
by incorporating the enhancement?
speedup overall =

1
(1 - fraction enhanced ) +

fraction enhanced
speedup enhanced

fraction enhanced = 0.4


speedup enhanced = 10
speedup overall =

ECGR 4181/5181

1
0.6 +

0.4
10

= 1.56

UNCC 2015

Quantitative Principles of Computer Design

The Amdahls Law expresses the law of diminishing returns The


incremental improvement in speedup gained by an additional improvement
in the performance of just a portion of the computation diminishes as
improvements are added.

If an enhancement is only usable for a fraction of a task, we cannot speed


up the task by more than the 1/(1-fraction)

Caution! Do not confuse fraction of time converted to use enhancement


and fraction of time after enhancement in use

ECGR 4181/5181

UNCC 2015

Quantitative Principles of Computer Design


Example: A common transformation required in graphics engines is square
root. Implementations of floating-point (FP) square root vary significantly in
performance, especially among processors designed for graphics. Suppose
FP square root (FPSQR) is responsible for 20% of the execution time of a
critical graphics benchmark. One proposal is to enhance the FPSQR
hardware and speed up this operation by a factor of 10. The other alternative
is just to try to make all FP instrcutions in the graphics processor run faster by
1.6; FP instructions are responsible for a total of 50 % of the execution time
for the application. The design team believes that they can make all FP
instructions run 1.6 times faster with the same effort required for the fast
square root. Compare the two design alternatives.

speedup FPSQR =

speedup FP =

1
(1 - 0.2) +

0.2
10

1
0.5
(1 - 0.5) +
1.6

= 1.22

= 1.23

Improving the performance of the FP operations overall is slightly better because


of the higher frequency
ECGR 4181/5181

UNCC 2015

The CPU Performance Equation

Sometimes difficult to measure time consumed by different operations

Alternative Decompose CPU execution time into different components and see
how different alternatives affect these components

CPU time = CPU clock cycles for a program x clock cycle time
CPU clock cycles for a program
CPU time =
clock rate

CPI =

CPU clock cycles per program


Instruction count

In terms of number of instructions executed instruction path length or instruction


count (IC) and clock cycles per instruction (CPI)
This figure of merit provides insight into different styles of instruction sets and implementations

ECGR 4181/5181

UNCC 2015

CPU Performance Equation

CPU clock cycles for a program = IC x CPI


CPU time = IC x clock cycle time x CPI
IC CPI
CPU time =
clock rate
instrcutions clock cycles
seconds
seconds

= CPU time

=
program
instruction clock cycle program

CPU performance is dependent on three characteristics: clock cycle (or rate), clock cycles per
instruction, and instruction count

Difficult to change one parameter without affecting other parameters.


Clock cycle time hardware technology and organization
CPI organization and instruction set architecture
Instruction count ISA and compiler technology

Many performance improvement techniques improve one component of CPU performance


with small or predictable impact on the other two
ECGR 4181/5181

UNCC 2015

CPU Performance Equation

Sometimes it is useful to calculate the number of total CPU clock cycles as


n

CPU clock cycles = IC i CPI i


i =1

CPU time = IC i CPI i clock cycle time


i =1

CPI =

IC
i =1

CPI i

IC

=
i =1

IC i
CPI i
IC

Note: CPI i should be measured and not just calculated from the reference table

Possible uses of CPU performance equation:


High-level performance comparisons
Back-of-the-envelope calculations
Enable architects to think about compilers and technology

ECGR 4181/5181

UNCC 2015

CPU Performance Equation

Example
Suppose we have the following measurements:

Frequency of FP operations = 25%


Average CPI of FP operations = 4.0
Average CPI of other operations = 1.33
Frequency of FPSQR = 2%
CPI of FPSQR = 20

Assume that the two design alternatives are to decrease the CPI of FPSQR to 2 or to decrease the
average CPI of all the FP operations to 2.5. Compare the two design alternatives using the
CPU performance equation.

Note: only the CPI changes; the clock rate and IC do not change

ECGR 4181/5181

UNCC 2015

CPU Performance Equation


CPI without enhancement

IC
CPI original = CPI i i
IC
i =1
n

=(4x25%) + (1.33x75%) = 2.0

CPI for enhanced FPSQR

CPI new FPSQR = CPI original 2% (CPI old FPSQR CPI new FPSQR only )
=2.0 2% x (20 2) = 1.64
CPI for enhanced FP instructions

CPI new FP = (75% x 1.33) + (25% x 2.5) = 1.62


Speedup new FP =
=

CPI original
CPI new FP

ECGR 4181/5181

CPU timeoriginal
CPU time new FP

IC clock cycle time CPI original


IC clock cycle time CPI new FP

2.0
= 1.23
1.62
UNCC 2015

CPU Performance Equation


Example
Machine A: 40% ALU operations (1 cycle), 24% loads (1 cycle), 11% stores (2 cycles), 25%
branches (2 cycles)

Should one-cycle stores be implemented if it slows the clock by 15%?

old CPI = 0.4 + 0.24 + (0.11 x 2) + (0.25 x 2) = 1.36 cycles/instruction

new CPI = 0.4 + 0.24 + 0.11 + (0.25 x 2) = 1.25 cycles/instruction

Speedup = (IC x old CPI x old clock cycle time)/(IC x new CPI x new clock cycle time)

= (1.36 x T)/(1.25 x 1.15T) = 0.946

<1

NO!
ECGR 4181/5181

UNCC 2015

CPU Performance Equation

How to actually measure execution time and CPI?

Execution time: time (Unix/Linux command)

Calculate CPI

Breakdown CPI into compute, memory stalls, etc. identify performance problems

CPI Breakdown Measurement

Hardware event counters

Cycle-level simulators

ECGR 4181/5181

UNCC 2015

Harmonic Mean

If performance is expressed as a rate, then the average that tracks the total execution
time is the harmonic mean:

HM =

n
n

i =1 Rate i

Note: Arithmetic mean cannot be used for rates

Example: Travel at 40 MPH for one mile and 80 MPH for one mile average is not 60 MPH

HM = 2/(1/40 + 1/80) = 53.33 MPH

ECGR 4181/5181

UNCC 2015

Littles Law

Key relationship between latency and bandwidth

The average number of objects in a queue is the product of the entry rate and the average
holding time

average number of objects = arrival rate x average holding time

ECGR 4181/5181

UNCC 2015

Power and Energy


Problem: Get power in, get power out
Thermal Design Power (TDP)
Characterizes sustained power consumption
Used as target for power supply and cooling system
Lower than peak power, higher than average power
consumption

Clock rate can be reduced dynamically to limit


power consumption
Energy per task is often a better measurement

ECGR 4181/5181

UNCC 2015

Dynamic Energy and Power


Dynamic energy
Transistor switch from 0 -> 1 or 1 -> 0
x Capacitive load x Voltage2

Dynamic power
x Capacitive load x Voltage2 x Frequency switched

Reducing clock rate reduces power, not energy

ECGR 4181/5181

UNCC 2015

Power
Intel 80386
consumed ~ 2 W
3.3 GHz Intel Core
i7 consumes 130
W
Heat must be
dissipated from 1.5
x 1.5 cm chip
This is the limit of
what can be
cooled by air

ECGR 4181/5181

UNCC 2015

Reducing Power
Techniques for reducing power:
Do nothing well
Dynamic Voltage-Frequency Scaling
Low power state for DRAM, disks
Overclocking, turning off cores

ECGR 4181/5181

UNCC 2015

Static Power
Static power consumption
Currentstatic x Voltage
Scales with number of transistors
To reduce: power gating

ECGR 4181/5181

UNCC 2015

Power Trends

The Power Wall

In CMOS IC technology

Power = Capacitive load Voltage 2 Frequency


30
ECGR 4181/5181

5V 1V

1000
UNCC 2015

Reducing Power
Suppose a new CPU has
85% of capacitive load of old CPU
15% voltage and 15% frequency reduction

Pnew Cold 0.85 (Vold 0.85)2 Fold 0.85


4
0.85
=
=
= 0.52
2
Pold
Cold Vold Fold

The power wall


We cant reduce voltage further
We cant remove more heat
How else can we improve performance?
ECGR 4181/5181

UNCC 2015

Dependability
Module reliability
Mean time to failure (MTTF)
Mean time to repair (MTTR)
Mean time between failures (MTBF) = MTTF + MTTR
Availability = MTTF / MTBF

ECGR 4181/5181

UNCC 2015

Define and quantity dependability

Module reliability = measure of continuous service


accomplishment (or time to failure).
Two metrics

1. Mean Time To Failure (MTTF) measures Reliability


2. Failures In Time (FIT) = 1/MTTF, the rate of failures
Traditionally reported as failures per billion hours of operation

Mean Time To Repair (MTTR) measures Service


Interruption
Mean Time Between Failures (MTBF) = MTTF+MTTR

Module availability measures service as alternate


between the 2 states of accomplishment and
interruption (number between 0 and 1, e.g. 0.9)

Module availability = MTTF / ( MTTF + MTTR)

ECGR 4181/5181

UNCC 2015

Example calculating reliability

If modules have exponentially distributed lifetimes


(age of module does not affect probability of
failure), overall failure rate is the sum of failure
rates of the modules

Calculate FIT and MTTF for 10 disks (1M hour


MTTF per disk), 1 disk controller (0.5M hour
MTTF), and 1 power supply (0.2M hour MTTF):

FailureRate =

MTTF =
ECGR 4181/5181

UNCC 2015

Performance Improvement

Exploiting properties of programs

Principle of Locality programs tend to reuse data and instructions they have used
recently
Rule of thumb A program spends 90% of its execution time in only 10% of the code. We
can predict with reasonable accuracy what instructions and data a program will use in
near future based on its accesses in the recent past.
Principle of locality also applies to data accesses not as strongly as code accesses.
Two types of locality

Temporal locality recently accessed items are likely to be accessed again in


near future
Spatial locality Items whose addresses are near one another tend to be
referenced close together

ECGR 4181/5181

UNCC 2015

Performance Improvement

Take Advantage of Parallelism

One of the most important methods for improving performance

Parallelism at different levels. For example,


System level multiprocessors and multiple disks for servers workload spread among
CPUs or disks improves throughput scalability is important for server applications
Individual processors Parallelism at instruction level, such as pipelining
Parallelism at the level of detailed digital design, such as carry-lookahead ALUs and setassociative caches for parallel searches

One common class of techniques is parallel computation of two or more possible


outcomes, followed by late selection, such as carry select adders and handling
branches in pipelines

ECGR 4181/5181

UNCC 2015

Multiprocessors
Multicore microprocessors
More than one processor per chip

Requires explicitly parallel programming


Compare with instruction level parallelism
- Hardware executes multiple instructions at once
- Hidden from the programmer
Hard to do
- Programming for performance
- Load balancing
- Optimizing communication and synchronization

ECGR 4181/5181

UNCC 2015

Fallacies and Pitfalls

Fallacy The relative performance of two processors with the same ISA can be judged by clock
rate or by the performance of a single benchmark suite.
As processors have become faster and more sophisticated, processor performance in one
application area can diverge from that in another area. Sometimes the ISA is responsible,
but more often the pipeline structure and memory system are responsible. Thus the clock
rate is not a good metric.
Performance of a 1.7 GHz Pentium 4 relative to a 1 GHz Pentium III

ECGR 4181/5181

UNCC 2015

Fallacies and Pitfalls

Pitfall Comparing hand-coded assembly and compiler-generated, high-level language


performance
Embedded market several factors encourage hand-coding, at least of key loops;
importance of a few small loops to overall performance (real-time performance) in some
embedded applications. Inclusion of instructions that can significantly boost performance of
certain computations and compilers can not effectively use
Hand-coding of the critical parts can lead to large performance gains

EEMBC
benchmark set

Compilergenerated
performance

Hand-coded
performance

Ratio
hand/compiler

Trimedia 1300
@ 166MHz

Consumer

23.3

110

4.7

BOPS Manta @
136 MHz

Telecomm

2.6

225.8

86.8

TI
TMS320C6203
@ 300 MHz

Telecomm

6.8

68.5

10.1

Machine

ECGR 4181/5181

UNCC 2015

Fallacies and Pitfalls

Fallacy Peak performance tracks performance


One definition of peak performance is that the performance level a machine is guaranteed
not to exceed. The gap between peak performance and observed performance is typically a
factor of 10 or more for supercomputers. Since the gap is large and can vary significantly,
peak performance is not useful in predicting observed performance unless the workload
consists of small programs that operate close to peak.
Peak MIPS Obtained by choosing an instruction mix that minimizes the CPI, even if the
mix is totally impractical.

ECGR 4181/5181

UNCC 2015

Fallacies and Pitfalls

Fallacy The best design for a computer is the one that optimizes the primary objective without
considering implementation
Complex designs take longer to complete and prolong time to market design will be less
competitive

Pitfall Neglecting the cost of software in either evaluating a system or examining costperformance
For many years, hardware was so expensive that it dominated the cost of software, but this
is no longer true
For medium-size database sever the software costs are roughly 50% of the total cost while
for desktop the software cost somewhere between 23% and 38%

ECGR 4181/5181

UNCC 2015

Fallacies and Pitfalls

Pitfall Falling prey to Amdahls Law


Expending tremendous effort optimizing some aspect of a system before measuring its
usage

Fallacy MIPS is an accurate measure for comparing performance among processors


Embedded market uses Dhrystone as the benchmark of choice and reports performance as
Dhrystone MIPS

Instruction count
clock rate
=
Exection time 106 CPI 106
IC
Execution time =
MIPS 106
MIPS =

Easy to understand
Faster machines have higher MIPS rating matches intuition

ECGR 4181/5181

UNCC 2015

Fallacies and Pitfalls

Problems with using MIPS as a measure for comparison:


MIPS is dependent on the instruction set, making it difficult to
compare MIPS of computers with different instruction sets

MIPS varies between programs on the same computer

MIPS can vary inversely to performance. Example: MIPS rating of a


machine with optional floating-point hardware. Takes more clock
cycles per floating-point instruction than integer instruction floatingpoint programs using optional hardware instead software floatingpoint routines take less time but have a lower MIPS rating. Software
floating point executes simpler instructions resulting in higher MIPS
rating, but it executes so many instructions that the execution time is
longer.

MIPS is sometimes used by a single vendor within a single set of


machines designed for a given class of applications

ECGR 4181/5181

UNCC 2015

Fallacies and Pitfalls

Designers started reporting relative MIPS embedded market reports for


Dhrystone

Relative MIPS for a machine M was defined based on some


reference machine as

MIPSM =

Performance M
MIPSreference
Performance reference

MFLOPS measured the inverse of execution time

ECGR 4181/5181

UNCC 2015

Other Performance Metrics


Power consumption especially in the embedded market
where battery life is important (and passive cooling)
For power-limited applications, the most important metric is
energy efficiency

ECGR 4181/5181

UNCC 2015

Summary: Evaluating ISAs

Design-time metrics:
Can it be implemented, in how long, at what cost?
Can it be programmed? Ease of compilation?

Static Metrics:
How many bytes does the program occupy in memory?

Dynamic Metrics:
How many instructions are executed? How many bytes does the
processor fetch to execute the program?
How many clocks are required per instruction?
CPI
How "lean" a clock is practical?

Best Metric: Time to execute the program!


depends on the instructions set, the
processor organization, and compilation
techniques.
ECGR 4181/5181

Inst. Count

Cycle Time

UNCC 2015

Você também pode gostar