Você está na página 1de 91

Digital Design

Chapter 6:
Optimizations and Tradeoffs
Slides to accompany the textbook Digital Design, with RTL Design, VHDL,
and Verilog, 2nd Edition,
by Frank Vahid, John Wiley and Sons Publishers, 2010.
http://www.ddvahid.com

Copyright 2010 Frank Vahid


Instructors of courses requiring Vahid's Digital Design textbook (published by John Wiley and Sons) have permission to modify and use these slides for customary course-related activities,
subject to keeping this copyright notice in place and unmodified. These slides may be posted as unanimated pdf versions on publicly-accessible course websites.. PowerPoint source (or pdf
Digital Design 2e
with animations) may not be posted to publicly-accessible websites, but may be posted for students on internal protected sites or distributed directly to students by other electronic means.
Copyright 2010 1
Instructors may make printouts of the slides available to students for a reasonable photocopying charge, without incurring royalties. Any other use requires explicit permission. Instructors
Frank Vahid
may obtain PowerPoint source or obtain special use permissions from Wiley see http://www.ddvahid.com for information.
6.1

Introduction
We now know how to build digital circuits
How can we build better circuits?
Lets consider two important design criteria
Delay the time from inputs changing to new correct stable output
Size the number of transistors
For quick estimation, assume
Transforming F1 to F2 represents
Every gate has delay of 1 gate-delay
an optimization: Better in all
Every gate input requires 2 transistors
criteria of interest
Ignore inverters
16 transistors 4 transistors 20
w 2 gate-delays 1 gate-delay F1

(transistors)
x 15
y
w

size
a
F1 F2 10
x
w
x 5 F2
y
F1 = wxy + wxy F2 = wx 1 2 3 4
= wx(y+y) = wx delay (gate-delays)
(a) (b) a
(c)
Digital Design 2e
Copyright 2010 2
Frank Vahid
Note: Slides with animation are denoted with a small red "a" near the animated items
Introduction
Tradeoff
Improves some, but worsens other, criteria of interest

Transforming G1 to G2
represents a tradeoff: Some
criteria better, others worse.

14 transistors 12 transistors 20

(transistors)
w 2 gate-delays w 3 gate-delays 15 G1
x a

size
x G2
G1 G2 10
y
w 5
z
y
z 1 2 3 4
G1 = wx + wy + z G2 = w(x+y) + z a delay (gate-delays)

Digital Design 2e
Copyright 2010 3
Frank Vahid
Introduction
Tradeoffs
Optimizations Some criteria of interest
All criteria of interest are improved, while

size

size
are improved (or at others are worsened a

least kept the same)

delay delay

We obviously prefer optimizations, but often must accept


tradeoffs
You cant build a car that is the most comfortable, and has the best
fuel efficiency, and is the fastest you have to give up something to
gain other things.

Digital Design 2e
Copyright 2010 4
Frank Vahid
6.2

Combinational Logic Optimization and Tradeoffs


Two-level size optimization using Example
algebraic methods F = xyz + xyz + xyz + xyz

Goal: Two-level circuit (ORed AND F = xy(z + z) + xy(z + z)


gates) with fewest transistors F = xy*1 + xy*1 a

Though transistors getting cheaper


F = xy + xy 4 literals + 2
(Moores Law), still cost something terms = 6
Define problem algebraically x m gate inputs
y
Sum-of-products yields two levels F
x
F = abc + abc is sum-of-products; 6 gate inputs =
y n
G = w(xy + z) is not. 12 transistors
Transform sum-of-products equation to 0
0
0
have fewest literals and terms x y m

Each literal and term translates to a x y a


m n
gate input, each of which translates to y n F
about 2 transistors (see Ch. 2) y
m n
x
For simplicity, ignore inverters x
1
1
Digital Design 2e 1
Copyright 2010 Note: Assuming 4-transistor 2-input AND/OR circuits; 5
Frank Vahid in reality, only NAND/NOR use only 4 transistors.
Algebraic Two-Level Size Optimization
Previous example showed common
algebraic minimization method F = xyz + xyz + xyz + xyz
(Multiply out to sum-of-products, then...) F = xy(z + z) + xy(z + z)
Apply following as much as possible F = xy*1 + xy*1 a

ab + ab = a(b + b) = a*1 = a F = xy + xy

Combining terms to eliminate a variable


(Formally called the Uniting theorem)
F = xyz + xyz + xyz
Duplicating a term sometimes helps
F = xyz + xyz + xyz + xyz
Doesnt change function F = xy(z+z) + xz(y+y) a
c + d = c + d + d = c + d + d + d + d ... F = xy + xz
Sometimes after combining terms, can
combine resulting terms G = xyz + xyz + xyz + xyz
G = xy(z+z) + xy(z+z)
G = xy + xy (now do again) a

G = x(y+y)
Digital Design 2e
Copyright 2010 G=x 6
Frank Vahid
Karnaugh Maps for Two-Level Size Optimization
F yz
Easy to miss possible opportunities to x 00
Notice not in binary order
01 11 10
combine terms when doing algebraically 0 xyz xyz xyz xyz K-map
Karnaugh Maps (K-maps) 1 xyz xyz xyz xyz
a

Graphical method to help us find


opportunities to combine terms Treat left & right as adjacent too

Minterms differing in one variable are adjacent F = xyz + xyz + xyz + xyz
in the map F yz
Can clearly see opportunities to combine x 00 01 11 10
terms look for adjacent 1s 0 1 1 0 0
a
For F, clearly two opportunities
1 0 0 1 1
Top left circle is shorthand for:
xyz+xyz = xy(z+z) = xy(1) = xy F yz

Draw circle, write term that has all the literals x


00 01 11 10
except the one that changes in the circle 0 1 1 0 0
a
Circle xy, x=1 & y=1 in both cells of the circle, but
z changes (z=1 in one cell, 0 in the other) 1 0 0 1 1

Minimized function: OR the final terms


F = xy + xy xy xy
F = xyz + xyz + xyz + xyz
Digital Design 2e F = xy(z + z) + xy(z + z)
Copyright 2010 Easier than algebraically: F = xy*1 + xy*1 7
Frank Vahid
F = xy + xy
K-maps
Four adjacent 1s means G = xyz + xyz + xyz + xyz
G = x(yz+ yz + yz + yz) (must be true)
two variables can be G = x(y(z+z) + y(z+z))
eliminated G = x(y+y)
Makes intuitive sense those G=x
two variables appear in all
combinations, so one term G yz
must be true x 00 01 11 10
Draw one big circle 0 0 0 0 0
shorthand for the algebraic
transformations above 1 1 1 1 1
x
G yz
x 00 01 11 10
Draw the biggest
circle possible, or 0 0 0 0 0
youll have more terms
1 1 1 1 1
than really needed
Digital Design 2e xy xy
Copyright 2010 a 8
Frank Vahid
K-maps
Four adjacent cells can be in H
H = xyz + xyz + xyz + xyz
yz
shape of a square x
(xy appears in all combinations)
00 01 11 10
OK to cover a 1 twice
0 0 1 1 0
Just like duplicating a term a
z
Remember, c + d = c + d + d 1 0 1 1 0

No need to cover 1s more than


I yz yz
once x 00 01 11 10
Yields extra terms not minimized
0 0 1 0 0
a

J yz xy yz 1 1 1 1 1
x
x 00 01 11 10 The two circles are shorthand for:
I = xyz + xyz + xyz + xyz + xyz
xz I = xyz + xyz + xyz + xyz + xyz + xyz
0 1 1 0 0
I = (xyz + xyz) + (xyz + xyz + xyz + xyz)
a I = (yz) + (x)
1 0 1 1 0

Digital Design 2e
Copyright 2010 9
Frank Vahid
K-maps
xyz
Circles can cross left/right sides
K yz
x 00 01 11 10
Remember, edges are adjacent 0 0 1 0 0
xz
Minterms differ in one variable only
1 1 0 0 1
Circles must have 1, 2, 4, or 8
L yz
cells 3, 5, or 7 not allowed x 00 01 11 10

3/5/7 doesnt correspond to 0 0 0 0 0


a

algebraic transformations that 1 1 1 1 0


combine terms to eliminate a E yz
variable x 00 01 11 10

Circling all the cells is OK 0 1 1 1 1


1 a

Function just equals 1 1 1 1 1 1

Digital Design 2e
Copyright 2010 10
Frank Vahid
K-maps for Four Variables
F yz
Four-variable K-map follows wx 00 01 11 10
same principle 00 0 0 1 0

Adjacent cells differ in one 01 1 1 1 0

wxy
variable
11 0 0 1 0
Left/right adjacent
Top/bottom also adjacent 10 0 0 1 0

5 and 6 variable maps exist G yz


yz
F=wxy+yz
But hard to use wx 00 01 11 10

Two-variable maps exist 00 0 1 1 0

But not very useful easy to do 01 0 1 1 0

algebraically by hand F z 11 0 1 1 0
y 0 1

0 10 0 1 1 0

1 z
Digital Design 2e G=z
Copyright 2010 11
Frank Vahid
Two-Level Size Optimization Using K-maps
General K-map method
1. Convert the functions equation into
sum-of-minterms form
2. Place 1s in the appropriate K-map
cells for each minterm
3. Cover all 1s by drawing the fewest
largest circles, with every 1
included at least once; write the
corresponding term for each circle
4. OR all the resulting terms to create
the minimized function.

Digital Design 2e
Copyright 2010 12
Frank Vahid
Two-Level Size Optimization Using K-maps
General K-map method Common to revise (1) and (2):
1. Convert the functions equation into Create sum-of-products
sum-of-minterms form Draw 1s for each product
2. Place 1s in the appropriate K-map
cells for each minterm Ex: F = w'xz + yz + w'xy'z'
F yz wxz yz
wx 00 01 11 10

00 0 0 1 0

01 1 1 1 0
wxyz
11 0 0 1 0

10 0 0 1 0
Digital Design 2e
Copyright 2010 13
Frank Vahid
Two-Level Size Optimization Using K-maps
General K-map method Example: Minimize:
G = a + abc + b*(c + bc)
1. Convert the functions equation into
1. Convert to sum-of-products
sum-of-minterms form
G = a + abc + bc + bc
2. Place 1s in the appropriate K-map 2. Place 1s in appropriate cells
cells for each minterm G bc
a
bc
3. Cover all 1s by drawing the fewest 00 01 11 10
largest circles, with every 1
included at least once; write the
abc 0 1 0 0 1
a

1 1 1 1 1
corresponding term for each circle
4. OR all the resulting terms to create a
3. Cover 1s
the minimized function.
G bc
a 00 01 11 10 c

0 1 0 0 1

1 1 1 1 1
a
Digital Design 2e 4. OR terms: G = a + c
Copyright 2010 14
Frank Vahid
Two-Level Size Optimization Using K-maps
Four Variable Example
Minimize: abcd abcd
H = ab(cd + cd) + abcd + abcd abcd abd abcd
+ abd + abcd abcd
H cd
1. Convert to sum-of-products: ab 00 01 11 10
H = abcd + abcd + abcd + 00 1 0 0 1
bd
abcd + abd + abcd a
01 0 1 1 1 abc
2. Place 1s in K-map cells
11 0 0 0 0 abd
3. Cover 1s
10 1 0 0 1
4. OR resulting terms
Funny-looking circle, but
remember that left/right

H = bd + abc + abd
adjacent, and top/bottom
adjacent

Digital Design 2e
Copyright 2010 15
Frank Vahid
Dont Care Input Combinations
F yz yz
What if we know that particular input
x 00 01 11 10
combinations can never occur?
e.g., Minimize F = xyz, given that xyz 0 X 0 0 0
a
(xyz=000) can never be true, and that
xyz (xyz=101) can never be true 1 1 X 0 0

So it doesnt matter what F outputs when Good use of dont cares


xyz or xyz is true, because those
cases will never occur
Thus, make F be 1 or 0 for those cases
in a way that best minimizes the F yz yz unneeded
equation x 00 01 11 10
On K-map 0 X 0 0 0
Draw Xs for dont care combinations a

1 1 X 0 0
Include X in circle ONLY if minimizes xy
equation
Unnecessary use of dont
Dont include other Xs cares; results in extra term
Digital Design 2e
Copyright 2010 16
Frank Vahid
Optimization Example using Dont Cares
Minimize:
F bc
F = abc + abc + abc ac b
a
Given dont cares: abc, abc 00 01 11 10

0 0 1 X 1
a

1 0 0 X 1
Note: Introduce dont cares
with caution F = ac + b
Must be sure that we really dont
care what the function outputs for
that input combination
If we do care, even the slightest,
then its probably safer to set the
output to 0

Digital Design 2e
Copyright 2010 17
Frank Vahid
Optimization with Dont Cares Example:
Sliding Switch
Switch with 5 positions
1 2 3 4 5 x
3-bit value gives position in 2,3,4,
binary y
detector

Want circuit that z F

Outputs 1 when switch is in


G yz
position 2, 3, or 4 Without
x
Outputs 0 when switch is in dont 00 01 11 10 xy
cares: 0 0 0 1 1
position 1 or 5 a
F = xy xyz
Note that the 3-bit input can + xyz 1 1 0 0 0
never output binary 0, 6, or 7
G yz y
Treat as dont care input x a
00 01 11 10
combinations
0 X 0 1 1 With dont
cares:
1 1 0 X X z F = y + z
Digital Design 2e
Copyright 2010 18
Frank Vahid
Automating Two-Level Logic Size Optimization
I yz
Minimizing by hand x 00 01 11 10
Is hard for functions with 5 or
more variables 0 1 1 1 0
May not yield minimum cover (a) a

1 1 0 1 1
depending on order we choose
Is error prone yz xy yz xy
Minimization thus typically I yz
4 terms
done by automated tools x 00 01 11 10
Exact algorithm: finds optimal
solution 0 1 1 1 0
a
(b)
a Heuristic: finds good solution,
1 1 0 1 1
but not necessarily optimal
yz xz xy
Only 3 terms
Digital Design 2e
Copyright 2010 19
Frank Vahid
Basic Concepts Underlying Automated Two-Level
Logic Size Optimization
Definitions F yz
xyz
On-set: All minterms that define x 00 01 11 10
when F=1
xyz
Off-set: All minterms that define 0 0 1 0 0
xyz' a
when F=0 xy
1 0 0 1 1
Implicant: Any product term
(minterm or other) that when 1 4 implicants of F
causes F=1 Note: We use K-maps here just for
a On K-map, any legal (but not intuitive illustration of concepts;
automated tools do not use K-maps.
necessarily largest) circle
Cover: Implicant xy covers Prime implicant: Maximally
minterms xyz and xyz expanded implicant any
Expanding a term: removing a expansion would cover 1s not in a

variable (like larger K-map circle) on-set


xyz xy is an expansion of xyz xyz, and xy, above
But not xyz or xyz they can
be expanded
Digital Design 2e
Copyright 2010 20
Frank Vahid
Basic Concepts Underlying Automated Two-Level
Logic Size Optimization
not essential
Definitions (cont) G yz yz
Essential prime implicant: The
x 00 01 11 10
only prime implicant that covers a
particular minterm in a functions
0 1 1 0 0
on-set a

Importance: We must include all


1 0 1 1 1
essential PIs in a functions cover
In contrast, some, but not all, non- xy
essential PIs will be included essential xz xy
not essential essential

Digital Design 2e
Copyright 2010 21
Frank Vahid
Automated Two-Level Logic Size Optimization Method

Steps 1 and 2 are exact


Step 3: Hard. Checking all possibilities: exact, but computationally
expensive. Checking some but not all: heuristic.

Digital Design 2e
Copyright 2010 22
Frank Vahid
Tabular Method Step 1: Determine Prime Implicants
Methodically Compare All Implicant Pairs, Try to Combine
Example function: F = x'y'z' + x'y'z + x'yz + xy'z + xyz' + xyz
Actually, comparing ALL pairs isnt necessaryjust
pairs differing in uncomplemented literals by one.
Minterms Minterms
(3-literal 2-literal (3-literal 2-literal 1-literal
implicants) implicants implicants) implicants implicants
x'y'z'+x'yz No
(0) x'y'z' x'y'z'+x'y'z = x'y' (0,1) x'y' 0 (0) x'y'z' (0,1) x'y'
(1,5,3,7) z
(1,5) y'z (1,5) y'z
(1) x'y'z 1 (1) x'y'z (1,3,5,7) z
(1,3) x'z (1,3) x'z

(3) x'yz (3,7) yz (3) x'yz (3,7) yz

(5,7) xz (5,7) xz
(5) xy'z 2 (5) xy'z
a

(6) xyz' (6,7) xy (6) xyz' (6,7) xy

(7) xyz 3 (7) xyz Prime implicants:


Implicant's number of
uncomplemented literals x'y' xy z
Digital Design 2e
Copyright 2010 23
Frank Vahid
Tabular Method Step 2: Add Essential PIs to Cover
Prime implicants (from Step 1): x'y', xy, z

Prime implicants Prime implicants Prime implicants


x'y' xy z x'y' xy z x'y' xy z
Minterms (0,1) (6,7) (1,3,5,7) Minterms (0,1) (6,7) (1,3,5,7) Minterms (0,1) (6,7) (1,3,5,7)
(0) x'y'z' (0) x'y'z' (0) x'y'z'

(1) x'y'z (1) x'y'z (1) x'y'z

(3) x'yz (3) x'yz (3) x'yz

(5) xy'z (5) xy'z (5) xy'z a

(6) xyz' (6) xyz' (6) xyz'

(7) xyz (7) xyz (7) xyz

(a) (b) (c)

Digital Design 2e If only one X in row, then that PI is essentialit's the only PI
Copyright 2010
that covers that row's minterm. 24
Frank Vahid
Tabular Method Step 3: Use Fewest Remaining PIs to
Cover Remaining Minterms
Essential PIs (from Step 2): x'y', xy, z Prime implicants

Covered all minterms, thus nothing to do in Minterms


x'y' xy z
(0,1) (6,7) (1,3,5,7)
step 3 (0) x'y'z'
Final minimized equation: (1) x'y'z
F = x'y' + xy + z
(3) x'yz

(5) xy'z

(6) xyz'

(7) xyz

(c)

Digital Design 2e
Copyright 2010 25
Frank Vahid
Problem with Methods that Enumerate all Minterms or
Compute all Prime Implicants
Too many minterms for functions with many variables
Function with 32 variables:
232 = 4 billion possible minterms.
Too much compute time/memory
Too many computations to generate all prime implicants
Comparing every minterm with every other minterm, for 32
variables, is (4 billion)2 = 1 quadrillion computations
Functions with many variables could requires days, months, years,
or more of computation unreasonable

Digital Design 2e
Copyright 2010 26
Frank Vahid
Solution to Computation Problem
Solution
Dont generate all minterms or prime implicants
Instead, just take input equation, and try to iteratively improve it
Ex: F = abcdefgh + abcdefgh+ jklmnop
Note: 15 variables, may have thousands of minterms
But can minimize just by combining first two terms:
F = abcdefg(h+h) + jklmnop = abcdefg + jklmnop

Digital Design 2e
Copyright 2010 27
Frank Vahid
Two-Level Optimization using Iterative Method
I yz
Method: Randomly apply expand x xz
00 01 11 10
operations, see if helps
0 0 1 1 0 a
Expand: remove a variable from a (a) z
term 1 0 1 1 0

Like expanding circle size on K-map xyz xyz


e.g., Expanding xz to z legal, but I yz
expanding xz to z not legal, in shown x 00 01 11 10 xz
function
After expand, remove other terms 0 0 1 1 0 a
(b) x'
covered by newly expanded term
1 0 1 1 0 Not legal
Keep trying (iterate) until doesnt help
xyz xyz

Ex: Illustrated above on K-map,


F = abcdefgh + abcdefgh+ jklmnop but iterative method is
F = abcdefg + abcdefgh + jklmnop intended for an automated
F = abcdefg + jklmnop solution (no K-map)
Digital Design 2e
Copyright 2010 Covered by newly expanded term abcdefg 28
Frank Vahid a
Ex: Iterative Hueristic for Two-Level Logic Size
Optimization
F = xyz + xyz' + x'y'z' + x'y'z (minterms in on-set)
Random expand: F = xyz X + xyz' + x'y'z + x'y'z
Legal: Covers xyz' and xyz, both in on-set
Any implicant covered by xy? Yes, xyz'.

X + x'y'z' + x'y'z
F = xy + xyz'
Random expand: F = xy X + x'y'z' + x'y'z
Not legal (x covers xy'z', xy'z, xyz', xyz: two not in on-set) a

X + x'y'z
Random expand: F = xy + x'y'z'
Legal
Implicant covered by x'y': x'y'z
X
F = xy + x'y'z' + x'y'z

Digital Design 2e
Copyright 2010 29
Frank Vahid
Multi-Level Logic Optimization Performance/Size
Tradeoffs
We dont always need the speed of two-level logic
Multiple levels may yield fewer gates
Example
F1 = ab + acd + ace F2 = ab + ac(d + e) = a(b + c(d + e))
General technique: Factor out literals xy + xz = x(y+z)

22 transistors
2 gate delays a
a 4 F2
4 F1
b

size(transistors)
b 20
4
15 F2
a c
c 6 4
d 10
6 F1 d
4 5
a e 16 transistors
c 6
e 4 gate-delays 1 2 3 4
delay (gate-delays)
F1 = ab + acd + ace F2 = a(b+c(d+e))
(a) (b) (c)

a
a
Digital Design 2e
Copyright 2010 30
Frank Vahid
Multi-Level Example
Q: Use multiple levels to reduce number of transistors for
F1 = abcd + abcef
A: abcd + abcef = abc(d + ef)
Has fewer gate inputs, thus fewer transistors
a
22 transistors 18 transistors
2 gate delays 3 gate delays F1
20

(transistors)
a a F2
b b 6 15

size
c 8 c 4 F2
d 10
4 F1 d
a 4 5
b e
c 10 4
e f 1 2 3 4
f delay (gate-delays)
F1 = abcd + abcef F2 = abc(d + ef)
(a) (b) (c)

Digital Design 2e
Copyright 2010 31
Frank Vahid
Multi-Level Example: Non-Critical Path
Critical path: longest delay path to output
Optimization: reduce size of logic on non-critical paths by using multiple
levels

26 transistors 22 transistors
3 gate-delays 3 gate-delays
a
4 a F1

Size (transistors)
b 4 25
b 20 F2
4
c 4
c 15
d 6 F1
6 4 F2 10
e a
4 5
b
f 6 f 6 1 2 3 4
g g delay (gate-delays)
F1 = (a+b)c + dfg + efg F2 = (a+b)c + (d+e)fg
(a) (b) (c)

Digital Design 2e
Copyright 2010 32
Frank Vahid
Automated Multi-Level Methods

Main techniques use heuristic iterative methods


Define various operations
Factoring: abc + abd = ab(c+d)
Plus other transformations similar to two-level iterative
improvement

Digital Design 2e
Copyright 2010 33
Frank Vahid
6.3

Sequential Logic Optimization and Tradeoffs


State Reduction:
Reduce number of states Inputs: x; Outputs : y
in FSM without changing x' x'
y=1
behavior x x
A B D
Fewer states potentially y=0 x x'
y=0
reduces size of state C
register and combinational (a)
y=1
x y
a
logic
Inputs: x; Outputs : y
x'
1 0 0 1
if x =
y=1 then y = 0 1 0 0 1
x
A B (c)
x'
y=0 x For the same sequence
C
of inputs,
(b) the output of the two
y=1 FSMs is the same

Digital Design 2e
Copyright 2010 34
Frank Vahid
State Reduction: Equivalent States
Two states are equivalent if: Inputs: x; Outputs : y

1. They assign the same values to x' x'


y=1
x x
outputs A B D
x'
e.g. A and D both assign y to 0, y=0 x y=0
C
2. AND, for all possible sequences of
y=1
inputs, the FSM outputs will be the
same starting from either state
e.g. say x=1,1,0
starting from A, y=0,1,1,1, (A,B,C,B)
starting from D, y=0,1,1,1, (D,B,C,B)

Digital Design 2e
Copyright 2010 35
Frank Vahid
State Reduction via the Partitioning Method
First partition (b) G1
Initial groups
G2
states into Inputs: x; Outputs: y {A, D} {B, C}
groups based on x' x'
y=1
outputs x x
x=0
A goes to A (G1) B goes to D (G1)
A B D D goes to D (G1) C goes to B (G2)
Then partition x' same different
y=0 x y=0 (c)
based on next (a)
C A goes to B (G2) B goes to C (G2)
states x=1 D goes to B (G2) C goes to B (G2)
y=1
For each same same
possible input,
for states in New groups
group, if next (d) G1 G2 G3
{A, D} {B} {C}
states in Inputs: x; Outputs: y
different x'
groups, divide x y=1 A goes to A (G1) One state groups;
A B D x=0 D goes to D (G1) nothing to check
group
x' x' same
Repeat until (f)
y=0 x (e)
no such states C A goes to B (G2) Done: A and D
x=1 D goes to B (G2)
y=1 are equivalent
same
Digital Design 2e
Copyright 2010 36
Frank Vahid
Ex: Minimizing States using the Partitioning Method
Inputs: x; Outputs: y Inputs: x; Outputs: y
x x
x S3,S4
x
S3 S0 S4
x y=0 x
y=0 x x y=0
x x
x x x x
S0 S1,S2
S2 S1
y=0 x
y=1 y=1 y=1

Initial grouping based on outputs: G1={S3, S0, S4} and G2={S2, S1}
Differ divide
Group G1:
for x=0, S3 S3 (G1), S0 S2 (G2), S4 S0 (G1)
for x=1, S3 S0 (G1), S0 S1 (G2), S4 S0 (G1)
Divide G1. New groups: G1={S3,S4}, G2={S2,S1}, and G3={S0}
Repeat for G1, then G2. (G3 only has one state). No further divisions, so done.
S3 and S4 are equivalent
S2 and S1 are equivalent

Digital Design 2e
Copyright 2010 37
Frank Vahid
Need for Automation
Inputs: x; Outputs: z
x

Partitioning for large FSM too x

SA
x
SB x
SH
x SI
z=1
SO x
x x
complex for humans z=1
x
x
z=0
x z=0
x
x x' x
z=0

State reduction typically


SC SD x SG SJ SM
x x
x z=1 x
z=0 z=1 z=0 z=0
x x x
x x
automated SE
x x
SK
z=1 x
SL SN
z=0 z=0 x
Often using heuristics to reduce
SF z=1
x
z=1

compute time

Digital Design 2e
Copyright 2010 38
Frank Vahid
State Encoding
State encoding: Assigning unique bit Inputs: b; Outputs: x
representation to each state x=0
Different encodings may reduce circuit size, 00
Off b
or trade off size and performance
b
Minimum-bitwidth binary encoding: Uses x=1 x=1 x=1
fewest bits 11 10
01 On1 10 On2 11 On3
Alternatives still exist, such as
A:00, B:01, C:10, D:11
A:00, B:01, C:11, D:10
a
4! = 24 alternatives for 4 states, N! for N states
Consider Three-Cycle Laser Timer
Example 3.7's encoding led to 15 gate inputs
Try alternative encoding 1
x = s1 + s0 1
n1 = s0 1 0
n0 = s1b + s1s0 1 0
Only 8 gate inputs 0
Thus fewer transistors 0
Digital Design 2e
Copyright 2010 39
Frank Vahid
State Encoding: One-Hot Encoding
Inputs: none; Outputs: x
One-hot encoding x=0 x=1

One bit per state a bit being 1 A 00


0001
D 11
1000
corresponds to a particular state
For A, B, C, D:
B 01 C 10
A: 0001, B: 0010, C: 0100, D: 1000 0010 0100 a
x=1 x=1
Example: FSM that outputs 0, 1, 1, 1
Equations if one-hot encoding:
n3 = s2; n2 = s1; n1 = s0; x = s3 +
s2 + s1
Fewer gates and only one level of
logic less delay than two levels, so
faster clock frequency x
x

n1
8
binary s3 s2 s1 s0
6 n0
s1 s0
4 one-hot clk State register
clk State register
2 n0
n1
Digital Design 2e n2
n3
Copyright 2010 1 2 3 4 40
Frank Vahid delay (gate-delays)
One-Hot Encoding Example:
Three-Cycles-High Laser Timer
Four states Use four-bit one-hot
Inputs: b; Outputs: x
encoding x=0
State table leads to equations: 0001
b
x = s3 + s2 + s1 Off
n3 = s2 b
x=1 x=1 x=1
n2 = s1
0010 0100 1000
n1 = s0*b On1 On2 On3
a
n0 = s0*b + s3
Smaller
3+0+0+2+(2+2) = 9 gate inputs
Earlier binary encoding (Ch 3):
15 gate inputs
Faster
Critical path: n0 = s0*b + s3
Previously: n0 = s1s0b + s1s0
2-input AND slightly faster than 3-
input AND

Digital Design 2e
Copyright 2010 41
Frank Vahid
Output Encoding
Output encoding: Encoding Use the output values
as the state encoding
method where the state
encoding is same as the Inputs: none; Outputs: x,y
xy=00 xy=01
output values
A 00 D 11
Possible if enough outputs, all
states with unique output values
B 01 C 10 a

xy=11 xy=10

Digital Design 2e
Copyright 2010 42
Frank Vahid
Output Encoding Example: Sequence Generator
w
Inputs: none; Outputs: w, x, y, z x
y
wxyz=0001 wxyz=1000 z

A D

B C

wxyz=0011 wxyz=1100

s3 s2 s1 s0
Generate sequence 0001, 0011, 1110, 1000,
repeat clk State register
FSM shown n0
n1
Use output values as state encoding n3
n2

Create state table


Derive equations for next state
n3 = s1 + s2; n2 = s1; n1 = s1s0;
n0 = s1s0 + s3s2

Digital Design 2e
Copyright 2010 43
Frank Vahid
Moore vs. Mealy FSMs
Mealy FSM adds this

Controller Output Output


logic O logic O
I O
FSM FSM
Combinational
FSM FSM outputs outputs
logic I I
inputs outputs
S FSM Next-state FSM Next-state
m a
inputs logic inputs logic
m-bit m S S
clk
state register

N clk State register clk State register

N N
(a) (b)

FSM implementation architecture


State register and logic Inputs: b; Outputs: x
More detailed view /x=0
Next state logic function of present state and FSM inputs S0 S1
Output logic
If function of present state only Moore FSM b/x=1
If function of present state and FSM inputs Mealy FSM b/x=0 a

Graphically: show outputs with


transitions, not with states
Digital Design 2e
Copyright 2010 44
Frank Vahid
Mealy FSMs May Have Fewer States
enough: enough money for soda
Inputs: enough (bit) Inputs: enough (bit) d: dispenses soda
Outputs: d, clear (bit) Outputs: d, clear (bit)
clear: resets money count
/ d=0, clear=1

Init Wait Init Wait


d=0 enough enough
enough
clear=1
Disp enough/d=1
d=1

clk clk
a
Inputs: enough Inputs: enough
State: I W W D I State: I W W I

Outputs: clear Outputs: clear


d d
(a) (b)

Soda dispenser example: Initialize, wait until enough, dispense


Moore: 3 states; Mealy: 2 states
Digital Design 2e
Copyright 2010 45
Frank Vahid
Mealy vs. Moore
Inputs: b; Outputs: s1, s0, p Inputs: b; Outputs: s1, s0, p
Q: Which is Moore, b
Time b/s1s0=00, p=0 Time
and which is Mealy? s1s0=00, p=0
b/s1s0=00, p=1 b
A: Mealy on left, Alarm b/s1s0=01, p=0
S2 s1s0=00, p=1
Moore on right b/s1s0=01, p=1
b
b
Alarm
Mealy outputs on Date b/s1s0=10, p=0 s1s0=01, p=0
b
arcs, meaning b/s1s0=10, p=1
S4 s1s0=01, p=1
outputs are function Stpwch b/s1s0=11, p=0 b
of state AND INPUTS b/s1s0=11, p=1 Date
b

Moore outputs in b
s1s0=10, p=0

states, meaning Mealy S6 s1s0=10, p=1


outputs are function b
b
of state only Stpwch
s1s0=11, p=0
b
Example is wristwatch: pressing S8 s1s0=11, p=1
button b changes display (s1s0)
Digital Design 2e and also causes beep (p=1) Moore
Copyright 2010 46
Frank Vahid Assumes button press is synchronized to occur for one cycle only
Mealy vs. Moore Tradeoff
Mealy may have fewer states, but drawback is that its outputs change mid-
cycle if input changes
Note earlier soda dispenser example
Mealy had fewer states, but output d not 1 for full cycle
Represents a type of tradeoff
Inputs: enough (bit) Inputs: enough (bit)
Outputs: d, clear (bit) Outputs: d, clear (bit)
/d=0, clear=1

Moore Init Wait


enough
Init Wait Mealy
d=0 enough
enough
clear=1
Disp enough/d=1
d=1

clk clk

Inputs: enough Inputs: enough


State: I W W D I State: I W W I

Outputs: clear Outputs: clear


Digital Design 2e
Copyright 2010
d d 47
Frank Vahid (a) (b)
Implementing a Mealy FSM
Straightforward Inputs: enough (bit)
Convert to state table Outputs: d, clear (bit)
/ d=0, clear=1
Derive equations for each
Init Wait
output
enough/d=0
Key difference from
enough/d=1
Moore: External outputs
(d, clear) may have
different value in same
state, depending on input
values

Digital Design 2e
Copyright 2010 48
Frank Vahid
Mealy and Moore can be Combined

Inputs: b; Outputs: s1, s0, p


b/p=0
Time
s1s0=00
b/p=1 Easier to comprehend
b/p=0 Combined due to clearly
Alarm Moore/Mealy associating s1s0
s1s0=01
b/p=1
FSM for beeping assignments to each
wristwatch state, and not duplicating
b/p=0 s1s0 assignments on
Date example
s1s0=10
outgoing transitions
b/p=1
b/p=0
Stpwch
s1s0=11
b/p=1

Digital Design 2e
Copyright 2010 49
Frank Vahid
6.4

Datapath Component Tradeoffs


Can make some components faster (but bigger), or smaller
(but slower), than the straightforward components in Ch 4
This chapter builds:
A faster (but bigger) adder than the carry-ripple adder
A smaller (but slower) multiplier than the array-based multiplier
Could also do for the other Ch 4 components

Digital Design 2e
Copyright 2010 50
Frank Vahid
Building a Faster Adder
Built carry-ripple adder in Ch 4 a3 b3 a2 b2 a1 b1 a0 b0 cin
4-bit adder
Similar to adding by hand, column by column cout s3 s2 s1 s0
Con: Slow
carries: c3 c2 c1 cin
Output is not correct until the carries have
rippled to the left critical path B: b3 b2 b1 b0 a
4-bit carry-ripple adder has 4*2 = 8 gate delays A: + a3 a2 a1 a0
Pro: Small cout s3 s2 s1 s0
4-bit carry-ripple adder has just 4*5 = 20 gates
a3 b3 a2 b2 a1b1 a0 b0 ci

FA FA FA FA

co s3 s2 s1 s0

Digital Design 2e
Copyright 2010 51
Frank Vahid
Building a Faster Adder
Carry-ripple is slow but small
8-bit: 8*2 = 16 gate delays, 8*5 = 40 gates
16-bit: 16*2 = 32 gate delays, 16*5 = 80 gates
32-bit: 32*2 = 64 gate delays, 32*5 = 160 gates
Two-level logic adder (2 gate delays) 10000

transistors
OK for 4-bit adder: About 100 gates 8000
8-bit: 8,000 transistors / 16-bit: 2 M / 32-bit: 100 billion 6000
N-bit two-level adder uses absurd number of gates for 4000
N much beyond 4 2000
Compromise 0
1 2 3 4 5 6 7 8
Build 4-bit adder using two-level logic, compose to build N
N-bit adder
8-bit adder: 2*(2 gate delays) = 4 gate delays,
2*(100 gates)=200 gates
32-bit adder: 8*(2 gate delays) = 32 gate delays
8*(100 gates)=800 gates
a7 a6 a5 a4 b7 b6 b5 b4 a3 a2 a1 a0 b3 b2 b1 b0

Can we do a3 a2 a1 a0 b7 b6 b5 b4 a3 a2 a1 a0 b3 b2 b1 b0
better? a 4-bit adder ci 4-bit adder ci a

Digital Design 2e co s3 s2 s1 s0 co s3 s2 s1 s0
Copyright 2010 52
Frank Vahid
co s7 s6 s5 s4 s3 s2 s1 s0
Faster Adder (Nave Inefficient) Attempt at Lookahead
Idea
Modify carry-ripple adder For a stages carry-in, dont wait for carry
to ripple, but rather directly compute from inputs of earlier stages
Called lookahead because current stage looks ahead at previous
stages rather than waiting for carry to ripple to current stage

a3a3
b3 b3 a2
a2b2b2 a1
a1b1
b1 a0 b0
a0c0
b0 ci

look look look


ahead ahead ahead

FA c3 FA c2 FA c1 FA c0
FA
c4 co s3 3
stage stage 2s2 stage 1 s1 stage 0 s0
cout s3 s2 s1 s0

Digital Design 2e Notice no rippling of carry


Copyright 2010 53
Frank Vahid
Faster Adder (Nave Inefficient) Attempt at Lookahead
Want each stages carry-in bit to be function of external inputs only (as, bs, or c0)
a3b3 c3 a2b2 c2 a1b1 c1 a0 b0 c0

FA FA FA FA
co2 co1 co0
co s3 s2 s1 s0
Full-adder: s = a xor b c = bc + ac + ab
c1 = co0 = b0c0 + a0c0 + a0b0

c2 = co1 = b1c1 + a1c1 + a1b1


a
c2 = b1(b0c0 + a0c0 + a0b0) + a1(b0c0 + a0c0 + a0b0) + a1b1
c2 = b1b0c0 + b1a0c0 + b1a0b0 + a1b0c0 + a1a0c0 + a1a0b0 + a1b1

c3 = co2 = b2c2 + a2c2 + a2b2


(continue plugging in...)
Digital Design 2e
Copyright 2010 54
Frank Vahid
Faster Adder (Nave Inefficient) Attempt at Lookahead
a3b3 a2b2 a1b1 a0b0 c0

Carry lookahead logic


function of external inputs
No waiting for ripple look look look
Problem ahead ahead ahead

Equations get too big c3 c2 c1 c0

Not efficient FA
c4 stage 3 stage 2 stage 1 stage 0
Need a better form of cout s3 s2 s1 s0
lookahead
c1 = b0c0 + a0c0 + a0b0

c2 = b1b0c0 + b1a0c0 + b1a0b0 + a1b0c0 + a1a0c0 + a1a0b0 + a1b1

c3 = b2b1b0c0 + b2b1a0c0 + b2b1a0b0 + b2a1b0c0 + b2a1a0c0 + b2a1a0b0 + b2a1b1 +


a2b1b0c0 + a2b1a0c0 + a2b1a0b0 + a2a1b0c0 + a2a1a0c0 + a2a1a0b0 + a2a1b1 + a2b2

Digital Design 2e
Copyright 2010 55
Frank Vahid
Efficient Lookahead
cin c0 a
carries: c4 c3 c2 c1 c0 c1 1 0 1 1 1 1 1 1
b0
B: b3 b2 b1 b0 1 1 0 1
a0
A: + a3 a2 a1 a0 + 1 + 1 + 1 + 0
cout s3 s2 s1 s0 0 1 0 0

if a0b0 = 1 if a0 xor b0 = 1
c1 = a0b0 + (a0 xor b0)c0 then c1 = 1 then c1 = 1 if c0 = 1
(call this G: Generate) (call this P: Propagate)
c2 = a1b1 + (a1 xor b1)c1
c3 = a2b2 + (a2 xor b2)c2 Why those names? When a0b0=1, we should generate a 1
for c1. When a0 XOR b0 = 1, we should propagate the c0
c4 = a3b3 + (a3 xor b3)c3 value as the value of c1, meaning c1 should equal c0.

c1 = G0 + P0c0 Gi = aibi (generate)


c2 = G1 + P1c1 Pi = ai XOR bi (propagate)
c3 = G2 + P2c2
c4 = G3 + P3c3
Digital Design 2e
Copyright 2010 56
Frank Vahid
Efficient Lookahead
Substituting as in the nave scheme:
Gi = aibi (generate)
c1 = G0 + P0c0 Pi = ai XOR bi (propagate)

c2 = G1 + P1c1 Gi/Pi function of inputs only


c2 = G1 + P1(G0 + P0c0)
c2 = G1 + P1G0 + P1P0c0

c3 = G2 + P2c2
c3 = G2 + P2(G1 + P1G0 + P1P0c0)
c3 = G2 + P2G1 + P2P1G0 + P2P1P0c0

c4 = G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0c0

Digital Design 2e
Copyright 2010 57
Frank Vahid
a3 b3 a2 b2 a1 b1 a0 b0 cin

CLA Half-adder
SPG
block
Half-adder Half-adder Half-adder

Each stage:
HA for G
and P
Another G3 P3
Carry-lookahead logic c3
G2 P2
c2
G1 P1
c1
G0 P0 c0
XOR for s
Call SPG cout s3 s2 (b) s1 s0
block P3 G3 P2 G2 P1 G1 P0 G0 c0
Create Carry-lookahead logic
carry-
lookahead
logic from
equations
a
More
efficient
than nave
scheme, at
expense of Stage 4 Stage 3 Stage 2 Stage 1
one extra
gate delay
c1 = G0 + P0c0
Digital Design 2e c2 = G1 + P1G0 + P1P0c0
Copyright 2010 c3 = G2 + P2G1 + P2P1G0 + P2P1P0c0 58
Frank Vahid
cout = G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0c0
Carry-Lookahead Adder High-Level View
a3 b3 a2 b2 a1 b1 a0 b0 c0

a b cin a b cin a b cin a b cin


SPG block SPG block SPG block SPG block
P G P G P G P G

P3 G3 c3 P2 G2 c2 P1 G1 c1 P0 G0
4-bit carry-lookahead logic
cout
cout s3 s2 s1 s0

Fast only 4 gate delays 4-bit adder comparison


Each stage has SPG block with 2 gate levels (gate delays, gates)
Carry-lookahead logic quickly computes the Carry-ripple: (8, 20)
carry from the propagate and generate bits Two-level: (2, 500)
using 2 gate levels inside CLA: (4, 26)
Reasonable number of gates 4-bit adder has o Nice
only 26 gates compromise

Digital Design 2e
Copyright 2010 59
Frank Vahid
Carry-Lookahead Adder 32-bit?
Problem: Gates get bigger in each stage
4th stage has 5-input gates
32nd stage would have 33-input gates
Too many inputs for one gate
Would require building from smaller gates,
meaning more levels (slower), more gates
(bigger)
Gates get bigger
One solution: Connect 4-bit CLA adders in in each stage
carry-ripple manner Stage 4
Ex: 16-bit adder: 4 + 4 + 4 + 4 = 16 gate
delays. Can we do better?
a15-a12 b15-b12 a11-a8 b11-b8 a7a6a5a4 b7b6b5b4 a3a2a1a0 b3b2b1b0

a3a2a1a0 b3b2b1b0 a3a2a1a0 b3b2b1b0 a3a2a1a0 b3b2b1b0 a3a2a1a0 b3b2b1b0


4-bit adder cin 4-bit adder cin 4-bit adder cin 4-bit adder cin
a
cout s3s2s1s0 cout s3s2s1s0 cout s3s2s1s0 cout s3s2s1s0

cout s15-s12 s11-s8 s7s6s5s4 s3s2s1s0

Digital Design 2e
Copyright 2010 60
Frank Vahid
Hierarchical Carry-Lookahead Adders
Better solution Rather than rippling the carries, just repeat the carry-
lookahead concept
Requires minor modification of 4-bit CLA adder to output P and G
These use carry-lookahead internally

a15-a12 b15-b12 a11-a8 b11-b8 a7a6a5a4 b7b6b5b4 a3a2a1a0 b3b2b1b0

a3a2a1a0 b3b2b1b0 a3a2a1a0 b3b2b1b0 a3a2a1a0 b3b2b1b0 a3a2a1a0 b3b2b1b0


4-bit adder cin 4-bit adder cin 4-bit adder cin 4-bit adder cin
P G cout s3s2 s1s0 P G cout s3s2 s1s0 P G cout s3s2 s1s0 P G cout s3s2 s1s0

P3G3 c3 P2 G2 c2 P1 G1 c1 P0 G0
a
4-bit carry-lookahead logic
P G cout

s15-s12 s11-s18 s7-s4 s3-s0


Second level of carry-lookahead
P3 G3 P2 G2 P1 G1 P0 G0 c0
Carry-lookahead logic

Same lookahead logic as


inside the 4-bit adders
Stage 4 Stage 3 Stage 2 Stage 1
Digital Design 2e
Copyright 2010 c1 = G0 + P0c0
61
Frank Vahid c2 = G1 + P1G0 + P1P0c0
c3 = G2 + P2G1 + P2P1G0 + P2P1P0c0
cout = G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0c0
a
Hierarchial Carry-Lookahead Adders
Hierarchical CLA concept can be applied for larger adders
32-bit hierarchical CLA:
Only about 8 gate delays (2 for SPG block, then 2 per CLA level)
Only about 14 gates in each 4-bit CLA logic block
SPG ai bi P G c a
block
Q: How many gate
delays for 64-bit
hierarchical CLA,
4-bit 4-bit 4-bit 4-bit 4-bit 4-bit 4-bit 4-bit
using 4-bit CLA logic?
CLA CLA CLA CLA CLA CLA CLA CLA
logic logic logic logic logic logic logic logic
A: 16 CLA-logic blocks
PGc c GP c G P P G c P Gc c GP c G P in 1st level, 4 in 2nd, 1
P G c
in 3rd -- so still just 8
4-bit 4-bit gate delays (2 for
CLA CLA
logic logic SPG, and 2+2+2 for
CLA logic). CLA is a
P G c 2-bit c G P very efficient method.
CLA
logic
Digital Design 2e
Copyright 2010 62
Frank Vahid
Carry Select Adder
Another way to compose adders
High-order stage Compute result for carry in of 1 and of 0
Select based on carry-out of low-order stage
Faster than pure rippling
a7a6a5a4 b7b6b5b4

a3a2a1a0 b3b2b1b0

a3a2a1a0 b3b2b1b0 a3a2a1a0 b3b2b1b0 a3a2a1a0 b3b2b1b0


Operate in parallel HI4_1 4-bit adder ci 1 HI4_0 4-bit adder cin 0 LO4 4-bit adder ci ci
co s3s2s1s0 co s3s2s1s0 co s3s2s1s0

I1 I0 suppose =1
5-bit wide 2x1 mux S
Q

co s7s6s5s4 s3s2s1s0
Digital Design 2e
Copyright 2010 63
Frank Vahid
Adder Tradeoffs

carry-lookahead

multilevel

size
carry-lookahead

carry-select
carry-
ripple

delay

Designer picks the adder that satisfies particular delay and


size requirements
May use different adder types in different parts of same design
Faster adders on critical path, smaller adders on non-critical path
Digital Design 2e
Copyright 2010 64
Frank Vahid
Smaller Multiplier
Multiplier in Ch 4 was array style
Fast, reasonable size for 4-bit: 4*4 = 16 partial product AND terms, 3 adders
But big for 32-bit: 32*32 = 1024 AND terms, and 31 adders

32-bit multiplier would have 1024 gates here ...


a3 a2 a1 a0

b0

pp1
b1

pp2
0 0
b2
+ (5-bit)

pp3
00
a
b3 pp4
+ (6-bit) ... and 31 adders
0 00 here (big ones, too)
+ (7-bit)
Digital Design 2e
Copyright 2010 p7..p0 65
Frank Vahid
Smaller Multiplier -- Sequential (Add-and-Shift) Style
Smaller multiplier: Basic idea
Dont compute all partial products simultaneously
Rather, compute one at a time (similar to by hand), maintain running
sum

Step 1 Step 2 Step 3 Step 4


0110 0110 0110 0110
0011 0 011 0 01 1 00 11
(running sum) 0000 00110 010010 0010010
(partial product) + 0 1 1 0 +0 1 1 0 + 0000 + 0000
a
(new running sum) 0 0 1 1 0 010010 0010010 00010010

Digital Design 2e
Copyright 2010 66
Frank Vahid
Smaller Multiplier -- Sequential (Add-and-Shift) Style
multiplier multiplicand

Design circuit that multiplicand


register (4)
computes one partial load
0110
product at a time, adds to
running sum
Note that shifting mdld

Controller
4-bit adder
running sum right multiplier
register (4)
mrld
(relative to partial load
0011
product) after each step mr3
mr2
ensures partial product mr1
added to correct running mr0
00010010
rsload
sum bits rsclear
load
running sum00100100
rsshr
clear 01001000
shr register (8) 00110000
Step 1 Step 2 Step 3 Step 4 00000000
0110 0110 0110 0110
0011 0 01 1 001 1 start 0011 a
0000 00110 010010 0010010 (running sum)
+ 0110 +0 1 1 0 + 0000 + 0000 (partial product)
00110 010010 0010010 0 0 0 1 0 0 1 0 (new running sum) product

Digital Design 2e
Copyright 2010 67
Frank Vahid
Smaller Multiplier -- Sequential Style: Controller
controller Vs. array-style:
a
start mdld Pro: small
start mrld Just three registers,
mdld = 1
mrld = 1 mr3 adder, and controller
rsclear = 1 mr2
rsshr=1 rsshr=1 rsshr=1 rsshr=1 mr1 Con: slow
mr0 mr1 mr2 mr3 mr0
a
rsload 2 cycles per multiplier
rsclear
mr0 mr1 mr2 mr3 rsshr bit
rsload=1 rsload=1 rsload=1 rsload=1 32-bit: 32*2=64 cycles
(plus 1 for init.)
start
multiplier multiplicand

Wait for start=1 multiplicand

register (4)
Looks at multiplier one bit at a load
0110
time
Adds partial product mdld
multiplier 4-bit adder
controller

(multiplicand) to running sum if mrld register (4)

present multiplier bit is 1 mr3


0011
load

mr2
Then shifts running sum right mr1
mr0 a
rsload
one position rsclear
load
running sum
rsshr
10010
00110
0110
00000000
0010010
00010010
010010
clear
shr 000
0000
000
register (8)

start

Digital Design 2e
Copyright 2010
Correct product product 68
Frank Vahid
6.5

RTL Design Optimizations and Tradeoffs


While creating datapath during RTL design, there are
several optimizations and tradeoffs, involving
Pipelining
Concurrency
Component allocation
Operator binding
Operator scheduling
Moore vs. Mealy high-level state machines

Digital Design 2e
Copyright 2010 69
Frank Vahid
Pipelining Time

Intuitive example: Washing dishes Without pipelining:


with a friend, you wash, friend dries W1 D1 W2 D2 W3 D3

You wash plate 1 a

Then friend dries plate 1, while you wash With pipelining:


plate 2 W1 W2 W3 Stage 1
Then friend dries plate 2, while you wash
D1 D2 D3 Stage 2
plate 3; and so on
You dont sit and watch friend dry; you
start on the next plate
Pipelining: Break task into stages,
each stage outputs data for next
stage, all stages operate concurrently
(if they have data)
Digital Design 2e
Copyright 2010 70
Frank Vahid
Pipelining Example
W X Y Z
W X Y Z

Stage 1
Longest path

2ns
2ns
+ + is only 2 ns
2ns

+ + 2ns
pipeline
clk Longest path registers
clk a
is 2+2 = 4 ns
2ns

2ns
+ So minimum clock +

Stage 2
So minimum clock
period is 4ns period is 2ns

S clk S clk

S S(0) SS(1) S(0) S(1)


S = W+X+Y+Z
Datapath on left has critical path of 4 ns, so fastest clock period is 4 ns
Can read new data, add, and write result to S, every 4 ns
Datapath on right has critical path of only 2 ns
So can read new data every 2 ns doubled performance (sort of...)
Digital Design 2e
Copyright 2010 71
Frank Vahid
Pipelining Example
W X Y Z
W X Y Z

Longest path
+ + is only 2 ns
+ +
pipeline
Longest path registers
clk clk
is 2+2 = 4 ns
+ So mininum clock + So mininum clock
period is 4 ns period is 2 ns

S clk S clk

S S(0) S(1) S S(0) S(1)

(a) (b)

Pipelining requires refined definition of performance


Latency: Time for new data to result in new output data (seconds)
Throughput: Rate at which new data can be input (items / second)
So pipelining above system:
Doubled the throughput, from 1 item / 4 ns, to 1 item / 2 ns
Latency stayed the same: 4 ns
Digital Design 2e
Copyright 2010 72
Frank Vahid
Pipeline Example: FIR Datapath
100-tap FIR filter: Row of X
100 concurrent multipliers, xt registers
followed by tree of adders
Assume 20 ns per multiplier

Stage 1
multipliers
14 ns for entire adder tree 20 ns x x
Critical path of 20+14 = 34 ns pipeline
Add pipeline registers registers

Longest path now only 20 ns adder tree

Stage 2
+ +
Clock frequency can be nearly 14 ns
doubled +
Great speedup with minimal
extra hardware yreg
Y
Digital Design 2e
Copyright 2010 73
Frank Vahid
Concurrency
Concurrency: Divide task into
subparts, execute subparts Task
simultaneously
Dishwashing example: Divide stack
into 3 substacks, give substacks to a Concurrency
3 neighbors, who work
simultaneously 3 times speedup Can do both, too
(ignoring time to move dishes to
neighbors' homes)
Concurrency does things side-by-
side; pipelining instead uses stages
(like a factory line)
Already used concurrency in FIR Pipelining
filter concurrent multiplications
* * *
Digital Design 2e
Copyright 2010 74
Frank Vahid
Concurrency Example: SAD Design Revisited
Sum-of-absolute differences video compression example (Ch 5)
Compute sum of absolute differences (SAD) of 256 pairs of pixels
Original : Main loop did 1 sum per iteration, 256 iterations, 2 cycles per iter.
go AB_rd AB_addr A_data B_data

i_lt_256 A
S0 go lt 8 8
cmp B
go 256 9
i_inc
sum_clr=1 A B
S1
i_clr=1 i_clr
i
8
S2
sum_ld
i_lt_256'

i_lt_256 sum 32 abs


AB_rd=1 sum_clr
S3 sum_ld=1
i_inc=1 32 32 8
sadreg_ld
S4 sad_reg_ld=1
sadreg_clr
sadreg +
Controller 32
Datapath
sad
256 iters.*2 cycles/iter. = 512 cycles -/abs/+ done in 1 cycle,
Digital Design 2e but done 256 times
Copyright 2010 75
Frank Vahid
Concurrency Example: SAD Design Revisited
More concurrent design
Compute SAD for 16 pairs concurrently, do 16 times to compute all
16*16=256 SADs.
Main loop does 16 sums per iteration, only 16 iters., still 2 cycles per iter.
AB_addr A0 B0 A1 B1 A14 B14 A15 B15
go AB_rd
a
i_lt_16
<16
S0 !go
go i_inc 16 subtractors
sum_clr=1 i
Orig: 256*2 = 512 cycles

S1 i_clr=1 i_clr
New: 16*2 = 32 cycles

sum_ld
S2 16 absolute
i_lt_16'

i_lt_16 sum abs abs abs abs values


sum_clr
AB_rd=1
sum_ld=1
i_inc=1 sad_reg_ld
+ +
S4 sad_reg_ld=1 sad_reg
Adder tree to
sum 16 values
+ +
Controller Datapath a

All -/abs/+s shown done in 1


Digital Design 2e
Copyright 2010
sad cycle, but done only 16 times 76
Frank Vahid
Concurrency Example: SAD Design Revisited
Comparing the two designs
Original: 256 iterations * 2 cycles/iter = 512 cycles
More concurrent: 16 iterations * 2 cycles/iter = 32 cycles
Speedup: 512/32 = 16x speedup
Versus software
Recall: Estimated about 6 microprocessor cycles per iteration
256 iterations * 6 cycles per iteration = 1536 cycles
Original design speedup vs. software: 1536 / 512 = 3x
(assuming cycle lengths are equal)
Concurrent designs speedup vs. software: 1536 / 32 = 48x
48x is very significant quality of video may be much better

Digital Design 2e
Copyright 2010 77
Frank Vahid
Component Allocation
Another RTL tradeoff: Component allocation Choosing a particular
set of functional units to implement a set of operations
e.g., given two states, each with multiplication
Can use 2 multipliers (*)
OR, can instead use 1 multiplier, and 2 muxes
Smaller size, but slightly longer delay due to the mux delay

A B A: (sl=0; sr=0; t1ld=1)


B: (sl=1; sr=1; t4ld=1)
t1 := t2*t3 t4 := t5*t6 t2 t5 t3 t6
FSM-A: (t1ld=1) B: (t4ld=1) a

size
2 mul
t2 t3 t5 t6 sl 21 21 sr
1 mul


delay
t1 t4 t1 t4 (c)
(a) (b)
Digital Design 2e
Copyright 2010 78
Frank Vahid
Operator Binding
Another RTL tradeoff: Operator binding Mapping a set of operations
to a particular component allocation
Note: operator/operation mean behavior (multiplication, addition), while
component (aka functional unit) means hardware (multiplier, adder)
Different bindings may yield different size or delay

A B C A B C

t1 = t2* t3 t4 = t5* t6 t7 = t8* t3 t1 = t2* t3 t4 = t5* t6 t7 = t8* t3

t2 t3 t5 t8 t6 t3 t2 t8 t3 t5 t6 Binding 1 a

size
2 muxes
Binding 2
sl 2x1 2x1 sr vs. sl 2x1
1 mux
2 multipliers delay
MULA MULB MULA MULB
allocated

t1 t4 t7 t1 t7 t4
Binding 1 Binding 2
Digital Design 2e
Copyright 2010 79
Frank Vahid
Operator Scheduling
Yet another RTL tradeoff: Operator scheduling
Introducing or merging states, and assigning operations to
those states.

A B C A B B2 C

(some t1 = t2*t3 (some (some t1 = t2* t3 t4 = t5* t6 (some


operations) t4 = t5*t6 operations) operations) t4 = t5* t6 operations)

t2 t5 t3 t6
a
but more
t2 t3 t5 t6 delay due to
sl 2x1 2x1 sr
muxes, and
* * smaller extra state
3-state schedule
(only 1 *) *
size

t1 t4
t1 t4
4-state schedule

Digital Design 2e delay


Copyright 2010 80
Frank Vahid
Operator Scheduling Example: Smaller FIR Filter
3-tap FIR filter of Ch 5: Two states DP computes new Y every cycle
Used 3 multipliers and 2 adders; can we reduce the designs size?
Inputs: X (12 bits) Outputs: Y (12 bits)
Local storage: xt0, xt1, xt2, c0, c1, c2 (12 bits);
Yreg (12 bits)
y(t) = c0*x(t) + c1*x(t-1) + c2*x(t-2)
Init FC

Yreg := 0 Yreg :=
xt0 := 0 c0*xt0 +
xt1 := 0 c1*xt1 + 3 2 2
xt2 := 0 c2*xt2
c0 := 3 xt0 := X c0_ld c1_ld c2_ld
c1 := 2 xt1 := xt0 xt0_clr
c2 := 2 xt2 := xt1 c0 c1 c2

...
FIR filter

...
xt0_ld
xt0 xt1 xt2
X
12
clk
x(t)
* x(t-1)
* x(t-2)
*

Yreg_clr
+ + Yreg_ld
Y
Yreg
12
Datapath for 3-tap FIR filter
Digital Design 2e
Copyright 2010 81
Frank Vahid
Operator Scheduling Example: Smaller FIR Filter
Reduce the designs size by re-scheduling the operations
Do only one multiplication operation per state
Inputs: X (12 bits) Inputs : X (12 bits)
Outputs: Y (12 bits) Outputs : Y (12 bits)
Local storage: xt0, xt1, xt2, Local storage: xt0, xt1, xt2,
c0, c1, c2 (12 bits); c0, c1, c2 (12 bits);
Yreg (12 bits) Yreg, sum (12 bits)

Yreg := sum := 0
c0*xt0 + xt0 := X
S1 c1*xt1 + S1
xt1 := xt0
c2*xt2 xt2 := xt1
xt0 := X
xt1 := xt0
sum := sum + c0xt0
xt2 := xt1
S2
(a)
a

S3 sum := sum + c1xt1

S4 sum := sum + c2xt2

y(t) = c0*x(t) + c1*x(t-1) + c2*x(t-2)


Digital Design 2e S5 Yreg := sum
Copyright 2010 82
Frank Vahid (b)
Operator Scheduling Example: Smaller FIR Filter
Reduce the designs size by re-scheduling the operations
Do only one multiplication (*) operation per state, along with sum (+)
Inputs: X (12 bits)
Outputs: Y (12 bits)
Local storage: xt0, xt1, xt2, c0 c1 c2
c0, c1, c2 (12 bits); X xt0 xt1 xt2
Yreg, sum (12 bits)
clk
sum := 0
xt0 := X
S1 xt1 := xt0
xt2 := xt1
mul_s1 yreg
Y
S2 sum := sum + c0*xt0 3x1 3x1
mul_s0

S3 sum := sum + c1*xt1
Multiply-
accumulate: a
S4 sum := sum + c2*xt2
MAC common datapath
sum_clr component
sum Serialized datapath
sum_ld for 3-tap FIR filter
S5 Yreg := sum

Digital Design 2e
Copyright 2010 83
Frank Vahid
Operator Scheduling Example: Smaller FIR Filter
Many other options exist
between fully-concurrent and
fully-serialized
e.g., for 3-tap FIR, can use 1, 2, concurrent FIR
or 3 multipliers
compromises

size
Can also choose fast array-style
multipliers (which are concurrent
internally) or slower shift-and-
serial
add multipliers (which are FIR
serialized internally)
Each options represents delay
compromises

Digital Design 2e
Copyright 2010 84
Frank Vahid
6.6

More on Optimizations and Tradeoffs


Serial vs. concurrent computation has been a common tradeoff
theme at all levels of design
Serial: Perform tasks one at a time
Concurrent: Perform multiple tasks simultaneously
Combinational logic tradeoffs
Concurrent: Two-level logic (fast but big)
Serial: Multi-level logic (smaller but slower)
abc + abd + ef (ab)(c+d) + ef essentially computes ab first (serialized)
Datapath component tradeoffs
Serial: Carry-ripple adder (small but slow)
Concurrent: Carry-lookahead adder (faster but bigger)
Computes the carry-in bits concurrently
Also multiplier: concurrent (array-style) vs. serial (shift-and-add)
RTL design tradeoffs
Concurrent: Schedule multiple operations in one state
Serial: Schedule one operation per state
Digital Design 2e
Copyright 2010 85
Frank Vahid
Higher vs. Lower Levels of Design
Optimizations and tradeoffs at higher levels typically have
greater impact than those at lower levels
Ex: RTL decisions impact size/delay more than gate-level decisions

Spotlight analogy: The lower you


are, the less solution landscape is
illuminated (meaning possible)

high-level changes

size

delay land
Digital Design 2e
Copyright 2010 (a) (b) 86
Frank Vahid
Algorithm Selection
Linear
Chosen algorithm can have big impact search
0: 0x00000000
e.g., which filtering algorithm?
FIR is one type, but others require less computation at 1: 0x00000001
expense of lower-quality filtering 2: 0x0000000F
Example: Quickly find items address in 256-word 3: 0x000000FF
memory 64
One use: data compression. Many others. 96
96: 0x00000F0A
Algorithm 1: Linear search
128: 0x0000FFAA
Compare item with M[0], then M[1], M[2], ... 128
256 comparisons worst case Binary
Algorithm 2: Binary search (sort memory first) search
Start considering entire memory range
a
If M[mid]>item, consider lower half of M
If M[mid]<item, consider upper half of M 255: 0xFFFF0000
Repeat on new smaller range
Dividing range by 2 each step; at most 8 such divisions 256x32 memory
Only 8 comparisons in worst case
Choice of algorithm has tremendous impact
Far more impact than say choice of comparator type

Digital Design 2e
Copyright 2010 87
Frank Vahid
Power Optimization
Until now, weve focused on size and delay 8

Power is another important design criteria

energy (1=value in 2001)


energy
Measured in Watts (energy/second) demand

Rate at which energy is consumed 4

Increasingly important as more transistors fit on a battery energy


2 density
chip
1
Power not scaling down at same rate as size
Means more heat per unit area cooling is difficult 2001 03 05 07 09

Coupled with batterys not improving at same rate


Means battery cant supply chips power for as long
CMOS technology: Switching a wire from 0 to 1
consumes power (known as dynamic power)
P = k * CV2f
k: constant; C: capacitance of wires; V: voltage; f: switching
frequency
Power reduction methods
Reduce voltage: But slower, and theres a limit
What else?

Digital Design 2e
Copyright 2010 88
Frank Vahid
Power Optimization using Clock Gating
2 c0 c1 c2
P = k * CV f X xt0 xt1 xt2

Much of a chips switching f (>30%) x_ld


due to clock signals yreg
After all, clock goes to every register clk n1 n2 n3 n4

Portion of FIR filter shown on right


y_ld
Notice clock signals n1, n2, n3, n4 clk Much
Solution: Disable clock switching to n1, n2, n3 switching
on clock
n4
registers unused in a particular state wires
Achieve using AND gates a
c0 c1 c2
FSM only sets 2nd input to AND gate to X xt0 xt1 xt2

1 in those states during which register


x_ld
gets loaded
yreg
Note: Advanced method, usually done s1 n1
n2 n3 n4
by tools, not designers clk

Putting gates on clock wires creates s5

variations in clock signal (clock skew); y_ld

must be done with great care clk


Greatly reduced n1, n2, n3
Digital Design 2e switching less power n4
Copyright 2010 89
Frank Vahid
Power Optimization using Low-Power Gates on
Non-Critical Paths
Another method: Use low-power gates

power
high-power gates
Multiple versions of gates may exist low-power gates on
Fast/high-power, and slow/low-power, versions non-critical path

Use slow/low-power gates on non-critical paths low-power


gates
Reduces power, without increasing delay
delay
26 transistors 26 transistors
a 3 ns delay a 3 ns delay
1/1 5 nanowatts power 1/1 4 nanowatts power
b b
1/1 1/1
c c
d 1/1 F1 d 1/1 F1
1/1 2/0.5
e e
nanowatts
f 1/1 nanoseconds f 2/0.5
g g

Digital Design 2e
Copyright 2010 90
Frank Vahid
Chapter Summary
Optimization (improve criteria without loss) vs. tradeoff (improve criteria at
expense of another)
Combinational logic
K-maps and tabular method for two-level logic size optimization
Iterative improvement heuristics for two-level optimization and multi-level too
Sequential logic
State minimization, state encoding, Moore vs. Mealy
Datapath components
Faster adder using carry-lookahead
Smaller multiplier using sequential multiplication
RTL
Pipelining, concurrency, component allocation, operator binding, operator
scheduling
Serial vs. concurrent, efficient algorithms, power reduction methods (clock
gating, low-power gates)
Multibillion dollar EDA industry (electronic design automation) creates tools
for automated optimization/tradeoffs
Digital Design 2e
Copyright 2010 91
Frank Vahid

Você também pode gostar