Escolar Documentos
Profissional Documentos
Cultura Documentos
Chapter 6:
Optimizations and Tradeoffs
Slides to accompany the textbook Digital Design, with RTL Design, VHDL,
and Verilog, 2nd Edition,
by Frank Vahid, John Wiley and Sons Publishers, 2010.
http://www.ddvahid.com
Introduction
We now know how to build digital circuits
How can we build better circuits?
Lets consider two important design criteria
Delay the time from inputs changing to new correct stable output
Size the number of transistors
For quick estimation, assume
Transforming F1 to F2 represents
Every gate has delay of 1 gate-delay
an optimization: Better in all
Every gate input requires 2 transistors
criteria of interest
Ignore inverters
16 transistors 4 transistors 20
w 2 gate-delays 1 gate-delay F1
(transistors)
x 15
y
w
size
a
F1 F2 10
x
w
x 5 F2
y
F1 = wxy + wxy F2 = wx 1 2 3 4
= wx(y+y) = wx delay (gate-delays)
(a) (b) a
(c)
Digital Design 2e
Copyright 2010 2
Frank Vahid
Note: Slides with animation are denoted with a small red "a" near the animated items
Introduction
Tradeoff
Improves some, but worsens other, criteria of interest
Transforming G1 to G2
represents a tradeoff: Some
criteria better, others worse.
14 transistors 12 transistors 20
(transistors)
w 2 gate-delays w 3 gate-delays 15 G1
x a
size
x G2
G1 G2 10
y
w 5
z
y
z 1 2 3 4
G1 = wx + wy + z G2 = w(x+y) + z a delay (gate-delays)
Digital Design 2e
Copyright 2010 3
Frank Vahid
Introduction
Tradeoffs
Optimizations Some criteria of interest
All criteria of interest are improved, while
size
size
are improved (or at others are worsened a
delay delay
Digital Design 2e
Copyright 2010 4
Frank Vahid
6.2
ab + ab = a(b + b) = a*1 = a F = xy + xy
G = x(y+y)
Digital Design 2e
Copyright 2010 G=x 6
Frank Vahid
Karnaugh Maps for Two-Level Size Optimization
F yz
Easy to miss possible opportunities to x 00
Notice not in binary order
01 11 10
combine terms when doing algebraically 0 xyz xyz xyz xyz K-map
Karnaugh Maps (K-maps) 1 xyz xyz xyz xyz
a
Minterms differing in one variable are adjacent F = xyz + xyz + xyz + xyz
in the map F yz
Can clearly see opportunities to combine x 00 01 11 10
terms look for adjacent 1s 0 1 1 0 0
a
For F, clearly two opportunities
1 0 0 1 1
Top left circle is shorthand for:
xyz+xyz = xy(z+z) = xy(1) = xy F yz
J yz xy yz 1 1 1 1 1
x
x 00 01 11 10 The two circles are shorthand for:
I = xyz + xyz + xyz + xyz + xyz
xz I = xyz + xyz + xyz + xyz + xyz + xyz
0 1 1 0 0
I = (xyz + xyz) + (xyz + xyz + xyz + xyz)
a I = (yz) + (x)
1 0 1 1 0
Digital Design 2e
Copyright 2010 9
Frank Vahid
K-maps
xyz
Circles can cross left/right sides
K yz
x 00 01 11 10
Remember, edges are adjacent 0 0 1 0 0
xz
Minterms differ in one variable only
1 1 0 0 1
Circles must have 1, 2, 4, or 8
L yz
cells 3, 5, or 7 not allowed x 00 01 11 10
Digital Design 2e
Copyright 2010 10
Frank Vahid
K-maps for Four Variables
F yz
Four-variable K-map follows wx 00 01 11 10
same principle 00 0 0 1 0
wxy
variable
11 0 0 1 0
Left/right adjacent
Top/bottom also adjacent 10 0 0 1 0
algebraically by hand F z 11 0 1 1 0
y 0 1
0 10 0 1 1 0
1 z
Digital Design 2e G=z
Copyright 2010 11
Frank Vahid
Two-Level Size Optimization Using K-maps
General K-map method
1. Convert the functions equation into
sum-of-minterms form
2. Place 1s in the appropriate K-map
cells for each minterm
3. Cover all 1s by drawing the fewest
largest circles, with every 1
included at least once; write the
corresponding term for each circle
4. OR all the resulting terms to create
the minimized function.
Digital Design 2e
Copyright 2010 12
Frank Vahid
Two-Level Size Optimization Using K-maps
General K-map method Common to revise (1) and (2):
1. Convert the functions equation into Create sum-of-products
sum-of-minterms form Draw 1s for each product
2. Place 1s in the appropriate K-map
cells for each minterm Ex: F = w'xz + yz + w'xy'z'
F yz wxz yz
wx 00 01 11 10
00 0 0 1 0
01 1 1 1 0
wxyz
11 0 0 1 0
10 0 0 1 0
Digital Design 2e
Copyright 2010 13
Frank Vahid
Two-Level Size Optimization Using K-maps
General K-map method Example: Minimize:
G = a + abc + b*(c + bc)
1. Convert the functions equation into
1. Convert to sum-of-products
sum-of-minterms form
G = a + abc + bc + bc
2. Place 1s in the appropriate K-map 2. Place 1s in appropriate cells
cells for each minterm G bc
a
bc
3. Cover all 1s by drawing the fewest 00 01 11 10
largest circles, with every 1
included at least once; write the
abc 0 1 0 0 1
a
1 1 1 1 1
corresponding term for each circle
4. OR all the resulting terms to create a
3. Cover 1s
the minimized function.
G bc
a 00 01 11 10 c
0 1 0 0 1
1 1 1 1 1
a
Digital Design 2e 4. OR terms: G = a + c
Copyright 2010 14
Frank Vahid
Two-Level Size Optimization Using K-maps
Four Variable Example
Minimize: abcd abcd
H = ab(cd + cd) + abcd + abcd abcd abd abcd
+ abd + abcd abcd
H cd
1. Convert to sum-of-products: ab 00 01 11 10
H = abcd + abcd + abcd + 00 1 0 0 1
bd
abcd + abd + abcd a
01 0 1 1 1 abc
2. Place 1s in K-map cells
11 0 0 0 0 abd
3. Cover 1s
10 1 0 0 1
4. OR resulting terms
Funny-looking circle, but
remember that left/right
H = bd + abc + abd
adjacent, and top/bottom
adjacent
Digital Design 2e
Copyright 2010 15
Frank Vahid
Dont Care Input Combinations
F yz yz
What if we know that particular input
x 00 01 11 10
combinations can never occur?
e.g., Minimize F = xyz, given that xyz 0 X 0 0 0
a
(xyz=000) can never be true, and that
xyz (xyz=101) can never be true 1 1 X 0 0
1 1 X 0 0
Include X in circle ONLY if minimizes xy
equation
Unnecessary use of dont
Dont include other Xs cares; results in extra term
Digital Design 2e
Copyright 2010 16
Frank Vahid
Optimization Example using Dont Cares
Minimize:
F bc
F = abc + abc + abc ac b
a
Given dont cares: abc, abc 00 01 11 10
0 0 1 X 1
a
1 0 0 X 1
Note: Introduce dont cares
with caution F = ac + b
Must be sure that we really dont
care what the function outputs for
that input combination
If we do care, even the slightest,
then its probably safer to set the
output to 0
Digital Design 2e
Copyright 2010 17
Frank Vahid
Optimization with Dont Cares Example:
Sliding Switch
Switch with 5 positions
1 2 3 4 5 x
3-bit value gives position in 2,3,4,
binary y
detector
1 1 0 1 1
depending on order we choose
Is error prone yz xy yz xy
Minimization thus typically I yz
4 terms
done by automated tools x 00 01 11 10
Exact algorithm: finds optimal
solution 0 1 1 1 0
a
(b)
a Heuristic: finds good solution,
1 1 0 1 1
but not necessarily optimal
yz xz xy
Only 3 terms
Digital Design 2e
Copyright 2010 19
Frank Vahid
Basic Concepts Underlying Automated Two-Level
Logic Size Optimization
Definitions F yz
xyz
On-set: All minterms that define x 00 01 11 10
when F=1
xyz
Off-set: All minterms that define 0 0 1 0 0
xyz' a
when F=0 xy
1 0 0 1 1
Implicant: Any product term
(minterm or other) that when 1 4 implicants of F
causes F=1 Note: We use K-maps here just for
a On K-map, any legal (but not intuitive illustration of concepts;
automated tools do not use K-maps.
necessarily largest) circle
Cover: Implicant xy covers Prime implicant: Maximally
minterms xyz and xyz expanded implicant any
Expanding a term: removing a expansion would cover 1s not in a
Digital Design 2e
Copyright 2010 21
Frank Vahid
Automated Two-Level Logic Size Optimization Method
Digital Design 2e
Copyright 2010 22
Frank Vahid
Tabular Method Step 1: Determine Prime Implicants
Methodically Compare All Implicant Pairs, Try to Combine
Example function: F = x'y'z' + x'y'z + x'yz + xy'z + xyz' + xyz
Actually, comparing ALL pairs isnt necessaryjust
pairs differing in uncomplemented literals by one.
Minterms Minterms
(3-literal 2-literal (3-literal 2-literal 1-literal
implicants) implicants implicants) implicants implicants
x'y'z'+x'yz No
(0) x'y'z' x'y'z'+x'y'z = x'y' (0,1) x'y' 0 (0) x'y'z' (0,1) x'y'
(1,5,3,7) z
(1,5) y'z (1,5) y'z
(1) x'y'z 1 (1) x'y'z (1,3,5,7) z
(1,3) x'z (1,3) x'z
(5,7) xz (5,7) xz
(5) xy'z 2 (5) xy'z
a
Digital Design 2e If only one X in row, then that PI is essentialit's the only PI
Copyright 2010
that covers that row's minterm. 24
Frank Vahid
Tabular Method Step 3: Use Fewest Remaining PIs to
Cover Remaining Minterms
Essential PIs (from Step 2): x'y', xy, z Prime implicants
(5) xy'z
(6) xyz'
(7) xyz
(c)
Digital Design 2e
Copyright 2010 25
Frank Vahid
Problem with Methods that Enumerate all Minterms or
Compute all Prime Implicants
Too many minterms for functions with many variables
Function with 32 variables:
232 = 4 billion possible minterms.
Too much compute time/memory
Too many computations to generate all prime implicants
Comparing every minterm with every other minterm, for 32
variables, is (4 billion)2 = 1 quadrillion computations
Functions with many variables could requires days, months, years,
or more of computation unreasonable
Digital Design 2e
Copyright 2010 26
Frank Vahid
Solution to Computation Problem
Solution
Dont generate all minterms or prime implicants
Instead, just take input equation, and try to iteratively improve it
Ex: F = abcdefgh + abcdefgh+ jklmnop
Note: 15 variables, may have thousands of minterms
But can minimize just by combining first two terms:
F = abcdefg(h+h) + jklmnop = abcdefg + jklmnop
Digital Design 2e
Copyright 2010 27
Frank Vahid
Two-Level Optimization using Iterative Method
I yz
Method: Randomly apply expand x xz
00 01 11 10
operations, see if helps
0 0 1 1 0 a
Expand: remove a variable from a (a) z
term 1 0 1 1 0
X + x'y'z' + x'y'z
F = xy + xyz'
Random expand: F = xy X + x'y'z' + x'y'z
Not legal (x covers xy'z', xy'z, xyz', xyz: two not in on-set) a
X + x'y'z
Random expand: F = xy + x'y'z'
Legal
Implicant covered by x'y': x'y'z
X
F = xy + x'y'z' + x'y'z
Digital Design 2e
Copyright 2010 29
Frank Vahid
Multi-Level Logic Optimization Performance/Size
Tradeoffs
We dont always need the speed of two-level logic
Multiple levels may yield fewer gates
Example
F1 = ab + acd + ace F2 = ab + ac(d + e) = a(b + c(d + e))
General technique: Factor out literals xy + xz = x(y+z)
22 transistors
2 gate delays a
a 4 F2
4 F1
b
size(transistors)
b 20
4
15 F2
a c
c 6 4
d 10
6 F1 d
4 5
a e 16 transistors
c 6
e 4 gate-delays 1 2 3 4
delay (gate-delays)
F1 = ab + acd + ace F2 = a(b+c(d+e))
(a) (b) (c)
a
a
Digital Design 2e
Copyright 2010 30
Frank Vahid
Multi-Level Example
Q: Use multiple levels to reduce number of transistors for
F1 = abcd + abcef
A: abcd + abcef = abc(d + ef)
Has fewer gate inputs, thus fewer transistors
a
22 transistors 18 transistors
2 gate delays 3 gate delays F1
20
(transistors)
a a F2
b b 6 15
size
c 8 c 4 F2
d 10
4 F1 d
a 4 5
b e
c 10 4
e f 1 2 3 4
f delay (gate-delays)
F1 = abcd + abcef F2 = abc(d + ef)
(a) (b) (c)
Digital Design 2e
Copyright 2010 31
Frank Vahid
Multi-Level Example: Non-Critical Path
Critical path: longest delay path to output
Optimization: reduce size of logic on non-critical paths by using multiple
levels
26 transistors 22 transistors
3 gate-delays 3 gate-delays
a
4 a F1
Size (transistors)
b 4 25
b 20 F2
4
c 4
c 15
d 6 F1
6 4 F2 10
e a
4 5
b
f 6 f 6 1 2 3 4
g g delay (gate-delays)
F1 = (a+b)c + dfg + efg F2 = (a+b)c + (d+e)fg
(a) (b) (c)
Digital Design 2e
Copyright 2010 32
Frank Vahid
Automated Multi-Level Methods
Digital Design 2e
Copyright 2010 33
Frank Vahid
6.3
Digital Design 2e
Copyright 2010 34
Frank Vahid
State Reduction: Equivalent States
Two states are equivalent if: Inputs: x; Outputs : y
Digital Design 2e
Copyright 2010 35
Frank Vahid
State Reduction via the Partitioning Method
First partition (b) G1
Initial groups
G2
states into Inputs: x; Outputs: y {A, D} {B, C}
groups based on x' x'
y=1
outputs x x
x=0
A goes to A (G1) B goes to D (G1)
A B D D goes to D (G1) C goes to B (G2)
Then partition x' same different
y=0 x y=0 (c)
based on next (a)
C A goes to B (G2) B goes to C (G2)
states x=1 D goes to B (G2) C goes to B (G2)
y=1
For each same same
possible input,
for states in New groups
group, if next (d) G1 G2 G3
{A, D} {B} {C}
states in Inputs: x; Outputs: y
different x'
groups, divide x y=1 A goes to A (G1) One state groups;
A B D x=0 D goes to D (G1) nothing to check
group
x' x' same
Repeat until (f)
y=0 x (e)
no such states C A goes to B (G2) Done: A and D
x=1 D goes to B (G2)
y=1 are equivalent
same
Digital Design 2e
Copyright 2010 36
Frank Vahid
Ex: Minimizing States using the Partitioning Method
Inputs: x; Outputs: y Inputs: x; Outputs: y
x x
x S3,S4
x
S3 S0 S4
x y=0 x
y=0 x x y=0
x x
x x x x
S0 S1,S2
S2 S1
y=0 x
y=1 y=1 y=1
Initial grouping based on outputs: G1={S3, S0, S4} and G2={S2, S1}
Differ divide
Group G1:
for x=0, S3 S3 (G1), S0 S2 (G2), S4 S0 (G1)
for x=1, S3 S0 (G1), S0 S1 (G2), S4 S0 (G1)
Divide G1. New groups: G1={S3,S4}, G2={S2,S1}, and G3={S0}
Repeat for G1, then G2. (G3 only has one state). No further divisions, so done.
S3 and S4 are equivalent
S2 and S1 are equivalent
Digital Design 2e
Copyright 2010 37
Frank Vahid
Need for Automation
Inputs: x; Outputs: z
x
SA
x
SB x
SH
x SI
z=1
SO x
x x
complex for humans z=1
x
x
z=0
x z=0
x
x x' x
z=0
compute time
Digital Design 2e
Copyright 2010 38
Frank Vahid
State Encoding
State encoding: Assigning unique bit Inputs: b; Outputs: x
representation to each state x=0
Different encodings may reduce circuit size, 00
Off b
or trade off size and performance
b
Minimum-bitwidth binary encoding: Uses x=1 x=1 x=1
fewest bits 11 10
01 On1 10 On2 11 On3
Alternatives still exist, such as
A:00, B:01, C:10, D:11
A:00, B:01, C:11, D:10
a
4! = 24 alternatives for 4 states, N! for N states
Consider Three-Cycle Laser Timer
Example 3.7's encoding led to 15 gate inputs
Try alternative encoding 1
x = s1 + s0 1
n1 = s0 1 0
n0 = s1b + s1s0 1 0
Only 8 gate inputs 0
Thus fewer transistors 0
Digital Design 2e
Copyright 2010 39
Frank Vahid
State Encoding: One-Hot Encoding
Inputs: none; Outputs: x
One-hot encoding x=0 x=1
n1
8
binary s3 s2 s1 s0
6 n0
s1 s0
4 one-hot clk State register
clk State register
2 n0
n1
Digital Design 2e n2
n3
Copyright 2010 1 2 3 4 40
Frank Vahid delay (gate-delays)
One-Hot Encoding Example:
Three-Cycles-High Laser Timer
Four states Use four-bit one-hot
Inputs: b; Outputs: x
encoding x=0
State table leads to equations: 0001
b
x = s3 + s2 + s1 Off
n3 = s2 b
x=1 x=1 x=1
n2 = s1
0010 0100 1000
n1 = s0*b On1 On2 On3
a
n0 = s0*b + s3
Smaller
3+0+0+2+(2+2) = 9 gate inputs
Earlier binary encoding (Ch 3):
15 gate inputs
Faster
Critical path: n0 = s0*b + s3
Previously: n0 = s1s0b + s1s0
2-input AND slightly faster than 3-
input AND
Digital Design 2e
Copyright 2010 41
Frank Vahid
Output Encoding
Output encoding: Encoding Use the output values
as the state encoding
method where the state
encoding is same as the Inputs: none; Outputs: x,y
xy=00 xy=01
output values
A 00 D 11
Possible if enough outputs, all
states with unique output values
B 01 C 10 a
xy=11 xy=10
Digital Design 2e
Copyright 2010 42
Frank Vahid
Output Encoding Example: Sequence Generator
w
Inputs: none; Outputs: w, x, y, z x
y
wxyz=0001 wxyz=1000 z
A D
B C
wxyz=0011 wxyz=1100
s3 s2 s1 s0
Generate sequence 0001, 0011, 1110, 1000,
repeat clk State register
FSM shown n0
n1
Use output values as state encoding n3
n2
Digital Design 2e
Copyright 2010 43
Frank Vahid
Moore vs. Mealy FSMs
Mealy FSM adds this
N N
(a) (b)
clk clk
a
Inputs: enough Inputs: enough
State: I W W D I State: I W W I
Moore outputs in b
s1s0=10, p=0
clk clk
Digital Design 2e
Copyright 2010 48
Frank Vahid
Mealy and Moore can be Combined
Digital Design 2e
Copyright 2010 49
Frank Vahid
6.4
Digital Design 2e
Copyright 2010 50
Frank Vahid
Building a Faster Adder
Built carry-ripple adder in Ch 4 a3 b3 a2 b2 a1 b1 a0 b0 cin
4-bit adder
Similar to adding by hand, column by column cout s3 s2 s1 s0
Con: Slow
carries: c3 c2 c1 cin
Output is not correct until the carries have
rippled to the left critical path B: b3 b2 b1 b0 a
4-bit carry-ripple adder has 4*2 = 8 gate delays A: + a3 a2 a1 a0
Pro: Small cout s3 s2 s1 s0
4-bit carry-ripple adder has just 4*5 = 20 gates
a3 b3 a2 b2 a1b1 a0 b0 ci
FA FA FA FA
co s3 s2 s1 s0
Digital Design 2e
Copyright 2010 51
Frank Vahid
Building a Faster Adder
Carry-ripple is slow but small
8-bit: 8*2 = 16 gate delays, 8*5 = 40 gates
16-bit: 16*2 = 32 gate delays, 16*5 = 80 gates
32-bit: 32*2 = 64 gate delays, 32*5 = 160 gates
Two-level logic adder (2 gate delays) 10000
transistors
OK for 4-bit adder: About 100 gates 8000
8-bit: 8,000 transistors / 16-bit: 2 M / 32-bit: 100 billion 6000
N-bit two-level adder uses absurd number of gates for 4000
N much beyond 4 2000
Compromise 0
1 2 3 4 5 6 7 8
Build 4-bit adder using two-level logic, compose to build N
N-bit adder
8-bit adder: 2*(2 gate delays) = 4 gate delays,
2*(100 gates)=200 gates
32-bit adder: 8*(2 gate delays) = 32 gate delays
8*(100 gates)=800 gates
a7 a6 a5 a4 b7 b6 b5 b4 a3 a2 a1 a0 b3 b2 b1 b0
Can we do a3 a2 a1 a0 b7 b6 b5 b4 a3 a2 a1 a0 b3 b2 b1 b0
better? a 4-bit adder ci 4-bit adder ci a
Digital Design 2e co s3 s2 s1 s0 co s3 s2 s1 s0
Copyright 2010 52
Frank Vahid
co s7 s6 s5 s4 s3 s2 s1 s0
Faster Adder (Nave Inefficient) Attempt at Lookahead
Idea
Modify carry-ripple adder For a stages carry-in, dont wait for carry
to ripple, but rather directly compute from inputs of earlier stages
Called lookahead because current stage looks ahead at previous
stages rather than waiting for carry to ripple to current stage
a3a3
b3 b3 a2
a2b2b2 a1
a1b1
b1 a0 b0
a0c0
b0 ci
FA c3 FA c2 FA c1 FA c0
FA
c4 co s3 3
stage stage 2s2 stage 1 s1 stage 0 s0
cout s3 s2 s1 s0
FA FA FA FA
co2 co1 co0
co s3 s2 s1 s0
Full-adder: s = a xor b c = bc + ac + ab
c1 = co0 = b0c0 + a0c0 + a0b0
Not efficient FA
c4 stage 3 stage 2 stage 1 stage 0
Need a better form of cout s3 s2 s1 s0
lookahead
c1 = b0c0 + a0c0 + a0b0
Digital Design 2e
Copyright 2010 55
Frank Vahid
Efficient Lookahead
cin c0 a
carries: c4 c3 c2 c1 c0 c1 1 0 1 1 1 1 1 1
b0
B: b3 b2 b1 b0 1 1 0 1
a0
A: + a3 a2 a1 a0 + 1 + 1 + 1 + 0
cout s3 s2 s1 s0 0 1 0 0
if a0b0 = 1 if a0 xor b0 = 1
c1 = a0b0 + (a0 xor b0)c0 then c1 = 1 then c1 = 1 if c0 = 1
(call this G: Generate) (call this P: Propagate)
c2 = a1b1 + (a1 xor b1)c1
c3 = a2b2 + (a2 xor b2)c2 Why those names? When a0b0=1, we should generate a 1
for c1. When a0 XOR b0 = 1, we should propagate the c0
c4 = a3b3 + (a3 xor b3)c3 value as the value of c1, meaning c1 should equal c0.
c3 = G2 + P2c2
c3 = G2 + P2(G1 + P1G0 + P1P0c0)
c3 = G2 + P2G1 + P2P1G0 + P2P1P0c0
Digital Design 2e
Copyright 2010 57
Frank Vahid
a3 b3 a2 b2 a1 b1 a0 b0 cin
CLA Half-adder
SPG
block
Half-adder Half-adder Half-adder
Each stage:
HA for G
and P
Another G3 P3
Carry-lookahead logic c3
G2 P2
c2
G1 P1
c1
G0 P0 c0
XOR for s
Call SPG cout s3 s2 (b) s1 s0
block P3 G3 P2 G2 P1 G1 P0 G0 c0
Create Carry-lookahead logic
carry-
lookahead
logic from
equations
a
More
efficient
than nave
scheme, at
expense of Stage 4 Stage 3 Stage 2 Stage 1
one extra
gate delay
c1 = G0 + P0c0
Digital Design 2e c2 = G1 + P1G0 + P1P0c0
Copyright 2010 c3 = G2 + P2G1 + P2P1G0 + P2P1P0c0 58
Frank Vahid
cout = G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0c0
Carry-Lookahead Adder High-Level View
a3 b3 a2 b2 a1 b1 a0 b0 c0
P3 G3 c3 P2 G2 c2 P1 G1 c1 P0 G0
4-bit carry-lookahead logic
cout
cout s3 s2 s1 s0
Digital Design 2e
Copyright 2010 59
Frank Vahid
Carry-Lookahead Adder 32-bit?
Problem: Gates get bigger in each stage
4th stage has 5-input gates
32nd stage would have 33-input gates
Too many inputs for one gate
Would require building from smaller gates,
meaning more levels (slower), more gates
(bigger)
Gates get bigger
One solution: Connect 4-bit CLA adders in in each stage
carry-ripple manner Stage 4
Ex: 16-bit adder: 4 + 4 + 4 + 4 = 16 gate
delays. Can we do better?
a15-a12 b15-b12 a11-a8 b11-b8 a7a6a5a4 b7b6b5b4 a3a2a1a0 b3b2b1b0
Digital Design 2e
Copyright 2010 60
Frank Vahid
Hierarchical Carry-Lookahead Adders
Better solution Rather than rippling the carries, just repeat the carry-
lookahead concept
Requires minor modification of 4-bit CLA adder to output P and G
These use carry-lookahead internally
P3G3 c3 P2 G2 c2 P1 G1 c1 P0 G0
a
4-bit carry-lookahead logic
P G cout
a3a2a1a0 b3b2b1b0
I1 I0 suppose =1
5-bit wide 2x1 mux S
Q
co s7s6s5s4 s3s2s1s0
Digital Design 2e
Copyright 2010 63
Frank Vahid
Adder Tradeoffs
carry-lookahead
multilevel
size
carry-lookahead
carry-select
carry-
ripple
delay
b0
pp1
b1
pp2
0 0
b2
+ (5-bit)
pp3
00
a
b3 pp4
+ (6-bit) ... and 31 adders
0 00 here (big ones, too)
+ (7-bit)
Digital Design 2e
Copyright 2010 p7..p0 65
Frank Vahid
Smaller Multiplier -- Sequential (Add-and-Shift) Style
Smaller multiplier: Basic idea
Dont compute all partial products simultaneously
Rather, compute one at a time (similar to by hand), maintain running
sum
Digital Design 2e
Copyright 2010 66
Frank Vahid
Smaller Multiplier -- Sequential (Add-and-Shift) Style
multiplier multiplicand
Controller
4-bit adder
running sum right multiplier
register (4)
mrld
(relative to partial load
0011
product) after each step mr3
mr2
ensures partial product mr1
added to correct running mr0
00010010
rsload
sum bits rsclear
load
running sum00100100
rsshr
clear 01001000
shr register (8) 00110000
Step 1 Step 2 Step 3 Step 4 00000000
0110 0110 0110 0110
0011 0 01 1 001 1 start 0011 a
0000 00110 010010 0010010 (running sum)
+ 0110 +0 1 1 0 + 0000 + 0000 (partial product)
00110 010010 0010010 0 0 0 1 0 0 1 0 (new running sum) product
Digital Design 2e
Copyright 2010 67
Frank Vahid
Smaller Multiplier -- Sequential Style: Controller
controller Vs. array-style:
a
start mdld Pro: small
start mrld Just three registers,
mdld = 1
mrld = 1 mr3 adder, and controller
rsclear = 1 mr2
rsshr=1 rsshr=1 rsshr=1 rsshr=1 mr1 Con: slow
mr0 mr1 mr2 mr3 mr0
a
rsload 2 cycles per multiplier
rsclear
mr0 mr1 mr2 mr3 rsshr bit
rsload=1 rsload=1 rsload=1 rsload=1 32-bit: 32*2=64 cycles
(plus 1 for init.)
start
multiplier multiplicand
register (4)
Looks at multiplier one bit at a load
0110
time
Adds partial product mdld
multiplier 4-bit adder
controller
mr2
Then shifts running sum right mr1
mr0 a
rsload
one position rsclear
load
running sum
rsshr
10010
00110
0110
00000000
0010010
00010010
010010
clear
shr 000
0000
000
register (8)
start
Digital Design 2e
Copyright 2010
Correct product product 68
Frank Vahid
6.5
Digital Design 2e
Copyright 2010 69
Frank Vahid
Pipelining Time
Stage 1
Longest path
2ns
2ns
+ + is only 2 ns
2ns
+ + 2ns
pipeline
clk Longest path registers
clk a
is 2+2 = 4 ns
2ns
2ns
+ So minimum clock +
Stage 2
So minimum clock
period is 4ns period is 2ns
S clk S clk
Longest path
+ + is only 2 ns
+ +
pipeline
Longest path registers
clk clk
is 2+2 = 4 ns
+ So mininum clock + So mininum clock
period is 4 ns period is 2 ns
S clk S clk
(a) (b)
Stage 1
multipliers
14 ns for entire adder tree 20 ns x x
Critical path of 20+14 = 34 ns pipeline
Add pipeline registers registers
Stage 2
+ +
Clock frequency can be nearly 14 ns
doubled +
Great speedup with minimal
extra hardware yreg
Y
Digital Design 2e
Copyright 2010 73
Frank Vahid
Concurrency
Concurrency: Divide task into
subparts, execute subparts Task
simultaneously
Dishwashing example: Divide stack
into 3 substacks, give substacks to a Concurrency
3 neighbors, who work
simultaneously 3 times speedup Can do both, too
(ignoring time to move dishes to
neighbors' homes)
Concurrency does things side-by-
side; pipelining instead uses stages
(like a factory line)
Already used concurrency in FIR Pipelining
filter concurrent multiplications
* * *
Digital Design 2e
Copyright 2010 74
Frank Vahid
Concurrency Example: SAD Design Revisited
Sum-of-absolute differences video compression example (Ch 5)
Compute sum of absolute differences (SAD) of 256 pairs of pixels
Original : Main loop did 1 sum per iteration, 256 iterations, 2 cycles per iter.
go AB_rd AB_addr A_data B_data
i_lt_256 A
S0 go lt 8 8
cmp B
go 256 9
i_inc
sum_clr=1 A B
S1
i_clr=1 i_clr
i
8
S2
sum_ld
i_lt_256'
S1 i_clr=1 i_clr
New: 16*2 = 32 cycles
sum_ld
S2 16 absolute
i_lt_16'
Digital Design 2e
Copyright 2010 77
Frank Vahid
Component Allocation
Another RTL tradeoff: Component allocation Choosing a particular
set of functional units to implement a set of operations
e.g., given two states, each with multiplication
Can use 2 multipliers (*)
OR, can instead use 1 multiplier, and 2 muxes
Smaller size, but slightly longer delay due to the mux delay
size
2 mul
t2 t3 t5 t6 sl 21 21 sr
1 mul
delay
t1 t4 t1 t4 (c)
(a) (b)
Digital Design 2e
Copyright 2010 78
Frank Vahid
Operator Binding
Another RTL tradeoff: Operator binding Mapping a set of operations
to a particular component allocation
Note: operator/operation mean behavior (multiplication, addition), while
component (aka functional unit) means hardware (multiplier, adder)
Different bindings may yield different size or delay
A B C A B C
t2 t3 t5 t8 t6 t3 t2 t8 t3 t5 t6 Binding 1 a
size
2 muxes
Binding 2
sl 2x1 2x1 sr vs. sl 2x1
1 mux
2 multipliers delay
MULA MULB MULA MULB
allocated
t1 t4 t7 t1 t7 t4
Binding 1 Binding 2
Digital Design 2e
Copyright 2010 79
Frank Vahid
Operator Scheduling
Yet another RTL tradeoff: Operator scheduling
Introducing or merging states, and assigning operations to
those states.
A B C A B B2 C
t2 t5 t3 t6
a
but more
t2 t3 t5 t6 delay due to
sl 2x1 2x1 sr
muxes, and
* * smaller extra state
3-state schedule
(only 1 *) *
size
t1 t4
t1 t4
4-state schedule
Yreg := 0 Yreg :=
xt0 := 0 c0*xt0 +
xt1 := 0 c1*xt1 + 3 2 2
xt2 := 0 c2*xt2
c0 := 3 xt0 := X c0_ld c1_ld c2_ld
c1 := 2 xt1 := xt0 xt0_clr
c2 := 2 xt2 := xt1 c0 c1 c2
...
FIR filter
...
xt0_ld
xt0 xt1 xt2
X
12
clk
x(t)
* x(t-1)
* x(t-2)
*
Yreg_clr
+ + Yreg_ld
Y
Yreg
12
Datapath for 3-tap FIR filter
Digital Design 2e
Copyright 2010 81
Frank Vahid
Operator Scheduling Example: Smaller FIR Filter
Reduce the designs size by re-scheduling the operations
Do only one multiplication operation per state
Inputs: X (12 bits) Inputs : X (12 bits)
Outputs: Y (12 bits) Outputs : Y (12 bits)
Local storage: xt0, xt1, xt2, Local storage: xt0, xt1, xt2,
c0, c1, c2 (12 bits); c0, c1, c2 (12 bits);
Yreg (12 bits) Yreg, sum (12 bits)
Yreg := sum := 0
c0*xt0 + xt0 := X
S1 c1*xt1 + S1
xt1 := xt0
c2*xt2 xt2 := xt1
xt0 := X
xt1 := xt0
sum := sum + c0xt0
xt2 := xt1
S2
(a)
a
Digital Design 2e
Copyright 2010 83
Frank Vahid
Operator Scheduling Example: Smaller FIR Filter
Many other options exist
between fully-concurrent and
fully-serialized
e.g., for 3-tap FIR, can use 1, 2, concurrent FIR
or 3 multipliers
compromises
size
Can also choose fast array-style
multipliers (which are concurrent
internally) or slower shift-and-
serial
add multipliers (which are FIR
serialized internally)
Each options represents delay
compromises
Digital Design 2e
Copyright 2010 84
Frank Vahid
6.6
high-level changes
size
delay land
Digital Design 2e
Copyright 2010 (a) (b) 86
Frank Vahid
Algorithm Selection
Linear
Chosen algorithm can have big impact search
0: 0x00000000
e.g., which filtering algorithm?
FIR is one type, but others require less computation at 1: 0x00000001
expense of lower-quality filtering 2: 0x0000000F
Example: Quickly find items address in 256-word 3: 0x000000FF
memory 64
One use: data compression. Many others. 96
96: 0x00000F0A
Algorithm 1: Linear search
128: 0x0000FFAA
Compare item with M[0], then M[1], M[2], ... 128
256 comparisons worst case Binary
Algorithm 2: Binary search (sort memory first) search
Start considering entire memory range
a
If M[mid]>item, consider lower half of M
If M[mid]<item, consider upper half of M 255: 0xFFFF0000
Repeat on new smaller range
Dividing range by 2 each step; at most 8 such divisions 256x32 memory
Only 8 comparisons in worst case
Choice of algorithm has tremendous impact
Far more impact than say choice of comparator type
Digital Design 2e
Copyright 2010 87
Frank Vahid
Power Optimization
Until now, weve focused on size and delay 8
Digital Design 2e
Copyright 2010 88
Frank Vahid
Power Optimization using Clock Gating
2 c0 c1 c2
P = k * CV f X xt0 xt1 xt2
power
high-power gates
Multiple versions of gates may exist low-power gates on
Fast/high-power, and slow/low-power, versions non-critical path
Digital Design 2e
Copyright 2010 90
Frank Vahid
Chapter Summary
Optimization (improve criteria without loss) vs. tradeoff (improve criteria at
expense of another)
Combinational logic
K-maps and tabular method for two-level logic size optimization
Iterative improvement heuristics for two-level optimization and multi-level too
Sequential logic
State minimization, state encoding, Moore vs. Mealy
Datapath components
Faster adder using carry-lookahead
Smaller multiplier using sequential multiplication
RTL
Pipelining, concurrency, component allocation, operator binding, operator
scheduling
Serial vs. concurrent, efficient algorithms, power reduction methods (clock
gating, low-power gates)
Multibillion dollar EDA industry (electronic design automation) creates tools
for automated optimization/tradeoffs
Digital Design 2e
Copyright 2010 91
Frank Vahid