Bit Level Optimization of Adder Trees For Multiple Constant Multiplication Fir Filter Implementation

SUBMITTED TO IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
Bit-Level Optimization of Adder-Trees for Multiple

Constant Multiplication FIR Filter Implementation
Yu Pan and Pramod Kumar Meher, Senior Member, IEEE,
Index TermsAdder-tree optimization, finite impulse response

filter, FIR, Multiple constant multiplication, MCM
I. I NTRODUCTION
Finite impulse response (FIR) digital filters are widely used
as a basic tool in many digital signal processing (DSP) and
communication applications. The complexity of a FIR filter
is largely dominated by the multiplication of input samples
with filter coefficients. But fortunately, the filter coefficients
are constants for a given filter, so that multiplications are
implemented by a network of adders, subtractors, and hardwired shifts, where the number of adders and subtractors
are minimized by a constant multiplication scheme. In case
of a transposed direct-form FIR filter, the recent most input
sample at any given clock period is multiplied with all the
filter coefficients. A set of intermediate results are generated
in this case, and shared across all the multiplications in
order to minimize the total number of additions/subtractions
using multiple constant multiplication (MCM) techniques (see
Fig.I(a)). Each such intermediate result in an MCM process
corresponds to one of the common sub-expressions (CS) of
the set of constants to be multiplied.
A great deal of research has been done to develop effective
algorithms to identify the optimal set of non-redundant subexpressions to achieve the minimum number of logic operators
and the minimum logic depth of the MCM [1][7]. Irrespective
of differences in methodology and the level of optimality,
in all these works, after the common subexpression terms
are determined and the ADD/SUB network of non-redundant
subexpressions (or terms) is formed, the product value corresponding to each of the coefficients is computed by an adderYu Pan and Pramod Kumar Meher are with Institute for Infocomm Research, 1 Fusionopolis Way, Singapore, 138632, e-mail: {ypan,
pkmeher}@i2r.a-star.edu.sg.
Copyright (c) 2013 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be
obtained from the IEEE by sending an email to pubs-permissions@ieee.org.
C1 =(101)00(1001)0
<<4
<<3
<<2
+
+
<<3
<<6
<<5
<<2
<<4
Adder
Tree0
<<6
<<5
<<1
1
(a)
REG
REG
<<1
MCM
Term
Network
(b)
Adder
Tree1
REG
C0 =10(101)0(1001)
REG
AbstractMultiple constant multiplication (MCM) scheme is

widely used for implementing transposed direct-form FIR filters.
While the research focus of MCM has been on more effective
common subexpression elimination, the optimization of addertrees, which sum up the computed sub-expressions for each
coefficient, is largely omitted. In this paper, we have identified
the resource minimization problem in the scheduling of addertree operations for the MCM block, and presented a mixed
integer programming (MIP) based algorithm for more efficient
MCM-based implementation of FIR filters. Experimental result
shows that up to 15% reduction of area and 11.6% reduction of
power (with an average of 8.46% and 5.96% respectively) can
be achieved on the top of already optimized adder/subtractor
network of the MCM block.
Structural
Structural
Add/Sub
Fig. 1. Composition of the MCM block. (a) MCM and common subexpressions. (b) Term network and adder-trees for each coefficients.
tree that sums up its relevant terms. As shown in Fig.I(b), two

adder-trees are formed, for computing the product of a pair
of coefficients using shifted versions of unique CS terms 1,
10-1 and 1001 from the term-networks or subexpressionnetworks.
A brief characterization study of the number of ADD/SUB
operators in different parts of arbitrary filters using 2-bit recursive common subexpression elimination for MCMs reveals that
the number of operators used to form the adder-tree networks
is very significant, sometimes a few times more, compared
to that in the term networks. While the main research focus
of MCM is on more effective common subexpression sharing
techniques, optimizations on adder-trees are largely omitted
[8], [9]. Irrespective of which common subexpression identification algorithm is used, the formation of an adder-tree
is commonly handled by the same tree-height minimization
algorithm [10], that guarantees the height of generated addertree is the minimum at the operator level. In this paper, we
present a method of derivation of equivalent adder-trees to
minimize the adder-tree resource. We have developed the cost
model of the shift-ADD/SUB network by bit-level analysis,
which could be reduced by suitable scheduling of operations
on the adder-tree. We find that significant area and power
reduction (up to 15% and 11.6% respectively with an average
of 8.46% and 5.96%) can be achieved on top of already
optimized MCM blocks.
II. A DDER -T REE S CHEDULING P ROBLEM

Given N input terms T = {T0 , . . . , TN 1 } and their earliest
arrival time/delay Di , the objective of an adder-tree scheduling
algorithm is to define an assignment of binary addition and
subtraction operations to sum up the input terms such that the
total delay to produce the final output is minimized.
(0)
(3)
(1)
(2)
(0)
(1)
+
(1)
SUB
(2) +
ADD
+
(3)
ADDERTREE
(2)
+
SUB (4)
ADD (5)
FAs
FAs
Wires
FAs
0s
signs
1
Fig. 2. (a) An example adder-tree with delays and signs on each input term,
(b) An internal schedule with minimum delay.
FAs
FAs
FAs*
FAs
FAs
FAs*
(d)
(c)
signs
i
A. Greedy Adder-Tree Scheduling
In order to quantify and minimize the hardware cost of the

adder-tree, we model the cost of ADD/SUB operations in this
section based on the ripple carry implementation, which is
most area efficient and will be picked up by the hardware
compiler whenever the timing allows.
Without loss of generality, for a single ADD/SUB operation,
the pair of its input operands may be of different bit-widths,
and one of them is to be left shifted by certain bit positions.
We enumerate all the scenarios of shift-add/sub operations in
Fig.3. The cost calculation is done separately in three bitsegments. Starting from the least significant bit (LSB), the 1st
segment covers the bit positions up to but not including the
first bit of the shifted operand; the 3rd segment covers the bits
corresponding to the sign extension bits of the sign extended
operand; the 2nd segment takes the rest of the bit positions.
Two cases of ADD operation are shown in Fig.3(a) and
(b). In both cases, the 2nd and 3rd segments are implemented
by one Full Adder (FA) per bit, while the 1st segment cost
nothing than wiring.
Wires
signs
0s
B. Cost model
FAs
(b)
( )
(a)
(b)
The common practice of handling the summation of CS

terms of each coefficient is to use the tree-height minimization
algorithm [10] to produce a height optimum adder-tree. The
tree-height minimization algorithm iteratively collapses the
pair {Ti , Tj } with smallest delays using an ADD/SUB to form
a new term with delay max(Di , Dj ) + 1, until a single term
is reduced to. Fig.2 gives an example of the schedule for an
adder-tree on the left with minimum delay.
Note that either a positive or negative sign is associated
with each input term (see Fig.2(a)), which denotes whether the
corresponding term should be added to or subtracted from the
summation. These signs also determine whether an addition
operation or a subtraction operation should be used when the
algorithm collapses a pair of terms in the adder-tree based on
the following rules. (1) If two input edges are of the same
sign, an ADD will be used; otherwise, it will be a SUB. (2)
The sign of the output edge is always the same as that of the
left input edge (i.e., the minuend edge in the subtraction
case). Using these two rules, it is possible that the final term
producing the summation result may carry a negative sign,
such that a negation is needed after the adder-tree to correct the
value. For an FIR filter, results from multiple adder-trees are
accumulated by a structural adder-register line. So the negation
can be eliminated by replacing the structural adder with a
subtractor (see coefficient C1 in Fig.I for an example).
0s
signs
delay
(a)
signs
0s
(3)
(1)
signs
0s
0s
FAs
FAs
FAs
Wires
FAs
Wires
(f)
(e)
Fig. 3. Cost of ADD/SUB operation under various input scenarios. Notations:

FA - Full Adder, aboveline - invertor (INV). (a)(b) Cases for ADD. (c)(d)(e)(f)
Cases for SUB.
C = 30914 = 100010000200001
C=30914=
1000 10000200001
<<2
<<4
SUB
<<2
SUB
15/[4b]
3/[2b]
<<4
SUB
3/[2b]
-15/[4b]
15/[4b]
<<5
<<10
ADD
15361/[14b]
ADD
97/[7b]
<<5
ADD
+ 15457/[14b]
<<1
<<1
ADD
+
SUB
ADD
(a)
<<10
10
SUB
Termnode
Treenode
Structuralnode
x/[yb]val/[bitwidth]
[y ]
[
]
+ 15457/[14b]
<<1
<<1
ADD
ADD
(b)
Fig. 4. (a) Adder-tree by greedy scheduling algorithm. (b) Bit-level resource

optimal adder-tree.
Four cases of SUB operation are shown in Fig.3(c)(f). In

the first two cases where the shift is with the minuend, the 1st
segment is implemented by FAs with invertors on the minuend
bits, except for a single special case at the LSB using direct
wire connection1 to save a pair of FA and invertor. The 1st
segments for the last two cases are wires. The 2nd segment
for all cases is implemented in pairs of FAs and invertors. For
the 3rd segment, when the sign extension bits are from the
subtrahend, inverters are not needed since these bits simply
take the value of the inverted sign of the subtrahend.
This cost model is verified experimentally from synthesis
results of Synopsis Design Compiler for ASICs, and is applicable to cases where either or both of input operands are
unsigned signals.
C. Motivational Example
The number of operators on an adder-tree is determined
by the number of input terms the coefficient uses from the
term network. For a N input adder-tree, N 1 operators are
required.
A motivational example in Fig.4 shows the key differences
between a greedily scheduled adder-tree based on the height
minimization algorithm and the resource optimized adder-tree
1 Plus 1 to inverted LSB is equivalent to wiring of the LSB and moving the
plus 1 to the second LSB. This optimization is done by most synthesis tools.
of the same height. First, the bit widths2 of intermediate nodes

are significantly smaller. For example, while other adder-tree
nodes are of similar bit widths, the one in the second layer is
only of 7-bits in the optimal adder-tree (Fig.4(b)) compared
to 14-bits of the non-optimal adder-tree. Note that wider bit
width signal may contribute further to higher hardware cost
when input to the next layer of operators. Second, subtractions
with shift on the subtrahend also reduces operators cost. For
example, for base bit width of 8-bits, the operator performing
1 << 4 1 on Fig.4(a) costs 11 FAs and 7 INVs according
to the cost model, while the operator performing 11 << 4
on Fig.4(b) costs 8 FAs and 8 INVs. In total, the optimal
adder-tree sums up to 30 FAs and 20 INVs, while the nonoptimal one is of 40 FAs and 7 INVs. Assuming the ratio of
resource consumption of an FA to that of an INV is around
8 to 1, nearly 18% resource is reduced by using the optimal
adder-tree in this example.
Note that the adder-tree also determines the type of operators used for its structural accumulation. An output edge
carrying a sign requires a SUB on the accumulation line,
which usually consumes more hardware than an ADD. For
linear phase FIR filters where coefficients are symmetric, each
adder-tree corresponds to 2 structural operators.
D. Logic Depth Relaxation
The clock performance of the entire FIR filter is decided by
the largest of the delays of all coefficients. Assuming the delay
of an ADD/SUB operator to be 1 unit, the delay of the constant
multiplication by a coefficient can be simply measured by the
number of ADD/SUB steps on a maximal path in the part
of the network corresponding to the coefficient. We generally
use logic depth to describe the required ADD/SUB steps. For
a coefficient whose logic depth is less than the filters logic
depth, incrementing (relaxing) its logic depth may reduce the
resource consumption.
Given an algorithm which computes the adder-tree of the
minimum resource on a given depth L for a coefficient, if L is
less than the filters logic depth, one can always try increasing
L by 1 and rescheduling onto a L + 1 depth adder-tree for
possible reduction of resource without degrading the filters
clock performance.
III. M INIMUM R ESOURCE A DDER -T REE S CHEDULING
USING MIP
In this section, we describe a mixed integer programming
(MIP) based formulation to schedule the adder-tree of an
individual coefficient to minimize the hardware resource. As
linearity is required in MIP, various techniques to transform
modeling friendly non-linear expressions into linear equations
and inequalities are indispensable and discussed in detail. This
MIP procedure is then used in the next section as the building
block for resource minimization for the entire FIR filter.
Given a set of input terms T = {T0 , . . . , TN 1 } and their
earliest arrival time or delay Di , we try to form an adder-tree
2 This refers to additional bit-width imposed by the ADD/SUB network on
top of the base bit width of input signal X.
E0
E1
E2
V0
E3
E4
V1
E8
E5
V2
E7
V3
E9
E10
E12
E13
V4
Fig. 5.
E6
Layer0
E11
V5
Layer1
V6
Layer2
E14
Layer3
Binary adder-tree of depth L = 3 used for MIP modeling.
of maximum depth L to sum up the input terms and at the

same time minimize the hardware resource used.
In order to model the above problem, we construct a
complete binary tree G(V, E) of depth L. Each node Vi V
on the tree is a position to hold a binary operator (ADD/SUB),
and |V | = 2L 1. Each edge Ei E is a potential operand
position to accommodate an input term, and |E| = 2L+1 1.
Fig.5 shows an example of the decision tree when L = 3.
An input term Ti of delay Di is schedulable on all edge
positions of layer Di and downwards. For each input term
Ti , we create a set of binary variables T Seti = {ti.j |Ej of
depths equal/greater than Di }. Ti is said to be scheduled on
Ej if ti.j is assigned 1. The adder-tree scheduling problem
is equivalent to finding an assignment of input terms to the
edge positions on the binary tree such that the resource of the
adder-tree is minimized. Constraints are formulated to ensure
that the adder-tree summing up these input terms is correctly
formed.
A. Structural Constraints on Forming the Adder-Tree
Firstly, each input term should be scheduled on exactly one
edge position. For each input term Ti , we have
X
ti.j = 1
(1)
ti.j T Seti
We group and define the set of all the binary variables ti.j
associated with a particular edge position Ej as ET Setj =
{ti.j | ti.j T Seti and ti.j is associated with Ej }. Then the
constraint holds that there is at most one input term scheduled
on Ej . For each edge position Ej , we have
X
sumET Setj =
ti.j 1
(2)
ti.j ET Setj
The next constraint ensures that no two input terms are

assigned to dependent edge positions. On the binary tree, Ej2
is dependent on Ej1 if there exists a directed path from Ej1
to Ej2 . For example, E12 and E8 are both dependent on E0 ,
while E12 is dependent on E8 in Fig.5. For each dependent
pair of edge positions Ej1 and Ej2 on the binary tree, we have
sumET Setj1 + sumET Setj2 1
(3)
Lastly, we express that the number of ADD/SUB operators

used for the summation of input terms totals to N 1. However, it is easier to compute the number of NULL operators on
the binary tree. A NULL operator is an empty operator that is
assigned neither an ADD nor a SUB in the end on the binary
ADD
ADD
SUB
SUB
(a)
(b)
(c)
(d)
<<2
2
<<0
0
<<3
<<1
B. Constraints for the Type of Operators

For an none NULL operator, its ADD/SUB type is determined by the simple rules outlined in Section II-A, and
graphically depicted with Fig.6. On the binary decision tree,
the types of each operator are determined by (1) the assignment of input terms which carry the original edge signs,
and (2) the propagation of the signs throughout the network.
Consequently, we use two different sets of variables to model
the initial condition and the propagation.
For the initial condition, for each edge Ej , we define two
binary variables ispj and isnj to express the sign of Ej due
to input term assignment. ispj is 1 if the input term assigned
to Ej carries a + sign, isnj is 1 if it carries a sign,
and both are 0 if no input term is assigned to Ej . For each
Ej , we have
X
ispj =
ti.j
ti.j ET Setj whose Sign(Ti ) is +00
(5)
ti.j ET Setj whose Sign(Ti ) is 00
For sign propagation, for each edge Ej , we define another

pair of binary variables spj and snj for the resultant sign of
Ej . spj is 1 if Ej carries a + sign, and snj is 1 if Ej
carries a sign. If both are 0, it indicates that Ej is the
output of a NULL operator. The sign propagates from upmost
layer downwards. For each Ej 0th layer, we have
spj = ispj
(6)
For each Ej on the 1st layer downwards, supposing Ejl is the
Coeff =1010010
((a))
Fig. 7.
<<0
0
ADD
ADD
Till now, the above constraints suffice to produce assignments of the input terms correctly to form adder-trees of
various topologies. In order to model the cost of the adder-tree,
we need to further model the ADD/SUB type of the operators,
the bit widths of the edges and the cost of operators.
snj = isnj
<<1
1
<<4
Ej
ti.j
<<4
4
ADD
tree. It should be clear that if Ej is assigned any input term,

all the operators which precede it on the binary tree should be
NULLs. The number of operators that are nullified by edge
Ej on the lth level is a constant N umN ull(Ej ) = 2l 1.
Holistically, the total number of operators on the binary tree
equals to the summation of the number of ADDs/SUBs and
the number of NULL operators. As a result, the number of
NULL operators is 2L N . This is expressed as follows
X
N umN ull(Ej ) sumET Setj = 2L N
(4)
x
<<6
6
Fig. 6. Relationship of operator type and the output sign with the signs of
input operands.
isnj =
ADD
<<1
((b))
(a) Absolute shifts propagation. (b) Actual shifts.
left input edge to the operator that produces Ej , we have

spj = ispj + spjl
snj = isnj + snjl
(7)
Now we are ready to formulate the type of operators using

the signs on the edges. We define three binary variables kaj ,
ksj and knj for each Ej
/ 0th layer to represent the Kind
of the operator that produces Ej . kaj is 1 if the operator is an
ADD; ksj is 1 if the operator is a SUB. Both 0 indicates that
the operator is a NULL, in which case knj will be 1. Suppose
the operators two inputs edges are Ejl and Ejr respectively,
then we have
kaj = (spjl spjr ) (snjl snjr )
ksj = (spjl snjr ) (snjl spjr )
knj = 1 kaj ksj
(8)
[Linearization] Neither binary nor binary operation is

linear. For a general expression c = a b, where a, b and c
are binaries, extra linear constraints are added to linearize it
as follows:
2 c a b 0;
cab+10
The binary operator can be replaced with + in our case,

since it is not possible for its both sides to be 1 at the same
time.
C. Constraints for Operator Cost
To model the cost of adder-tree operators, we further require
the bit widths of the operands (edges). As an edge carries the
signal of an multiplication between the input signal X and
the partial coefficient value of the term (or intermediate term)
assembled on the adder-tree up to this point, the bit width
of an edge is the summation of two parts: the base bit width
imposed by X which is a constant, and the bit width of the
value of the term. Formulating the term values in turn requires
knowledge of the values of shifts on the edges. In this section,
we model the shifts, values, bit width of edges and finally the
cost of operators on the adder-tree.
1) Edge shifts: We model the shifts on the edges with
a two step process by modeling the absolute shifts first,
followed by the actual shifts. The absolute shift of a term
(edge) within a coefficient is the shift of the LSB of the
term against the LSB of the coefficient. Fig.7(a) gives an
example of the absolute shift of each edge of a specific addertree of the coefficient. The absolute shift for an output edge
simply takes the minimum value of that of its two input
edges. Given that absolute shifts of all input terms are known
constants for a coefficient, absolute shifts of intermediate terms
on its possible adder-trees can be obtained by a top-down
propagation process.
Absolute shifts are modeled in a similar way that signs are
modeled. On each edge Ej , we first define an integer variable
iashj for the initial absolute shift imposed by input terms.
Let AbsSh(Ti ) denote the constant of absolute shift value of
input term Ti , we have
X
AbsSh(Ti ) ti.j
(9)
iashj =
ti.j ET Setj
ashj = iashj
(10)
For each Ej
/ 0th layer, we have
ashj = min(ashjl , ashjr ) + iashj
(11)
Given the absolute shifts, we are ready to express the actual

shifts. For a pair of input edges to the same operator, their
common absolute shifts are trimmed off and taken care of by
the output edge of the operator (see Fig.7(b)). For each edge
Ej
/ the last layer, and let Ejo be its output edge, the actual
shift integer variable of Ej , denoted as shj , is expressed as
shj = max(ashj ashjo , 0)
(12)
The max function with 0 ensures that the result value for a
NULL operator is 0. For the last layer edge Ej , i.e., the output
edge of the adder-tree, we simply have
shj = ashj
(13)
[Linearization] Note that both min and max functions are

non-linear. The general form of s = min(a, b) function can
be linearized by introducing an additional binary variable z
and the following linear constraints
s a;
s b;
s + M z a;
s + M (1 z) b
where M is a big positive constant greater than the possible

difference between a and b. Similarly, s = max(a, b) can be
linearized with
s a;
s b;
s a + M z;
s b + M (1 z)
2) Edge values: Computing the values of edges is also a

top-down propagation process. For each edge Ej , we define
an integer variable ivj for the initial value imposed by input
terms. Let V alue(Ti ) be the constant of the value of input
term Ti , we have
X
ivj =
V alue(Ti ) ti.j
(14)
ti.jET Setj
Let integer variable vj denote the values of edge Ej . For

each Ej 0th layer, we have
vj = ivj
For each Ej
/ 0th layer, we have
vj = ivj
if knj is 1
= vjl << shjl + vjr << shjr
if kaj is 1
= vjl << shjl vjr << shjr
if ksj is 1
(15)
(16)
[Linearization] The conditional equalities above should be

linearized. Since at any time, exactly one among kaj , ksj and
knj is 1, each of these conditional equations can be treated in
the following form:
s=a
if z is 1
= unrestricted
Then we define an integer variable ashj on each edge Ej for

the absolute shift value. The top-down propagation is modeled
as follows. For each Ej 0th layer, we have
otherwise (z is 0)
which can be expressed with the following linear constraints

a M (1 z) s a + M (1 z)
where M is a sufficiently big positive constant to ensure
unrestricted is not violated under upper/lower bound values
of a.
Furthermore, we need to linearize the << shift operation.
Let us consider the general form of s = vj << shj , where
shj {0, 1, . . . , S} and S being the largest shift value shj
may take. We create a set of additional binary variables
{z0 , . . . , zS } such that zk is 1 when shj equals to k, that
is
Sk=0 zk = 1
Sk=0 k zk = shj
Now s = vj << shj can be expressed as
s = Sk=0 2k (zk vj )
which is still not linear while both zk and vj are variables.
Let their product be ck = zk vj , this can be expressed with
the following linear constraints
M zk ck M zk
vj M (1 zk ) ck vj + M (1 zk )
where M is a positive constant greater than the upper bound
of |vj |.
3) Edge bit width: The bit width of an edge is the summation of the base bit width constant B of input signal X
and the bit width from the value of the edge. Suppose bits(v)
computes the bit width of a non-negative value v, for the bit
width bwj of Ej , we have
bwj = bits(max(|vj | 1, 0)) + B
(17)
The 1 in the formula describes the case when vj is of value

2k , the bit width imposed by the value should be bits(2k 1)
(e.g., when vj is 1, the signal is X itself, thus 0 additional bit is
required on top of B). A special case of vj = 0 indicates that
Ej is input to a NULL operator. In this case, we dont care
what value bwj takes, because it will not be accounted in later
resource accumulation, consequently we need not separate this
case in the formula.
[Linearization] The absolute value function and bits function
should be linearized. For the general form of s = |a|, we
define an extra binary variable z such that it equals to 0 when

a is negative, and 1 otherwise. That is
a M (1 z) < 0;
a+M z 0
where M is a positive constant greater than the upper bound

of |a|. Given z, s = |a| can be expressed in
a M z s a + M z
a M (1 z) s a + M (1 z)
where the M here should be a positive constant greater than
the upper bound of 2|a|.
We further linearize the bits function in the general form of
s = bits(a), where a is non-negative. Suppose a does not exceed S bits, we create a set of binary variables {z0 , . . . , zS1 }
to represent the LSB to MSB of a, as such
k
a = S1
k=0 2 zk
We create another set of binary variables {p0 , . . . , pS1 }, such

that if zk is the most significant bit of value 1 amongst all bits,
then pk is 1, which in turn indicates that the bit width of a
should be k + 1. If zk is the most significant bit of value 1,
the bits more significant than it will all be 0. For each pk , we
have
if zk , zk+1 , . . . , zS1 are all 0
pk = 1
=0
otherwise
which can be expressed with the following linear constraints
where M is a constant greater than the upper bound of shjl .

5) Cost of structural operators: For the sake of clarity, the
structural operator(s) of the adder-tree are not modeled on
the binary tree. In linear phase FIR filters, which are used
in most cases, a coefficient is accumulated twice at symmetric
tap positions, thus corresponding to two structural operators. In
a general FIR filter, an adder-tree corresponds to one structural
operator. For a structural operator, the adder-tree output edge
always inputs to it as the right side operand, while the left
input and output edges are on the accumulation line and their
bit widths can be predetermined according to the upper/lower
bound of accumulated values up to this coefficient tap. The
left input edge is always + signed and carries 0 shift. Let
Es denote the structural operator output, and Esr be its righthand-side input edge (i.e., the final output edge of the addertree), the cost of the structural operator cs is modeled as
follows
cs = F [Bits(Es ) shsr ]
if spsr is 1
= F [Bits(Es ) shsr )] + I bwjr
if snsr is 1 (20)
where Bits(Es ) is the bit width constant of the output edge.

D. Objective function
The objective function that minimizes the resource cost of
all operators on the adder-tree and its structural operator(s) is
X
X
minimize :
cj +
cs
(21)
Ej
/ 0th layer
structural ops
(1 zk ) + zk+1 + . . . + zS1 M (1 pk ) 0
(1 zk ) + zk+1 + . . . + zS1 + pk > 0
where M is a positive constant greater than the number of z
variables involved in the inequalities. Finally, s = bits(a) is
expressed in
s = S1
k=0 (k + 1) pk
4) Cost of adder-tree operators: Let F and I be the
constants of the costs of a 1-bit full adder and invertor
respectively, the cost of each operator cj which produces
edge Ej is described as follows according to the cost model
described in Section II-B
cj = 0
if knj is 1
= F [bwj (shjl + shjr )]
if kaj is 1
= F [bwj (sjl + shjr )] + I (bwjr sjl )
if ksj is 1
(18)
where sjl is a binary variable indicating whether the shift is

at the left input edge Ejl as such
sjl = 1
if shjl > 0
=0
otherwise
(19)
[Linearization] The conditional equalities can be linearized

using the same technique discussed in the edge value modeling
section. Constraint (19) is expressed linearly as
shjl M sjl 0;
shjl sjl 0
IV. R ESOURCE M INIMIZATION F OR T HE E NTIRE FIR

F ILTER
In this section, we describe the top level algorithm to
optimize the resource consumption of the entire FIR filter,
based on the minimum resource of each individual coefficient
returned by solving the MIP formulations discussed previously. The top level algorithm is based on the observation that
logic depth relaxation on the coefficient may result in addertree of less resource. Without affecting the filter performance,
logic depth relaxation is applicable to all the coefficients
whose logic depths are smaller than the logic depth of the
filter.
We first compute the minimum logic depth required by a
coefficient adder-tree, and then apply logic depth relaxation to
improve the minimum resource when applicable.
A. Minimum Logic Depth of a Coefficient Adder-Tree
Firstly, the total number of ADD/SUB operators needed to
sum up N input terms is N 1. Based on the binary tree model
of the MIP formulation, a input term Ti with latency/logic
depth of Di nullifies at least 2Di 1 predecessor operator
positions on the binary tree. As a result, the minimum binary
tree required to accommodating
the input terms should consist
PN 1
of at least N 1 + i=0 (2Di 1) operator positions. On
the other hand, a binary tree of depths LP
contains 2L 1
N 1
L
operators. Thus we have 2 1 N 1 + i=0 (2Di 1).
The minimum depths required by the coefficient is computed

by
N
1
X
L dlog2
(22)
2Di e
i=0
The logic depth of the entire FIR filter takes the largest one
among all its coefficients logic depths.
B. Resource Minimization Algorithm of FIR filter
Filter
Order
16
24
32
40
48
56
64
Avg
Logic
Depth
3
3
3
3
4
4
4
Cell
Conv.
15764
23065
28897
36400
41271
49642
57026
Area (sq.
MIP.
13393
21003
26909
33281
38399
46837
52925
um)
Improv.
15.04%
8.94%
6.88%
8.57%
6.96%
5.65%
7.19%
8.46%
Power Consumption (mw)

Conv.
MIP
Improv.
0.280
0.247
11.57%
0.425
0.391
7.97%
0.531
0.511
3.93%
0.672
0.636
5.39%
0.785
0.742
5.44%
0.927
0.902
2.65%
1.064
1.013
4.80%
5.96%
TABLE I
A REA AND POWER CONSUMPTION IMPROVEMENT ON MCM BLOCK .
Algorithm 1: Resource minimization of entire FIR filter.

1
2
3
4
5
6
compute the logic depth of the filter Lf ilter ;

for each non-zero coefficient Ci do
compute Ci s minimum logic depth LCi ;
repeat
rsrc := M ipSched(Ci , LCi );
if rsrc is not improved over the previous
iteration then
use the previous adder-tree schedule and
break;
LCi := LCi + 1;
until LCi > Lf ilter ;
The algorithm for resource minimization of the entire FIR

filter is elaborated in Algorithm 1. For a particular coefficient,
the MIP based resource minimization procedure is first applied
to yield an adder-tree schedule at its minimum logic depth
(line 5). Further logic depth relaxation is attempted provided
that the current depth does not exceed that of the filters
(line 8), and the current attempt did improve the minimum
resource (line 6). At the end of the algorithm, each coefficient
is resource minimized with its adder-tree schedule. The total
resource of the entire filter is minimized, without increasing
its overall logic depth.
V. E XPERIMENTAL E VALUATIONS
We performed our experiments on seven linear phase band
pass FIR filters of orders 16, 24, 32, 40, 48, 56 and 64. The
coefficients are quantized as 16-bit signed integers and then
converted to their Canonical Signed Digit (CSD) representations. A 2-bit common subexpression identification algorithm,
which iteratively identifies and eliminates the most frequently
occurring 2-bit horizontal pattern among the coefficients outlined in [11], is developed to produce the input term network
for the adder-trees.
Graph algorithms are developed using LEMON [12] C++
graph library which includes well defined interfaces for linear programming and supports several popular solvers. IBM
ILOGs CPLEX 12.3 [13] library is used to solve the MIP
procedures. A small library of generic hardware components
(ADDs, SUBs, registers and negators) handling different
operand cases (signed/unsigned) and shifts is composed to facilitate parameterizable hardware generation. ADDs and SUBs
in these components are directly performed by VHDL +
and operators on operands with appropriate bit width with
appropriate shift. In the end of the algorithm, synthesizable
hardware description in VHDL for the filter is produced. The

filters are then synthesized with Synopsis design compiler
using TSMC 90nm library.
The area and power consumption of the filters MCM blocks
(inclusive of structural operators) are shown and compared
in Table I against the conventional adder-tree scheduling
algorithm. The average area improvement is 8.46%, and power
saving is 5.96%. With increasing filter order, the reductions
tend to be lower. This is because the ratio of small valued
coefficients is higher. These coefficients use small adder-trees
that can hardly be improved. Among a total of 140 adder-trees
optimized in the table, 19 are further improved by logic depth
relaxation of 1 more step.
The run time of solving the MIP problem for each coefficient is generally within a few seconds. There are a small
number of cases for a few adder-trees of depths 4 taking more
than 1 minute to solve. In these cases, due to the auxiliary
variables introduced to linearize the << and bits functions
in the MIP formulation, the total number of binary/integer
variables may approach a thousand. However, since filter area
optimization is a one shot process at design time, this run time
overhead is still affordable.
VI. C ONCLUSION
In this paper, we have identified the resource minimization
problem in the scheduling of adder-tree operations for the FIR
MCM block, and presented an MIP-based algorithm for exact
bit-level resource optimization. Experimental result shows that
up to 15% reduction of area and 11.6% reduction of power can
be achieved on top of already optimized ADD/SUB networks
of MCM blocks. Further exploration of efficient heuristic
algorithms for resource minimization of adder-trees of FIR
filters could be done in the future.
R EFERENCES
[1] D. R. Bull and D. H. Horrocks, Primitive operator digital filter, IEE
Proceedings-G, vol. 138, no. 3, pp. 401412, Jun. 1991.
[2] A. G. Dempster and M. D. Macleod, Use of minimum-adder multiplier
blocks in FIR digital filters, IEEE Transactions in Circuits and SystemsII: Analog and Digital Signal Processing, vol. 42, no. 9, pp. 569577,
1995.
[3] S. D. S. M. Mehendale and G. Venkatesh, Synthesis of multiplierless FIR filters with minimum number of additions, in Proc. IEEE
International Conference on Computer-Aided Design (ICCAD), 1995.
[4] I. C. Park and H. J. Kang, Digital filter synthesis based on minimal
signed digit representation, in Proc, Design Automation Conference
(DAC), 2001.
[5] Y. Voronenko and M. Puschel, Multiplierless multiple constant multiplication, ACM Transactions on Algorithms, vol. 3, no. 2, 2007.
[6] P. K. Meher and Y. Pan, Mcm-based implementation of block fir filters

for high-speed and low-power applications, in VLSI and System-onChip (VLSI-SoC), 2011 IEEE/IFIP 19th International Conference on.
IEEE, Oct. 2011, pp. 118121.
[7] L. Aksoy, C. Lazzari, E. Costa, P. Flores, and J. Monteiro, Design of
digit-serial FIR filters: Algorithms, architectures, and a CAD tool, IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, vol. 21,
no. 3, pp. 498511, Mar. 2013.
[8] M. B. Gately, M. B. Yeary, and C. Y. Tang, Multiple real-constant multiplication with improved cost model and greedy and optimal searches, in
Proc. IEEE International Symposium on Circuits and Systems (ISCAS),
May 2012, pp. 588591.
[9] M. Kumm, P. Zipf, M. Faust, and C.-H. Chang, Pipelined adder graph
optimization for high speed multiple constant multiplication, in Circuits
and Systems (ISCAS), 2012 IEEE International Symposium on. IEEE,
May 2012, pp. 4952.
[10] R. Hartley and A. Casavant, Tree-height minimization in pipelined
architectures, in Proc. IEEE International Conference on ComputerAided Design (ICCAD), Nov. 1989.
[11] R. Mahesh and A. Vinod, A new common subexpression elimination
algorithm for realizing low-complexity higher order digital filters, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and
Systems, vol. 27, no. 2, pp. 217229, 2008.
[12] E. L. University, LEMON (library for efficient modeling and optimization in networks), http://lemon.cs.elte.hu/.
[13] IBM ILOG CPLEX optimizer, www.ibm.com/software/integration/
optimization/cplex-optimizer/.

Bit Level Optimization of Adder Trees For Multiple Constant Multiplication Fir Filter Implementation

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Bit Level Optimization of Adder Trees For Multiple Constant Multiplication Fir Filter Implementation

Enviado por

Direitos autorais:

Formatos disponíveis

SUBMITTED TO IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS