Você está na página 1de 4

On a Parallel Decimal Multiplier based on Hybrid 8421-5421 BCD Recoding

Ming Zhu, Abu M. Baker, Yingtao Jiang


Department of Electrical and Computer Engineering, University of Nevada, Las Vegas, NV, USA, 89154
{zhum2, bakera2}@unlv.nevada.edu, yingtao.jiang@unlv.edu
Abstract- Parallel decimal multiplications can be broadly divided
into two major steps: partial product generations (PPGs) and
partial product accumulations (PPAs). Although most operands
in decimal multipliers are represented in popular 8421 BCD
codes, alternative 4221 and 5211 BCD codes are also sometimes
employed, alone or mixed with 8421 codes, to represent the partial products, with a hope that by doing so, the multipliers
area/time efficiency will be improved. However, study shows
that multipliers based on unsuitable mixing of codes, like 42218421, can have worse area and timing performance than 8421
multipliers due to the long carry propagation delay of 4221 PPA
and the extra 4221-8421 conversions as required. Consequently,
in this paper, we propose an 8421-5421 BCD multiplier that is
equipped with a simplified PPG based on 8421-5421 recoding
and an improved 8421 carry-lookahead adder tree for PPA, and
compare it against the two best known multiplier designs using
90nm TSMC technology. The synthesis results have confirmed
that the proposed 8421-5421 multiplier achieves the lowest delay
and is the most time-area efficient design.
Keywords: Decimal; Multiplication; Recoding; Architecture.

I.
INTRODUCTION
An area-time efficient hardware-based BCD multiplier is
often preferred to much slower software simulations [1], to
meet the precision and performance requirements in financial
computing. In general, an n-digit-by-n-digit BCD multiplicacan be performed as:
tion
10

10 ,

(1)

where
is referred as the i-th partial product,
and the i-th digit of the multipgiven by the multiplicand
lier, (i.e., ).

Figure 1

A general structure for an n-by-n BCD multiplier


TABLE 1 BCD REPRESENTATIONS

Decimal Value

8421

5421

4221

5211

0000

0000

0000

0000

0001

0001

0001

0001 | 0010

0010

0010

0010 | 0100

0011 | 0100

0011

0011

0011 | 0101

0101 | 0110

0100

0100

0110 | 1000

0111

0101

1000

1001 | 0111

1000

0110

1001

1100 |1010

1010 | 1101

0111

1010

1101 | 1011

1100 | 1011

1000

1011

1110

1110 | 1101

1001

1100

1111

1111

978-1-4799-0066-4/13/$31.00 2013 IEEE

A general parallel BCD multiplier architecture is depicted in Figure 1. It has two major stages: (i) the partial
product generation (PPG) stage, including a decoder of to
select the right multiples of the multiplicand as determined by
the pre-computations logic, and (ii) the partial product accumulation (PPA) stage, where all the partial products are
shifted and added together to obtain the final product.
Although operands in the decimal multipliers (Figure 1)
are typically represented and computed in popular 8421 BCD,
actually, 4221, 5211 and 5421 BCD (TABLE 1) can also be
employed, alone or along with 8421 codes, to simplify the
computation logic, and thus to improve the area/time efficiency over conventional 8421-based BCD multipliers. For
instance, [2] [3] introduce 5421-to-4221 and 8421-to-5211
BCD recoding to speed up the PPG and generate 4221-coded
partial products, so that 4221 carry-save adders (CSAs),
which have simpler logic than 8421 adder, can be applied in
the partial product reduction (PPR). However, as existing
4221 PPRs involve long carry propagations, their overall performances tend to be much worse than the 8421 PPR [4] of
the same size. Besides, instead of using a 4221 adder to add
the two 4221-coded results of the 4221 PPR directly, current
practice employs 4221-8421 conversions so that classical
8421 full adders can be used to obtain the final 8421 products;
this conversion increases both the delay and chip area.
In this paper, we design three 16-by-16 multipliers based
on different combinations of 8421 and 4221 BCD codes. The
first 8421 multiplier is designed following the architecture
originally reported in [4], but with significant improvement
where the pre-computations are performed with 8421-5421
recoding logic. We also design a 4221 BCD multiplier, which
includes modified 4221-5211 recoding for multiplicand precomputations and novel 4221 carry-lookahead adders (CLAs)
for PPA. The third design is also a 4221 BCD multiplier, but
built upon 32:2 4221 CSA trees and a 4221 CLA for PPA, so
that no 4221-8421 conversions are required. We synthesize
all three multipliers, and compare them against the known
best designs in terms of delay and delay-area product.
In what follows, we will review previous work in Section
2. The proposed 8421 and 4211 multipliers are described in
Sections 3 and 4, respectively. Performance results of the
various BCD multiplier designs are reported and analyzed in
Section 5. Finally, the conclusion is drawn in Section 6.
II.
PRIOR WORK
In this section, various design techniques that are applicable to PPGs and PPAs (Figure 1) in decimal multipliers
with different BCD representations will be reviewed.
A. Data Coding and Logic in Partial Product Generation
Existing PPGs decode to select the right multiples of
the multiplicand available from the pre-computation logic
(Figure 1). Several decoding strategies of have been proposed to simplify the logic of PPG. In [5] [6], is written as
0
1 , where 0 , 1 {0,1,2,4,5}, so that a large
partial product can be computed as summation of two

1391

smaller intermediate partial products (e.g., 7 = 2 +


5
5 ). [4] [7] exploits a Radix-5 algorithm, i.e.
1 , where 0 , 1 , 0,1,2 , and negative interme0
diate partial products are obtained by calculating their 10s
complement (i.e., adding the 9s complement and 1). Similar
to the Radix-5 case, a Radix-4 BCD multiplication decomposes into two numbers, but this time, in the form of
4
0
1 , and longer delay is observed when computing 8 . The Radix-10 adopted in [2] [3] recodes
from an integer interval of [6, 9] to [-4, 0] and adds 1 to the
higher significant multiplier digit; such recoding logic can be
time-consuming, and yet summation of two intermediate partial products is still required to get one partial product, as the
case in computing 3
2 .
Conventionally, the operands of the PPGs (i.e. multiplicands, multipliers and generated partial products) are all
represented in 8421 BCD, as in [4] [5] [6] [7], where 2 and
5 are pre-computed using a scheme suggested in [5] and
additions for getting the final products can be performed using the schemes described in [4] [8]. A different approach is
adopted in [2] [3], where generated partial products are coded
in 4221 (Table 1). In this approach, (i) 2 is obtained by
first applying 8421-to-5211 conversion of , and then leftby 1 bit to get 2 in 4221; and
shifting the 5211-encoded
(ii) 5
is obtained by first left-shifting
represented in
in 5421 and then recoding this
8421 by 3 bits to get 5
5421-encoded 5 to its 4221 BCD representation. There are
two compelling reasons that 4221 is preferred to represent the
partial products: (i) the 9s complements of a decimal number
represented in 4221 can be obtained by simply inverting each
bit of the digits, and (ii) the 4221 CSAs required for the following PPR have simpler logic than their 8421 counterparts.
B. Data Coding and Logic in Partial Product Accumulation
After obtaining the 8421 partial products from the PPGs,
these partial products can be summed up either iteratively [5]
[6], or using parallel 8421 CSAs for the PPR and a multi-digit
8421 adder for obtaining the final product [7]. In [4], the PPA
is performed using 8421 CLAs organized as a tree structure,
which gives the best speed and chip area performance.
In [2] [3], 4221 CSA trees are utilized for PPR. Since no
4221 full adders or CLAs have been considered in the literature, all the known 4221 multipliers use 4221-8421 conversion logic to recode all the 4221-encoded PPR results back to
their 8421 representations so that a classical 8421 CLA can
be employed to get the final product [2]. Although a single
4221 CSA has simpler logic than a 8421 full adder, the carry
propagation problems among CSA trees and the extra 42218421 conversion stage tend to increase the delay and hardware overhead of the 4221 multiplication considerably.
III.
PROPOSED 8421 BCD MULTIPLIER
We implement 8421 BCD multiplier based on the architecture originally proposed in our early work reported in [4],
but with optimizations in pre-computations units in PPG
(Figure 1) for computing 2 and 5 .
2 represented in 8421 can be obtained by recoding
from its 8421 code (X) to the 5421 form (Y) and then leftshifting it by 1 bit. The recoding logic from 8421 to 5421 is
expressed in (2). Similarly, if
in 8421 is left shifted by 3
bits, it becomes 5 in 5421, and we can get the 5 in 8421
by applying 5421-8421 recoding logic given in (3) [9].

The Boolean functions in (2) and (3) are simpler than


those in [5], and there incurs no gate delay in performing the
shifting. In addition, every literal of a recoding output bit in
(2) and (3) takes a unique input variable, which correspond to
smaller input fan-in, leading to lower circuit delay and/or area.
0
1
2
3

3
3
2
3

0
0
1

0
1
2
3

3
3
3
3

0
1
2
2

3
2
2

1
3
1
3
3

2
1
3
0
0
2
1

0
1
2

1
3

0
1

0
(2)

0
3

1
0

2
0

(3)

IV.
4221 BCD MULTIPLIER
To take full advantage of the simplicity of CSAs and 9s
complement in 4221, we implement a novel 4221 full adder
so that the 4221-to-8421 conversion as required in [2] [3] is
no longer needed in our 4221 multipliers, and all the inputs
and outputs of the multipliers as well as the internal intermediate results can be represented directly in 4221. However,
due to the redundant and discontinuous nature of 4221 representation, we only use the codes listed in the left sub-column
of the 4221 column in TABLE 1, if there are two representations for one decimal value. This way, we can avoid the socalled many-to-many 5211-to-4221 recoding and the high
complexity of 4221 full adder logic that otherwise will incur.
A. 4221 BCD PPG
The PPG is similar to what is shown in [4]. Since we
force each decimal value to be represented by one unique
4221 BCD, we now calculate 2 in 4221 directly by (4), and
by 3 bits to become
5 by left-shifting the 4221-encoded
a 5211-encoded 5 and recoding it from 5211 to 4221 codes
in 4221 needs more gates
(5). Obviously, calculating 2
than that in 8421-5421 recoding. In addition, the Boolean
expressions for 0 , 1 and OP in 4221, (6)~(8) respectively,
are also more complicated than those in 8421 [4]. Even so,
given the simpler logic for obtaining a numbers 9s complement in 4221, the PPG for 4221 may still hold its performance advantage.
For each PPG, after achieving OP and the two intermediate partial products, the second of which might be negative,
we sum them up, based on two addition schemes for the two
4221 PPAs following, to ensure that the operands for the
PPRs are positive. One scheme is by using a 4221 CLA to
add the 2 intermediate partial products and OP to produce
one positive 4221-coded partial product for the following
4221 CLA trees; the other utilizes a multi-digit 4221 CSA [2]
to produce two positive intermediate 4221-coded partial
products for the following PPR based on 32:2 CSA trees.
B. 4221 BCD Addition and PPA
Like in [8], the logic of a 1-digit 4221 full adder can be
expressed as (9); that is, adding two 1-digit inputs, 3: 0
and 3: 0 , and a single-bit carry-in, cin, gives a 1-digit sum
3: 0 , a single-bit carry-out, cout, a single-bit carrygeneration, gdigit, and a single-bit carry-propagation, pdigit.
Let us define
&
,
|
,
^
, for
0,1,2,3.
Obviously, the logic of 4221 full adder is much more
complicated than 8421 in [8]. 4221 CLAs are built with the
same carry generation/propagation structure as those in [10].

1392

2 0
2 1

3
_ 3

2 2

2 3

1
2
3

3
3
3
1

2
2

0 0

4
2
3
1
1
1
1

1 0

1 1

2
3
3

0
1

1
2
3
3

3
2
3
1
1
2

2
3

2
1
2
0

1
2

0
1

0
3

(4)

1
3

1
0

0
3
2

Figure 2

32 parallel 4221 32:2 CSA Trees for PPR

(5)

1
0

1
0
1
2

3
1
3

3
2
1
3
2
1
2
0
1
2
1 0
0
1
0

1
2

0
1

0
0

(6)

(7)

1
1

0
3
3
3 2
2 1 0

0
0

0
0
0
0
0
3 1
3 1
2 1
3 0

3
0 1

1
1

1
1

3
2
2
3

1
1
2

2
0

0
1

2
2

(8)

0
2

1
2

0
0

2
1
0
0
1
0
0
3~ 3 2 1 0
3 2 1 0
3 2 1 0
3 2
3 2 1 0
3 2 1
3 2 1 0
3 1
3 0
3 0

3
3

Figure 3

32:2 4221 CSA Tree for PPR

carries, and outputs two 4221 digits with the same weight.
(9) Finally, a 32-digit 4221 CLA adds up the results of the PPR
3 2
2
to get the final 4221 product.
2 0
2 1
2 0
This proposed 4221 PPA architecture is similar to that
1
0
0
0
used
in [2] using CSAs, but with one major distinction, i.e. no
1
0
0
4221-8421 recoding logic is needed in our design.
3 1 0
1 0
V.
EVALUATION AND COMPARISON
2 1 0
3 1 0
3 2 0
3 1
In this paper, we implement 16-by-16 decimal multip3
2 1 0
2 0
liers for 8421 and 4221, and compare them to those in [2] [4],
3 2 1 0
the two most area-time efficient and high performance archi3
2 1
2 0
2
1 0
3
tectures known in the open literature. All the designs are
1
1 0
coded in Verilog HDL and synthesized using Synopsys De3
1
1 2
0
sign Compiler, with the 90nm technology from TSMC. We
2 0
2 1 0
use the product of delay and circuit area as the merits for perA PPR with 4221 CSA trees is shown in Figure 2. Partial formance comparisons.
products are separated by horizontal grids, from top to bottom.
A. Decimal PPG
Each partial product includes its first positive intermediate
We first evaluate the logic of generating 2 and 5 , as
partial product as a black dot row, and its positive second
listed in TABLE 2. Then we synthesize the respective PPGs
intermediate partial product as a grey dot row. Each column
of all three multipliers (TABLE 3).
includes a 32:2 4221 CSA tree, as depicted in Figure 3, where
It shows that, 8421-5421 recoding is the most efficient
each CSA is a 3:2 4-bit binary CSA and each x2, i.e. 2a ,
for 2 and 5 , and it reduces the 8421 PPG area in [4] by
has a carry-in from the CSA tree of the less significant digit
about 10%, for the same amount of delay. Since the 4221
and a carry-out to the more significant one [2]. x1 is to PPG takes both advantages of simple 9s compliment and
ensure the correct 4221 representation for the 4221 full adder. multi-digit 4221 CSAs for the two intermediate partial prodEach CSA tree adds up the digits of the same weight from the ucts, no carry-propagation is involved, leading to low delay
32 positive intermediate partial products, propagates the and circuit area.

1393

TABLE 2 16-DIGIT 2

AND 5

Multiplier

Delay (ns)

COMPARISON
)

Area (

Delay

Area

8421 [5]

0.03

622.34

8421-5421 [2]

0.03

600.47

18.67
18.01

4221-5211

0.05

1281.37

60.07

8421 [5]

0.03

1450.01

45.50

5421-8421 [2]

0.03

751.46

22.54

5211-4221

0.03

975.84

29.28

TABLE 3 16-DIGIT PPG COMPARISON


PPG

Delay (ns)

8421 [4]

0.50

Area (

Delay

6802.69

Area

3401.35

8421-5421

0.50

6313.00

3156.50

4221-5211 w/ 4221 CLA

0.50

9439.52

4719.76

4221-5211 w/ 4221 CSAs

0.30

4114.35

1234.31

TABLE 4 FULL ADDERS COMPARISON


digit
1
16

Full Adder

Delay (ns)

8421

0.1

Area (

Delay

282.24

Area

28.22

4221

0.11

823.44

90.58

8421

0.30

3389.00

1016.70

4221

0.30

6236.09

1870.83

TABLE 5 PPR AND PPA FOR 16-BY-16 DECIMAL MULTIPLICATION


Module

Delay (ns)

8421 PPA [4]

63,149

63,149

4221 PPA1 with 4221 CLAs

1.23

98,487

121,139

4221 32:2 CSA

0.73

5,810

3,548

4221 PPR

91,990

91,990

4221 PPA2 w/ PPR

1.26

125,280

157,853

Area (

Delay

Area

REFERENCE

TABLE 6 16-BY-16 DECIMAL MULTIPLICATION


BCD Form

Delay (ns)

8421 Multiplier [4]

1.49

200,903

299,345

8421-5421 Multiplier

1.46

181,873

265,535

4221 Multiplier1 w/ PPA1

1.70

263,089

447,251

4221 Multiplier2 w/ PPA2

1.55

205,103

317,909

Area (

Delay

the same tree structure by 16% in delay and 68% in area-time


efficiency. Relying on the incredible area/time efficiency of
the 8421 full adder and the 4-stage tree structure for PPA, the
8421-5421 BCD multiplier is superior to the 4221 multiplier
with 4221 CSA trees by about 20% in area-time efficiency.
Following the same performance scaling methodology
suggested in [2], it shows that our 8421-5421 multiplier outperforms the radix-10 multiplier in [2] by 42.26% in delay
and 32% in area-time efficiency, and the radix-5 multiplier in
[2] by 27.5% and 44.6%, respectively. Our 4221 multiplier
using 4221 CSAs outperforms the radix-10 [2] by 34.87% in
delay and 10% in delay-area product, and the radix-5 [2] by
20% and 20.81%, respectively.
All in all, one sees that our proposed 8421-5421 decimal
multiplier has the lowest delay and the highest area-time efficiency among all the BCD multiplier designs.
VI.
CONCLUSION
In this paper, we presented several decimal multipliers
using different combinations of 8421 and 4221 BCD representations. We took advantage of the 8421-5421 recoding to
improve the pre-computations of 2 and 5 , and thus obtained the best 8421 multiplier ever. We also designed the
best 4221 multiplier with a novel 4221 full adder where the
4221-8421 recoding as required by previous 4221 multipliers
can be totally eliminated. The proposed multipliers are compared each other as well as with the known best designs in [2]
[4]. The results have demonstrated that our 8421-5421 decimal multiplier outperforms all the existing BCD multipliers
in terms of both delay and area-time efficiency.

Area

[1] M. F. Cowlishaw, "Decimal floating-point: algorithm for computers,"


Proc. of 16th Symposium on Computer Architecture, June 2003, pp.
104-111.
[2] A. Vazquez, E. Antelo, and P. Montuschi, "A New Family of HighPerformance Parallel Decimal Multipliers," 18th IEEE Symposium on
Computer Arithmetic, Montpellier, France, June 2007, pp. 195-204.

B. Decimal Addition and PPA


We first compare the full adders for each BCD forms as
in TABLE 4, indicating that 8421 CLAs are more efficient
than the 4221 ones, in terms of delay-area product.
Then, we compare both 8421 PPA and 4221 PPA1 organized as the same tree structure as in [4], but with 8421 CLAs
and 4221 CLAs, respectively, and the 4221 PPA2 with 32:2
4221 CSA trees and a 32-digit 4221 full adder. Synthesis
results are presented as in TABLE 5. Due to the simplicity of
8421 CLAs, 8421 PPA has much lower delay and circuit cost
than 4221 PPA1. On the other hand, the long carry propagation among those 4221 CSA trees in 4221 PPA2 increases the
delay significantly, and the 32-digit 4221 full adder has a
very negative impact on the performance. Overall speaking,
the 8421 PPA is the most area-time efficient.
C. 16-by-16 Decimal Multiplication
Synthesis results for 8421 and 4221 multiplications are
tabulated in TABLE 6. Due to the simple logic of 8421-5421
PPG logic and 8421 full adder, our 8421-5421 decimal multiplier outperforms the one in [4] slightly in delay and by
about 12% in area, and outperforms the 4221 multiplier with

[3] A. Vazquez, E. Antelo, and P. Montuschi, "Improved Design of High


Performance Parallel Decimal Multipliers," IEEE Transactions on
Computers, vol. 59, no. 5, pp. 679-693, May 2010.
[4] M. Zhu and Y. Jiang, "An Area-Time Efficient Architecture for 16x16
Decimal Multiplications," Information Technology: New Generations,
ITNG 2013 (In Press), Las Vegas, April 2013.
[5] M. A. Erle and M. J. Schulte, "Decimal Multiplication Via Carry-Save
Addition," IEEE International Conference on Applications-Specific
Systems, Architectures, and Processors, 2003, pp. 348-358.
[6] M. A. Erle, M. J. Schulte, and E. M. Schwarz, "Decimal Multiplication
with Efficient Partial Product Generation," 17th IEEE Symposium on
Computer Arithmetic, pp. 2128, IEEE Computer Society, June 2005.
[7] T. Lang and A. Nannarelli, "A Radix-10 Combinational Multiplier,"
Fortieth Asilomar Conference on Signals, Systems, and Computers,
ACSSC '06, Oct. 29 - Nov. 1, 2006, pp. 313-317.
[8] M. S. Schmookler and A. Weinberger, "High Speed Decimal Addition,"
IEEE Transactions of Comuters., vol. c-20, no. 8, August 1971.
[9] M. Baesler and T. Teufel, "FPGA Implementation of a Decimal
Floating-Point Accurate Scalar Product Unit with a Parallel Fixed-Point
Multiplier," in International Conference on Reconfigurable Computing
and FPGAs, Cancun, Quintana Roo, Mexico, Dec. 2009, pp. 6-11.
[10] A. Weinberger and J. L. Smith, "A One-Microsecond Adder Using OneMegacycle Circuitry," IRE Transactions on Electronics Computers, vol.
EC-5, no. 2, pp. 65-73, June 1956.

1394

Você também pode gostar