Você está na página 1de 4

A High-Speed Dual-Phase Processing Pipelined Domino Circuit

Design with a Built-in Performance Adjusting Mechanism


Ching-Hwa Cheng, and *Jiun-In Guo
Dept. of Electronic Engineering, Feng Chia University, * Dept. of Electronics Engineering, National Chiao Tung University
100, Wenhwa Road, Seatwen, TaiChung, Taiwan, R.O.C. *1001, Ta-Hsueh Road, HsinChu, Taiwan, R.O.C.

Abstract
A high-speed dual-phase domino circuit design with high
performance and reliable characteristics is proposed. The
cell-based automatic synthesis flow supports the quick
design of high performance chips. The test chip of a
dual-phase 64-bit high-speed multiplier with a built-in
performance adjustment mechanism has been successfully
validated using TSMC 0.18um CMOS technology. The test
chip shows a 2.7X performance improvement compared to
the conventional static CMOS logic design. In addition, a
cell-based synthesizable design CAD flow, with
consideration of the skew-tolerant issue has been established.
A latched type domino cell library with noise-alleviation,
charge sharing, and crosstalk alleviation abilities was also
developed to support the proposed design flow. Finally, a
built-in performance adjustment mechanism is conducted
within the design. This mechanism supports performance
adjustment after chip fabrication, under clock skew
considerations.
Keywords: pipelined domino circuits, performance
adjustment

Several studies have introduced the dual-phase


operational dynamic circuits, e.g. No Race (NORA)
dual-phase operational domino circuit [1] is shown in Figure
1. The NORA circuit, which mostly focuses for designing
high-speed circuits, alternately uses PMOS and NMOS type
logic gates. However, the design of the NORA dynamic
circuitry can only be applied on positive logic circuits, and
this requires adopting a full custom design technique.
Without the use of logic optimization and synthesis, it is
difficult to enhance design performance and also to reduce
the area cost for large circuits.

I. Introduction
The domino logic design offers a smaller circuit area and
higher operating speed than the complementary CMOS
design. In a conventional domino circuit, the single-phase
function evaluation occupies only half of a clock cycle. For
all of the internal gates within the circuits, only the signal
rising transition, without a falling signal transition, is
permitted. This makes the conventional domino logic a
positive function only. For the correct operation of all
domino gates, a fan-in low-to-high (rising) signal transition
must be maintained during the evaluation phase. Table 1
shows the conventional domino logic using a bubble
(inverter) pushing technique, which greatly increases the
circuit area.
Table 1 The gate count comparisons of static and domino
circuits

ISCAS-85
Benchmark cir.
C17
C432
C499
C13155
C1908
C3540
C6288
Average

Static
#of gate
3
112
312
306
279
463
1651
446.57

Dynamic
#of gate
13
260
392
416
369
1085
2789
760.57

978-1-4577-2081-9/12/$26.00 2012 IEEE

Gate Increasing
ratio
333.33%
132.14%
25.64%
35.95%
32.26%
134.34%
68.93%
70.32%

Figure 1. The conventional NORA (ZIPPER) dynamic


circuit.
In this paper, a dual-phase DominoLatch design is
proposed. Using the proposed design technique, the
dual-phase operational characteristics are applied to the
designed circuits. The conventional domino and the
proposed DominoLatch logic circuit comparisons are shown
in Figure 2. The DominoLatch circuit operation is similar to
that of pipelined circuits. The positive and negative functions
of the logic gates use individual clock signals with opposite
phases. As the DominoLatch design uses the inverted logic
gates, not same to conventional domino circuit can not apply
logic optimized technique. The DominoLatch circuitry uses
the conventional logic synthesis technique to optimize
performance and area. The generic static logic circuits, after
synthesis flow optimization, can be maintained and mapped
into DominoLatch circuit directly. Hence, DominoLatch
circuits are ease of implementing the large complex function.
In addition, compared with the conventional dual-phase
dynamic circuitry, the DominoLatch circuitry achieves a
very desirable high speed operation with lower gate count.

Mp

V dd

Mp

Vdd

Embedded Latch

Vx

Vx

V out

Co

In1

In2

In2
Ink
Phi

Mn

Phi

Phi

Ink

precharge
phase
evaluate
phase

precharge
phase

Phi

Phi

Mn

(a) Convention domino AND gate

Vout

Co

In1

evaluate
phase

Store
phase

(b) Proposed DominoLatch AND gate

The bi-phase clock is applied to the logic gates within the


DominoLatch circuitry. The positive and inverted functions
of the logic gates use an opposite clock phase. The positive
logic function uses the same clock phase and applies the
inverted-phase clock when there is inverted function logic
gate. As shown in Figure 4, the na2 gate fanins next to the
positive-logic an2 gate. The na2 gate and an2 gate are
applied by phi1 and phi0, respectively. The phi1 clock phase
is inverted in respect to the phi0 clock phase.
Dynamic Latch

G2

Conventional Bubble-Push technique

Dynamic

phi

G2

N1- + Latch

CLK2

phi

CLK1

CLK1

N 2+

N2 -

CLK2

G1
G2
G1
Dynamic
latch

Proposed DominoLatch technique

clk

N1-latch output
CLK2

clk

Figure 2. The conventional and proposed DominoLatch ckt


In this paper, the dual-phase pipelined domino circuit
design automation flow is proposed, which focuses on theory
and practical implementation in real chip. The cell based
synthesis design flow exhibits the features of skew tolerance,
low power consumption, and high performance operation
DominoLatch circuits.

II. DominoLatch Circuit & Design Automation


An original example of static logic circuitry is shown in
Figure 3(a). The example circuit, after DominoLatch logic
synthesis processing, is shown in Figure 3(b). The dynamic
latch (dl) has been added to the a3, a4 fanin pins for the
signal synchronous requirement. The DominoLatch
transistor schematic is shown in Figure 3(c). Gate-na2,
gate-inv, gate-dl1, gate-dl3, and gate-dl4 use the phi1 clock;
gate-an2, gate-nr2 and gate-dl2 use the phi0 clock. The
added dl makes it very simple, with only 6 transistors,
resulting in a lower area requirement than the conventional
static latch. The half dynamic latch (hl) is embedded in the
negation logic gate and this embedded latch only has 3
transistors.
o u tp u t

N + Na4

a1
a2

n r2

o u tp u t

dl
1

dl

dl

dl

n r2

in v

d l4

phi

phi

phi

phi

phi
phi

phi
phi

in v

( b ) P r o p o s e d g a t e c ir c u i t

an2

Vx

an2

a4

hl

phi

na2

a3

N -

( a ) O r ig in a l g a t e c ir c u i t
na2

a1
a2

phi

o u tp u t

phi

a3
p h i1

d l1

p h i0

d l3

d l2

a4
phi

phi

phi

phi

phi

N1- out1

N1- out2

Latch keeps data


N1 -out1

N2-out2

G1

clk

Na3

Delayed falling signal

N1- output

Original Circuit

a1
a2

N1-evaluate precharge N1-evaluate

CLK1

G2

Dynamic

+ Latch

phi

( c ) D o m i n o L a t c h t r a n s i s t o r s c h e m a t i c c ir c x u it

Figure 3. The proposed DominoLatch example circuit

N2- output

precharge N2+evaluate precharge N2+evaluate


0

N2 + out1

N2 + out2

Next stage get the data after the previous stage is stable

Figure 4. The basic DominoLatch circuit construction


The major automatic design flow in the DominoLatch
circuitry consists of the clock phase assignment process and
the circuit retiming process. The clock phase assignment
process traces each data transmission path, following the
gate inverted function in order to assign each logic stage with
an appropriate operational clock phase. The logic stage
represents a group of gates having the same logic operational
clock phase.
The DominoLatch logic gate needs to use opposite clock
phases alternatively. For the two neighboring stages with
inverted functions, the assigned clock operational phase is
opposite. The clock phase assignment procedure traces the
circuit logic gate function in order to assign a suitable
operating clock phase. In the operating diagram shown in
Figure 3, there is an inverted logic gate na2 output transition
necessary fanin next to the positive-logic an2, operating in
the precharge phase, resulting in a correct output function.
During the clock phase assignment process, a counting
number is assigned to each logic gate in order to represent
the depth of the circuit stages. In order to assign a number to
each logic gate, after tracing these logic gates in a logic
transmission path, the gates are divided into two groups (i.e.
even and odd numbers) and a clock signal is assigned to each
of them. For each logic gate, if the fanin path counting
number (input stage length) is different, the largest fanin
counting number can be adopted for the current logic gate.
After the clock phase assignment procedure, in order to
uphold the synchronous pipeline operation, the retiming
process needs to maintain the circuit synchronous operation,
and insert the additional dynamic latch gates into the data
transmission paths, in order to ensure that the whole circuit
operates correctly. The retiming process identifies whether
the current gate fanin counting number is different or not.

For those gates with different fanin counting numbers, the


retiming process needs to add the additional dynamic latches
into the data transmission path, with the same counting
number of fanin pins for each gate, in order to maintain the
circuit synchronous operation. Figure 3 shows three dynamic
latches (ncc) added to a4 fanin path for correct pipeline
behavior. The dual-phase operational DominoLatch circuitry
allows the circuit operation to have stable pipeline behavior.
The circuit pipeline operational and timing diagram shown in
Figure 5.

III. Built in Performance Adjustment and


Function Self Test Mechanism
The phi1 and phi0 signal evaluation times are important
for the DominoLatch circuit, as process variation influences
the clock skew and duty cycle pulse widths (from ADPLL).
To deal with this variation problem for reliable high-speed
DominoLatch design, the proposed Duty Cycle Pulse
Generator (DCPG) circuit will provide for a good adjustment
mechanism within the chip, so as to ensure the correct
functioning during high speed. The DCPG circuit design is
shown in Figure 7. The DCPG circuit generates a dual clock
with the alignment of phi0 falling edges and phi1 rising
edges. The evaluation time of the dual-phase clocks do not
overlap. From the FF and SS corner model simulations, the
DCPG adjusts the duty cycle time discrepancy by 55.5ps of
coarse scaling and 1.3ps of fine scaling. The DCPG was
designed by using delay lines in order to keep the DCPG
circuitry stable and tolerant of the process variation.
S
DCPG_IN

S te p 1

S te p 2
S te p 3

8 1 * 5 2 8 3
) ) D Q G OR J LF J D WH

A d d in g la t c h t o r e s o lv e th e
In v e r t in g f u n c tio n p r o b le m

S te p 5

S te p 7

IN

D a t a R e tim in g

OUT

MUX

DCPG_IN6

DCPG_IN5

OUT

Figure 7. The proposed Duty Cycle Pulse Generator


(DCPG)
The DCPG circuit generates a dual clock with the
aligned phi0 falling edges and phi1 rising edges; thus, the
circuit performance can be adjusted by using this technique.
Since this adjusting mechanism was constructed by adjusting
the evaluation time of the dual-phase clock, it is now possible
for the circuit operation speed to be adjusted gradually. Each
stages functional evaluation with data transmission time
needs to be completed within the assigned duty cycle time of
the clock. The evaluation time is shortened when the duty
cycle time decreases. Then, the DominoLatch circuit
operation can control evaluation times for the odd and even
stages to push the circuit to the highest stable speed operable
by the DCPG.
Clk

Clk
(Pd+Td)odd stage

S y n c h ro n o u s
p r o c e s s b y a d d in g la tc h e s

A s s ig n s s u ita b le c lo c k
p h a s e to e a c h lo g ic g a te

& LU F X LW& R P E LQ D WLR Q


3 U L P D U \  , Q S X W  , Q Y H U W H U  ) )
 , Q W H U Q D O  J D W H  S R V L W L Y H
' R P LQ R ) LQ D O 5 H V X OW
9 H U LOR J 1 H WOLV W
& K LS ' K \ V LF D O ' H V LJ Q

Figure 6. The design automation flow for DominoLatch


circuit

phi0
(Pd+Td)even stage

phi1

C lo c k P h a s e
a s s ig n m e n t
3 R V W  V \ Q WK H V LV 

DCPG_OUT

IN
MUX

A d d in g L a t c h

6 \QRSV\V

S te p 6

R Q

Coarse Delay

phi0
S te p 4

Fine Delay
DCPG_IN4
DCPG_IN3
DCPG_IN2
DCPG_IN1
DCPG_IN0

5 7 / & 2 ' (
6 \ Q R S V \ V  S U H V \ Q W K H V L V 
D Q G R S WLP L] D WLR Q

DCPG_IN5

With the DominoLatch circuit pipeline operation, the


circuit throughput speed is decided by the largest stage delay
time within this DominoLatch circuit. In the DominoLatch
circuitry, the delay time is the worst stage of the data input
signal propagating to the output. The circuits highest
operating frequency (performance) depends on the stage,
which has the longest delay time within the DominoLatch
circuit.
Since the DominoLatch circuit design can use an
inversion logic gate, the generic static circuits design after
optimization synthesis flow, can be directly mapped to the
DominoLatch circuit. Figure 6 shows that the proposed
automatic synthesis flow can help to optimize the circuit
delay time and area costs when designing large complex
circuitry.

Coarse Delay
DCPG_IN6

Figure 5. The pipeline operation of the DominoLatch


circuits

phi1

Figure 8. The DCPG progressive adjusting diagrams.


The duty cycle progressive adjusting clock waveforms
are shown in Figure 8. The performance adjustment can be
achieved by using the proposed mechanism. By controlling
the evaluation time in adjusting the clock duty cycles, the
function evaluation of each stage, including the data
transmission time within two DominoLatch stages, can be
evaluated. Regulating the evaluation time for the circuits
normal operation can even help it to tolerate the clock
generator problems (e.g. clock skew and jitter) after chip
manufacturing.

Figure 10 shows the waveforms for the pulse width


IV. DominoLatch Test Chip Design
Figure 9 shows the circuit structure, with a layout view of adjustment. The Clk2_out is the system clock generated from
the test chip. Block-1 shows the system clock generation, Block-1. DCPG_IN is the adjusted control signal for DCPG.
including an ADPLL (i.e. All Digital Phase Lock Loop), and The edge clipping phi0 and phi1 signals are sent to Block-3.
is used to generate the system clock, as well as the external The BIST operation quickly locks the appropriate working
clock input exchanging mechanism. The DCPG in Block-2 is frequency of Block-3.
DCPG initialization
used to adapt clock pulse width. The two phases of operation
Clk2_out
divides the clock into positive and inverted clock signals.
phi0
The dual-phase operation mechanism uses the clock duty
phi1
cycle adjustment mechanism, in order to generate the En_count
clipping clock pulse allow the designed circuit to work DCPG_I
NLBO
toward its stabilized highest speed. Block-3 shows the
rst
highest speed DominoLatch multiplier with built-in self test
en
and control techniques.
LFSR_count
00

02

01

03

18

19

XX

Block 1

Block 3
INE_CLK

en
count

adpll_Lock
IN_CLK
reset_IN

L_sel scan_in
en
rst
ck2_out

ADPLL

LFSR

count [7:0]
tpg_go
ck2_out
rst

clk2

TPI [63:0]

clk22

DCPG

ctl_coun
t
Ora_ctl_en
d

ora_ctl

CLK_SEL

Block 2

Mechanism
(I)

net_rst

DIV_number [9:0]

Ora_ctl
Tpg_go

BT2

BT2

Count

ck1_out

32x32 bit
High Speed
Multiplier

ck2_out
phi1
phi0

00

20

00

00

01

02

39

40

00

MISR data compression


XX

01

Multiplier has 77 levels, the first output


need wait (77/2) 39th clock cycles

02

39

40

41

ORA

Figure 10. The adjustment mechanism of the test chip


Table 2. Experimental results of the test chip

tii [6:0]
Mechanism
(II)
net_rst

ora_ctl

ora_out [63:0]

Figure 9. The DominoLatch test chip with layout view


Due to IOPAD limitation, the highest speed of the
DominoLatch circuit cannot be easily measured by external
instruments. The duty cycle adjusting mechanism can be
joined with the BIST functionality embedded in the high
speed DominoLatch designs. As the DCPG is used to adjust
the logic gate stage evaluation time. This adjustment
mechanism is implemented with cooperation of the built-in
self testing (BIST) technique. The evaluation times of the
odd and even stages can be clipped by the DCPG adjustment
mechanism. Cooperation with the automatic test pattern
generation (ATPG) and multiple output signature analysis
(MISR) circuit provides both performance adjusting and
testing functionality.
For evaluating the highest operation frequency, the
built-in function checking technique combined with LFSR
design generates non-polynomial random input test patterns,
and uses the MISR to compress the multiplier output
responses. The correct comparison mechanism examines the
highest speed output responses of DominoLatch circuit after
the duty cycle is adjusted. The automatic comparison result
uses 1 or 0 to represent whether the DominoLatch circuit
function is correct or not. It is easy to ensure if the circuit
function is correct or not after the duty cycle adjustment.
From the BIST results, the highest operation clock frequency
of the circuit can be decided by progressively shortening the
evaluation time for odd and even stages. Then, the
DominoLatch circuit performance can be justified.

00

00

BIST output correct and the DCPG is


enabled to adjust duty cycle again

result [63:0]

ck2_out

0
3
32

end compression
01

00

rst
DCPG_IN [6:0]

02

64

Table 2 shows the performance comparisons of static


circuit and the proposed DominoLatch circuit. Column 2
includes the logic gate counts and the dynamic latches for
data synchronization. Columns 3 and 4 respectively, show
the delay time and power consumption of the clock
frequency fixed at 100MHz. Columns 5 and 6, respectively
show the highest operation frequency of DominoLatch
circuit and the time to finish the multiplication operations.

V. Conclusion
The dual-phase high speed domino circuit design
technique with in-house EDA cell-based automatic synthesis
flow is proposed. The clock phase and retiming process
maintain the correct circuit operation. The clock cycle
adjustment mechanism cooperates with BIST circuit to
provide highspeed circuit performance adjustment
mechanism for circuit chip after manufacturing. The test chip
of dual-phase 64-bit high-speed multiplier with performance
adjustment mechanism has been successfully validated.

References
[1] N. Goncalves and H. Den Man, "NORA: A racefree
dynamic CMOS technique for pipelined logic structure",
IEEE J. Solid-state Circuits, vol. 18, pp. 261-266, June
1983.
[2] M. Sjalander, P. Larsson Edefors, "Multiplication
Acceleration Through Twin Precision", IEEE Trans. on
VLSI Systems, vol. 17, no. 9, 2009.

Você também pode gostar