A High-Speed Dual-Phase Processing Pipelined Domino Circuit Design With A Built-In Performance Adjusting Mechanism

A High-Speed Dual-Phase Processing Pipelined Domino Circuit
Design with a Built-in Performance Adjusting Mechanism

Ching-Hwa Cheng, and *Jiun-In Guo
Dept. of Electronic Engineering, Feng Chia University, * Dept. of Electronics Engineering, National Chiao Tung University
100, Wenhwa Road, Seatwen, TaiChung, Taiwan, R.O.C. *1001, Ta-Hsueh Road, HsinChu, Taiwan, R.O.C.
Abstract
A high-speed dual-phase domino circuit design with high
performance and reliable characteristics is proposed. The
cell-based automatic synthesis flow supports the quick
design of high performance chips. The test chip of a
dual-phase 64-bit high-speed multiplier with a built-in
performance adjustment mechanism has been successfully
validated using TSMC 0.18um CMOS technology. The test
chip shows a 2.7X performance improvement compared to
the conventional static CMOS logic design. In addition, a
cell-based synthesizable design CAD flow, with
consideration of the skew-tolerant issue has been established.
A latched type domino cell library with noise-alleviation,
charge sharing, and crosstalk alleviation abilities was also
developed to support the proposed design flow. Finally, a
built-in performance adjustment mechanism is conducted
within the design. This mechanism supports performance
adjustment after chip fabrication, under clock skew
considerations.
Keywords: pipelined domino circuits, performance
adjustment
Several studies have introduced the dual-phase

operational dynamic circuits, e.g. No Race (NORA)
dual-phase operational domino circuit [1] is shown in Figure
1. The NORA circuit, which mostly focuses for designing
high-speed circuits, alternately uses PMOS and NMOS type
logic gates. However, the design of the NORA dynamic
circuitry can only be applied on positive logic circuits, and
this requires adopting a full custom design technique.
Without the use of logic optimization and synthesis, it is
difficult to enhance design performance and also to reduce
the area cost for large circuits.
I. Introduction
The domino logic design offers a smaller circuit area and
higher operating speed than the complementary CMOS
design. In a conventional domino circuit, the single-phase
function evaluation occupies only half of a clock cycle. For
all of the internal gates within the circuits, only the signal
rising transition, without a falling signal transition, is
permitted. This makes the conventional domino logic a
positive function only. For the correct operation of all
domino gates, a fan-in low-to-high (rising) signal transition
must be maintained during the evaluation phase. Table 1
shows the conventional domino logic using a bubble
(inverter) pushing technique, which greatly increases the
circuit area.
Table 1 The gate count comparisons of static and domino
circuits
ISCAS-85
Benchmark cir.
C17
C432
C499
C13155
C1908
C3540
C6288
Average
Static
#of gate
3
112
312
306
279
463
1651
446.57
Dynamic
#of gate
13
260
392
416
369
1085
2789
760.57
978-1-4577-2081-9/12/$26.00 2012 IEEE
Gate Increasing
ratio
333.33%
132.14%
25.64%
35.95%
32.26%
134.34%
68.93%
70.32%
Figure 1. The conventional NORA (ZIPPER) dynamic

circuit.
In this paper, a dual-phase DominoLatch design is
proposed. Using the proposed design technique, the
dual-phase operational characteristics are applied to the
designed circuits. The conventional domino and the
proposed DominoLatch logic circuit comparisons are shown
in Figure 2. The DominoLatch circuit operation is similar to
that of pipelined circuits. The positive and negative functions
of the logic gates use individual clock signals with opposite
phases. As the DominoLatch design uses the inverted logic
gates, not same to conventional domino circuit can not apply
logic optimized technique. The DominoLatch circuitry uses
the conventional logic synthesis technique to optimize
performance and area. The generic static logic circuits, after
synthesis flow optimization, can be maintained and mapped
into DominoLatch circuit directly. Hence, DominoLatch
circuits are ease of implementing the large complex function.
In addition, compared with the conventional dual-phase
dynamic circuitry, the DominoLatch circuitry achieves a
very desirable high speed operation with lower gate count.
Mp
V dd
Mp
Vdd
Embedded Latch
Vx
Vx
V out
Co
In1
In2
In2
Ink
Phi
Mn
Phi
Phi
Ink
precharge
phase
evaluate
phase
precharge
phase
Phi
Phi
Mn
(a) Convention domino AND gate
Vout
Co
In1
evaluate
phase
Store
phase
(b) Proposed DominoLatch AND gate
The bi-phase clock is applied to the logic gates within the

DominoLatch circuitry. The positive and inverted functions
of the logic gates use an opposite clock phase. The positive
logic function uses the same clock phase and applies the
inverted-phase clock when there is inverted function logic
gate. As shown in Figure 4, the na2 gate fanins next to the
positive-logic an2 gate. The na2 gate and an2 gate are
applied by phi1 and phi0, respectively. The phi1 clock phase
is inverted in respect to the phi0 clock phase.
Dynamic Latch
G2
Conventional Bubble-Push technique
Dynamic
phi
G2
N1- + Latch
CLK2
phi
CLK1
CLK1
N 2+
N2 -
CLK2
G1
G2
G1
Dynamic
latch
Proposed DominoLatch technique
clk
N1-latch output
CLK2
clk
Figure 2. The conventional and proposed DominoLatch ckt

In this paper, the dual-phase pipelined domino circuit
design automation flow is proposed, which focuses on theory
and practical implementation in real chip. The cell based
synthesis design flow exhibits the features of skew tolerance,
low power consumption, and high performance operation
DominoLatch circuits.
II. DominoLatch Circuit & Design Automation

An original example of static logic circuitry is shown in
Figure 3(a). The example circuit, after DominoLatch logic
synthesis processing, is shown in Figure 3(b). The dynamic
latch (dl) has been added to the a3, a4 fanin pins for the
signal synchronous requirement. The DominoLatch
transistor schematic is shown in Figure 3(c). Gate-na2,
gate-inv, gate-dl1, gate-dl3, and gate-dl4 use the phi1 clock;
gate-an2, gate-nr2 and gate-dl2 use the phi0 clock. The
added dl makes it very simple, with only 6 transistors,
resulting in a lower area requirement than the conventional
static latch. The half dynamic latch (hl) is embedded in the
negation logic gate and this embedded latch only has 3
transistors.
o u tp u t
N + Na4
a1
a2
n r2
o u tp u t
dl
1
dl
dl
dl
n r2
in v
d l4
phi
phi
phi
phi
phi
phi
phi
phi
in v
( b ) P r o p o s e d g a t e c ir c u i t
an2
Vx
an2
a4
hl
phi
na2
a3
N -
( a ) O r ig in a l g a t e c ir c u i t
na2
a1
a2
phi
o u tp u t
phi
a3
p h i1
d l1
p h i0
d l3
d l2
a4
phi
phi
phi
phi
phi
N1- out1
N1- out2
Latch keeps data

N1 -out1
N2-out2
G1
clk
Na3
Delayed falling signal
N1- output
Original Circuit
a1
a2
N1-evaluate precharge N1-evaluate
CLK1
G2
Dynamic
+ Latch
phi
( c ) D o m i n o L a t c h t r a n s i s t o r s c h e m a t i c c ir c x u it
Figure 3. The proposed DominoLatch example circuit
N2- output
precharge N2+evaluate precharge N2+evaluate

0
N2 + out1
N2 + out2
Next stage get the data after the previous stage is stable
Figure 4. The basic DominoLatch circuit construction

The major automatic design flow in the DominoLatch
circuitry consists of the clock phase assignment process and
the circuit retiming process. The clock phase assignment
process traces each data transmission path, following the
gate inverted function in order to assign each logic stage with
an appropriate operational clock phase. The logic stage
represents a group of gates having the same logic operational
clock phase.
The DominoLatch logic gate needs to use opposite clock
phases alternatively. For the two neighboring stages with
inverted functions, the assigned clock operational phase is
opposite. The clock phase assignment procedure traces the
circuit logic gate function in order to assign a suitable
operating clock phase. In the operating diagram shown in
Figure 3, there is an inverted logic gate na2 output transition
necessary fanin next to the positive-logic an2, operating in
the precharge phase, resulting in a correct output function.
During the clock phase assignment process, a counting
number is assigned to each logic gate in order to represent
the depth of the circuit stages. In order to assign a number to
each logic gate, after tracing these logic gates in a logic
transmission path, the gates are divided into two groups (i.e.
even and odd numbers) and a clock signal is assigned to each
of them. For each logic gate, if the fanin path counting
number (input stage length) is different, the largest fanin
counting number can be adopted for the current logic gate.
After the clock phase assignment procedure, in order to
uphold the synchronous pipeline operation, the retiming
process needs to maintain the circuit synchronous operation,
and insert the additional dynamic latch gates into the data
transmission paths, in order to ensure that the whole circuit
operates correctly. The retiming process identifies whether
the current gate fanin counting number is different or not.
For those gates with different fanin counting numbers, the

retiming process needs to add the additional dynamic latches
into the data transmission path, with the same counting
number of fanin pins for each gate, in order to maintain the
circuit synchronous operation. Figure 3 shows three dynamic
latches (ncc) added to a4 fanin path for correct pipeline
behavior. The dual-phase operational DominoLatch circuitry
allows the circuit operation to have stable pipeline behavior.
The circuit pipeline operational and timing diagram shown in
Figure 5.
III. Built in Performance Adjustment and

Function Self Test Mechanism
The phi1 and phi0 signal evaluation times are important
for the DominoLatch circuit, as process variation influences
the clock skew and duty cycle pulse widths (from ADPLL).
To deal with this variation problem for reliable high-speed
DominoLatch design, the proposed Duty Cycle Pulse
Generator (DCPG) circuit will provide for a good adjustment
mechanism within the chip, so as to ensure the correct
functioning during high speed. The DCPG circuit design is
shown in Figure 7. The DCPG circuit generates a dual clock
with the alignment of phi0 falling edges and phi1 rising
edges. The evaluation time of the dual-phase clocks do not
overlap. From the FF and SS corner model simulations, the
DCPG adjusts the duty cycle time discrepancy by 55.5ps of
coarse scaling and 1.3ps of fine scaling. The DCPG was
designed by using delay lines in order to keep the DCPG
circuitry stable and tolerant of the process variation.
S
DCPG_IN
S te p 1
S te p 2
S te p 3
8 1 * 5 2 8 3
) ) D Q G OR J LF J D WH
A d d in g la t c h t o r e s o lv e th e
In v e r t in g f u n c tio n p r o b le m
S te p 5
S te p 7
IN
D a t a R e tim in g
OUT
MUX
DCPG_IN6
DCPG_IN5
OUT
Figure 7. The proposed Duty Cycle Pulse Generator

(DCPG)
The DCPG circuit generates a dual clock with the
aligned phi0 falling edges and phi1 rising edges; thus, the
circuit performance can be adjusted by using this technique.
Since this adjusting mechanism was constructed by adjusting
the evaluation time of the dual-phase clock, it is now possible
for the circuit operation speed to be adjusted gradually. Each
stages functional evaluation with data transmission time
needs to be completed within the assigned duty cycle time of
the clock. The evaluation time is shortened when the duty
cycle time decreases. Then, the DominoLatch circuit
operation can control evaluation times for the odd and even
stages to push the circuit to the highest stable speed operable
by the DCPG.
Clk
Clk
(Pd+Td)odd stage
S y n c h ro n o u s
p r o c e s s b y a d d in g la tc h e s
A s s ig n s s u ita b le c lo c k
p h a s e to e a c h lo g ic g a te
& LU F X LW& R P E LQ D WLR Q

3 U L P D U \ , Q S X W , Q Y H U W H U ) )
, Q W H U Q D O J D W H S R V L W L Y H
' R P LQ R ) LQ D O 5 H V X OW
9 H U LOR J 1 H WOLV W
& K LS ' K \ V LF D O ' H V LJ Q
Figure 6. The design automation flow for DominoLatch

circuit
phi0
(Pd+Td)even stage
phi1
C lo c k P h a s e
a s s ig n m e n t
3 R V W V \ Q WK H V LV
DCPG_OUT
IN
MUX
A d d in g L a t c h
6 \QRSV\V
S te p 6
R Q
Coarse Delay
phi0
S te p 4
Fine Delay
DCPG_IN4
DCPG_IN3
DCPG_IN2
DCPG_IN1
DCPG_IN0
5 7 / & 2 ' (
6 \ Q R S V \ V S U H V \ Q W K H V L V
D Q G R S WLP L] D WLR Q
DCPG_IN5
With the DominoLatch circuit pipeline operation, the

circuit throughput speed is decided by the largest stage delay
time within this DominoLatch circuit. In the DominoLatch
circuitry, the delay time is the worst stage of the data input
signal propagating to the output. The circuits highest
operating frequency (performance) depends on the stage,
which has the longest delay time within the DominoLatch
circuit.
Since the DominoLatch circuit design can use an
inversion logic gate, the generic static circuits design after
optimization synthesis flow, can be directly mapped to the
DominoLatch circuit. Figure 6 shows that the proposed
automatic synthesis flow can help to optimize the circuit
delay time and area costs when designing large complex
circuitry.
Coarse Delay
DCPG_IN6
Figure 5. The pipeline operation of the DominoLatch

circuits
phi1
Figure 8. The DCPG progressive adjusting diagrams.

The duty cycle progressive adjusting clock waveforms
are shown in Figure 8. The performance adjustment can be
achieved by using the proposed mechanism. By controlling
the evaluation time in adjusting the clock duty cycles, the
function evaluation of each stage, including the data
transmission time within two DominoLatch stages, can be
evaluated. Regulating the evaluation time for the circuits
normal operation can even help it to tolerate the clock
generator problems (e.g. clock skew and jitter) after chip
manufacturing.
Figure 10 shows the waveforms for the pulse width

IV. DominoLatch Test Chip Design
Figure 9 shows the circuit structure, with a layout view of adjustment. The Clk2_out is the system clock generated from
the test chip. Block-1 shows the system clock generation, Block-1. DCPG_IN is the adjusted control signal for DCPG.
including an ADPLL (i.e. All Digital Phase Lock Loop), and The edge clipping phi0 and phi1 signals are sent to Block-3.
is used to generate the system clock, as well as the external The BIST operation quickly locks the appropriate working
clock input exchanging mechanism. The DCPG in Block-2 is frequency of Block-3.
DCPG initialization
used to adapt clock pulse width. The two phases of operation
Clk2_out
divides the clock into positive and inverted clock signals.
phi0
The dual-phase operation mechanism uses the clock duty
phi1
cycle adjustment mechanism, in order to generate the En_count
clipping clock pulse allow the designed circuit to work DCPG_I
NLBO
toward its stabilized highest speed. Block-3 shows the
rst
highest speed DominoLatch multiplier with built-in self test
en
and control techniques.
LFSR_count
00
02
01
03
18
19
XX
Block 1
Block 3
INE_CLK
en
count
adpll_Lock
IN_CLK
reset_IN
L_sel scan_in
en
rst
ck2_out
ADPLL
LFSR
count [7:0]
tpg_go
ck2_out
rst
clk2
TPI [63:0]
clk22
DCPG
ctl_coun
t
Ora_ctl_en
d
ora_ctl
CLK_SEL
Block 2
Mechanism
(I)
net_rst
DIV_number [9:0]
Ora_ctl
Tpg_go
BT2
BT2
Count
ck1_out
32x32 bit
High Speed
Multiplier
ck2_out
phi1
phi0
00
20
00
00
01
02
39
40
00
MISR data compression

XX
01
Multiplier has 77 levels, the first output

need wait (77/2) 39th clock cycles
02
39
40
41
ORA
Figure 10. The adjustment mechanism of the test chip

Table 2. Experimental results of the test chip
tii [6:0]
Mechanism
(II)
net_rst
ora_ctl
ora_out [63:0]
Figure 9. The DominoLatch test chip with layout view

Due to IOPAD limitation, the highest speed of the
DominoLatch circuit cannot be easily measured by external
instruments. The duty cycle adjusting mechanism can be
joined with the BIST functionality embedded in the high
speed DominoLatch designs. As the DCPG is used to adjust
the logic gate stage evaluation time. This adjustment
mechanism is implemented with cooperation of the built-in
self testing (BIST) technique. The evaluation times of the
odd and even stages can be clipped by the DCPG adjustment
mechanism. Cooperation with the automatic test pattern
generation (ATPG) and multiple output signature analysis
(MISR) circuit provides both performance adjusting and
testing functionality.
For evaluating the highest operation frequency, the
built-in function checking technique combined with LFSR
design generates non-polynomial random input test patterns,
and uses the MISR to compress the multiplier output
responses. The correct comparison mechanism examines the
highest speed output responses of DominoLatch circuit after
the duty cycle is adjusted. The automatic comparison result
uses 1 or 0 to represent whether the DominoLatch circuit
function is correct or not. It is easy to ensure if the circuit
function is correct or not after the duty cycle adjustment.
From the BIST results, the highest operation clock frequency
of the circuit can be decided by progressively shortening the
evaluation time for odd and even stages. Then, the
DominoLatch circuit performance can be justified.
00
00
BIST output correct and the DCPG is

enabled to adjust duty cycle again
result [63:0]
ck2_out
0
3
32
end compression
01
00
rst
DCPG_IN [6:0]
02
64
Table 2 shows the performance comparisons of static

circuit and the proposed DominoLatch circuit. Column 2
includes the logic gate counts and the dynamic latches for
data synchronization. Columns 3 and 4 respectively, show
the delay time and power consumption of the clock
frequency fixed at 100MHz. Columns 5 and 6, respectively
show the highest operation frequency of DominoLatch
circuit and the time to finish the multiplication operations.
V. Conclusion
The dual-phase high speed domino circuit design
technique with in-house EDA cell-based automatic synthesis
flow is proposed. The clock phase and retiming process
maintain the correct circuit operation. The clock cycle
adjustment mechanism cooperates with BIST circuit to
provide highspeed circuit performance adjustment
mechanism for circuit chip after manufacturing. The test chip
of dual-phase 64-bit high-speed multiplier with performance
adjustment mechanism has been successfully validated.
References
[1] N. Goncalves and H. Den Man, "NORA: A racefree
dynamic CMOS technique for pipelined logic structure",
IEEE J. Solid-state Circuits, vol. 18, pp. 261-266, June
1983.
[2] M. Sjalander, P. Larsson Edefors, "Multiplication
Acceleration Through Twin Precision", IEEE Trans. on
VLSI Systems, vol. 17, no. 9, 2009.

A High-Speed Dual-Phase Processing Pipelined Domino Circuit Design With A Built-In Performance Adjusting Mechanism

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

A High-Speed Dual-Phase Processing Pipelined Domino Circuit Design With A Built-In Performance Adjusting Mechanism

Enviado por

Direitos autorais:

Formatos disponíveis

A High-Speed Dual-Phase Processing Pipelined Domino Circuit

Design with a Built-in Performance Adjusting Mechanism

Several studies have introduced the dual-phase

978-1-4577-2081-9/12/$26.00 2012 IEEE

Figure 1. The conventional NORA (ZIPPER) dynamic

(a) Convention domino AND gate

(b) Proposed DominoLatch AND gate

The bi-phase clock is applied to the logic gates within the

Conventional Bubble-Push technique

Proposed DominoLatch technique

Figure 2. The conventional and proposed DominoLatch ckt

II. DominoLatch Circuit & Design Automation

Latch keeps data

Delayed falling signal

N1-evaluate precharge N1-evaluate

Figure 3. The proposed DominoLatch example circuit

precharge N2+evaluate precharge N2+evaluate

Figure 4. The basic DominoLatch circuit construction

For those gates with different fanin counting numbers, the

III. Built in Performance Adjustment and

Figure 7. The proposed Duty Cycle Pulse Generator

& LU F X LW& R P E LQ D WLR Q

Figure 6. The design automation flow for DominoLatch

With the DominoLatch circuit pipeline operation, the

Figure 5. The pipeline operation of the DominoLatch

Figure 8. The DCPG progressive adjusting diagrams.

Figure 10 shows the waveforms for the pulse width

MISR data compression

Multiplier has 77 levels, the first output

Figure 10. The adjustment mechanism of the test chip

Figure 9. The DominoLatch test chip with layout view

BIST output correct and the DCPG is

Table 2 shows the performance comparisons of static

Você também pode gostar