Está en la página 1de 4

A novel clock synchronizer for low-voltage clock distribution network

Chong Lu1,2*, Zhi-kui Duan2, Yi Ding3, Hong-zhou Tan1,2

SYSU-CMU Shunde International Joint Research Institute, Sun Yat-sen University, Shunde 528300, China
2
School of Information Science and Technology, Sun Yat-sen University, Guangzhou 510006, China
3
School of Computer Science and Technology, Hunan University of Arts and Science, Changde 415000, China
* Email:luchong@mail2.sysu.edu.cn
Abstract
In this paper, we propose a fast clock synchronizer for
the low-voltage clock distribution network to reduce the
power consumption and to suppress the phase error. This
proposed circuit will align the clock signals of the leaf
nodes with the source of root in most 4 clock cycles and
diminish the buffers of original clock driver chains. CTC
and FTC are implemented to perform coarse and fine tuning separately in 2 and 3 clock cycles with one shared
cycle and low-voltage phase detectors are also applied to
meet the requirement of power supply. Interleaved delay
units are introduced to improve the precision of coarse
tuning and binary search scheme is employed to shorten
the fine tuning periods. The proposed circuit is designed
using TSMC 65 nm GP process with a least 0.6 V supply.
Comparison with the H-tree clock network synthesized
of single core of OpenSPARC T2 is applied in this paper.
The experimental results show that the clock will get
synchronized in most 4 cycles with the phase error suppressed under 48 ps and the power saving is up to 42%.
1. Introduction
The clock signal is significant to the synchronous digital
circuits block and memory devices, and the distribution
to millions of registers and latches on very large scale
microprocessors is quite a challenge. Normally, the distribution network is divided to global and local parts
with individual methodologies. Binary trees are widely
accepted in global clock distribution for its simplicity
and ease of integration to the digital backend workflow
with the aids of automatic clock synthesis algorithm [1],
[2]. A huge amount of inverters and buffers are inserted
and sizing-optimized to balance the propagation paths
and to drive local clock networks or mesh. The power
consumption of global distribution tree is enormous and
the simultaneous rising transition of clock signals will
lead to a notable swing on the power supply network [3].
Recently, academic researches focusing on deduction of
power consumption of clock distribution network imply
possible solutions in the future [4-7]. Reduction of supply voltage is a promising technic, however the performance of clock driving buffers is affected and the phase
error increases much [8] and the sensitivity to temperature variations should also be noticed [9].
In this paper, a novel low-power clock distribution network with fast synchronization circuit is proposed. The
operation supply voltage is shrank to 0.6 V using TSMC

978-1-4799-8485-5/15/$31.00 2015 IEEE

65 nm GP process and the power consumption is lessened. The impact on the phase error and jitter is compensated by the synchronization circuit. The proposed
circuit is composed of an improved synchronous mirror
delay (SMD) circuit with closed-loop structure and a dynamic compensation circuit for coarse and fine tuning.
The performance lost on precision is minor and the expense for alignment is only most 4 clock cycles and least
2 clock cycles.
The synchronization circuit consists of several functional
blocks: a coarse tuning component (CTC) and a fine tuning component (FTC), an input buffer (IB) and a feedback buffer (FB), as Figure 1 demonstrated. Unlike conventional SMD, clock drivers (CD) are now separated
from the circuit and working as part of the global clock
distribution network, while the sizes of clock drivers are
determined by the effective load of the local networks or
mesh.

Figure 1. Architeture of proposed circuit


The texts are organized as following. Section II introduces the structure of coarse tuning component of the
applied synchronization circuit and the operation timing.
Section III describes the details and implementations of
fine tuning component and low-voltage design circuits.
The experimental results and comparisons to the cases
with conventional H-tree are given in Section IV. The
conclusions are shown in the last section.
2. Structure of CTC
In the first clock cycle, the external clock from root of
source, EXTCLK propagates through IB and CD then
feeds back through FB as FBCLK and is sent to input
port of CTC with delay d1+d2+d1. Then FBCLK (CTCI)
will get compared with the reference clock generated in
the second clock cycle which delays d1, thus the original
phase difference Tv is measured and sampled, then com-

pensated with Tv. However, the precision of CTC is restricted by the resolution of measurement or compensation units. Three reference clocks are generated in IB,
normal reference clock RCLK with delay d1 and the
other two inverted clocks with phase difference , NCLK
and PCLK. Moreover, PCLK arrives earlier than NCLK
with a phase shifting .
Since the supply voltage is much lower than normal case,
the delay of measurement unit is nearly doubled comparing with the case with normal supply voltage. Another
challenge comes from the transition time of reference
clocks, and even worse in the tri-state inverters or AND
gates, which are the fundamental delay units of conventional clock synchronization circuits. In CTC, delay units
of IMDL are simplified to balanced inverters with the
delay R without any duty cycle distortion. CTC consists
of an interleaved measurement delay line (IMDL), a control circuit (CC) and a dual control delay line (DCDL),
as Figure 2 shown.

Figure 3. Schematic view of LVPD


Two groups of control signal are generated by CC and to
manipulate the dual-control delay units (DCDU), I for
clock injection and P for pass. Only one of I0IN will be
high while others keep low. However, P0PN will hold a
pattern as 111000. The signals Ik and Pk are expressed as the equations below.
Ik

Qk  Q k 1

Pk

Pk 1  I k

PN 1

Q N 1 1

(2)

To improve the resolution R, DCDU circuit is employed.


As Figure 4 shown, when Pk is logical high and Ik is low,
DCDU will work as an inverter. An inverter I1 is inserted to accelerate the falling-down process of MN0. The
performance of DCDU is quite close to the unit of IMDL
when the clock signal propagated through.

Figure 2. Diagram of CTC


IMDL is separated into odd and even parts which accept
normal and inverted reference clocks. The phase detections of propagated signals are also divided into odd
and even groups with a pair of reference clocks, RCLK
and NCLK, and the delay information is captured and
sampled in digitalized form. The residual error of measurement is suppressed to the certain range and associated with R.

Figure 4. Schematic view of DCDU

Tv = Tck  d1  d 2
Tvc K * R

Tvc  Tv

(1)

K u R  Tv ( R, 0)

Low Voltage Phase Detector (LVPD) in CC is based on


E-TSPC register [10] to accept arbitrary duty-cycle clock
and sends measurement result to combinational components of CC. E-TSPC accepts only positive edge of clock
signal and the response time is improved. The structure
of LVPD is shown in Figure 3, and the stages are simplified to extend the flexibility of operation but tolerance
to the overlap of input signals is affected. Therefore, the
transition time of input signals should be guaranteed by
optimized size of buffers in IB and IMDL.

If the only group of control signals with both Ik and Pk


are logical high, the clock injection occurs, and the propagation path VLx will get interpolated with the gated
reference clocks RCLK/PCLK which is controlled by Ik.
The odd units will allow PCLK propagating through the
transmission gate TG0, while the even DCDUs accept
RCLK. The clock injection occurs in only one DCDU
(I=P=1), while other DCDU works as inverter (I=1,P=0)
or fully disabled (I=P=0).
However, the performance loss of injection depends on
the selection of reference clock owning to the charging
speed of transmission gate. The falling edge of PCLK is
affected and e is much smaller than R and insignificant,
while is manually adjusted to equate to o. Therefore,
the compensation time of DCDL is Tv=Tv+. The
phase error =Tv Tv is described as below.

Io
Ie

G  V o  T G ( R, 0)
G  V e | G ( R, 0)

(3)

Therefore, the total delay from input clock EXTCLK to


output clock INTCLK can be calculated as the expression d1+d1+d2+Tv+T''v+d2=2Tck+, and the output
clock signal will get roughly aligned in 2 cycles with
phase error in the range (-R,0). In the second clock
cycle, the fine tuning will get started although.
3. Structure of FTC
As mentioned above, the phase error is suppressed in a
certain range after the roughly alignment is complete. In
the fine tuning process, the phase error will be reduced
to a lowered level with the compensation of FTC. While
the phase error is restricted in the range (-R,0), the resolution of FTC is set with 7 < < 8 . However, differs
from the measurement-compensation method of CTC,
another trail-error strategy in FTC is applied due to the
precise phase error is quite difficult to obtain.

Figure 5. Diagram of FTC


FTC is composed by the low voltage phase detector
(LVPD), the variable delay path (VDP) and the state machine (STM). LVPD will capture the arriving order of
EXTCLK and INTCLK and the indication signal R-, R0,
R+ will be sent to STM, as shown in Figure 5. Another
path of EXTCLKD with extra delay is introduced to
resolve the timing relationship. EXTCLK and its variant
EXTCLKD are selected as the reference clock, while the
signal INTCLK will be sampled, as Figure 6(a) shown.
The results from the two LVPD phase comparators are
categorized as three cases. If INTCLK arrives earlier
than EXTCLK, leading that both the sampled results Q0
and Q1 are logical high, cited as case R+ which implies
more extra delay is required to control the INTCLK. On
the other hand, the case Q0 and Q1 are both logical low
indicates that INTCLK arrives later than EXTCLKD and
less delay is required. While if Q0 = 0 and Q1 = 1, then
INTCLK falls into the gap between EXTCLK and
EXTCLKD and the fine tuning will get stable and the
signal R0 is sent to the control component.
There are 8 legal states in STM, S0,...,S6 and another
stable state SL for lock. The seven individual delay paths
DP0,...,DP6 with delay ,...,7 controlled by the state

registers is provided. The initial state starts from S3 with


delay 3 while next state is determined by last state of
STM and indication signal, as Figure 6(b) shown. Since
the indication signal is obtained when the reference
clock arrives, the state register will get updated following and the selection of delay paths is synchronously
completed.

(a) Timing of LVPD


(b) state transitions
Figure 6. Operation timing of LVPD and states in STM
Even in the second clock cycle and the coarse tuning is
not finished, the state register is set to S3 and the indication signal generated by LVPD is invalid in this cycle.
Therefore, the phase error is + 3 in this cycle and the
state register will hold the valve S3 until the next clock
cycle arrives.
In the third clock cycle, LVPD will capture the phase
error +3, and the current state of STM (if not SL) is
determined by the indication signal when the reference
clock arrives. If R- is received, the state register runs into
S1 and the delay will decrease to . If R+ is received, the
state runs into S5 with delay 5. And if R0 is received,
STM runs into the SL state and get stable.
In the fourth clock cycle, all other possible states S0, S2,
S4 and S6 will get accessed with the same method. The
tuning step in this cycle is reduced to . The applied
binary search scheme is much faster than conventional
method of traversing from S0 to S6, and the phase error
is reduced to (-, 0).
4. Experimental Results
The experiment is established using TSMC 65 nm general purpose process and the supply voltage is 0.6~1 V
with most 40% off with the standard value. The delay of
IB d1 has increased 1.6X due to the lowered supply voltage; therefore some trade-off is made between the operating frequency range and the driving ability. The area
is 460 m x 9.6 m (2X height) and the average power is
1.4 mW when the input frequency is 1 GHz and the voltage supply is 0.6 V.
The functionality of proposed synchronization circuit is
verified by post-layout simulation using HSPICE with
Verilog-AMS modeling of clock drivers. The operation
frequency is 1 GHz and supply voltage is 0.6 V. As the
results shown in Figure 7, INTCLK gets roughly synchronized with EXTCLK in 2 clock cycles and the phase
error is suppressed in the next 2 clock cycles.

Figure 7. Simulation waveform


Its counterpart for comparison is a full-digital implementation of single core of OpenSPARC T2 [11] with the
clock tree synthesized by EDA software and the inverters
and buffers re-characterized. The block contains 44041
registers and the memories are omitted. The clock tree
synthesis is repeated 8 times with different supply voltages at 0.6~1 V and a step at 0.05 V. In another case, the
proposed circuit is applied after the partition of global
and local network. In this experiment, the local network
of modules MMU, TLU, FGU and LSU are attached
with the synchronization circuit. The power of clock tree
can be calculated with the expression P=N*0.5CV2 and
reduced with the voltage supply, however the reduction
is also associated with the total numbers of inverters or
buffers, N, which is increased to meet the timing margin
shown in Figure 8.

Figure 8. Power reduction with various power supply


Another comparison is shown in Figure 9 to demonstrate
the relationship of the power reduction and the load of
local clock network. The power reduction is raised by
the decrease of power supply. Since the propagation path
of global network is only responsible for the load of
local network, the total number of buffers and the entire
power consumption is reduced.

Figure 9. Power reduction with various capacitance load


of local clock network when operating at 0.6 V

5. Summary
In this paper, we propose a novel clock synchronization
circuit for low-voltage clock distribution. The proposed
circuit will perform clock alignment in least 2 cycles and
most 4 cycles with phase error under 48 ps. The active
area is 460 m x 9.6 m with the power consumption at
1.4 mW. With this circuit, the power saving in clock distribution network can archive 42% at most.
References
[1] H. Qian, P. Restle, J. Kozhaya, and C. Gunion,
Subtractive router for tree-driven-grid clocks,
Computer-Aided Design of Integrated Circuits and
Systems, IEEE Transactions on, 31(6), pp.868877
(2012).
[2] C. Deng, Y. Cai, and Q. Zhou, A register clustering
algorithm for low power clock tree synthesis, 2014
IEEE International Symposium on Circuits and
Systems (ISCAS), (2014).
[3] A. Kahng, S. Kang, and H. Lee, Smart non-default
routing for clock power reduction, Design
Automation Conference (DAC), pp.17 (2013).
[4] H.-T. Lin, Y.-L. Chuang, Z.-H. Yang, and T.-Y. Ho,
Pulsed-latch utilization for clock-tree power optimization, Very Large Scale Integration (VLSI) Systems IEEE Transactions on, pp.721733(2014).
[5] F. Haj Ali Asgari and M. Sachdev, A low-power
reduced swing global clocking methodology, Very
Large Scale Integration (VLSI) Systems, IEEE
Transactions on, 12(5), pp. 538545(2004).
[6] N. Kancharapu, M. Dave, V. Masimukkula, M.
Baghini, and D. Sharma,A low-power low-skew
current-mode clock distribution network in 90nm
cmos technology, VLSI, IEEE Computer Society
Annual Symposium on, pp.132137(2011).
[7] A. Kulkarni and P. Khandekar, Design and implementation of low power clock distribution network,
in Advances in Engineering, Science and Management, International Conference on, pp. 761765
(2012).
[8] J. Pangjun and S. Sapatnekar, Low-power clock
distribution using multiple voltages and reduced
swings, Very Large Scale Integration Systems, IEEE
Transactions on, 10(3), pp. 309318(2002).
[9] S. Tawfik and V. Kursun, Low-power low-voltage
hot-spot tolerant clocking with suppressed skew, in
Circuits and Systems, IEEE International Symposium
on, pp. 645648(2007).
[10] M.-V. Krishna, M.-A. Do, K.-S. Yeo, C.-C. Boon,
and W.-M. Lim, Design and Analysis of Ultra Low
Power True Single Phase Clock CMOS 2/3 Prescaler, IEEE Trans. Circuits Syst. I, Reg. Papers,
57(1), pp.72-82(2010).
[11] www.opensparc.org

También podría gustarte