Você está na página 1de 5

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO.

1, JANUARY 2011

161

Recongurable SRAM Architecture With Spatial Voltage Scaling for Low Power Mobile Multimedia Applications
Minki Cho, Jason Schlessman, Wayne Wolf, and Saibal Mukhopadhyay

AbstractThis paper presents a dynamically recongurable SRAM array for low-power mobile multimedia application. The proposed structure use a lower voltage for cells storing low-order bits and a nominal voltage for cells storing higher order bits. The architecture allows recongure the number of bits in the low-voltage mode to change the error characteristics of the array in run-time. Simulations in predictive 70 nm nodes show that the proposed array can obtain 45% savings in memory power with a marginal ( 10%) reduction in image quality. Index TermsImage processing, Low-power, multimedia, process variations, reconguration, SRAM.

support dynamic run-time reconguration of the number of bits in the low-voltage mode. A static choice of the number of bits in low-voltage mode can lead to unacceptable quality degradation for different images and/or applications. The different failure mechanisms in SRAM cell depends differently on supply, wordline, and bitline voltages of a cell [3]. The proper congurations of all the voltage levels are necessary to dynamically modify the number of bits in low-error and high-error modes. This cannot be achieved in standard SRAM VLSI -architecture [8]. In this paper, we present a recongurable architecture that can dynamically modify the number of bits in the low voltage (Lbit ) and high-voltage domain. Real-time modication of Lbit , depending on the error tolerance of an application, can result in better energy accuracy tradeoff for mobile multimedia applications. II. ACCURACY-AWARE SRAM -ARCHITECTURE The overall architecture of the accuracy-aware recongurable SRAM architecture is presented in Fig. 1. We consider the conventional array architecture with column multiplexing. In column multiplexing based architecture, particular bits of different words are grouped together (hereafter referred to as MUX group). A single read/write circuit is used per MUX group (Fig. 1). Therefore, an entire MUX groups needs to be at a same voltage level. The key-requirement of the proposed approach is that the cell supply, bit-line precharge and write voltage (i.e., the voltage applied to bitline of logic 1 while writing) need to be at the same voltage levels. To achieve this we propose following modications to array design. A column based supply network for cells will be used (as discussed in [9]). The supply network of all the cells in a MUX group [i.e. Number of rows (NROW )2 Number of columns in a MUX group (NMUX )] are connected together. The supply networks for different MUX groups are disconnected to allow bit-by-bit reconguration. Recongurable unit: A set of bits that are connected to same voltage domain. The reconguration can be performed only in recongurable unit by unit basis. The cell supply of a MUX group is also connected to the corresponding precharge device and the write driver supply. Therefore, changing the pre-charge voltage using a voltage switching network (the top PMOSes in Fig. 1) is sufcient to change the cell supply, bitline, and write voltage for the array. The major challenge is to recongure the WL signal. In a regular array the WL signal for all the bits are connected. To address this challenge we have developed a recongurable wordline structure as shown in Fig. 1. In this structure, the local WLs (LWL) of a row of cells in a MUX group are connected together. The LWL of different MUX groups are disconnected. The LWLs are connected to the output of the reconguring inverters (RecongInv) which have the global WL (GWL) as their input. The supply of the RecongInvs is connected to the supply voltage of that MUX group. For a selected row, GWL is 0 which makes LWL (= `1') voltage same as the cell supply and bitline voltage. The driver of GWL operates at nominal voltage to eliminate the short-circuit current through the RecongInvs in the nominal voltage units. The RecongInvs can be applied to also precharge and column selection network for power saving (not necessary for failure reduction).

I. INTRODUCTION Reducing memory power is critical to improve the energy-efciency of mobile multimedia systems performing video/audio processing applications [1], [2]. Scaling the operating voltage of the memory array can reduce the power dissipated in each memory access. However, a lower supply voltage increases the parametric failures (access, disturb and write) in SRAM array caused by manufacturing and intrinsic device variations [3]. This leads to a higher bit-error rate from memory with voltage scaling and limits the opportunity for voltage-scaling and power saving. However, image processing and multimedia applications can provide acceptable quality-of-service even with a non-negligible amount of bit-error rate. Aggressive voltage scaling can be performed in SRAM arrays for multimedia and communication applications exploiting this inherent error tolerance [4], [5]. However, aggressive scaling of voltage in all bits of a pixel increases the error rates in all bits and leads to faster degradation of image quality. It is well-known that the lower order bits (LOB) of an image pixel (and many other signals) are more tolerant to noise than higher order bits (HOB) in typical multimedia applications. Hence, appreciable power savings with minimal image quality degradation can be achieved by operating LOBs at a low-voltage and HOBs at nominal voltage [6]. Yi et al. have shown that such spatial voltage scaling can be very useful for improving effective yield of Video memories [7]. Effective application of spatial voltage scaling for accuracy-energy tradeoff requires a feasible SRAM architecture that can dynamically recongure error behavior of the array. The SRAM array in a mobile system can be shared among different applications with varying error toleranceimage processing applications have inherent error tolerance but the data centric applications do not. The characteristics of the image stored in an array also strongly impact the effect of errors in LOBs on the overall image quality degradation. Hence, SRAM array need to
Manuscript received October 27, 2008; revised February 13, 2009, June 02, 2009, and July 29, 2009. First published October 06, 2009; current version published December 27, 2010. This work was supported in part by National Science Foundation under Grant CCF-0916083. M. Cho, W. Wolf, and S. Mukhopadhyay are with School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail: mcho8@gatech.edu; wolf@ece.gatech.edu; saibal@ece.gatech. edu). J. Schlessman is with Princeton University, Princeton, NJ 08544 USA (e-mail: jschless@princeton.edu). Digital Object Identier 10.1109/TVLSI.2009.2031468

III. DESIGN CONSIDERATIONS The reconguration can be performed only in recongurable unit by unit basis. A recongurable unit is dened as a set of bits that are

1063-8210/$26.00 2009 IEEE

162

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 1, JANUARY 2011

without (Dorg ) and with (D) reconguring inverter with low and high supply voltage for the reconguring is plotted in Fig. 2(a). Given Vlow we rst estimate the delay for the recongurable WL for different pMOS width and estimate the width that provides minimum WL delay for low voltage mode. The maximum allowable pMOS width can be obtained from the cell layout. We consider IBM 130 nm CMOS technology to estimate the area overhead. The layout of one row of the MUX group shows that, the inverter area can be kept under half-cell width resulting in 6% overhead for the core SRAM array (Fig. 2(b)). The smaller one between the maximum pMOS width from layout and optimal pMOS width from delay is used as the nal pMOS width. A minimum size nMOS is used to reduce energy. RecongInvs for pre-charge and column-selection network can be sized using the same principle. B. Voltage Reconguration Network The reconguration of array is performed by changing the voltage levels for a number of LOBs to high or low voltage. The voltage reconguration network consists of two PMOSesone connected to the low voltage and the other to the high voltage. The gate voltage of the PMOSes for the j-th bit is controlled by the j-th select signal (selj ). To recongure a bit to the low voltage selj is set to 0. Note this architecture also allows uniform voltage scaling which can be performed by setting selj = 0 for all the bits and changing the Vlow . The switching PMOSes in the voltage reconguration network for the array and pre-charge devices need to ensure a low voltage drop. A higher drop will reduce the effective cell supply and can degrade read margin. However, as the same pMOS also acts as the supply for the RecongInv, the wordline and cell supply voltage remains the same which reduces this effect. The width of the pMOS marginally impacts the switching speed of the RecongInvs. Hence, it has minimal effect on access and write failures. As the pMOS network is at the array periphery, its impact on the overall array area is negligible [9]. C. Reconguration Length Implementation of a given RL can be performed by putting multiple bits in same voltage network. For example, RL = 2 implies MUX Groups for bit 0 and bit 1 will share the same cell supply and bitline voltage network, same local wordline, and same reconguration inverter. A shorter RL improves power saving and provides more room for accuracy-power tradeoff. Note RL does not impact area overhead, performance, or failure rates of the cell. RL determines the number of reconguring bitthe input signals that are decoded to generate control signals for the voltage conguration network. The number of reconguration bit is a key limiting factor for smaller RL length for 16 or 24 bit image storage. For example, if we consider 16 bit image and RL = 1, then a separate control signal is required for voltage conguration network for each bits. This can be generated by decoding 4 reconguration bits. If RL = 4, we need 4 different control signals. 1st signal will be used to control voltage conguration network for bit 0 to bit 3, 2nd signal for bit 4 to bi 7, 3rd signal for bit 8 to bit 11, and 4th signal for bit 12 to bit 15. This can be generated by decoding 2 reconguration bits. D. Reconguration Time (2) The reconguration of the array essentially implies dynamic change in the voltage of MUX Groups from low to nominal level or vice-versa for a number of bits. The reconguration is performed in run-time and can always be revoked. The array can move between high-error and low-error mode in run-time. The error characteristics in the low-error mode (i.e., number of low-voltage bits) can also be changed in runtime. The choice of the size of pMOS devices in the voltage reconguration network determines the reconguration time (the time required

Fig. 1. Proposed accuracy-aware low-power array.

connected to same voltage domain. The number of bits in each recongurable unit is dened as the reconguration length (RL). The efciency of the recongurable solution depends on the RL. Although the basic structure of the regular array remains unchanged, following elements need careful design considerations. A. Reconguring Inverters One RecongInv is shared by a MUX group and needs to be designed properly. The low-to-high transition of the local WL is the critical output transition for RecongInv. Therefore, the critical transistor is the pMOS which needs to be sized to minimize the wordline delay, particularly for the low-voltage domain. The total WL delay with RecongInvs is given by

Ncol CbitWL + Nbit (uCax + CN ) Vhigh sIax


Driver Delay

WL + 2Nmux Cax V + Nmux Cbit u( Iax =r)


Recon gInv Delay

(1)

where, `u' = WP (Recon gInvs)=WAX ; CN is the capacitance of the NFET RecongInvs; and r is nMOS to pMOS current ratio; `s' = WN (driver)=WAX ; CbitWL is the wordline capacitance per cell, Cax is the gate capacitance of the access device, Iax is the effective current of the access device. WP is the width of the pMOS device in the reconguration inverter, WN is the width of the nMOS device in the reconguration inverter, and WAX is the width of the access transistor in SRAM cell. The factor represents the pMOS strength at different voltage domain (i.e., V = Vlow or Vhigh ). The total delay can be minimized as:

@T @u

= 0 => uopt =

srNmux (2 + CbitWL =Cax ) V Nbit Vhigh

A large value of uopt will be required for the inverters in the low-voltage domain. However, increasing the pMOS size increases the power dissipation in the GWL driver and the area overhead of the array. Therefore, an engineering choice of pMOS size is required. Fig. 2(a) shows the impact of pMOS transistor width to SRAM access transistor width ratio on the WL delay. The ratio of the WL delays

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 1, JANUARY 2011

163

Fig. 2. Physical design considerations: (a): Effect of pMOS size of the RecongInv on WL delay. D is WL delay through the reconguration network, and is the original WL delay, (b) layout of one row of the proposed recongurable SRAM showing the reconguring inverters.

to change the voltage level of a bit). This is because during reconguration the supply line for the MUX group associated with a bit needs to be discharged from high-voltage to low-voltage (while re-conguring from low-error to high-error mode) or charged from low-voltage to high-voltage (while reconguring from high-error to low-error mode). Ideally, the voltage reconguration network for a MuxGroup can be designed using two properly sized pMOS devices. However, to improve the voltage stability we consider a distributed network (i.e., one voltage reconguration network with two pMOS devices per column). The time required to change the common voltage (i.e., cell supply, supply of the RecongInvs, the supply of precharge transistors, and write driver) is estimated considering this distributed pMOS network per column. The capacitance of the supply line is estimated considering 256 cells, precharge PMOSes, write drives, and 256 RecongInvs. Note RecongInvs are shared across a MUX Group so this provides worst case estimate. The metal capacitance of the shared supply line is estimated considering height of the cells and 0.2 fF=m metal capacitance and included in the analysis. We consider the PMOSes in the voltage reconguration network per column has 5X width of that of the cell PMOSes which result in an area overhead of approximately 1 SRAM cell per column of 256 cells (0.4% area overhead). With this additional area and considering predictive 70 nm technology, the estimated Reconguration time is 4 ns. For a 250 MHz clock cycle, this corresponds to 1 clock cycle of reconguration time. Since all the columns are recongured in parallel, the total reconguration time for the array is also the same (i.e., 1 clock cycle). If we consider reading/writing a 256 2 256 pixel image and even if reconguration is performed for every image, the performance overhead due to reconguration time is negligible. The access protocol will require an additional clock cycle per image for reconguration. The reconguration time can be reduced at the cost of additional area. IV. SIMULATION RESULTS The effect of the proposed architecture is evaluated on a standard 8-bit grayscale test image suite available in [10] considering 250 MHz of operation. The total power of the array is computed considering the read access, write access, and active leakage power. The read/write power is computed considering the wordline (including RecongInvs), bitline, and cell switching energy [8]. The power savings was computed with reference to a regular array [all bit at nominal voltage (1 V)]. A system level fault simulation methodology was used to simulate the error characteristics of the array [8]. First, circuit simulations (including the RecongInvs and voltage conguration PMOSes) are performed to estimate the distributions of read margin, write margin and access time of cells at different voltages. Next, a random array instance is created in which the images are mapped to evaluate the effect of

faults. The locations of faulty cells are created by randomly associating cell margin values (generated from the estimated distributions) to each cell and comparing them against a target [8]. A weak value indicator (weak 0 or weak 1) is also associated with each cell. While mapping the image a bit is ipped, if it maps to a faulty cell location and bit value matches the weak value indicator of the cell. The image quality degradation is estimated by comparing the Mean Structural Similarity (MSSIM) index (proposed in [11]) of the original and modied image. If two images are visually identical, MSSIM is 1. A degradation in the image quality results in a lower MSSIM. We consider the effect of different types of failures as well as worst-case condition when a cell is considered faulty if it has any type of faults and any weak value indicator. A. Effect of Voltage Scaling on Image Quality and Power Fig. 3(a), (b) shows that signicant power saving can be obtained with a graceful degradation of image quality for different images. As expected, Lbit = 8 provides more power saving at the cost of quality. Using Lbit = 4, 45% power saving can be obtained compared to a regular array with a 10% reduction in quality. Power saving is 20% higher than the saving achievable by reducing voltage of all bits (blind scaling) at the same degradation level. Considering the fact that the memory power in multimedia applications can be as much as 50% [12], the proposed architecture with Lbit = 4 and Vlow = 0:4 V can result in overall 23% savings in system power. B. Reconguration: Accuracy-Energy Tradeoff We consider the effect of reconguration of Lbit at a given low voltage level [Fig. 3(c)]. Increasing the number of reconguring bit degrades the quality with an increase in power saving over the regular array. The quality degrades slowly till 6th bit, but reconguring the 7th and 8th bit to low voltage mode can result in signicant error. Further, a higher Vlow provides a higher room for reconguration with a lower power saving in each conguration. If Vlow and Lbit are recongured simultaneously, better energy-accuracy tradeoff and higher power saving can be obtained [Fig. 4(a) and (b)]. However, as ne-grain change in the supply voltage increases the design complexity, a more practical approach is to select a low voltage level and use Lbit for reconguration. C. Effect of Chip to Chip Variation in Fault Locations Since for a given voltage and Lbit value, different chips are going to have different fault locations, we performed a Monte Carlo simulation to analyze this effect. Each instance of the Monte Carlo simulation results in a different fault location, even for same value of Vlow and

164

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 1, JANUARY 2011

Fig. 3. Effect of supply voltage scaling on (a) image quality and power and (c) effect of number of LOBs.

Fig. 4. Effect of

and

on (a) quality and (b) power.

Fig. 6. Effect of image quality on reconguring bit: (a) brightness (b) contrast (c) noise (d) blurring/sharpening.

times. To capture transient noise, an image was read repeatedly and a normal variation is applied to the read margin of the cells (over the read margin obtained after manufacturing variations) during each read operation. This results in transient disturb fails to the image during each read. We compute the MSSIM after each read operation. It is observed that, as blind scaling makes HOB cells weak, image becomes more susceptible to transient noise (Fig. 5(b)). This shows the effect of error propagation is less with spatial voltage scaling as faults are localized in the LOBs. E. Effect of Image Property
Fig. 5. (a) Quality degradation of different images considering multiple MC run; (b) impact of transient noise on repeated reading of same image (MSSIM normalized to that for 1st read).

Lbit . A consistent improvement in image quality was observed with spatial voltage scaling [Fig. 5(a)]. The variation in quality degradation is higher when voltages of all bits are scaled uniformly.
D. Effect of Transient Noise Cells which are in the weak corner (may not be failing), can fail due to transient noise (such as thermal or supply noise) at different

We evaluate the impact of different characteristics of images on the recongurability. Images with less brightness show more error under spatial voltage scaling [Fig. 6(a)]. The MSSIM of the low (high) brightness image after memory access is computed with respect to the input low (high) brightness image (not with respect to the original image). Images with higher contrast show less degradation under reconguration of Lbit to higher values [Fig. 6(b)]. Next, we apply different amount of salt and paper (S&P) noise in a given image. Images with low noise level are observed to maintain a target quality with higher number of Lbit [Fig. 6(c)]. We evaluate the correlation of frequency characteristics with quality degradation under reconguration. We observe that the sharpened image (increased high frequency components)

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 1, JANUARY 2011

165

allows higher level of reconguration compared to blurred images for same quality [Fig. 6(d)].

A Low-Jitter ADPLL via a Suppressive Digital Filter and an Interpolation-Based Locking Scheme
Hsuan-Jung Hsu and Shi-Yu Huang

V. CONCLUSION We have presented a recongurable SRAM architecture for low-power mobile multimedia applications. The proposed architecture uses spatial voltage scaling where cells storing lower-order bits of an image pixel are operated at a lower voltage, and the higher-order cells are operated at nominal voltage. The results demonstrate that 45% power savings can be obtained with marginal image quality degradation.
AbstractIn this brief, we present a low-jitter and wide-range all-digital phase-locked loop (ADPLL). This ADPLL achieves low output clock jitter by a number of schemes. First, the phase is locked quickly through a predictive phase-locking scheme. Then, the jitter is further reduced by a suppressive digital loop lter. Finally, an interpolation-based locking scheme is utilized to enhance the resolution of the digitally controlled oscillator (DCO) so as to further reduce the phase error and jitter. Simulation results show that the jitter performance is very close to that of the free-running DCO. and jitter are 56 Measurement results show that the jitter and 7.28 ps, respectively, when the output clock of the ADPLL is running at 600 MHz. Index TermsAll-digital phase-locked loop (ADPLL), digital lter, digitally controlled oscillator (DCO), frequency interpolation, locking algorithm.

REFERENCES
[1] S. Yang et al., Power and performance analysis of motion estimation based on hardware and software realizations, IEEE Trans. Computers, vol. 54, no. 6, pp. 714716, Jun. 2005. [2] G. Chen and M. Kandemir, Optimizing address code generation for array-intensive DSP applications, in Proc. Int. Symp. Code Generation Optimization, 2005, pp. 141152. [3] S. Mukhopadhyay et al., Modeling of failure probability and statistical design of SRAM array for yield enhancement in nano-scaled CMOS, IEEE Trans. Comput.-Aided Design, vol. 24, no. 5, pp. 18591880, Dec. 2005. [4] F. Kurdahi et al., Error aware design, in Proc. 10th Eur. Conf. Digital System Design Architectures, Aug. 2007, pp. 815. [5] A. K. Djahromi et al., Cross layer error exploitation for aggressive voltage scaling, in Proc. IEEE Int. Symp. Quality Electronic Design, Mar. 2007, pp. 192197. [6] J. George, B. Marr, B. E. S. Akgul, and K. V. Palem, Probabilistic arithmetic and energy efcient embedded signal processing, in Proc. CASES, Oct. 2006, pp. 168198. [7] K. Yi, S. Y. Cheng, F. Kurdahi, and A. Ettawil, A partial memory protection scheme for higher effective yield of embedded memory for video data, in Proc. ACSAC, 2008, pp. 273278. [8] M. Cho, J. Schlessman, W. Wolf, and S. Mukhopadhyay, Accuracyaware SRAM: A recongurable low power SRAM architecture for mobile multimedia applications, in Proc. ASPDAC, 2009, pp. 823828. [9] K. Zhang, V. Bhattacharya, Z. Chen, F. Hamzaoglu, D. Murray, N. Vallepalli, Y. Wang, B. Zheng, and M. Bohr, A 3-GHz 70 MB SRAM in 65 nm CMOS technology with integrated column-based dynamic power supply, in Proc. ISSCC 2005, pp. 474611. [10] [Online]. Available: http://www.imageprocessingplace.com/ [11] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, Image quality assessment: From error visibility to structural similarity, IEEE Trans. Image Process., vol. 13, no. 4, pp. 600612, Apr. 2004. [12] T. Liu, T. Lin, S. Wang, W. Lee, J. Yang, K. Hou, and C. Lee, A 125 uW, fully scalable MPEG-2 and H.264/AVC video decoder for mobile applications, IEEE J. Solid-State Circuits, vol. 42, no. 1, pp. 161169, Jan. 2007.

I. INTRODUCTION In a system-on-chip (SOC) era, it is common that a chip is embedded with its own clock generator. Phase-locked loop (PLL) is an important clocking device for these digital systems. Traditionally, a PLL is partly made of some analog components. However, an analog control signal is often subject to digital switching noise so that more design efforts are needed when operating in a noisy digital environment. Moreover, the loop lter in traditional analog PLLs is usually composed of passive devices, such as resistors and capacitances, leading to not only large area but also low portability to different processes. On the contrary, an all-digital phase-locked loop (ADPLL) has not only higher noise immunity but also a number of other attractive features, including better testability, programmability, and ease of integration into digital systems. With the aid of digital ltering techniques, the passive devices could be avoided to save the cost and increase the portability of the ADPLLs. Numerous works on ADPLL have been presented in the past [1][4], [6], [9][12], in which the digitally controlled oscillator (DCO) is at the heart. Unlike the traditional analog voltage controlled oscillator (VCO), the oscillator in ADPLL is digitally controlled so that some quantization error exists. High-resolution design of DCO is thus pursued to catch up with the performance of analog VCO where the resolution can be dened as the minimum delay difference between the clock periods of two adjacent frequencies a DCO can generate. How to design the ne-resolution delay cell becomes the early challenge of the DCO design. Several types of DCOs have been analyzed and summarized in [12]. In [8], the average resolution of DCO could be enhanced to as ne as 1 ps with the insertion of more delay cells and control lines, implying that the performance bottleneck of ADPLL is now shifting to the other components such as the phase detectors (PDs) and/or the loop lters. In addition to the DCO, the resolution of an ADPLL is also dictated by the phase-frequency detector (PFD), or simply PD in some ADPLLs. Recently, the design of a high-resolution PD that is able to distinguish a minute phase difference is becoming more desirable. Sheng
Manuscript received January 15, 2009; revised May 18, 2009. First published October 13, 2009; current version published December 27, 2010. The authors are with the Department of Electrical Engineering, National Tsing-Hua University, Hsin-Chu 30013, Taiwan (e-mail: hjhsu00@larc.ee. nthu.edu.tw; syhuang@ee.nthu.edu.tw). Digital Object Identier 10.1109/TVLSI.2009.2030410

1063-8210/$26.00 2009 IEEE