Você está na página 1de 4

Fast Parallel CRC & DBI Calculation for High-speed Memories:GDDR5 and DDR4

Jinyeong Moon
DRAM Design Team II Hynix Semiconductor Inc. Icheon, South Korea jinyeong.moon@gmail.com
AbstractIn this paper, a new XOR gate and architecture for parallel calculation of CRC and DBI are proposed. With this proposal, speed constraints in high-speed DRAMs such as GDDR5 and DDR4 SDRAM are relaxed. This helps minimize the latency increase and hence the effective bandwidth loss from CRC and DBI functions.

Joong Sik Kih


Department of EECS Hanyang University Seoul, South Korea jskih119@gmail.com

I. INTRODUCTION Recent high-speed DRAMs such as GDDR5 and DDR4 SDRAM by default support Cyclic Redundancy Check (CRC) and Data Bus Inversion (DBI) functions. GDDR5 and DDR4 SDRAM have the same 8-bit CRC, ATM-8 HEC, whose polynomial is x^8+x^2+x^1+1, and the same DBI-dc function. The DBI operation is well described in [1, 2]. If both functions are simultaneously enabled, it would be natural to apply the DBI function first, and then calculate the CRC value since the DBI function may alter the actual data on the channel which CRC calculation is based on. However, this serial operation may bring a problem in a high-speed case as DDR4 SDRAMs case below. To maintain seamless data transfer in DDR4 SDRAM, CRC operation time should not exceed CAS to CAS delay time (tCCD). The 1st constraint is CRC time < tCCD = 5 nCK (1) Referring to Fig. 1, minimum CAS Latency (CL) for DDR4 with a CRC support can be calculated as follows. CLmin = tCore + Max(0, tCRC tPrep) + tAlign (2) To avoid the latency increase (and therefore effective bandwidth loss) solely due to CRC, its best to make the max function return as close to 0 as possible. So in a latency point of a view, DDR4 SDRAM has the 2nd constraint which is stronger than the previous one. tCalc < 4 nCK Flight time1 Flight time2 (3) After sufficient margining of flight times and package nonidealities, the inequality leaves about 1.2ns of calculation time in 3.2Gbps, the expected maximum data rate of DDR4 SDRAM (Here 500ps is assumed for each flight time, and 20% is taken off from the final value for a package margin) Considering that a CRC8 block and a DBI block usually require 6 and 4 stages of XOR blocks respectively, delay time of 120ps (or less) should be allocated to each XOR gate in a serial CRC+DBI operation, which seems hard to meet in slow PVT conditions in a conventional DRAM process. Therefore

Fig. 1. Minimum CAS Latency Timing Diagram

Fig. 2. Data Packet Format (a) GDDR5s CRC has the same timing as other DQs, (b) DDR4s CRC has 8-ui (4 nCK) free preparation time CL would have to be increased to guarantee proper timing slots, and it would eventually lead to the effective bandwidth loss for a system. Fig. 1 and 2 illustrate that things become worse in GDDR5 as GDDR5s DQ data and DBI-adjusted CRC values should be shipped at the same time from the Burst 0 unlike DDR4 SDRAM has 4 nCK free preparation time to send out DBIadjusted CRC values. So, in GDDR5, improving CLmin performance by fast CRC and DBI calculation has direct impact on enhancing speed performance, which is a strong motivation for us. Furthermore, besides the speed performance, a correct XOR operation in a very low voltage condition, such as 0.8V, is not easily achieved, and the area issues arise as well. To resolve the introduced high-speed and low-voltage difficulties, a faster and more robust XOR gate, a new architecture for parallel calculation and correction of CRC and DBI are presented in the subsequent sections.

978-1-4244-9474-3/11/$26.00 2011 IEEE

317

II. XOR GATE Even though many XOR circuits are available in [3-5], most of them have an area issue or a low-voltage problem. Fig. 3 (a) is a slightly modified version of the XOR circuit proposed in [3], and is confirmed to work well at the speed of 4Gbps in VDD of 1.5V (Hynix GDDR5 [1] used it for implementing CRC and DBI blocks). However this XOR gate has a serious problem in a low-voltage operation in that the internal node where the keeper is connected to is not full-rail. The internal node experiences VDD VTN instead of VDD on the input combination of A=B=1, and this is why special NMOS transistors with low threshold voltage (LVT) are used. And since an LVT NMOS has a larger leakage current, the keeper is added as well. In a low-voltage operation, for example, 0.8V, which DDR4 SDRAMs might work at in the future, VDD is aggressively scaled, but the threshold voltage would not change much. Further accounting for additional voltage couplings and sudden voltage drops, this circuit, even with LVT NMOS, fails to deliver the proper XOR operation due to lack of the noise margin. Our DDR4 SDRAM chip simulation shows that it fails from VDD of 0.95V with the presence of the maximum VDD noise of 0.35V. Fig. 3, 4(b), 5, and 6 in [4], Fig. 3 and 4 in [5], and Fig. 3 (e, g, i, j) in [6] have the similar low-voltage issue while all the internal nodes in the XOR proposal, Fig. 3 (c), are able to full-rail swing. Since DDR4 SDRAM uses more than 700 XOR gates for Read/Write CRC and DBI, the pure active area of XOR blocks is already taking up a large portion of the crowded peripheral circuits, making the area one of the most important factors. Fig. 3 (a, b, c, d, f, h) in [6] have at least 8 transistors, while Fig. 3(a) and proposals 2-input correspondent sub-block requires 7 (or 9) and 6.67 effective transistors, respectively. Fig. 3 (c) is the proposed XOR gate. Architecturally, its a 4-input gate, comprised of 3 sub-blocks, each containing a 2input equivalent XOR/XNOR gate. The first stage contains two 2-input XOR gates, and their outputs are XNORed first and then inverted with the final driver. The reason of alternating XOR/XNOR placement is that the 2-input XOR sub-block, as in Fig. 4(a) in [4], can have an indefinitely long chain of transmission gates depending upon an input combination, if serially placed. Without intermediate repeaters, such a long transmission gate chain significantly deteriorates Fig. 4. XOR Performance Comparison delay performance. Therefore inverters gates are used to block the calculation path every several stages. Similar 8-input First, XOR4 ensures the correct operation in low-voltage XOR gates would have XNOR+INV substitution every 3rd cases, and it doesnt hit the 1.2ns delay cap discussed in the stages. previous section. Second, it has the least operating current. An XOR chain which imitates an actual CRC8 block is Third, the active area is one of the smallest and it doesnt constructed for each implementation in Fig. 3 with an effort to require any special transistor. Please note that all but CMOS make an asymmetric propagation path as long as possible. The type have an almost identical active area, while the CMOS result is drawn in Fig. 4. XOR4 and XOR8 stand for the type is nearly 3 times larger (but still fails to show decent proposed 4-input XOR gate and the similar 8-input XOR gate delay performance). with the substitution in its third stage, respectively. Testing For a current comparison, the numbers of sub-blocks are inputs are random 64 cases plus hand-picked 16 transition matched among 2-, 4-, and 8-input gates. Operating currents cases with an intention to make a propagation path as long as possible for XOR4 and XOR8. The coefficients in the legends are similar except the CMOS type, and the off-currents, of Fig. 4 indicate the number of cascades for each IDD2P, are relatively larger in XOR4 and XOR8. Since CRC implementation to construct a chain. The result is from a and DBI blocks are almost entirely XOR gates, the power performance of those blocks would yield the similar result. parasitic-free simulation, where XOR4 seems the best choice.

Fig. 3. XOR Implementations

318

III. PARALLEL CRC AND DBI CALCULATION In DDR4 and GDDR5, all DQ ports and DBI# port are terminated at VDDQ. If signals have VDDQ termination, they would dissipate less channel power if the transition to zero is restrained. The DBI function in DDR4 and GDDR5 limits the maximum number of simultaneously driven-low DQ channels to 4. To elaborate more, if the number of zeroes in DQ[0:7] at a certain burst is going to be greater than 4, the DBI function flags low (DBI# is active-low) and inverts all DQ[0:7] of the very burst. However, since the DBI function changes the actual channel data, CRC values must be adjusted if the calculation was started before the completion of DBI. In DBI calculation, input bits for each DBI#[k] are 8-bit wide DQ[0:7][k], where k=burst order. In CRC calculation, each CRC[i] gets its unique bit mapping, some from DQs, and some others from DBI#s. The illustration is given in Fig. 5. For example, DBI#[2] is determined by the thick black box, 8 DQ bits from the Burst 2, and the gray boxes indicate the unique bit mapping for CRC[0]. The CRC operation is simply a pure XOR chain. And it is to tell if the number of 1s in a CRC bit mapping are odd. Lets first assume that CRC calculation is already started without DBI information. For the fast starters, DBI# slots in CRC bit mappings are filled with zeroes so as not to affect the number of 1s in CRC calculation. The information we need for a post-processing of a CRC+DBI correction is, 1) Whether DBI#[k] is included in CRC[i] or not, 2) Oddness of DQ bits associated with both burst k & CRC[i], 3) Actual DBI#[k] value Based on three items above, the flip decision table as in Fig. 6 can be made to test the impact of each DBI#[k] on CRC[i]. After summing the decisions from each DBI#[k], the final evenness/oddness decides whether to flip CRC[i] or not. Logic equations for each decision, D, and the final adjusted CRC value, CRC_new, can be written as CRC_new[i] = CRC[i] xor D[0] xor xor D[7] D[k] = + + not(Self) * Odd * (DBI#[k]==0) Self * Even * (DBI#[k]==1) Self * Odd (5) (4)

Fig. 5. CRC[0] Bit Mapping: XOR of Gray Blocks

Fig. 6. Flip Decision Table (D[k])

Fig. 7. DBI# Circuit The DBI# circuitry is as shown in Fig. 7. By using the proposed 4-input XOR gate from section II for the CRC box, and the full adder comprised of the same XOR gates and NAND gates, the delay difference between CRC and DBI is 4*tXOR2 2*tCMOS, where tCMOS is approximated CMOS gate delay. Further assuming tXOR2 = 2*tCMOS, the delay difference is ~3*tXOR2, which means DBI# is ready nearly 3*tXOR2 earlier than CRC. So there are free input slots in a CRC tree and nearly 3 stages of free time available for DBI# until the completion of single CRC calculation. Its obvious that if there are enough input slots after the 3rd stage along the CRC chain, which is also 3rd to the last, DBI#s can be plugged into the CRC slots, and CRC_new[i] is generated without any additional latency. However, in a 6-stage XOR tree with 37 DQ inputs, there are only 3 slots available in the magic point while the maximum number of D[k] fields for each CRC_new[i] is 6 in ATM-8 HEC. Moving the plugging point to one XOR stage earlier, 6 slots are guaranteed at the 2nd stage, and CRC+DBI correction can be finalized with the cost of one extra XOR2 delay.

Once a CRC polynomial and bit mappings are given, everything except DBI# values in (5) become structure constants, and D[k] is left with either an inverter or DBI#[k] (or DBI#[k] with an inverter). In the end, CRC_new[i] simply becomes a function of CRC[i] and DBI# values possibly with an invert to imply odd inversions. IV. CRC AND DBI CORRECTION ARCHITECTURE Even though the input field to the entire CRC box is 72-bit wide, the maximum number of DQ bits to each CRC[i] is 37 in ATM-8 HEC. It means only 6 stages of 2-input XOR gates are needed, and at least 27 input slots are available at the input level of the XOR tree.

319

The proposed architecture is presented in Fig. 8. In a parasitic-free simulation in a slow corner with VDD of 0.95v, the CRC+DBI time is 1.204ns. The architecture introduced in [1] starts CRC+DBI correction upon the completion of both functions. So, in addition to CRC calculation time, it fully requires 3 stages of XOR gates for the correction purpose only (and possibly additional inter-routing delays) if 6 coefficients and the original CRC value are alive in the worst case. V. POST-LAYOUT SIMULATION & RESULTS The process technology used for the simulations is Hynix 38nm DRAM process. And all the numbers in the tables are obtained from the post-simulation after the actual layout. Propagation delays of 6-stage 2-input XOR chain and the equivalent 3-stage 4-input XOR chain are presented in Table I. In the slow corner, with the similar active area, the proposed XOR gate is 31.35% faster than Fig. 3(a) which ASSCC08 [1] used. Powers are similar as they were in Section II. The calculation time of CRC+DBI altogether is presented in Table II. In the test circuits for three cases, the same proposed XOR gate was used. In the slow corner, the proposed architecture is 25% and 48% faster than the architecture of ASSCC08 and the serial type, respectively. VI. CONCLUSION A new 4-input XOR gate is proposed. All internal nodes are capable of full-swing, which makes it appropriate for the low-voltage operation. Furthermore, its the fastest among the options and consumes less area. Also, a new architecture for CRC+DBI correction is proposed. In the proposed architecture, CRC+DBI correction starts before the completion of CRC calculation, hence shortens the total calculation time by the saved amount. By this architectural change, total CRC+DBI calculation time is decreased by at least 17.9% from ASSCC08. With such time reduction, GDDR5 and DDR4 will get a relaxation in a CL constraint, possibly leading to CL reduction, hence increasing the overall bandwidth for a system. REFERENCES
[1] S. Yoon, B. Kim, Y. Kim, B. Chung, A Fast GDDR5 Read CRC Calculation Circuit with Read DBI Operation, IEEE Asian Solid-Sate Circuits Conference, pp. 249-252, November, 2008. Seung-Jun Bae, Kwang-Il Park, An 80 nm 4 Gb/s/pin 32 bit 512 Mb GDDR4 Graphics DRAM With Low Power and Low Noise Data Bus Inversion, IEEE Journal of Solid-State Circuits, Vol.43, pp. 121-131, January, 2008. K. Lin, C. Wu, A Low-cost Realization of Multiple-input ExclusiveOR gates, ASIC Conference and Exhibit, Proceedings of the Eighth Annual IEEE International, pp. 307310, September. 1995. J. Wang, S. Fang, W. Feng, New Efficient Designs for XOR and XNOR Functions on the Transistor Level, IEEE Journal of Solid-state Circuits, Vol. 29, No. 7, pp. 780-786, July, 1994. H. Bui, Y. Wang, and Y. Jiang, Design and Analysis of Low-Power 10-Transistor Full Adders Using Novel XOR-XNOR Gates, IEEE Transactions on Circuits and Systems II, Vol. 49, No. 1, pp. 2530, January, 2002. H. Bui, Y. Wang, and Y. Jiang, New 4-Transistor XOR and XNOR Designs, in Proc. 2nd IEEE Asia Pacific Conf. ASICs, pp.2528, August, 2000. TABLE II. THE CRC+DBI CORRECTION RESULT (POST SIM)

Fig. 8. CRC+DBI Correction Proposal


TABLE I. 6-STAGE XOR CHAIN DELAY (POST SIM)

Process TYPICAL SLOW FAST

Voltage/ Temperature 1.20v / 25C 1.05v / 100C 1.35v / -10C

Circuit Proposed ASSCC08 Proposed ASSCC08 Proposed ASSCC08

Max Delay 0.842 ns 1.115 ns 1.270 ns 1.850 ns 0.772 ns 0.925 ns

[2]

Process TYPICAL

Voltage/ Temperature 1.20v / 25C

Circuit Proposed ASSCC08 Serial Proposed ASSCC08 Serial Proposed ASSCC08 Serial

[3]

[4]

SLOW

1.05v / 100C

[5]

FAST

1.35v / -10C

[6]

Calc Time 1.20 ns 1.50 ns 2.19 ns 1.89 ns 2.52 ns 3.62 ns 1.10 ns 1.34 ns 1.81 ns

320

Você também pode gostar