

# Low/Power, High/Bandwidth. and Ultra-Small Memory Module Design

Qawi I. Harvard, PhD, and R. Jacob Baker, PhD Department of Electrical and Computer Engineering





The main memory subsystem has become inefficient. Sustaining performance gains has power consumption, capacity, and cost moving in the wrong direction. This talk proposes novel module, DRAM, and interconnect architectures in an attempt to alleviate these trends. The proposed architectures utilize inexpensive innovations, including interconnect and packaging, to substantially reduce the power, and increase the capacity and bandwidth of the main memory system. A low cost advanced packaging technology is used to propose an 8 die and 32-die memory module. The 32-die memory module measures less than 2 cm<sup>3</sup>. The size and packaging technique allow the memory module to consume less power than conventional module designs. A 4 Gb DRAM architecture utilizing 64 data pins is proposed. The DRAM architecture is inline with ITRS roadmaps and can consume 50% less power while increasing bandwidth by 100%. The large number of data pins are supported by a low power capacitive-coupled interconnect. The receivers developed for the capacitive interface were fabricated in 0.5  $\mu$ m and 65 nm CMOS technologies. The 0.5  $\mu$ m design operated at 200 Mbps, used a coupling capacitor of 100 fF, and consumed less than 3 pJ/bit of energy. The 65 nm design operated at 4 Gbps, used a coupling capacitor of 15 fF, and consumed less than 15 fJ/bit.





## Mobile Platform

□ Motorola Atrix (Front)





# Mobile Platform

- □ Motorola Atrix (Back)
  - Memory (DSP)
  - Memory & CPU
- HSPA+ DSP
- 802.11n & Bluetooth
- Compass







#### Server Platform

#### □ Intel Server Board S5502UR



Memory Slots



# Organization

- □ Main Memory Limitations
- □ Nano-Module
- □ Wide I/O DRAM Architecture
- □ High Bandwidth Interconnect
- □ Conclusions





Datacenter sparsity masked power limitations

 $\checkmark$  Power trend: Energy consumption doubled every 5 years

□ Server power

✓ ~50 W in 2000

✓ ~250 W in 2008

□ Server power breakdown

✓ CPU: 37%, Memory: 17%

✓ Trend is Memory power > CPU power

□ Main memory power

✓ More die per module

- $\checkmark$  Less modules per channel
- ✓ Higher bandwidth

















- **CPU** power wall
  - $\checkmark$  Voltage scaling reached its limit
  - ✓ Multiple cores supplement performance gains
  - ✓ No "multi-core" for DRAM
- □ DRAM voltage scaling reaching its limit
  - ✓ Current rate increase > voltage reduction rate
  - ✓ Power increasing
- □ DRAM pre-fetch
  - $\checkmark$  Memory core operates at slower frequency
  - $\checkmark$  High power I/O devices and data-path



■ Bi

UN

SE

VERSITY



College of Engineering



**DRAM** inefficiencies increase cost and power

- ✓ Processor cache increasing
- $\checkmark$  Intel Nehalem processor
- $\checkmark$  DRAM would need to have L3 BW and latency
- ✓ "…create the illusion of a large memory that we can access as fast as a very small memory." – Patterson & Hennessy

| Local                    | L1      | L2       | L3        | RAM        |
|--------------------------|---------|----------|-----------|------------|
| Read BW [GB/s]           | 45.6    | 31.1     | 26.2      | 10.1       |
| Write BW [GB/s]          | 45.6    | 28.8     | 19.9      | 8.4        |
| Latency [ns]<br>(cycles) | 1.3 (4) | 3.4 (10) | 13.0 (38) | 65.1 (191) |





# DRAM efficiencies increase performance

Capacity versus Performance



□ Capacity costs power

- ✓ Multiple memory channels
- $\checkmark$  Each additional module increases power

College of Engineering



#### □ Bandwidth versus performance



Arithmetic Intensity (Flops/Byte)

- □ Bandwidth costs power
  - $\checkmark$  Buffer on board
  - ✓ Multiple channels





- DRAM inefficiencies in practice
- □ Typical video/web server motherboard
  - ✓ 20+ layer PCB
  - ✓ 6 memory channels
- **RDIMM** 
  - ✓ 10+ layer PCB
  - ✓ Maximum comp. count





College of Engineering



#### □ 12 RDIMM

- $\checkmark$  Termination
  - o 36 components per DIMM
  - o 8 I/O per component
  - o 2.7 W of termination power for a read/write per module
  - o 32.4 W total termination power
- ✓ Wordline firing
  - o 100 ns activation rate
  - o 8126 page size
  - o 200 fF per bitline
  - o 11.2 W total bitline sense amplifier power
- □ Sustaining performance gains through capacity and bandwidth increases power and cost innovation required.





Goals

- $\checkmark$  Purpose was to move labs into prototype generation
- ✓ Required low cost, high bandwidth, and low power memory solution that can be used with capacitive coupled interconnects in advanced server architectures
- □ Module component count trends required a new approach
- □ Nano-module proposed
  - $\checkmark$  Low cost advanced packaging technology
  - ✓ Off-the-shelf memory components
- □ Results can be leveraged
  - ✓ NAND
  - ✓ Mobile





□ Literature review of high capacity memory stacks

**1990's** 

- ✓ Multichip Modules
  - o Realized planar space limitations
- ✓ Val & Lemione
- ✓ Irvine Sensors

□ Solutions proposed in research

 $\checkmark$  No industry due to memory hierarchy effectiveness





□ Memory stack technology gaining new attention

**2**010

 $\checkmark$  Samsung quad die with TSV

o 80  $\mu m$  pitch, 30  $\mu m$  diameter, 300 TSV

o  $R_{TSV} = 5 \Omega$ ,  $C_{TSV} = 300 fF$ 

**Pros**:

 $\checkmark$  Lower power, higher bandwidth

Cons:





□ Literature review revealed novel solutions

□ Slant the die!

□ Applicable to capacitive-coupled interconnects





## Nano-Module

#### □ Not the first to try it:







## Nano-Module

Controlled Impedance

✓ All Signals 50  $\Omega$  controlled impedance

 $\checkmark$  DQS and CLK 120  $\Omega$  differential impedance

□ Trace Length Matching

- $\checkmark$  All Data matched to worst case
- $\checkmark$  All CLK matched to worst case
- ✓ All Address/Command matched to worst case

$$Z_0 = \frac{87}{\sqrt{\varepsilon_r + 1.41}} \ln\left(\frac{5.98H}{0.8W + T}\right)$$

$$W_{T}$$
Microstrip









## Nano-Module

#### □ Thermal option

- ✓ Thermal conductivity
  - o Silicon, Metals >> Mold Compound
  - o Hot spots
  - o Temperature gradient











# Wide I/O DRAM Architecture

4 Gb DRAM ✓ Meets 2012 ITRS predictions ✓ Developed at Boise State **Edge** aligned pads □ Page size reduction Low cost process 10.2 mm  $\checkmark$  < 4 levels of metal  $\checkmark$  No impact to die size  $\checkmark$  No impact to array efficiency □ Move to 64 data pins ✓ Report challenges  $\checkmark$  Propose innovations







# Wide I/O DRAM Architecture

#### □ 4 Gb Edge DRAM

- ✓ Centralized Row and Column
- ✓ Smaller die
- ✓ Higher efficiency
- $\checkmark$  < 4 levels of metal







# Wide I/O DRAM Architecture

#### □ Challenges

- ✓ Number of metal layers
- ✓ Global data routing
- ✓ Local data routing
- Proposals
  - ✓ Split bank structure
  - ✓ Data-path design
  - ✓ Through bitline routing
  - ✓ SLICE architecture
  - ✓ Capacitive-coupled I/O







#### Capacitive-coupling

- ✓ Increased bandwidth
  - o Reduced ESD capacitance
  - o Smaller I/O channel = more I/O
  - o Removal of inductive channel
- ✓ Low power
  - o Reduced ESD capacitance
  - o Low power Tx & Rx
- ✓ Low cost
  - o Simple
- ✓ Alignment required
- □ Literature review
  - $\checkmark$  Revealed inefficiencies and lack of application







□ Proposed receiver design
 ✓ Extreme low power
 ✓ ~1 gate delay latency
 ✓ 'DC' transmission
 ✓ RTZ → NRZ











## Chip micrograph

- ✓ 1.5 mm x 1.5 mm
- ✓ 9 structures

□ Experimental results ✓ Operate at  $V_{TX} = 2.0$  V

✓ 3 pJ/bit at 200 Mbps





# 65 nm CMOS design (proof of scalability)

- ✓ 1.2 V process
- ✓ 15 fF metal-metal capacitor
- ✓ 4 Gbps
- ✓ 17 µm<sup>2</sup>
- ✓ 227 Tbps/mm<sup>2</sup>













| Work         | Process | Supply | Data Rate | Coupling | Gbps/mm <sup>2</sup> | Energy       | Requires<br>CLK? |  |  |
|--------------|---------|--------|-----------|----------|----------------------|--------------|------------------|--|--|
| Kanda, 2003  | 0.35 µm | 3.3 V  | 1.27 Gbps | ~10 fF   | 2117                 | 2.4 pJ/bit   | Yes              |  |  |
| Wilson, 2007 | 0.18 µm | 1.8 V  | 3 Gbps    | 150 fF   | 555                  | 5 pJ/bit     | No               |  |  |
| Fazzi, 2008  | 0.13 µm | 1.2 V  | 1.23 Gbps | ~10 fF   | 19,219               | 0.14 pJ/bit  | Yes              |  |  |
| Kim, 2009    | 0.18 µm | 1.8 V  | 2 Gbps    | 600 fF   | 690                  | 0.8 pJ/bit   | Yes              |  |  |
| This work    | 0.5 µm  | 5.0 V  | 200 Mbps  | 50 fF    | 325                  | 8 pJ/bit     | No               |  |  |
| This work    | 65 nm   | 1.2 V  | 4 Gbps    | 15 fF    | 226,757              | 0.015 pJ/bit | No               |  |  |

Kanda, K. Antono, D.D., Ishida, K., Kawaguchi, H., Kuroda, T., Sakurai, T.; "1.27 Gb/s/pin 3 mW/pin Wireless Superconnect (WSC) Interface Scheme," IEEE Solid-State Circuits Conference, Session 10, Paper 10.7, Feburuary 11<sup>th</sup>, 2003.

Wilson, J.; Mick, S.; Jian Xu; Lei Luo; Bonafede, S.; Huffman, A.; LaBennett, R.; Franzon, P.D.; , "Fully Integrated AC Coupled Interconnect Using Buried Bumps," Advanced Packaging, IEEE Transactions on , vol.30, no.2, pp.191-199, May 2007 Fazzi, A. Canegallo, R., Ciccarelli, L., Magani, L., Natali, F., Jung, E., Rolandi, P., Guerrieri, R., "3-D Capacitive Interconnections With Mono- and Bi-Directional Capabilities," Solid-State Circuits, IEEE Journal of, vol. 43, no. 1, pp. 275-284, Jan. 2008

Kim, G., Takamiya, M., Sakurai, T., "A 25-mV-Sensitivity 2-Gb/s Optimum-Logic-Threshold Capacitive-Coupling Receiver fro Wireless Wafer Probing Systems," Circuits and Systmes, IEEE Transactions on, vol. 56, no. 9, pp. 710-713, Sept. 2009





# Conclusions

□ Nano-Module

 $\checkmark$  Developed a new research direction for industry research labs

 $\checkmark$  Developed initial motivation

✓ Developed initial prototype

DRAM Architecture

 $\checkmark$  Demonstrated benefits of wide I/O topologies

 $\checkmark$  Proposed several low power innovations

✓ Provided application for novel interconnect technologies

□ Capacitive-Coupled Receiver

✓ Demonstrated low power receiver designs

✓ Achieved 2 Gbps at < 15 fJ/bit in 65 nm





## Questions

?









Appendix - PLL





□ Voltage controlled oscillator

$$A_{VCO} = 2\pi \cdot \frac{f_{MAX} - f_{MIN}}{V_{MAX} - V_{MIN}}$$





College of Engineering









#### □ PLL at lock



College of Engineering



#### □ PLL layout







#### □ PRBS generator







# Appendix - PCB

#### □ PCB test board







#### Appendix – Dead Bug







#### Appendix – Dead Bug





## Appendix – 65 nm Chip





## References

- [1] Val, C.; Lemoine, T.; , "3-D interconnection for ultra-dense multichip modules," *Components, Hybrids, and Manufacturing Technology, IEEE Transactions on*, vol.13, no.4, pp.814-821, Dec 1990
- [2] Bertin, C.L.; Perlman, D.J.; Shanken, S.N.; , "Evaluation of a three-dimensional memory cube system," *Components, Hybrids, and Manufacturing Technology, IEEE Transactions on*, vol.16, no.8, pp.1006-1011, Dec 1993
- [3] Uksong Kang; Hoe-Ju Chung; Seongmoo Heo; Duk-Ha Park; Hoon Lee; Jin Ho Kim; Soon-Hong Ahn; Soo-Ho Cha; Jaesung Ahn; DukMin Kwon; Jae-Wook Lee; Han-Sung Joo; Woo-Seop Kim; Dong Hyeon Jang; Nam Seog Kim; Jung-Hwan Choi; Tae-Gyeong Chung; Jei-Hwan Yoo; Joo Sun Choi; Changhyun Kim; Young-Hyun Jun; , "8 Gb 3-D DDR3 DRAM Using Through-Silicon-Via Technology," *Solid-State Circuits, IEEE Journal of*, vol.45, no.1, pp.111-119, Jan. 2010
- [4] Matthias, T.; Kim, B.; Burgstaller, D.; Wimplinger, M.; Lindner, P., "State-of-the-art Thin Wafer Processing," Chip Scale Review, vol. 14, no. 4, pp. 26, July 2010.
- [5] U.S. Enviornmental Protection Agency, "Report to Congress on Server and Data Center Energy Efficiency Public Law 109-431," 2007.
- [6] L. Minask, B. Ellison, "The Problem of Power Consumption in Servers," Intel Press, 2009, http://www.intel.com/intelpress/articles/rpcs1.htm
- [7] D. Patterson, J. Hennessy, Computer Organization and Design, 4<sup>th</sup> ed., Morgan Kaufmann Publishers, San Francisco, 2009.
- [8] Karp, J.; Regitz, W.; Chou, S.; , "A 4096-bit dynamic MOS RAM," Solid-State Circuits Conference. Digest of Technical Papers. 1972 IEEE International, vol.XV, no., pp. 10- 11, Feb 1972
- [9] Micron Technology Inc. Various Datasheets: <u>http://www.micron.com/products/dram/</u>
- [10] B. Gervasi, "Time to Rethink DDR4," MEMCON 2010, http://discobolusdesigns.com/personal/20100721a\_gervasi\_rethinking\_ddr4.pdf
- [11] Various IBM datasheets. www.ibm.com
- [12] "Power-Efficiency with 2, 4, 6, and 8 Gigabytes of Memory for Intel and AMD Servers," Neal Nelson & Associates, White Paper 2007.
- [13] Rambus, "Challenges and Solutions for Future Main Memory," http://www.rambus.com/assests/documents/products/future\_main\_memory\_whitepaper.pdf, May 2009.
- [14] Intel AMB Datasheet, http://www.intel.com/assets/pdf/datasheet/313072.pdf, pg 38.





#### References

- [15] "Intel Server Board S5520UR and SS5520URT, Technical Product Specification" Rev. 1.6, July 2010, Intel Corporation.
- [16] D. Klein, "The Future of Memory and Storage: Closing the Gap," Microsoft WinHEC 2007, May 2007.
- [17] Cotues, "Stepped Electronic Device Package," U.S. Patent 5,239,447, Aug. 24, 1993.
- [18] G. Rinne, P. Deane, "Microelectronic Packaging Using Arched Solder Columns," U.S. Patent 5,963,793, Oct. 5, 1999.
- [19] R. Plieninger, "Challenges and New Solutions for High Integration IC Packaging," ESTC, July 2006, http://141.30.122.65/Keynotes/6-Plieninger-ESTC\_Keynote\_20060907.pdf
- [20] Harvard, Q., "Wide I/O Dram Architecture Utilizing Proximity Communication" (2009). *Boise State University Theses and Dissertations*. Paper 72.
- [21] International Technology Roadmap for Semiconductor, 2007 Edition, http://www.itrs.net/Links/2007ITRS/Home2007.htm, 2007.
- [22] K. Kilbuck, "Main Memory Technology Direction," Microsoft WinHEC 2007, May 2007.
- [23] R. Drost, R. Hopkins, I. Sutherland, "Proximity Communication," *Proceedings of the IEEE 2003 Custom Integrated Circuits Conference*, vol. 39, issue 9, pp. 469-472, September 2003.
- [24] Saltzman, D.; Knight, T., Jr., "Capacitive coupling solves the known good die problem," *Multi-Chip Module Conference, 1994. MCMC-*94, Proceedings., 1994 IEEE, vol., no., pp.95-100, 15-17 Mar 1994
- [25] Salzman, D.; Knight, T., Jr.; Franzon, P., "Application of capacitive coupling to switch fabrics," Multi-Chip Module Conference, 1995. MCMC-95, Proceedings., 1995 IEEE, vol., no., pp.195-199, 31 Jan-2 Feb 1995
- [26] Wilson, J.; Mick, S.; Jian Xu; Lei Luo; Bonafede, S.; Huffman, A.; LaBennett, R.; Franzon, P.D.; , "Fully Integrated AC Coupled Interconnect Using Buried Bumps," Advanced Packaging, IEEE Transactions on , vol.30, no.2, pp.191-199, May 2007
- [27] Luo, L.; Wilson, J.M.; Mick, S.E.; Jian Xu; Liang Zhang; Franzon, P.D.; , "3 gb/s AC coupled chip-to-chip communication using a low swing pulse receiver," Solid-State Circuits, IEEE Journal of, vol.41, no.1, pp. 287-296, Jan. 2006
- [28] R. Baker, CMOS: Circuit Design, Layout, and Simulation, Third Edition, Wiley-IEEE, 2010
- [29] O. Schwartsglass, "PRBS Work," The Hebrew University of Jerusalem, VLSI class notes, 2002. http://www.cs.huji.ac.il/course/2002/vlsilab/files/prbs/PRBS.pdf

