### CS152: Computer Architecture and Engineering Locality and Memory Technology

### October 29, 1997

### Dave Patterson (http.cs.berkeley.edu/~patterson)

lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/

### Recap

- MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load)
- <sup>o</sup> More performance from deeper pipelines, parallelism
- Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency
- ° SW Pipelining
  - Symbolic Loop Unrolling to get most from pipeline with little code expansion, little overhead
- Dynamic Branch Prediction + early branch address for speculative execution
- <sup>o</sup> Superscalar and VLIW
  - CPI < 1
  - Dynamic issue vs. Static issue
  - More instructions issue at same time, larger the penalty of hazards

• Intel EPIC in IA-64 a hybrid: compact LIW + data hazard check cs 152 L1 6.2

### ° The Five Classic Components of a Computer



### ° Today's Topics:

- Recap last lecture
- Locality and Memory Hierarchy
- Administrivia
- SRAM Memory Technology
- DRAM Memory Technology
- Memory Organization

### **Technology Trends (from 1st lecture)**

|        | Capacity      | Speed (latency) |
|--------|---------------|-----------------|
| Logic: | 2x in 3 years | 2x in 3 years   |
| DRAM:  | 4x in 3 years | 2x in 10 years  |
| Disk:  | 4x in 3 years | 2x in 10 years  |



## Who Cares About the Memory Hierarchy?

### **Processor-DRAM Memory Gap (latency)**



### **Today's Situation: Microprocessor**

- Rely on caches to bridge gap
- Microprocessor-DRAM performance gap
  - time of a full cache miss in instructions executed
    1st Alpha (7000): 340 ns/5.0 ns = 68 clks x 2 or 136 ins
    2nd Alpha (8400): 266 ns/3.3 ns = 80 clks x 4 or 320 ins
    3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648 ins
  - 1/2X latency x 3X clock rate x 3X lnstr/clock  $\Rightarrow \approx 5X$
- 136 instructions
- 320 instructions
- 648 instructions

### **Impact on Performance**

## Suppose a processor executes at

- Clock Rate = 200 MHz (5 ns per cycle)
- CPI = 1.1
- 50% arith/logic, 30% ld/st, 20% control
- Suppose that 10% of memory operations get 50 cycle miss penalty



49%

- ° CPI = ideal CPI + average stalls per instruction = 1.1(cyc) +( 0.30 (datamops/ins) x 0.10 (miss/datamop) x 50 (cycle/miss) ) = 1.1 cycle + 1.5 cycle = 2.6
- ° 58 % of the time the processor is stalled waiting for memory!
- ° a 1% instruction miss rate would add an additional 0.5 cycles to the CPI!

### The Goal: illusion of large, fast, cheap memory

- Fact: Large memories are slow, fast memories are small
- <sup>o</sup> How do we create a memory that is large, cheap and fast (most of the time)?
  - Hierarchy
  - Parallelism



| Speed: | Fastest  | Slowest |
|--------|----------|---------|
| Size:  | Smallest | Biggest |
| Cost:  | Highest  | Lowest  |

### <sup>°</sup> The Principle of Locality:

• Program access a relatively small portion of the address space at any instant of time.



### • Temporal Locality (Locality in Time):

=> Keep most recently accessed data items closer to the processor

### • Spatial Locality (Locality in Space):

=> Move blocks consists of contiguous words to the upper levels



### **Memory Hierarchy: Terminology**

### <sup>o</sup> Hit: data appears in some block in the upper level (example: Block X)

- Hit Rate: the fraction of memory access found in the upper level
- Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss

### Miss: data needs to be retrieve from a block in the lower level (Block Y)

- Miss Rate = 1 (Hit Rate)
- Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block the processor



### Memory Hierarchy of a Modern Computer System

### ° By taking advantage of the principle of locality:

- Present the user with as much memory as is available in the cheapest technology.
- Provide access at the speed offered by the fastest technology.



### ° Registers <-> Memory

• by compiler (programmer?)

### ° cache <-> memory

• by the hardware

### ° memory <-> disks

- by the hardware and operating system (virtual memory)
- by the programmer (files)

### Memory Hierarchy Technology

### ° Random Access:

- "Random" is good: access time is the same for all locations
- DRAM: Dynamic Random Access Memory
  - High density, low power, cheap, slow
  - Dynamic: need to be "refreshed" regularly
- SRAM: Static Random Access Memory
  - Low density, high power, expensive, fast
  - Static: content will last "forever" (until lose power)

### ° "Non-so-random" Access Technology:

- Access time varies from location to location and from time to time
- Examples: Disk, CDROM

### Sequential Access Technology: access time linear in location (e.g., Tape)

# The next two lectures will concentrate on random access technology

es 152 L1 6.15 The Main Memory: DRAMs + Caches: SRAMs

### **Main Memory Background**

### <sup>o</sup> Performance of Main Memory:

- Latency: Cache Miss Penalty
  - Access Time: time between request and word arrives
  - Cycle Time: time between requests
- Bandwidth: I/O & Large Block Miss Penalty (L2)

### <sup>o</sup> Main Memory is **DRAM**: Dynamic Random Access Memory

- Dynamic since needs to be refreshed periodically (8 ms)
- Addresses divided into 2 halves (Memory as a 2D matrix):
  - RAS or Row Access Strobe
  - CAS or Column Access Strobe

### <sup>o</sup> Cache uses **SRAM**: Static Random Access Memory

- No refresh (6 transistors/bit vs. 1 transistor/bit)
- Address not divided

#### Size: DRAM/SRAM ≈ 4-8, Cost/Cycle time: SRAM/DRAM ≈ 8-16 CS 152 L1 6.16 DAP Fa97, © U.CB

### Random Access Memory (RAM) Technology

- <sup>o</sup> Why do computer designers need to know about RAM technology?
  - Processor performance is usually limited by memory bandwidth
  - As IC densities increase, lots of memory will fit on processor chip
    - Tailor on-chip memory to specific needs
      - Instruction cache
      - Data cache
      - Write buffer

### <sup>o</sup> What makes RAM different from a bunch of flip-flops?

• Density: RAM is much more denser

## ° Office Hours:

- Gebis: <u>Tuesday, 3:30-4:30</u>
- Kirby: ?
- Kozyrakis: Monday 1pm-2pm, Th 11am-noon 415 Soda Hall
- Patterson: Wednesday 12-1 and Wednesday 3:30-4:30 635 Soda Hall

### <sup>o</sup> Reflector site for handouts and lecture notes (backup):

- http://HTTP.CS.Berkeley.EDU/~patterson/152F97/index\_handouts.html
- http://HTTP.CS.Berkeley.EDU/~patterson/152F97/index\_lectures.html

### <sup>o</sup> Computers in the news

- Intel buys DEC fab line for \$700M + rights to DEC patents;
   + Intel pays some royalty per chip from 1997-2007
- DEC has rights to continue fab Alpha in future on Intel owned line
- Intel offers jobs to 2000 fab/process people; DEC keeps MPU designers
- DEC will build servers based on IA-64 (+Alpha); customers choose
- Intel gets rights to DEC UNIX

### Static RAM Cell



- 3. Cell pulls one line low
- 4. Sense amp on column detects difference between bit and bit

### **Typical SRAM Organization: 16-word x 4-bit**



cs 152 L1 6 .20

Logic Diagram of a Typical SRAM



Write Enable is usually active low (WE\_L)

### Din and Dout are combined to save pins:

- A new control signal, output enable (OE\_L) is needed
- WE L is asserted (Low), OE L is disasserted (High)
  - D serves as the data input pin
- WE\_L is disasserted (High), OE\_L is asserted (Low)
  - D is the data output pin
- Both WE\_L and OE\_L are asserted:
  - Result is unknown. Don't do that!!!

<sup>o</sup> Although could change VHDL to do what desire, must do the best with what you've got (vs. what you cs 15**need**)

## **Typical SRAM Timing**





### ° Six transistors use up a lot of area

### ° Consider a "Zero" is stored in the cell:

- Transistor N1 will try to pull "bit" to 0
- Transistor P2 will try to pull "bit bar" to 1

#### <sup>o</sup> But bit lines are precharged to high: Are P1 and P2 necessary? DAP Fa DAP Fa DAP Fa

### 1-Transistor Memory Cell (DRAM)

### ° Write:

- 1. Drive bit line
- 2.. Select row

### ° Read:

- 1. Precharge bit line to Vdd
- 2.. Select row
- 3. Cell and bit line share charges
  - Very small voltage changes on the bit line
- 4. Sense (fancy sense amp)
  - Can detect changes of ~1 million electrons
- 5. Write: restore the value

### ° Refresh

• 1. Just do a dummy read to every cell.



#### **Classical DRAM Organization (square)** bit (data) lines r **Each intersection represents** 0 a 1-T DRAM Cell W **RAM Cell** d Array e С 0 d word (row) select e r **Column Selector &** row Column **I/O Circuits** address **Address** <sup>o</sup> Row and Column Address together: data Select 1 bit a time



Square root of bits per RAS/CAS

### **DRAM** physical organization (4 Mbit)



### **Memory Systems**



### Tc = Tcycle + Tcontroller + Tdriver

Logic Diagram of a Typical DRAM



- ° Control Signals (RAS\_L, CAS\_L, WE\_L, OE\_L) are all active low
- <sup>o</sup> Din and Dout are combined (D):
  - WE\_L is asserted (Low), OE\_L is disasserted (High)
    - D serves as the data input pin
  - WE\_L is disasserted (High), OE\_L is asserted (Low)
    - D is the data output pin

### <sup>o</sup> Row and column addresses share the same pins (A)

- RAS\_L goes low: Pins A are latched in as row address
- CAS\_L goes low: Pins A are latched in as column address

cs 152 L1 . RAS/CAS edge-sensitive

<sup>°</sup> t<sub>RAC</sub>: minimum time from RAS line falling to the valid data output.

- Quoted as the speed of a DRAM
- A fast 4Mb DRAM  $t_{RAC} = 60$  ns
- \* t<sub>RC</sub>: minimum time from the start of one row access to the start of the next.
  - $t_{RC}$  = 110 ns for a 4Mbit DRAM with a  $t_{RAC}$  of 60 ns
- <sup>o</sup> t<sub>CAC</sub>: minimum time from CAS line falling to valid data output.
  - 15 ns for a 4Mbit DRAM with a  $t_{\text{RAC}}$  of 60 ns
- <sup>o</sup> t<sub>PC</sub>: minimum time from the start of one column access to the start of the next.
  - 35 ns for a 4Mbit DRAM with a  $\rm t_{RAC}$  of 60 ns

- ° A 60 ns (t<sub>RAC</sub>) DRAM can
  - perform a row access only every 110 ns (t<sub>RC</sub>)
  - perform column access (t<sub>CAC</sub>) in 15 ns, but time between column accesses is at least 35 ns (t<sub>PC</sub>).
    - In practice, external address delays and turning around buses make it 40 to 50 ns
- These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead.
  - Drive parallel DRAMs, external memory controller, bus to turn around, SIMM module, pins...
  - 180 ns to 250 ns latency from processor to memory is good for a "60 ns" (t<sub>RAC</sub>) DRAM

### **DRAM Write Timing**



cs 152 L1 6 .32

### **DRAM Read Timing**



### **Main Memory Performance**

### ° **Simple**:

 CPU, Cache, Bus, Memory same width (32 bits)

### ° Wide:

 CPU/Mux 1 word; Mux/ Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits)

### ° Interleaved:

• CPU, Cache, Bus 1 word: Memory N Modules (4 Modules); example is word interleaved



### **Cycle Time versus Access Time**



### <sup>o</sup> DRAM (Read/Write) Cycle Time >> DRAM (Read/ Write) Access Time

• ≈ 2:1; why?

### ° DRAM (Read/Write) Cycle Time :

- How frequent can you initiate an access?
- Analogy: A little kid can only ask his father for money on Saturday

### <sup>o</sup> DRAM (Read/Write) Access Time:

- How quickly will you get what you want once you initiate an access?
- Analogy: As soon as he asks, his father will give him the money

### <sup>o</sup> DRAM Bandwidth Limitation analogy:

• What happens if he runs out of money on Wednesday?

### **Increasing Bandwidth - Interleaving**



## **Main Memory Performance**

#### ° Timing model

- 1 to send address,
- 6 access time, 1 to send data
- Cache Block is 4 words
- <sup>°</sup> Simple M.P. = 4 x (1+6+1) = 32
- <sup>°</sup> Wide M.P. = 1 + 6 + 1 = 8

| Adde 14 | Bank 0 | Address | Bank I | Address | Bank C | Address        | Bank S |
|---------|--------|---------|--------|---------|--------|----------------|--------|
| • [     |        | 7 i [   |        | 7 ° [   |        | 7 % F          |        |
| 4 [     |        | ] 5 [   |        | ] % [   |        | ] 7 [          |        |
| * [     |        | ] , [   |        | ] 10 [  |        |                |        |
| 12      |        | 7 IS [  |        | _ I¥ [  |        | <u> ∣ 15</u> [ |        |

° In 1

#### <sup>°</sup> How many banks?

number banks **>** number clocks to access word in bank

• For sequential accesses, otherwise will return to original bank before it has next word ready

#### <sup>o</sup> Increasing DRAM => fewer chips => harder to have banks

- Growth bits/chip DRAM : 50%-60%/yr
- Nathan Myrvold M/S: mature software growth (33%/yr for NT) ≈ growth MB/\$ of DRAM (25%-30%/yr)

## Fewer DRAMs/System over Time

| (from Pete<br>MacWilliams,<br>Intel) |     |    |                                   | C          | ORAM G     | enerati                          | ion        |            |
|--------------------------------------|-----|----|-----------------------------------|------------|------------|----------------------------------|------------|------------|
|                                      |     |    | <b>'86</b>                        | <b>'89</b> | <b>'92</b> | <b>'96</b>                       | <b>'99</b> | <b>'02</b> |
|                                      |     |    | 1 Mb                              | 4 Mb       | 16 Mb      | 64 Mb                            | 256 Mb     | 1 Gb       |
| Size                                 | 4   | MB | <b>32</b> ⊣                       | ► 8        |            |                                  | Memory     |            |
| <b>Minimum PC Memory Size</b>        | 8   | MB |                                   | 16—        | → 4        | 4 DRAM growth —►<br>@ 60% / year |            |            |
| lem                                  | 16  | MB |                                   |            | 8 —        | ► 2                              |            |            |
| N<br>N<br>N<br>N                     | 32  | MB | <i>Memory per</i><br>System growt |            |            | 4 —                              | <u>► 1</u> |            |
| В<br>Ш                               | 64  | MB |                                   |            | th         | 8 —                              | ▶ 2        |            |
| nimu                                 | 128 | MB | -                                 | %-30%      |            |                                  | 4 —        | ► <u>1</u> |
| Mir                                  | 256 | MB | ↓<br>▼                            |            |            |                                  | 8 —        | ► 2        |

## **Page Mode DRAM: Motivation**



#### **Fast Page Mode Operation**



cs 152 L1 6 .41

DAP Fa97, © U.CB

| Standards | pinout, package,<br>refresh rate,<br>capacity, | binary compatibility,<br>IEEE 754, I/O bus |
|-----------|------------------------------------------------|--------------------------------------------|
| Sources   | Multiple                                       | Single                                     |
| Figures   | 1) capacity, 1a) \$/bit                        | 1) SPEC speed                              |
| of Merit  | 2) BW, 3) latency                              | 2) cost                                    |
| Improve   | 1) 60%, 1a) 25%,                               | 1) 60%,                                    |
| Rate/year | 2) 20%, 3) 7%                                  | 2) little change                           |

° Reduce cell size 2.5, increase die size 1.5

# ° Sell 10% of a single DRAM generation

- 6.25 billion DRAMs sold in 1996
- <sup>o</sup> 3 phases: engineering samples, first customer ship(FCS), mass production
  - Fastest to FCS, mass production wins share
- <sup>o</sup> Die size, testing time, yield => profit
  - Yield >> 60% (redundant rows/columns to repair flaws)

#### **DRAM History**

#### ° DRAMs: capacity +60%/yr, cost –30%/yr

• 2.5X cells/area, 1.5X die size in ≈3 years

#### ° '97 DRAM fab line costs \$1B to \$2B

- DRAM only: density, leakage v. speed
- Rely on increasing no. of computers & memory per computer (60% market)
  - SIMM or DIMM is replaceable unit
     => computers use any generation DRAM

# Commodity, second source industry => high volume, low profit, conservative

• Little organization innovation in 20 years page mode, EDO, Synch DRAM

## <sup>o</sup> Order of importance: 1) Cost/bit 1a) Capacity

• RAMBUS: 10X BW, +30% cost => little impact

**Today's Situation: DRAM** 

#### ° Commodity, second source industry $\Rightarrow$ high volume, low profit, conservative

• Little organization innovation (vs. processors) in 20 years: page mode, EDO, Synch DRAM

## <sup>o</sup> DRAM industry at a crossroads:

- Fewer DRAMs per computer over time
  - Growth bits/chip DRAM : 50%-60%/yr
  - Nathan Myrvold M/S: mature software growth (33%/yr for NT) ≈ growth MB/\$ of DRAM (25%-30%/yr)
- Starting to question buying larger DRAMs?



# Summary:

#### <sup>°</sup> Two Different Types of Locality:

- Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon.
- Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon.

#### ° By taking advantage of the principle of locality:

- Present the user with as much memory as is available in the cheapest technology.
- Provide access at the speed offered by the fastest technology.

#### ° DRAM is slow but cheap and dense:

• Good choice for presenting the user with a BIG memory system

#### ° SRAM is fast but expensive and not very dense:

• Good choice for providing the user FAST access time.

|   | Processor                                                   | % Area  | %Transistors |  |  |  |
|---|-------------------------------------------------------------|---------|--------------|--|--|--|
|   |                                                             | (≈cost) | (≈power)     |  |  |  |
| 0 | Alpha 21164                                                 | 37%     | 77%          |  |  |  |
| 0 | StrongArm SA110                                             | 61%     | 94%          |  |  |  |
| 0 | Pentium Pro                                                 | 64%     | 88%          |  |  |  |
|   | <ul> <li>2 dies per package: Proc/l\$/D\$ + L2\$</li> </ul> |         |              |  |  |  |

 Caches have no inherent value, only try to close performance gap