0% found this document useful (0 votes)
66 views

15IF11 Multicore C PDF

Here are the key steps to transfer a 64B cache block from DRAM to the CPU cache: 1. The memory controller issues a read command to Rank 0 with the cache block address (e.g. 0x40). 2. Rank 0 activates Row 0 in Chip 0 to access the first 8B of the cache block from Column 0. 3. It then activates Row 0 in Chips 1-7 to read the remaining 56B in parallel from Columns 8-63. 4. The 64B of data is returned to the memory controller and written to the CPU cache. The row is then closed. So in summary, transferring a cache line involves activating the correct row across multiple DRAM chips
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

15IF11 Multicore C PDF

Here are the key steps to transfer a 64B cache block from DRAM to the CPU cache: 1. The memory controller issues a read command to Rank 0 with the cache block address (e.g. 0x40). 2. Rank 0 activates Row 0 in Chip 0 to access the first 8B of the cache block from Column 0. 3. It then activates Row 0 in Chips 1-7 to read the remaining 56B in parallel from Columns 8-63. 4. The 64B of data is returned to the memory controller and written to the CPU cache. The row is then closed. So in summary, transferring a cache line involves activating the correct row across multiple DRAM chips
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

15IF11: Multicore Technology @ PSG Tech, Coimbatore

Session-3

Dr. John Jose


Assistant Professor
Department of Computer Science & Engineering
Indian Institute of Technology Guwahati, Assam.
9th & 10th March 2019
Accessing Cache Memory
Hit time
Memory
CPU Cache Miss penalty

Average memory access time (AMAT) =


Hit time + (Miss rate×Miss penalty)
 Hit Time: Time to find the block in the cache and return it to
processor [indexing, tag comparison, transfer].
 Miss Rate: Fraction of cache access that result in a miss.
 Miss Penalty: Number of cycles required to fetch the block
from the next level of memory hierarchy.

AMAT  Thit  f miss * Tmiss


Tmiss means the extra (not total) time (or cycle) for a miss in
addition to Thit, which is incurred by all accesses
How to optimize cache ?
 Reduce Average Memory Access Time
 AMAT= Hit Time + Miss Rate x Miss Penalty
 Motives
Reducing the miss rate
Reducing the miss penalty
Reducing the hit time
Larger Block Size

 Larger block size to reduce miss rate


 Advantages
Utilize spatial locality
Reduces compulsory misses
 Disadvantages
Increases miss penalty
More time to fetch a block to the cache [bus width issue]
Increases conflict misses
More number of blocks will be mapped to the same
location
May bring useless data and evict useful data [pollution]
Larger Block Size
Larger Caches
 Larger cache to reduce miss rate
 Advantages
Reduces capacity misses
Can accommodate larger memory footprint
 Drawbacks
Longer hit time
Higher cost, area and power
Larger Caches
Higher Associativity

 Higher associativity to reduce miss rate


Fully associative caches are the best, but high hit time.
So increase the associativity to the possible level
 Advantages
Reduce conflict miss
Reduce miss rate and eviction rate
 Drawbacks
Increase in the hit time
Complex design than direct mapped
More time to search in the set (tag comparison time)
AMAT vs cache associativity
Multilevel caches
 Multilevel caches to reduce miss penalty
 Performance gap between processors and memory
 Caches should be faster to keep pace with the speed of
processors, AND cache should be larger to overcome the
widening gap between the processor and main memory
 Add another level of cache between the cache and memory.
 The first-level cache (L1) can be small enough to match the
clock cycle time of the fast processor. [Low hit time]
 The second-level cache (L2) can be large enough to capture
many accesses that would go to main memory, thereby
lessening the effective miss penalty. [Low miss rate]
Multilevel caches
Components of a Modern Computer
Components of a Modern Computer
Main Memory in the System

DRAM BANKS
DRAM INTERFACE
DRAM MEMORY
CORE 1

CORE 3
CONTROLLER

L2 CACHE 1 L2 CACHE 3
L2 CACHE 0 L2 CACHE 2

CORE 2
CORE 0
SHARED L3 CACHE
DRAM (Dynamic Random Access Memory)
DRAM vs SRAM
 DRAM
Slower access (capacitor)
Higher density (1T, 1C cell)
Lower cost
Requires refresh (power, performance, circuitry)
Manufacturing requires putting capacitor and logic together
 SRAM
Faster access (no capacitor)
Lower density (6T cell)
Higher cost
No need for refresh
Manufacturing compatible with logic process (no capacitor)
DRAM Subsystem Organization

Channel
DIMM
Rank
Chip
Bank
Row
Column
B-Cell
The DRAM subsystem

“Channel” DIMM (Dual in-line memory module)

Processor

Memory channel Memory channel


Breaking down a DIMM

DIMM (Dual in-line memory module)

Side view

Front of DIMM Back of DIMM


Breaking down a DIMM

DIMM (Dual in-line memory module)

Side view

Front of DIMM Back of DIMM

Rank 0: collection of 8 chips Rank 1


Rank

Rank 0 (Front) Rank 1 (Back)

<0:63> <0:63>

Addr/Cmd CS <0:1> Data <0:63>

Memory channel
Breaking down a Rank

...

Chip 0

Chip 1

Chip 7
Rank 0

<56:63>
<8:15>
<0:7>
<0:63>

Data <0:63>
Breaking down a Chip

Chip 0

Bank 0
<0:7>
<0:7>

<0:7>

<0:7>
...

<0:7>
Breaking down a Bank
2kB
1B (column)

row 16k-1

...
Bank 0

row 0
<0:7>

Row-buffer
1B 1B 1B
...
<0:7>
DRAM Rank
Bank 0 of a Rank
BE0

Chip 0 Chip 1 Chip 2 Chip 3

8 8 8 8

32
 Rank : A set of chips that respond to same command and
same address at the same time but with different pieces of
the requested data.
 Easy to produce 8 bit chip than 32 bit chip.
 Produce an 8 bit chip but control and operate them as a rank
to get a 32 bit data in a single read.
DRAM Rank
DRAM Bank Operation
Access Address:
(Row 0, Column 0) Columns
(Row 0, Column 1)
(Row 0, Column 85)

Row decoder
(Row 1, Column 0)

Rows
Row address 0
1

Row 01
Row
Empty Row Buffer CONFLICT
HIT !

Column address 85
0
1 Column mux

Data
Transferring a cache block

Physical memory space

0xFFFF…F
Channel 0
...

DIMM 0

0x40
Rank 0
64B
cache block

0x00
Transferring a cache block

Physical memory space


Chip 0 Chip 1 Chip 7
Rank 0
0xFFFF…F

...
...

<56:63>
<8:15>
<0:7>

0x40

64B
Data <0:63>
cache block

0x00
Transferring a cache block

Physical memory space


Chip 0 Chip 1 Chip 7
Rank 0
0xFFFF…F

Row 0 ...
Col 0
...

<56:63>
<8:15>
<0:7>

0x40

64B
Data <0:63>
cache block

0x00
Transferring a cache block

Physical memory space


Chip 0 Chip 1 Chip 7
Rank 0
0xFFFF…F

Row 0 ...
Col 0
...

<56:63>
<8:15>
<0:7>

0x40

64B
Data <0:63>
cache block
8B
0x00 8B
Transferring a cache block

Physical memory space


Chip 0 Chip 1 Chip 7
Rank 0
0xFFFF…F

Row 0 ...
Col 1
...

<56:63>
<8:15>
<0:7>

0x40

64B
Data <0:63>
cache block
8B
0x00
Transferring a cache block

Physical memory space


Chip 0 Chip 1 Chip 7
Rank 0
0xFFFF…F

Row 0 ...
Col 1
...

<56:63>
<8:15>
<0:7>

0x40

64B
8B Data <0:63>
cache block
8B
0x00 8B
Transferring a cache block

Physical memory space


Chip 0 Chip 1 Chip 7
Rank 0
0xFFFF…F

Row 0 ...
Col 1
...

<56:63>
<8:15>
<0:7>

0x40

64B
8B Data <0:63>
cache block
8B
0x00
A 64B cache block takes 8 I/O cycles to transfer.

During the process, 8 columns are read sequentially.


Multiple Banks and Channels
 Multiple banks
Enable concurrent DRAM accesses
Bits in address determine which bank an address resides in
 Multiple independent channels
Fully parallel as they have separate data buses
Increased bus bandwidth
More wires, area and power consumptions
More pins for on-chip memory controller
 Enabling more concurrency requires reducing
Bank conflicts
Channel conflicts
Multiple banks to reduce delay
Address Mapping (Single Channel)
 Single-channel system, 8B memory bus
2GB memory, 8 banks, 16K rows & 2K columns per bank
 Row interleaving
Consecutive rows of memory in consecutive banks
Accesses to consecutive cache blocks serviced in a
pipelined manner
Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits)

 Cache block interleaving


 Consecutive cache block addresses in consecutive banks
 64 byte cache blocks
 Accesses to consecutive cache blocks in parallel
Row (14 bits) High Column Bank (3 bits) Low Col. Byte in bus (3 bits)
8 bits 3 bits
Address Mapping (Multiple Channels)
C Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits)

Row (14 bits) C Bank (3 bits) Column (11 bits) Byte in bus (3 bits)

Row (14 bits) Bank (3 bits) C Column (11 bits) Byte in bus (3 bits)

Row (14 bits) Bank (3 bits) Column (11 bits) C Byte in bus (3 bits)
Basic DRAM Operation
 CPU → controller transfer time
 Controller latency
Queuing & scheduling delay at the controller
Access converted to basic commands
 Controller → DRAM transfer time
 DRAM bank latency
Simple CAS (column address strobe) if row is “open” OR
RAS (row address strobe) + CAS if array precharged OR
PRE + RAS + CAS (worst case)
 DRAM → Controller transfer time
Bus latency (BL)
 Controller to CPU transfer time
DRAM Controller Overview
DRAM Scheduling Policies
 FCFS (first come first served)
Oldest request first
 FR-FCFS (first ready, first come first served)
 Row-hit first and then Oldest first
Goal is to maximize row buffer hit rate
 maximize DRAM throughput
 Actually, scheduling is done at the command level
 Column commands (read/write) prioritized over row
commands (activate/precharge)
 Within each group, older commands prioritized over
younger ones
DRAM Scheduling Policies
 A scheduling policy is essentially a prioritization order
 Prioritization can be based on
Request age
Row buffer hit/miss status
Request type (prefetch, read, write)
Request mode (load miss or store miss)
Requestor Type (CPU, DMA, GPU)
Request criticality
Oldest miss in the core?
How many instructions in core are dependent on it?
Will it stall the processor?
Interference caused to other cores
Row Buffer Management Policies
 Open row
 Keep the row open after an access
 Next access might need the same row  row hit
 Next access might need a different row  row conflict, wasted
energy
 Closed row
 Close the row after an access (if no other requests already in
the request buffer need the same row)
 Next access might need a different row  avoid a row conflict
 Next access might need the same row  extra activate latency
 Adaptive policies- Predict whether or not the next access to
the bank will be to the same row
DRAM Refresh
 DRAM capacitor charge leaks over time
 The memory controller needs to read each row periodically
to restore the charge
Activate + precharge each row every Nms
Typical N = 64 ms (Refresh Interval)

 Implications on performance?
DRAM bank unavailable while refreshed
Long pause times: If we refresh all rows in burst, every
64ms the DRAM will be unavailable until refresh ends
DRAM Refresh
 Burst refresh: All rows refreshed immediately after one
another
 Distributed refresh: Each row refreshed at a different time, at
regular intervals
 Distributed refresh eliminates long pause times
johnjose@iitg.ac.in
http://www.iitg.ac.in/johnjose/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy