0% found this document useful (0 votes)
7 views

Assignment (G)

This document discusses high performance computing through cache memory architectures. It compares separate instruction and data caches to a unified cache, and looks at miss rates and average memory access times. Shared L2/L3 caches have issues as core count increases, like contention. Separate caches have pros like avoiding structural hazards but shared caches allow flexibility in capacity usage, lowering miss rates.

Uploaded by

shs5feb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Assignment (G)

This document discusses high performance computing through cache memory architectures. It compares separate instruction and data caches to a unified cache, and looks at miss rates and average memory access times. Shared L2/L3 caches have issues as core count increases, like contention. Separate caches have pros like avoiding structural hazards but shared caches allow flexibility in capacity usage, lowering miss rates.

Uploaded by

shs5feb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

High Performance Compu ng

1. Consider only the Level 1 cache. We compare the performances between a pair of 16KB-
instruc on/16KB-data separate caches, and a 32 KB uni ed cache. A benchmark suite contains
25% of data transfer instruc ons. Each data transfer instruc on consists of one instruc on fetch
and one data transfer. Assume a hit takes 1 clock cycle and the miss penalty is 100 clock cycles. A
load or store hit takes 1 extra cycle on a uni ed cache. The following table shows Miss per 1000
instruc ons for instruc on, data, and uni ed caches of di erent sizes.

Size Instruc on cache Data cache Uni ed cache


16KB 3.62 40.9 50.6
32KB 1.27 37.8 43.2

(1) Find the percentage of instruc on references in the en re memory references.

From the benchmark suite, we know that 25% of the instruc ons are data transfer
instruc ons. Thus, assuming there are 100 instruc ons, there will be 100 instruc ons fetch
and 25 data transfer. The percentage of instruc on references in the en re memory
references is,

instruction references 100


%instruction references = = = 80%
instruction references + data references 100 + 25

(2) Find the miss rate of the 16KB instruc on cache and that of the 16KB data cache respec vely.

Miss rate is given by,


Misses
1000 Inst r uct ions
1000
Miss rate = Memory Access
Instruction

Since every instruc on access has exactly one memory access to fetch the instruc on, the
instruc on miss rate is

3.62/1000
Miss rate16KB Instruction = = 0.00362
1.00
fi
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
fi
fi
ti
ti
ti
ff
fi
ti
ti
ti
ti
ti
ti
ti
Since 25% of the instruc ons are data transfers, the data miss rate is,

40.9/1000
Miss rate16KB Data = = 0.1636
0.25

(3) Find the overall miss rate of the separate caches.

Since 80% of memory accesses are instruc on references, the overall miss rate is,

(80% × 0.00362) + (20% × 0.1636) = 0.035616

(4) Find the miss rate of the 32KB uni ed cache.

43.2/1000
Miss rate32KB Unified = = 0.03456
1.00 + 0.25

(5) Find the average memory access me of the separate caches.

Average memory access me is given by,

Average memory access time = Hit time + Miss rate × Miss penalty
Average memory access timeseparate = 80% × (1 + 0.00362 × 100) + 20% × (1 + 0.1636 × 100) = (80% × 1.362) + (20% × 17.36) = 4.5616

(6) Find the average memory access me of the uni ed cache.

Average memory access time unified = 80% × (1 + 0.03456 × 100) + 20% × (1 + 1 + 0.03456 × 100) = (80% × 4.456) + (20% × 5.456) = 4.656

From the results shown above, we could see that the miss rate of the separate caches is higher
compared to the miss rate of the uni ed cache. It is due to the fact that the uni ed cache has the
exibility to manage how to ll all of the capacity it has, thus the chance to have misses is
lowered. However, when we examine further to the average memory access me, it turns out
that the separate cache is be er since it shows lower value. This has to do with the fact that the
separate caches o er more ports per clock thus able to avoid the structural hazard.
fl
ff
ti
ti
tt
fi
ti
ti
fi
fi
ti
fi
fi
ti
2. Mul core processor is popular in the current microprocessor designs. Some mul core processors
in the market have separate L2 (or L3) caches, i.e. one cache per core, while others have L2 (L3)
caches shared by all cores. (Note that we assume the L2/L3 caches are all uni ed, i.e. both
instruc ons and data are on the same cache.)

Discuss the pros and cons of the separate cache architecture, using the knowledge you have
acquired in the lectures and the reliable informa on found in books, technical papers, on the
webs, etc.

When the number of processor cores increases, for example 4, 8, or more, what would be the
major problem(s) with the shared cache? Explain a possible solu on if you have an idea.

I would like to begin with what cache memory is and what the importance are. In the past, CPU
and memory had a slight di erent clock speed. Along with their development, it turned out that
CPU speed increased much faster than that of memory. This was due to the strict boundary
where the engineer stuck with slow DRAM to keep the cost low, as opposed to migra ng to fast
SRAM which is very expensive. This was where cache memory rst appeared, where it is actually
a small SRAM. Building a small SRAM would not cost that much yet giving signi cant performance
improvement.

Cache memory works as a bridge between CPU and main memory. Cache stores par cular block
from main memory so that it can be readily used by CPU, thus minimizing the wai ng me. As a
comparison, accessing cache is roughly 100 mes faster compared to the main memory. However,
since cache is small in term of size, only a very limited amount of data can be stored in cache. This
is where the term “cache hit” and “cache miss” appeared, where “hit” implies the data requested
by CPU is available in cache, while “miss” implies the reverse. When miss takes place, it requires
the cache to retrieve the data from the main memory, which is way slower. This leads to a term
“Miss penalty” which is a me required to wait the cache nding the data from main memory.
One way to reduce this is by introducing a second level (L2) or third level (L3) cache, where they
have much larger capacity yet slower compared to rst level (L1) cache, but s ll fast enough
compared to main memory. As in recent days there have been several implementa ons of L4
cache, I would like to restrain our discussion here about this last level cache, either L2, L3 or L4.
Thus, our current discussion is about the pros and cons of having a shared or separate last cache
for mul core processor.

Figure 1. Mul core processor die map (source: h p://www.techwarelabs.com/)


ti
ti
ti
ti
ti
ff
ti
tt
ti
fi
fi
fi
ti
fi
ti
ti
ti
fi
ti
ti
ti
ti
Figure 1 gives an example of L3 installa on on a die. Although it is not the only way to have a
shared L3 for mul core processor, it is a current common prac ce among vendors. Another way is
indeed to have a separate L3 cache each for every core. We can refer to Figure 1, and further
imagine having cache in Fig. 1 is cut so that it become 4 private caches. For a shared L3 cache, it is
possible that the data required by core-0 is available at cache block near the core-3, where it is far
from core-0. Having a private L3 cache will thus localize the required block so that the required
me to access the block in L3 cache is smaller. The advantage of having separate private L3 cache
is thus it cuts access latency. Another advantage is that since every cache has di erent bus
connec ng it to the core, it will reduce the chance of bus conges on, which implies reducing L3
miss penalty. Furthermore, having separate cache would reduce the probability of cache
conten on, meaning that two di erent cores will not overwrite the vital data that other cores put
on a speci c block loca on. However, several disadvantages do exist for having separate L3 cache.
One major drawback is that some mes this large L3 capacity is not maximized. Since the work of
each core strongly depends on the program, it is common that many programs do not give each
core the same burden, i.e. some cores work harder than others. The low burden cores will thus
not maximize the available L3 capacity, without giving any chances for other cores to use this
spare capacity. This leads to higher L3 cache miss for a separate implementa on.

In spite of many advantages o ered by separate L3 cache, many vendors prefer shared L3 cache.
This is perhaps due to an easier hardware implementa on to have a shared L3 cache, thus
reducing the price. However, since processor cores increases in a single chip, this will lead to
some problems for a shared L3 cache. One major problem is a cache conten on, as explained in
the previous paragraph. There will be some cores compe ng upon their cache size, and the worst
part is that they may overwrite a vital data previously wri en by other cores. Another problem
might come from access latency caused by bus conges on, if the data required by a par cular
core is located far away within L3 cache. To overcome these, my idea would be to localize their
region nearest to the posi on of each core. By implemen ng this, this will prevent other cores
interrup ng other cores’ cache region. Furthermore, the size of each cache region for each core
need not be the same. It can be op mized by assessing each core workload. When each workload
is well approximated, a percentage of cache region can be assigned for each processor core. This
will maximize L3 cache capacity, in addi on to preven ng cache conten on as well as bus
conges on due to locality it possess.

3. Discuss brie y how “spinlock” by TS (Test and Set) instruc on may cause performance
deteriora on on a bus-based mul -processor system.

We can understand Test and Set spinlock by imagining peoples queueing a toilet. If one is using
the toilet at the me being, other people cannot use that. This is implemented by assigning a
variable x that can be assessed by every processor in shared memory system. Whenever the
variable x shows 1, other requests from other processors will be contained in a loop that cannot
ti
ti
ti
ti
ti
ti
fi
fl
ti
ti
ti
ti
ff
ff
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
tt
ti
ti
ti
ti
ti
ti
ff
ti
proceed further while x is s ll 1. For a bus-based mul -processor system, this can deteriorate
performance badly. We can return to a toilet analogy. In public toilet, it is commonly contained in
a larger space where people can wait there. We can imagine suppose that we have only one
western-style toilet along with many other Japanese-style toilet. Let us suppose that a lot of
people want that only one western-style toilet, they will normally make a queue just within the
room. The more the people queueing that toilet, the more occupied the toilet room, disturbing
the ow of people using Japanese-style toilet. In a technical term, this is referred to as “high bus
tra c”. Whenever a par cular processor has obtained the lock, another processor will keep
incurring bus transac ons in a empt to acquire the key. This will make the bus tra c busy and
disturbing the ow of other processes. Another problem is referred to as “fairness”. We can
imagine of queueing a toilet unorderly. Let us suppose a group of people consis ng of big and
small guys compe ng in using the toilet. In this case, big guys will most likely win. The small guy
may never use the toilet if another big guys keep coming and taking the turn of the small guy. This
happen in the mul processor system where a processor does not get a fair chance of ge ng the
lock when it is set free. This become even worse for a NUMA architecture where they do not have
a symmetric access to the main memory.
ffi
fl
fl
ti
ti
ti
ti
ti
tt
ti
ti
ffi
tti

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy