Assignment (G)
Assignment (G)
1. Consider only the Level 1 cache. We compare the performances between a pair of 16KB-
instruc on/16KB-data separate caches, and a 32 KB uni ed cache. A benchmark suite contains
25% of data transfer instruc ons. Each data transfer instruc on consists of one instruc on fetch
and one data transfer. Assume a hit takes 1 clock cycle and the miss penalty is 100 clock cycles. A
load or store hit takes 1 extra cycle on a uni ed cache. The following table shows Miss per 1000
instruc ons for instruc on, data, and uni ed caches of di erent sizes.
From the benchmark suite, we know that 25% of the instruc ons are data transfer
instruc ons. Thus, assuming there are 100 instruc ons, there will be 100 instruc ons fetch
and 25 data transfer. The percentage of instruc on references in the en re memory
references is,
(2) Find the miss rate of the 16KB instruc on cache and that of the 16KB data cache respec vely.
Since every instruc on access has exactly one memory access to fetch the instruc on, the
instruc on miss rate is
3.62/1000
Miss rate16KB Instruction = = 0.00362
1.00
fi
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
fi
fi
ti
ti
ti
ff
fi
ti
ti
ti
ti
ti
ti
ti
Since 25% of the instruc ons are data transfers, the data miss rate is,
40.9/1000
Miss rate16KB Data = = 0.1636
0.25
Since 80% of memory accesses are instruc on references, the overall miss rate is,
43.2/1000
Miss rate32KB Unified = = 0.03456
1.00 + 0.25
Average memory access time = Hit time + Miss rate × Miss penalty
Average memory access timeseparate = 80% × (1 + 0.00362 × 100) + 20% × (1 + 0.1636 × 100) = (80% × 1.362) + (20% × 17.36) = 4.5616
Average memory access time unified = 80% × (1 + 0.03456 × 100) + 20% × (1 + 1 + 0.03456 × 100) = (80% × 4.456) + (20% × 5.456) = 4.656
From the results shown above, we could see that the miss rate of the separate caches is higher
compared to the miss rate of the uni ed cache. It is due to the fact that the uni ed cache has the
exibility to manage how to ll all of the capacity it has, thus the chance to have misses is
lowered. However, when we examine further to the average memory access me, it turns out
that the separate cache is be er since it shows lower value. This has to do with the fact that the
separate caches o er more ports per clock thus able to avoid the structural hazard.
fl
ff
ti
ti
tt
fi
ti
ti
fi
fi
ti
fi
fi
ti
2. Mul core processor is popular in the current microprocessor designs. Some mul core processors
in the market have separate L2 (or L3) caches, i.e. one cache per core, while others have L2 (L3)
caches shared by all cores. (Note that we assume the L2/L3 caches are all uni ed, i.e. both
instruc ons and data are on the same cache.)
Discuss the pros and cons of the separate cache architecture, using the knowledge you have
acquired in the lectures and the reliable informa on found in books, technical papers, on the
webs, etc.
When the number of processor cores increases, for example 4, 8, or more, what would be the
major problem(s) with the shared cache? Explain a possible solu on if you have an idea.
I would like to begin with what cache memory is and what the importance are. In the past, CPU
and memory had a slight di erent clock speed. Along with their development, it turned out that
CPU speed increased much faster than that of memory. This was due to the strict boundary
where the engineer stuck with slow DRAM to keep the cost low, as opposed to migra ng to fast
SRAM which is very expensive. This was where cache memory rst appeared, where it is actually
a small SRAM. Building a small SRAM would not cost that much yet giving signi cant performance
improvement.
Cache memory works as a bridge between CPU and main memory. Cache stores par cular block
from main memory so that it can be readily used by CPU, thus minimizing the wai ng me. As a
comparison, accessing cache is roughly 100 mes faster compared to the main memory. However,
since cache is small in term of size, only a very limited amount of data can be stored in cache. This
is where the term “cache hit” and “cache miss” appeared, where “hit” implies the data requested
by CPU is available in cache, while “miss” implies the reverse. When miss takes place, it requires
the cache to retrieve the data from the main memory, which is way slower. This leads to a term
“Miss penalty” which is a me required to wait the cache nding the data from main memory.
One way to reduce this is by introducing a second level (L2) or third level (L3) cache, where they
have much larger capacity yet slower compared to rst level (L1) cache, but s ll fast enough
compared to main memory. As in recent days there have been several implementa ons of L4
cache, I would like to restrain our discussion here about this last level cache, either L2, L3 or L4.
Thus, our current discussion is about the pros and cons of having a shared or separate last cache
for mul core processor.
In spite of many advantages o ered by separate L3 cache, many vendors prefer shared L3 cache.
This is perhaps due to an easier hardware implementa on to have a shared L3 cache, thus
reducing the price. However, since processor cores increases in a single chip, this will lead to
some problems for a shared L3 cache. One major problem is a cache conten on, as explained in
the previous paragraph. There will be some cores compe ng upon their cache size, and the worst
part is that they may overwrite a vital data previously wri en by other cores. Another problem
might come from access latency caused by bus conges on, if the data required by a par cular
core is located far away within L3 cache. To overcome these, my idea would be to localize their
region nearest to the posi on of each core. By implemen ng this, this will prevent other cores
interrup ng other cores’ cache region. Furthermore, the size of each cache region for each core
need not be the same. It can be op mized by assessing each core workload. When each workload
is well approximated, a percentage of cache region can be assigned for each processor core. This
will maximize L3 cache capacity, in addi on to preven ng cache conten on as well as bus
conges on due to locality it possess.
3. Discuss brie y how “spinlock” by TS (Test and Set) instruc on may cause performance
deteriora on on a bus-based mul -processor system.
We can understand Test and Set spinlock by imagining peoples queueing a toilet. If one is using
the toilet at the me being, other people cannot use that. This is implemented by assigning a
variable x that can be assessed by every processor in shared memory system. Whenever the
variable x shows 1, other requests from other processors will be contained in a loop that cannot
ti
ti
ti
ti
ti
ti
fi
fl
ti
ti
ti
ti
ff
ff
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
tt
ti
ti
ti
ti
ti
ti
ff
ti
proceed further while x is s ll 1. For a bus-based mul -processor system, this can deteriorate
performance badly. We can return to a toilet analogy. In public toilet, it is commonly contained in
a larger space where people can wait there. We can imagine suppose that we have only one
western-style toilet along with many other Japanese-style toilet. Let us suppose that a lot of
people want that only one western-style toilet, they will normally make a queue just within the
room. The more the people queueing that toilet, the more occupied the toilet room, disturbing
the ow of people using Japanese-style toilet. In a technical term, this is referred to as “high bus
tra c”. Whenever a par cular processor has obtained the lock, another processor will keep
incurring bus transac ons in a empt to acquire the key. This will make the bus tra c busy and
disturbing the ow of other processes. Another problem is referred to as “fairness”. We can
imagine of queueing a toilet unorderly. Let us suppose a group of people consis ng of big and
small guys compe ng in using the toilet. In this case, big guys will most likely win. The small guy
may never use the toilet if another big guys keep coming and taking the turn of the small guy. This
happen in the mul processor system where a processor does not get a fair chance of ge ng the
lock when it is set free. This become even worse for a NUMA architecture where they do not have
a symmetric access to the main memory.
ffi
fl
fl
ti
ti
ti
ti
ti
tt
ti
ti
ffi
tti