Computer Architecture Arm Substitles6
Computer Architecture Arm Substitles6
Computer Architecture Arm Substitles6
ARM
Module 1, Intro
[music] I'm Richard Grisenthwaite, Chief Architect at Arm. What does a Chief
Architect do? The Chief Architect is responsible for the evolution of the Arm
architecture. One of the things about an architecture is, you want your
software to continue running on future products. So it's a very long term
commitment, and the role of the chief architect is essentially to curate the
architecture over time, adding new features as needed by the partner. Why are
microprocessors so important and ubiquitous? I think the fundamental reason
is that they are a general purpose device. They describe a basic instruction
set that does the stuff that you can construct pretty much any program out
of. Indeed, they're Turing complete, which means, by definition, you can
create any computable problem, solving a computable problem with the
processor. They're not customized to any particular usage, but they do two
things really rather well, which are data processing and decision making. In
other words, having, for example, added two numbers and compared them, you're
then taking a branch and making a decision based off that. Now in reality, an
awful lot of general purpose problems are ones that involve: work out some
value or some criteria, compare against that, and then make a decision on it.
And essentially that allows you to solve a huge number of different problems
and that's why the microprocessor has become so ubiquitous. What is the
difference between architecture and microarchitecture? Architecture is what
it has to do, and microarchitecture is how it does it. So for example, we
will define in the architecture a set of instructions that do "add" or "load"
or "branch", but it says nothing about whether you've got a ten-stage
pipeline or a three-stage pipeline. It says nothing about branch prediction.
All of those sort of features which actually determine the performance of the
device; all of those are part of the microarchitecture. The key point is that
software that is written to be compliant on the architecture will actually
run on lots of different implementations, lots of different
microarchitectures that essentially implement that architecture in different
ways, choosing a trade-off between the power, the performance and the area of
the design. What does it take to build a commercially successful processor
today? If you actually start from scratch with an established architecture,
but we want to create a new microarchitecture, we reckon, for a high end
processor, you're talking about three-to-four hundred person-years' worth of
work in order to create a properly competitive multi-issue out-of-order
machine compared with the state-of-the-art that you can get from Arm. In
addition to that — and that's just to come up with the RTL — in addition to
that, you've then got to do the implementation. If you're going to build that
on a three-nanometer process, the leading edge to get the best possible
performance, you're talking about tens of millions of dollars for the mask
sets. There's a whole bunch of software you've got to go and build on top of
that and do all of that work. When we've looked at companies that are
interested in actually starting up, taking an Arm architecture license — say
I want to go and build my own business — we reckon that you need to be
prepared to spend getting on for half a billion dollars before you're
actually going to be successful because it takes time. Your first product is
not necessarily going to be fully competitive because it would be slightly
surprising if the first thing that you built was as good as what people have
been building for many years. It takes time to build up that expertise and
skills. And so you're going to see a couple of iterations before even the
best teams end up really being competitive. And so as I say, I think if you
went from nothing and wanted to essentially create something, and that's
using the Arm architecture with all of its existing software and so on,
you're talking the best part of half a billion. What is the Arm business
model? The Arm business model fundamentally is the licensing of semiconductor
IP to as wide a range of companies as possible and in as many ways as
possible, in order to maximize the uptake of the IP. When we talk about IP,
we're talking about essentially designs for processors and we license that
both as an architecture license, where you effectively gain the right to
build your own processor to the specification of the Arm architecture, or an
implementation license where we are licensing, actually, an implementation
that is compliant with the Arm architecture in the form of some richer
transfer-level RTL. What makes Arm different from its competitors? So if you
go back to where we were when Arms started out in the early 90s, there were
many, many different architectures available and they were all kind of doing
the same thing but slightly differently. They would have different
instructions, different instruction sets, and so software needed to be
ported. A huge part of Arm's success actually comes from the fact that we
created a business model of licensing the IP to make it very easy for people
to build processors, to incorporate the designs into their SoCs, into their
systems. And that then made it very straightforward for people to be able to
use the Arm architecture. Now what this then meant was, people said: I will
port more software to this because it's more widely available and you get
this positive feed-forward effect whereby more availability of hardware
encourages more software, encourages more hardware, and so on. And
essentially that meant that a lot of people said: there's no point in me
having a different architecture. I'm not getting a valuable difference from
doing that. All I've got is, kind of, a needless change to the software that
I need to make. So actually the whole model, we went on at Arm, which was:
let's license our IP to a great number of different players to come up with
different solutions to meet the form factor of a camera or a PDA, or whatever
it was back in the day. Those things made it much more straightforward for
people to incorporate our technology. [music]
Module 1, Video 1
Computer architecture is the study of tools and techniques that help us to
design computers. More precisely, it helps us understand how to meet the
needs of particular markets and applications, using the technology and
components that are available. For example, we might need to produce the chip
at the heart of a smartphone using 10 billion transistors and a power of less
than two Watts. How do we achieve the performance a customer wants? The
challenge is a fascinating one, and one that requires a broad understanding.
For example, what target market are we designing for? What are the
characteristics of the applications to be run? How will programming languages
and compilers interact with the microprocessor? How best to craft the narrow
interface between the hardware and software? How to organize the components
of our microprocessor? And how to design a circuit, given the characteristics
of individual transistors and wires? Like many design problems, computer
architecture requires many trade-offs to be made and evaluated. Each design
decision will impact trade-offs between size, performance, power, security,
complexity, and cost. Trade-offs must be re-evaluated regularly, due to
advances in fabrication technology, applications, and computer architecture.
Computer architecture must be grounded in quantitative techniques and
experimentation, but the endless number of possible designs means that the
field depends on a high degree of human ingenuity and art. Perhaps
surprisingly, the earliest computers and today's most advanced machines have
much in common. They both execute a stored program constructed from machine
instructions. These instructions perform simple operations such as adding two
numbers. Nevertheless, greater numbers of faster transistors, and the
application of a relatively small number of computer architecture concepts,
have enabled us to construct machines that can perform billions of
instructions per second, and shrink these machines to fit in hand-held
battery-powered devices. It is this rapid progress that has supported
breakthroughs in machine learning, drug discovery, climate modeling, and
supports our modern world where computation and storage are almost free. The
task of designing a microprocessor is split into different levels of
abstraction: "Architecture;" "Microarchitecture;" and "Implementation."
"Architecture" focuses on the contract between programmers and hardware. It
allows compatible families of microprocessor products to be built. The ARMv8-
A architecture is an example of this. Architecture includes the "Instruction
Set Architecture," or ISA, which defines what instructions exist. It also
defines precisely the behavior of memory and other features needed to build a
complete processor. "Microarchitecture" focuses on the organization and
structure of the major components of a microprocessor. It has to match the
rules set by the architecture. The microarchitecture still has flexibility
though; and so the implementation specifies the circuit detail precisely.
This culminates in the exact circuit design for manufacture. Each of these
levels is vital, and each comes with its own challenges and opportunities.
Module 1, Video 2
At the beginning of the 20th century, "computers" were people employed to
perform calculations. These computers used mechanical calculators to help
them perform arithmetic. They followed instructions to decide what
calculation to perform next. These instructions defined the algorithm or
program they were executing. They would consult paper records to find the
inputs to their calculations, and would store their intermediate results on
paper so they could refer back to them later. A modern electronic computer is
organized in a similar way. We will look in more detail at these components
of a microprocessor in the next video, but for now let's look at how it
operates. Microprocessors also follow instructions one by one, and then
perform relevant calculations. This idea of fetching instructions from memory
and executing them is called the "Fetch-Execute Cycle." In the "Fetch" stage,
the computer reads the next instruction from the program. This instruction is
encoded in binary as ones and zeroes, so it must be decoded to understand
what it means. This is done in the "Decode" stage. Once it is clear what to
do, we move to the "Execute" phase. This can involve different tasks such as
reading memory, performing a calculation, and storing a result. Once done,
the computer is then ready to begin the cycle again, by fetching the next
instruction. Instructions are normally fetched sequentially in order, but
some special instructions called "branches" can change which instruction will
be executed next. For branches, a calculation determines the next
instruction. This can mean evaluating a condition, or reading a register to
determine the next instruction's location. Branches allow computers to make
decisions, and to re-use instruction sequences for common tasks. A modern
computer program, like a web browser, contains millions of instructions, and
computers execute billions of instructions per second, but they all
conceptually follow this "Fetch-Execute Cycle."
Module 1, Video 3
Modern microprocessors are circuits built using anywhere from 1,000 to 100
billion tiny transistors. The key to designing circuits with such huge
numbers of parts is to build them from re-usable blocks. Let's take a look at
some of these. Each transistor is an electrically-controlled switch. When
there is too little voltage at the gate, the switch is off, so the electrical
signal cannot propagate from the drain to the source. However, when there is
sufficient voltage at the gate, the switch is on, so the signal does
propagate. When designing processors, we use digital electronics. The only
voltage or current values we consider represent zero and one, enabling us to
build robust circuits from imprecise components, even in the event of
electrical noise or manufacturing imperfections. This means our circuits
represent binary numbers, since they only have two states: zero and one. We
use transistors to build increasingly complex circuits. We can design
circuits that can remember binary values, or select between multiple inputs,
or even do arithmetic, like addition. We can then use those circuits as
building blocks in even larger designs. When designing a digital system, we
must keep everything synchronized to control when its behavior occurs. To do
this, we use a "Clock Signal," which is a wire in the circuit whose signal
cycles between zero and one. We measure the rate in Hertz. For example, if
you hear a processor has a clock speed of two Gigahertz, it means 2 billion
ones and zeroes per second. The maximum speed of the clock signal is
determined by the longest, and therefore slowest, path in the circuit between
2 clocked flip-flops. This is referred to as the "Critical Path." The signal
must have time to propagate all the way along the critical path before the
clock cycle completes. A microprocessor needs to do many types of arithmetic.
For this, we build an "Arithmetic Logic Unit" or "ALU." This circuit receives
2 numbers as input, as well as an indication of what operation to perform,
for example, addition or subtraction. In addition to logic, a microprocessor
needs memory. Memory is organized as arrays of memory cells that are able to
store many "Words" of data. A specific word, commonly 32 bits in length, can
be accessed by specifying its "Address." Each address is a number that
indicates the location in the memory that should be read or written. Memory
cells range from hundreds of bits to millions of bits in size, but larger
ones are slower to access, as signals and their long internal wires take
longer to propagate. For that reason, almost all microprocessors include at
least two types for storing data: a big slow "Data Memory," and a small fast
memory called a "Register File." In reality, the "Data Memory" may be
implemented using many different sizes of memory, as we'll see in Module 4.
As well as storing data in memory, we also use some memory to store the
instructions. We need a way to keep track of which instruction we will fetch
next, so we have a "Program Counter." This stores the address in Instruction
Memory of the next instruction to be accessed. Since instructions are encoded
in binary, we also have "Instruction Decode Logic" that converts that binary
to the various signals needed to control the microprocessor.
Module 1, Video 4
We've already seen how a microprocessor is controlled by instructions. But
what are they really? An instruction is a simple command that the
microprocessor hardware can perform directly. We write them as text like this
to make them easier to read. But for the microprocessor we encode them in
binary. We use a program called an assembler to translate between the human
text and the binary. In this video we'll be looking specifically at an Arm
instruction set called A64 but other instruction sets follow similar
principles. Arithmetic and logic instructions are the simplest type of
instruction. The first word tells us what operation will be performed such as
addition or multiplication. The values after this tell the processor where to
put the result and where to get the inputs. Values starting with X are
addresses in the register file. Arithmetic instructions read one or two
registers and then put the result into a third register. Branch instructions
are used to make decisions and to repeat instructions. Normally the
microprocessor executes instructions in sequential order but branches change
that, and explicitly tell the microprocessor the address of the instruction
to run next. This is done by giving the address of the next instruction in
the instruction memory. Some branches are unconditional, meaning they always
occur and always affect the next instruction address. Other branches are
conditional, meaning the processor will perform a calculation to decide
whether to follow the branch or to continue executing instructions
sequentially following the branch. These are preceded by comparison
instruction to calculate the condition. Loads and stores are the instructions
for accessing the data memory. Loads copy values from memory to the register
file. Stores do the opposite. In both cases, the instruction needs to know
the address in the data memory and the location in the register file to copy
between. For data memory, loads and stores read an address from a base
register. They can also optionally add to this base address by reading
another register, or by simply specifying a number in the instruction itself.
Using sequences of instructions we can build programs. Here is an example of
a small program that implements Euclid's greatest common divisor algorithm.
Let's take a look at it working, one instruction at a time. To start with,
the input stored in X1 and X2 are compared. If they're equal, because we have
found the greatest common divisor a conditional branch instruction moves to
instruction 7. If they're not equal, another conditional branch instruction
can be used to determine whether X1 is smaller. Finally, we use an arithmetic
instruction to subtract either X1 or X2 depending on which was larger and
then unconditionally branch back to the start. Here we can see an example of
the program running. One instruction executes at a time. As we step through
the program, the values in the registers X1 and X2 are shown. And we see
these are modified each time we reach instruction 3 or 5. The instructions
repeat in a loop, as is common for many different types of program. Each time
round the loop instruction 0 and 1 are checking whether or not to continue
the loop. When they detect X1 and X2 have become equal, the program finishes
and the processor moves on to the next task.
Module 1, Lab
In this exercise, we're going to be using ASim. ASim is a behavioral
simulator for a subset of the Arm AArch64 instruction set. Behavioral
simulator means that it's not simulating real circuit level details of a
microprocessor it's just simulating the behavior of each instruction as it
runs. Nevertheless, behavioral simulators are vital tools for computer
architects. We use them to check the designs we've built, match with the
behavior we intended and we can also use them to explore new ideas, for
example adding new instructions to the instruction set. We're just going to
be using it though to get familiar with the basics of Arm AArch64. So we can
create a new file which will allow us to type in some Arm AArch64
instructions and when we press assemble that will cause ASim to begin the
simulation of the instructions. If we scroll down we can see that below we've
put a little cheat sheet of Arm AArch64 instructions that you can use as a
reference. Do note that these are just examples though. So for example this
first one is mov X1 X2 and the descriptive text says that this copies the
value of register X2 to register X1. But equally that would imply that if we
use mov X3, X4 instead that would copy the value of X4 to the register X3. So
we don't have to just use these exact instructions, we can tweak them as we
need to, to achieve our goals. So going back to the ASim, let's say that we
wanted to add the number two to the number three. In our cheat sheet we could
see that there is an 'add' instruction but it requires that the inputs for
the addition are stored in registers. So first of all, we're going to need to
load up two registers with the numbers that we want to add in this case two
and three. So looking at our cheat sheet again we can see that there's an
instruction mov that allows us to do that. So if I write mov X2, #3 this
would cause the register X2 to be loaded with the value 3 and similarly I
could write mov X1,#2 and this would cause the register X2 X1 to be loaded
with the value two. Last but not least, we could then do the addition we
wanted to do, such as add X3 X1 X2. This will cause X1 to be added to X2 and
the results stored in X3. If I press assemble now, we will see that the UI
changes slightly. Here we can now see the simulated state of the processor.
ASim is only going to, because it's a behavioral simulator it's only going to
show the state before and after each instruction. And so right now we are
before executing the first instruction. That's the yellow highlighting there
and we can also see it in the simulated memory. If I press step we can now
see that that instruction has completed, and it's before executing the mov X1
X2. And notably we can see that X2 has been loaded with the number 3 which is
exactly what we would expect. If I press step again, we can see that mov X1
X2 has now occurred. MOV X1 #2, sorry has now occurred which has loaded the
value 2 into the register X one. And last but not least, if we step again we
see that X3 has become equal to five which is exactly what we would expect if
registry X1 was added to registry X2. So this allows us to get a feel for
what it's like for a machine to execute these instructions. We can reset back
to the beginning if we want to watch it go through again. If we want to do an
edit for example adding a new instruction we can just do that. Press assemble
and that will add the new instruction to our program, and we can see its
effects by simulating. Now in the exercise you will be invited to take an
existing program and add 1 new instruction to it at the indicated position.
Do note that when you assemble the exercise program, you'll be taken to a
gray bit of code, which is our testing code. But as you step, you'll see that
the simulator flicks between the testing code and the code that we wrote the
exercise one code. You can also click between them using this. Feel free to
try to work out what our testing code does but you can just ignore it if you
want to. The point of the exercise is just to add the instruction in the
indicated position. When you're comfortable you've got the right instruction
you can press run to get all the way to the end. And if you really think
that's correct if you scroll down to the bottom of the page you'll see the
submit button which you can use to send in your answer. Best of luck.
Module 2, Intro
[music] Hello, my name is Martin Weidmann. I'm an engineer and product
manager with Arm's Architecture and Technology Group. I look after the A-
profile architecture and I maintain Arm's Interrupt Controller
specifications. Computer architecture is sometimes called a science of trade-
offs. Why is everything a trade-off when it comes to designing a processor?
Let's take an example of where we've had to make trade-offs when developing
process architecture. So the Arm architecture has a feature called Pointer
Authentication. This is often abbreviated to PAC for Pointer Authentication
Code. What this feature is trying to do is protect against a form of attack
called ROP and JOP. These are Return Orientated and Jump Orientated
Programming, and it's where an attacker tries to subvert things like the call
stack to run legitimate code, but in ways that weren't expected by the
programmer or the compiler. PAC or Pointer Authentication tries to defend
against this kind of attack by using part of an address to provide an
encrypted signature. So we can check the signature and the address match and
if they don't, we can spot an attack in progress. So why is this a trade-off?
Well, because to add security, we want that signature to be as big as
possible. The bigger the signature, the more bits we use for that, the
stronger cryptographically that signature is. The trade-off is: the more bits
we use for the signature, the fewer bits we have available for other things,
such as the address. So you can have a big signature with a small address,
but if you want the address to get bigger, then you get a smaller signature,
and that's then cryptographically weaker. So the trade-off we have to make
when designing a technology like that is: What's the right amount of bits for
the signature? What's the strength of cryptography we need from that
signature in order to get the design goal, which is to defeat these attacks
and give us more robust computing? What sort of guiding principles do you use
when designing a processor? So when you're designing a processor, the key
thing you have to bear in mind is: What's it going to be used for? What you
don't want to end up with is a very expensive paperweight. So we need to
understand the requirements that the processor has to meet. We have to
understand the design trade-offs we're making and how they work into meeting
those requirements. We also have to consider not just the processor itself,
but how we're going to show that that processor is correct. We'll put as much
time, if not more, into testing and validating the design as we do into
designing it. How do you design a new microprocessor? If you wanted to create
a new processor from scratch, the first thing you're going to have to do is
understand the market that that processor is going to address and to then
build a team to design that processor. There isn't such a thing as one
processor for every possible market. The requirements for something like an
embedded microcontroller are going to be very different to what you want from
the processor in your mobile phone, your laptop, or your server. So you need
to understand those requirements as the first step into building a new
processor. What determines the best design for a microprocessor? So when
you're designing a processor, you need to work out what the best design for a
given market or application is going to be. There's no magic formula for
this. It's going to depend a lot on what you're trying to achieve with that
processor. You need to understand things like the power requirements, the
performance requirements. Is it going to work in a highly noisy electrical
environment? There's a big difference between the reliability requirements
you need from something like a watch versus a satellite. So you would take
those requirements and you'd work out what the best set of trade-offs is
going to be, and that's an art more than it is a science. How do the
underlying technologies contribute to this best design? A lot of technologies
go into our processor. There's the design of the microarchitecture, the
implementation of the processor. There's the silicon process you're going to
use, how you're going to integrate that processor into an SoC or ASIC. Is it
a single die, or is it going to be using chiplets or multiple sockets? All of
those different technologies are going to be factors into how you design the
process or what trade-offs you make, and what performance and power you get
out of the design once you're done. In reality there may be many different
'best' designs, so how do you pick one? So when you're designing a processor,
what you want is the best design. But often there isn't "a" best design,
there's just different trade-offs. You have to decide what the best set of
trade-offs is for the particular use case you're going for. And that's also
going to depend on: Is this a device that will be off the shelf, used for
lots of different applications—a general purpose processor? Or is this being
designed for one specific use case? Again, there isn't really a magic bullet
or single answer for this type of question. You need to understand how the
processor is going to be used and then use your experience to judge the
trade-offs, and what will give you the best mix of power, performance, area,
cost, reliability for your target use case.
Module 2, Video 1
[music] In this module, we're going to explore how to improve the simple
microprocessor design from Module 1 in order to allow it to execute programs
more efficiently. First, let's find out how long a program takes to execute.
The time taken to perform the average instruction is equal to the number of
clock cycles taken to perform an instruction multiplied by the duration of
one clock cycle. The time taken to run our program is found by multiplying
the average time to perform an instruction by the number of instructions in
our program. How could we make this faster? One thing we could try is to
reduce the number of instructions in a program. We might be able to optimize
the code removing unnecessary and repeated work and selecting instructions to
minimize code size and maximize performance. We could give our microprocessor
the ability to perform more operations in order to help programmers or
compilers further reduce the number of instructions in their program. For
example, allowing the loading of two data values at the same time might allow
fewer instructions to be used in the program. The downside to this approach,
is that adding more instructions will require extra circuitry in the
processor and therefore we likely increase the clock period. If the extra
instructions are rarely used this could even mean an overall decrease in
performance. We see this theme often in computer architecture trade-offs that
we have to carefully balance. Another approach is to use faster transistors
perhaps constructed from a more recent fabrication technology. This would
reduce the clock period but may increase costs. The rest of this module
focuses on an optimization to reduce the clock period called pipelining. This
is the most important optimization we use when designing processors. It uses
a similar concept to an assembly line in a factory where work can start on
the next item before the previous one finishes. Let's take a closer look.
Imagine that each instruction has to go through four circuits in a processor.
If we attempt to do all of these in one clock cycle this means our clock
period is the latency of all four circuits added together. If we were to
pipeline this, we would add a pipeline register in the middle. This divides
the circuit into two sections called stages. Notice that although each
instruction takes a similar amount of time to travel down the whole pipeline,
the pipeline design can execute nearly twice as many instructions per second.
The throughput has doubled. This is because we can set the clock period much
shorter. It's now the maximum latency of the two stages. We can pipeline into
many stages and this allows for much faster execution of programs.
Unfortunately, though, pipelining a real microprocessor design is not quite
as simple because the processor has various feedback signals and loops in the
circuit. In the next video, we'll take a look at the challenges of pipelining
in practice. [music]
Module 2, Video 2
[music] In this video, we're going to look at applying the pipeline
optimization to a realistic microprocessor design. In the first module, we
met the components of a microprocessor, so let's look at how these are really
connected. This diagram shows all the connections needed for a real,
unpipelined microprocessor. Each clock cycle, the processor starts by
fetching an instruction from the instruction memory. Once the instruction
reaches the decode logic, it is decoded to produce the control signals
necessary to execute it. The exact control signals vary depending on the type
of instruction. For example, arithmetic instructions access the register file
and interact with the ALU. Ultimately, no matter how the instruction was
executed, the last step of each clock cycle is to update the program counter.
This is done by the branch unit. For non-branch instructions, this just means
incrementing the program counter. However, for branch instructions, the
branch unit has to do some calculations. When we apply our pipeline
optimization to this design, we face some challenges. The design has several
loops because instructions have dependencies. How can we break these cycles?
The key observation is that not every instruction is the same. In real
programs, branch instructions usually make up less than 20 percent of the
program. For non-branches, the branch unit doesn't actually need to wait for
the ALU before calculating the result. Let's look at how we can use this fact
to pipeline the processor. Once the first instruction shown in yellow reaches
the pipeline register, we're ready to begin fetching the next instruction,
shown in blue. The yellow instruction can be in the execute stage whilst the
blue instruction is being fetched. Once the yellow instruction is finished,
the blue instruction is ready to enter the execute stage and a new green
instruction enters the fetch stage. What about the branches though? Let's
imagine this next yellow instruction is a branch. The fetch stage works
normally until the branch unit, but the branch unit cannot proceed.
Consequently, the pipeline stalls. The fetch stage spends a cycle waiting
whilst the execute stage executes the branch. Finally, once the ALU is done,
the branch unit can proceed and the next instruction, in this case blue, can
be fetched. Overall, this means that the processor wasted one cycle stalling
due to the branch. Since only 20 percent of instructions are branches, this
means that each instruction would require on average 1.2 cycles. The same
idea of stalling the pipeline can be used to create even longer pipeline
designs. This diagram shows a typical five-stage processor pipeline. In the
next video, we'll look at how we can manage or prevent some of the stalls in
a design like this. [music]
Module 2, Video 3
[music] Instructions within a program may be dependent on each other. That
is, one instruction may produce a value that a subsequent instruction
consumes. Data values may be communicated through registers or memory. The
simple program shown has a number of so-called true data dependencies. This
means we must take care to execute these instructions in order, and make sure
results are correctly communicated. Additionally, the outcomes of branch
instructions may affect the path taken through the program, and consequently,
this affects whether an instruction is actually executed. This sort of
dependency is known as a control dependency. In the previous video, we met a
realistic processor pipeline with five stages. Circumstances that prevent an
instruction making progress in our pipeline are known as pipeline hazards.
Let's take a look at how dependencies cause hazards. This program has a true
data dependency. The first instruction writes to register one, which is then
read by the second instruction. If we send this down our pipeline, we see
that the second instruction must stall, waiting for register one to be
written, before it can read and proceed. This is a data hazard.
Unfortunately, dependent instructions are common and stalling in this way
would significantly increase the average cycles per instruction. Let's take a
closer look at the hazard though. The ADD instruction is in the execute
stage, meaning its result is being computed. The SUB instruction needs that
result to proceed. Rather than waiting for the ADD to reach the writeback
stage, we could add an extra path into our pipeline to carry the output of
one stage to a later instruction, making the result available straight away.
We call this a forwarding path. In this case, the ALU result is forwarded to
the SUB instruction to be used as X1. This small piece of extra circuitry
allows this data hazard to be eliminated completely. Unfortunately, even if
we add forwarding paths everywhere, it's not possible to eliminate all data
hazards. For example, this program has a data hazard due to the load
instruction. There are other types of hazard too. This program contains a
control hazard. We cannot be sure which instruction to fetch until after the
branch instruction executes. Consequently, this program has two stall cycles.
We will look in more detail at how control hazards can be mitigated in the
next module. Another class of hazards, called "structural hazards" occur when
two instructions require the same resources simultaneously. For example, if
instructions and data were stored in the same memory, and this could only be
accessed once per cycle, we would have to very frequently stall our pipeline
to let these stages access memory one-by-one. [music]
Module 2, Video 4
[music] In the previous videos, we explored how pipelining could improve
performance by reducing our clock period and by overlapping the execution of
different instructions. We also saw that it was sometimes necessary to stall
our pipeline to ensure that instructions were executed correctly. Ideally,
our average cycles per instruction, or CPI, will remain at 1.0. If we must
stall, however, this will increase. For example, if 20 percent of our
instructions were loads and each of these caused one stall cycle, our CPI
would be 1.2. If a further 20 percent of instructions were branches, and each
of these caused two stall cycles, our CPI would be 1.6. The longer we make
our pipeline, the more stall cycles there will be, and eventually the cost of
stalls may outweigh the benefit of the faster clock period. For example,
let's imagine we added a stage to our five-stage pipeline from the previous
video. Now the number of stalls after a branch instruction increases to
three, hurting our CPI. On the other hand, our clock period would improve. So
whether or not this helps speed program execution would depend on the exact
details. It may eventually become more difficult to reduce our clock period
by adding further pipelining stages. This is because it becomes harder to
perfectly balance the logic between stages and because of the constant delays
associated with clocking and our pipeline registers. To mitigate these
issues, we will need to invest in more transistors and our design will
require more area and power. The deeper our pipeline gets, the greater the
investment we need to make in terms of area and power for the same
incremental improvement. Commercial processes today have anywhere from two to
twenty pipeline stages. The faster, more expensive and power-hungry
processors tend to have longer pipelines than the smaller, cheaper processes
in embedded devices. As with many techniques in computer architecture,
eventually it becomes more profitable to invest our time and resources in an
alternative way of improving performance. In later modules, we'll explore how
we can reduce the CPI, even in heavily pipelined processes. [music]
Module 2, Lab
[music] In this exercise, we're going to be using a model of a processor
pipeline to explore the effect of the pipelining optimization. Computer
architects use models like this to make high level decisions early on about
what parameters they will use for a processor and using a model such as this
saves the burden of actually building the processor to find out its
performance. The model is not simulating accurately the performance of the
processor but rather it's giving us an idea for what performance we might
expect. So what can we do with this model? Well, we can configure the number
of pipeline stages, which we can see affects the diagram. And we can also
turn on or off the forwarding optimization. As we change these numbers notice
that the design parameters change down here. So for example, the clock
frequency is improved by increasing the number of pipeline stages but the
design area will get bigger. And so this may be a consideration depending on
the problem. We can also choose which of two programs we're going to put down
our pipeline. When we press the step forward button the pipeline advances to
the next clock cycle and we can see the instructions have started to flow
down our pipeline and interesting events such as forwarding will be noted in
the simulation. Sometimes the simulation will detect that there will be a
stall for example, in this case, we can see that there is a data hazard
because the instruction in the red memory stage writes to register X13 which
is read by the instruction in the yellow decode stage and therefore a stall
cycle is necessary in order to allow the result to be generated. If we press
the play button, the simulation will proceed automatically and we can see
various stall events happening as the simulation proceeds. But notice that
the the program we're simulating is nearly 1,000,000 cycles long so watching
it play out at this speed is going to take quite a while. So we can use the
fast forward slider to simulate much, much faster. Notice that the statistics
down the bottom have updated depending on the results of the simulation, and
at this point we can see that the program is finished and the simulation of
the program, the simulated program took 3.98 milliseconds to execute. We can
also see that just below, the results of past simulations are stored in
little tables so we can easily refer back to them when we're doing later
experiments. So as an experiment, let's imagine what would happen if we
disabled the forwarding optimization but change nothing else and we'll just
run this program through. What we can see immediately is the design side is
slightly better which is what we would expect. It's got 1% better in fact in
this case because of the lack of the forwarding wires. But now that the
program is finished, we can see that the program execution time is a lot
worse. 6.34 milliseconds is about 50% worse. So again, looking in our table,
we can compare the execution times in the area and we can see that in most
cases the forwarding optimization would be a big optimization here because at
the cost of an increase in the area of about 1%, we've had a improvement in
execution time of about 50% which is likely to be a good trade off, but not
always. It would depend on the exact scenario. Perhaps that 1% area is more
important than the performance of this program. In the exercise, you'll be
invited to design a or suggest a design using the number of pipeline stages
and whether forwarding is enabled that will meet certain criteria. You can
play about and do as many simulations as you wish to figure out what the best
program might be. Once you've got it set up select the processor that you're
happy with at the top and then scroll down to the submit button and press
that. Good luck. [music]
Module 3, Intro
[music] Hi, I'm Nigel Stevens. I'm Lead Instruction Set Architect for the Arm
A-Profile architecture. I've been at Arm for about 14 years and I have
responsibility for the Arm V8-A instruction set including recent developments
such as the Scalable Vector Extension and Scalable Matrix Extension. What do
we mean by the instruction set architecture? The instructions in
architecture, primarily, most people think of I guess as the OP codes, the
encodings of instructions that are executed by an Arm-based processor. But it
also includes other aspects as well such as the exception model, system
programming features, memory management and suchlike. The architecture for
Arm is rather distinct from what other companies may call an architecture.
For Arm, architecture is a legal contract, if you will, between hardware and
software. If software uses only those instructional codes and features of the
ISA that are described by the Arm architecture to perform its work, and the
hardware that it's running on implements all of those op codes and features
exactly as defined by the architecture, then any architecturally compliant
software will run on any architecturally compliant hardware that implements
that Arm architecture. And that doesn't mean just processors from Arm itself,
but also processors that are designed by our partners and which we have
validated are conformant with our architecture. How do you decide which
instructions to include in an ISA? When we are looking at requests from
partners or from internal research to add a new instruction, we go through
quite a long process of trying to justify that instruction, or, quite
commonly, a set of instructions rather than a single instruction. We have to
show that it gives us some real benefit in performance, the performance of
your code running on that CPU. Or maybe not performance. Maybe it's security
you're trying to achieve. But it has to give you some really concrete benefit
that is worth the cost of adding all of the software, the validation
software, the implementation costs for all of the different implementations,
compiler support, so on and so forth. It has to meet the... It has to answer
that cost benefit analysis. What is the difference between an ISA and a
microarchitecture? The difference between an instruction set architecture, or
ISA, and the microarchitecture is that the ISA is an abstract concept. It
defines a set of instruction encodings which software can use, and which
hardware has to recognize and implement. How that is implemented is a choice
for the microarchitecture. So the instruction set architecture is fixed, it's
defined by Arm. The microarchitecture is defined by whatever team of people
is designing that CPU. And there are many different approaches to
implementing the Arm architecture, from very small, efficient cores with in-
order pipelines up to very high-performance, state-of-the-art, out-of-order
execution, and everywhere in between. So the microarchitecture is
implementation-specific, the architecture is generic, and software written
for the architecture should run on any microarchitecture. Why does Arm
produce processors with different instruction sets? Arm supports multiple
instruction sets. Some of that is to do with legacy: you can't abandon your
legacy software, your legacy ecosystem. So as the architecture has advanced
and we've introduced major new instruction sets, we still have to continue to
support old software. It takes years, maybe 10 years to move the software
ecosystem to a major new ISA. So for example, AArch64, which is the 64-bit
architecture that we introduced with Arm V8, also supported the AArch, what
we called AArch32, the old 32-bit architecture that was implemented in the
Arm V7 architecture, and prior to that including the Arm and the Thumb
instruction sets. And we needed to do that because, while some software might
start to migrate to the 64-bit architecture, there's still a lot of software
on the planet which is going to continue using the 32-bit architecture, and
that has to survive. So that's part of the reason: it's about legacy. You
can't just obsolete the whole world when you introduce a new architecture, a
new instruction set architecture in particular. There are other reasons as
well, which is there are certain instruction sets that are different for
reasons of the ecosystem that they're working with. So if you were to
compare, for example, the A-profile architecture that's designed for
application processors that run rich operating systems with virtual memory
supporting SMP, symmetric multi processing operating systems running large
applications, whatever it may be: web browsers on your phone or something, or
a web server in a server farm somewhere. You have your R-profile
architecture, which is designed for high-performance, real-time embedded
systems. The constraints there are somewhat different. The instruction set is
actually fairly similar to the A-profile, but some of the underpinnings of
the architecture, the system side of the architecture, are simplified in
order to give more consistent and predictable real-time response to things
like interrupts or memory translation and suchlike for real-time systems. And
then at the other extreme you have the M-profile architecture which is
designed to be capable of being built in a very simple, ultra-low power
implementation with low code size and again, similar to the R profile, very
predictable real-time performance. So the answer is there are different
instruction sets for reasons of the market that they're trying to address,
and then there are different instruction sets because, well, we have history.
[music]
Module 3, Video 1
[music] In the previous module, we explored how pipelining can be used to
improve performance. We also saw how it is sometimes necessary to stall our
pipeline to ensure our program is executed correctly. In a simple pipeline,
it will be necessary to stall the pipeline whenever we encounter a branch
instruction. This is because we must wait until our branch is executed before
we can be sure which instruction to fetch next. As a recap, branches are
instructions that change which instruction in the program will be executed
next. There are two types of branches: conditional branches and unconditional
branches. Unconditional branches always change which instruction executes
next, whereas conditional ones may or may not, depending on the computations
in the program. In real programs, between approximately one fifth and one
quarter of all instructions are branches, and the majority of these are
conditional. Executing a branch involves calculating the new address to load
into our program counter. This is the branch's "target address." However,
conditional branches have an extra task: we must first determine whether the
branch is taken. If the branch is not taken, we can effectively ignore the
branch and fetch the next instruction as normal. Recall the processor
performance equation from an earlier video. Since we have to wait for
branches to complete before fetching the next instruction, we generate stall
cycles. These increase the average number of "cycles per instruction," which
reduces our microprocessor's performance. The longer our pipeline gets, the
longer it is before each branch is resolved, and the more costly branches
become. Can you think of a way to avoid some of this stalling? One idea is to
evaluate branches earlier in the pipeline, for example in the Decode stage
instead of in the Execute stage. This can indeed help to reduce the number of
stalls, but we may still need to stall if the branch depends on other
instructions that haven't been executed yet. Another idea is to continue
fetching instructions in program order, effectively assuming that each branch
is not taken. The number of stalls in the pipeline for a not-taken branch is
zero in this design. On the other hand, if the branch is in fact taken, the
subsequent instructions that we fetched will be incorrect. So, we must remove
all instructions that have been fetched on this incorrect path from our
pipeline. This is called "flushing" the pipeline. Unfortunately, in real
programs, branches are taken much more than not taken. Could we then simply
assume instead that all branches will be taken? Sadly not, no, because then
we would also need to know the specific "target address" immediately, in
order to know which instruction to fetch next. It may at first seem
impossible to know this before the instruction is decoded. However, computer
architects have found a way to do exactly this. The next video will look at
"dynamic branch prediction:" the idea of predicting the behavior of the
branch instruction before it has even been fetched. [music]
Module 3, Video 2
[music] In this video, we'll explore how to predict the behavior of a branch
instruction. This can sometimes eliminate the cost of branches altogether.
Fundamentally, a branch involves changing the value of the program counter,
which is the address of the next instruction in the program. If we could
predict what this change will be, quickly and accurately, we would have no
need to stall. Precisely, we need to predict that we are fetching a branch,
predict it as taken or not taken, and predict what its target address is. How
could we ever make such predictions? What would we base them on? Well, since
instructions are often executed multiple times, we can accurately make these
predictions based on just the instruction address. If we've previously seen
that a particular address contains a branch, and we see that address again,
we can predict whether that branch will be taken, and its target address,
based on its behavior last time. Amazingly, for real programs, simply
predicting repeating behavior is typically around 90 percent accurate. This
means we could eliminate stalls caused by branch instructions 90 percent of
the time. Let's apply these insights to try to build a branch predictor. We
will add two extra blocks to the Fetch stage of the pipeline we met in Module
2. The first will remember information about recently executed branches. This
will include the program counter values of branches and their target
addresses. This memory is called the "Branch Target Buffer" or BTB. The
second block will make predictions about whether an address containing a
branch is taken or not. We simply call this the "branch predictor." In the
next video, we will look at these in more detail. Combining these two gives
us all the information we need to predict the next value of the program
counter based solely on the current value of the program counter. We don't
even need to decode the instruction to make this prediction. Here we can see
how a running branch predictor behaves for a sample program. Each program
counter is checked in the BTB to see if it's predicted to be a branch and to
identify its predicted target. We also simultaneously check the branch
predictor to see if the branch is predicted to be taken. Based on these
predictions, the branch unit computes the predicted next program counter.
Many cycles after the prediction, feedback will be given by the rest of the
pipeline as to whether or not the prediction was correct. Whenever the
prediction is wrong, we have to flush the pipeline and update the BTB and
branch predictor. The pipeline will then resume fetching from the correct
program counter as computed by the pipeline. The majority of instructions are
not branches, so most of the time the branch unit just adds 1 to the program
counter. The BTB contains the instruction address and target address of some
recently executed branches. The exact number of entries in the BTB varies
considerably. BTBs in large modern processors contain many thousands of
branches. In operation, the BTB checks the supply program counter against its
memory to see whether it has a match, and if so, it returns the target
address. Otherwise, it predicts that the instruction is not a branch. After
each branch executes, the BTB is updated with the true target address. BTB
cannot be arbitrarily large, so it may have to forget an existing branch to
remember a new one. A simple BTB design like this is typically around 90
percent accurate at predicting target addresses in real programs. [music]
Module 3, Video 3
[music] In the previous video, we met the two major components of dynamic
branch prediction, the BTB and the branch predictor. In this video, we'll
take a deeper look at the branch predictor. It predicts whether or not a
branch will be taken, based on the program counter. A simple branch predictor
would try to remember what the branch did last time and predict that the same
behavior will repeat. Let's see how such a predictor might be organized.
Remembering a branch prediction for every possible instruction address would
take up far too much memory, so we reuse the same memory for different
branches via a process called "hashing." We hash the address of the branch to
a smaller number. This does unfortunately lead to a problem called
"aliasing," where two different branches can hash to the same value, but this
is rare in practice. Let's see what happens now, when we execute a simple
loop. We see a misprediction when it encounters our branches for the first
time, and another when we exit the loop. The first case will be dependent on
the value in our predictor's memory, and it may be that we are able to
predict the branch correctly the first time we see it. The second case is
hard to avoid, although some more sophisticated branch predictors will learn
how many iterations a loop will make. A common improvement to this simple
scheme is to avoid instantly flipping our prediction just because the branch
does something unexpected once. This can be achieved with a saturating
counter, which instead remembers how many times the branch has been taken
recently, versus not taken. The counter increments when the branch is taken
and decrements when not taken. It predicts "taken" if the majority of recent
executions of this branch were taken. When building higher-performance
processors, we often have to discard many instructions every time we
mispredict a branch, so accurate branch prediction is very important.
Therefore, branch prediction is still an active area of research. One of the
key ideas used is correlation between branches. In real programs, a common
pattern is for example a pair of branches that always have opposite behavior.
A branch predictor can take advantage of this by remembering a history of
whether or not recent branches were taken or not taken and incorporating this
in the hash. Another idea is to combine multiple different types of predictor
in a "tournament predictor." We use a "meta predictor" to predict which of
two branch predictors will do a better job. As you can see, branch prediction
can get quite complicated. [music]
Module 3, Video 4
[music] In this module, we've met the concept of branch prediction. Even
using relatively small simple circuits, we can accurately predict real branch
behavior more than 95 percent of the time. Could we do better, and do we
really need to? In small, simple processors, these prediction accuracies are
fine, because each misprediction causes only a few stall cycles. Increasing
the accuracy is not that impactful. However, in complex processors with very
long pipelines, the difference between 98 percent and 99 percent prediction
accuracy can be significant for performance as a misprediction could incur
dozens of stall cycles. Accurate prediction really does matter. One of the
problems we face in making accurate predictors is that they need to be small
enough and fast enough to fit in a microprocessor. We can imagine all sorts
of ways to do accurate branch prediction, but if their circuit would be
slower than simply executing the branch, they would not be useful. Modern
high-performance processors will often have multiple branch predictors, for
example a small fast one and a slower complex one. The slower one can
override the prediction of the fast one if it thinks it got it wrong, which
does incur some stall cycles, but fewer than a total misprediction. Another
problem we face is that some branches are just very hard to predict. No
matter what technique is used, there are some branches in real programs that
are effectively "random". For example, when compressing or decompressing
data, the combination of the underlying algorithm and the input data may
provide no clear patterns to a prediction. No matter how hard we try, some
branches will never be predicted correctly 100 percent of the time. A final
problem is that since the predictors work based on observing the program,
there will always be a period of time when the predictors train on a program
to learn its behavior. [music]
Module 3, Lab
[music] In this exercise, we're going to be using a branch predictor
simulator to explore dynamic branch prediction. This simulator will
accurately simulate the details of a branch predictor but it uses a trace of
a real program executing on a real machine to avoid the need to simulate the
rest of the processor. Computer architects use techniques such as this to
quickly explore one area of process design understanding that perhaps the
accuracy may not be completely correct given that we're not simulating the
full process of design. The interface allows us to configure details of our
branch predictor for example the maximum of the saturating counters used in
the branch predictors table or the hash function. And we can see the impact
of changes on the delay of the branch predictor and also the design size
which are two key metrics when designing a branch predictor. Once we've
happily configured the design we want we can press run to simulate a program
and the results of that program's execution will be displayed in the rest of
the stats. So for example, we can see here that the predictor predicted
95.24% of the branches correctly. Which is fairly good and this resulted in
overall execution time of the program of 5.335 milliseconds. Just below we
can see a table that records previous simulation runs as well so we can do
multiple experiments and see which produces the best results. For example, if
we were curious about the effects of using a saturating counter with a
maximum of three rather than one we could change that and notice that the
design size has substantially increased and when we press run, we noticed
that the predictor accuracy, however, has also increased. It's gone up to
96.31% and consequently the execution time the program has fallen slightly
and so we can compare these two designs and see whether or not this
represents a good trade-off for our processor. Perhaps the area cost is
justified or perhaps it's too much. It would depend on the exact processor
we're trying to design In the problems you'll be invited in to come up with
designs that are suitable for particular constraints. For example, particular
constraints on the runtime of the program or the design size. Once you've
configured the branch predictor that you think meets the objectives you can
scroll down to the submit button and click that and you'll be told whether
the answer is great or not. Good luck. [music]
Module 4, Video 1
[music] So far, we've looked at the microprocessor's "datapath"— meaning its
execution units, registers, and control circuitry. We have given less
attention to its memory. We usually implement memory using a different type
of chip than the microprocessor, using a technology called DRAM. It is very
dense, allowing us to store lots of data in a small area. However, one issue
with DRAM is its speed. Since the 1980s, processor performance has increased
very rapidly at roughly 55 percent per year, so CPUs of today are many orders
of magnitude faster than those of 40 years ago. In contrast, memory
performance has grown much more modestly. Whilst memories are also much
faster than they were in previous decades, their performance has not kept
pace with processors. This leads to a processor-memory performance gap, with
it becoming more costly to access memory as time goes on. One of the issues
that makes these memories slow is their size. We can make a memory as fast as
our microprocessor, if it is very small. However, this is the opposite of
what programmers want. Programmers can use extra memory to solve more complex
problems, or to solve existing problems faster. This complicates
microprocessor design because although we want a large memory to hold all our
data, large memories are slow and this slows down our processor. How do we
overcome this? What if we could get the speed benefit of a small memory
alongside the size benefit of a large memory? One solution is to have a
large, slow memory and a small, fast memory in our system. For this to be
useful, we need the small memory to hold the data that we use most often, so
that we often get the speed benefits of accessing it, and only rarely have to
access the slow memory. Whilst there are different arrangements of these two
memories, the configuration that is most often used is an "on-chip cache
memory." The small memory sits inside the processor between the pipeline and
the large memory. We keep all data in the large main memory, and put copies
of often-used data in the small memory, which we call a cache. But we are not
limited to only one cache! Our pipeline reads memory in two places: when
fetching the instructions; and when accessing the data. It makes sense to
have two caches here, each optimized for different purposes. The instruction
cache is optimized for fast reading of instructions at Fetch. The data cache
is optimized for reading and writing data from the memory stage. We will
often put a larger "level 2" cache between these two caches and the main
memory. The L2 cache is a "medium-sized" memory: faster and smaller than main
memory, but slower and larger than the two L1 caches. Using a hierarchy of
caches reduces the bandwidth requirements to main memory and the energy cost
of moving data around. [music]
Module 4, Video 2
[music] We previously looked at the need for a cache, which can be used to
store often-used data for fast access by the processor. But which data is
used often enough that it is worth including in the cache? Programs are
mostly made of loops. Here's a simple one that sums values in memory. It
displays the two characteristics that we can exploit to decide what data to
put into a cache. Let's look briefly at how it works. Each time round the
loop, there are two loads and one store. The first load is to load the data
we're summing. The other load, and the store, are to update the running sum.
Notice two things here. First, we access the running sum over and over again;
each time round the loop. Second, when we access part of the data in one loop
iteration, we've already accessed its predecessor in the previous iteration,
and will access its successor in the next. Caches exploit two types of
"locality" in programs in order to be effective. The first is temporal
locality: if a piece of data is accessed, it is likely that it will be
accessed again in the near future. The running sum has temporal locality. The
second type of locality is spatial locality: If a piece of data is accessed,
then its close neighbors are quite likely to be accessed in the near future.
By close neighbors, we mean data whose memory addresses are not far from each
other. The data accesses have spatial locality. It turns out that most
programs exhibit lots of temporal and spatial locality. We can exploit this
to determine what to put in the cache. Exploiting temporal locality is fairly
easy: we simply see which values have been accessed and place them in the
cache. Exploiting spatial locality is also quite simple: when a piece of data
is accessed, place its neighbors in the cache. We'll see in the next module
how this is actually achieved. We use locality to guess what values will be
accessed next and store them in the cache. If we are correct, we get the
benefit of fast memory, but if we are wrong, we must perform a slow access to
main memory— just as we would if the cache were not present. In real
programs, we see hit rates of 90 percent or more in microprocessor caches,
resulting in vast performance improvements. However, it's worth noting that
some programs have less locality, and for those programs, caches offer little
performance benefit. [music]
Module 4, Video 3
[music] We've looked at the reasons why we build caches, but how do they
actually work? To the outside world, a cache simply takes an address as
input, and either provides the data that is stored at that location at
output, or returns a signal to say that it doesn't have it. If the data is
found, this is called a "cache hit". If the data is not in the cache, this is
called a "cache miss", and it means we must look for the data in main memory
instead. After each miss, we update the contents of the cache. The
fundamental building block of a cache is called the "cache line". It's a
number of data bytes from consecutive addresses in main memory. Cache lines
are typically 32 or 64 bytes long, and a cache typically has an array of
hundreds to many thousands of lines. The line captures spatial locality,
because it is larger than the data read by a single load instruction. When a
cache miss occurs, the whole line containing that data is copied to the cache
from main memory, meaning we have nearby values for future accesses. When a
request comes in, we use some bits from that address to index the line array.
Just like in the last module on branch prediction, this leads to the problem
of aliasing again, since selecting only some of the bits to index into the
array is like a hash. This means that many addresses map to the same line in
the cache, but we can only store one of their lines of data. We need to note
down which addresses' line is currently stored in the data array, in what we
call the "tag array". There is one tag for each cache line. When we access
the cache, we access the tag array with the same index to see if the data we
need is present. This design is called a "direct-mapped cache". Direct-mapped
caches work fairly well, but for some programs, we can be unlucky with the
program accessing to aliasing lines frequently. We can do something about
this by duplicating both the arrays, so that each line of data can now be
stored in one of two places in the cache. When we access the cache, we look
at both arrays and only get a miss if neither of the tags match. This is
called a "2-way set-associative cache", because each line has a set of two
places or "ways" it could reside. The "associativity" of this cache is
therefore two. Set-associative caches introduce a further complication: when
we want to add data, where do we put it? In a 2-way set-associative cache,
there are two choices. How we decide which cache line to evict is called the
"replacement policy". There are many different types of replacement policy. A
simple one is just to make a pseudo-random choice from all the possible cache
lines. Another option is to keep track of when each cache line was last
accessed and to evict the one last used furthest in the past. This is called
a "least recently used policy", and takes advantage of temporal locality. It
does, however, means storing extra information in the tags to track usage.
Many other ideas are possible too. [music]
Module 4, Video 4
[music] Now that we've seen how caches work, let's see how they affect the
performance of a processor. Recall the processor performance equation, where
the processing time is proportional to the average cycles per instruction.
Without a data cache, if 20 percent of instructions are loads, and main
memory takes 20 cycles to access, our CPI figure must be at least 5. However,
if we provide a cache that holds the required data 80 percent of the time...
...and only takes 2 cycles to access, our CPI reduces to 2.2, which is a
significant improvement! We can isolate the memory terms in this equation to
get the average memory access time —abbreviated to AMAT— which allows us to
compare different cache configurations more easily. Changing the cache
configuration will impact the AMAT. There are many different cache parameters
we can change, such as the size, replacement policy, associativity, whether
we put data in the cache for stores or just for loads, and so on. For
example, reducing the size of the cache will improve the access time for a
hit, but will also increase the miss rate. Let's say that we can halve the
access time to 1 with a corresponding halving of the hit rate. This alters
the AMAT to 13, which in this case is worse for performance overall. It's
also useful to look at why an address might miss in the cache. Broadly
speaking, we can divide cache misses into three different categories.
Compulsory misses occur when we attempt to access an address that we have
never seen before and so never had the opportunity to cache it. Capacity
misses occur when there is more data being accessed than the cache could
hold, even if we had complete freedom in where to put each cache block.
Conflict misses occur in caches where there are more addresses hashing to the
same index than arrays to hold the data. We can alter our cache
configurations to lower these misses, but as always, there are trade-offs
involved. Compulsory misses can be reduced by increasing the cache block
size, to take advantage of spatial locality. But for a fixed cache size, this
reduces the number of different addresses or cache lines that can be stored.
A technique called "pre-fetching" can also be used to predict the addresses
that will soon be accessed, and bring their data into the cache early. But
this increases energy consumption, and may make the cache perform worse if
the predictions are not highly accurate. Capacity misses can be reduced
through increasing the size of the cache. Although, as we saw before, this
impacts the number of cycles taken to determine a hit. Conflict misses can be
reduced through increasing the number of cache blocks in each set, with an
increase in energy consumption as a side effect of this. [music]
Module 4, Lab
[music] In this exercise, we're going to be using a cache memory simulator to
explore the effects of cache memories on processor performance. The simulator
accurately simulates the behavior of the cache memory system, but it's using
a model of the rest of the processor to quickly allow us to simulate the
effects of the cache on the processor without needing to simulate the full
processor. We can configure a number of parameters about our cache, for
example, the number of levels in our cache, whether or not the cache
separates instructions and data, or is unified, keeping them both in the same
cache. We can also configure the size, the line size, and the associativity,
and changing these numbers will affect design parameters, such as the access
times in the case of an L1 level one cache hit or a cache miss, and also the
design size. Once we're happy we've found design we'd like to investigate, we
can press "run", at which point the simulator will run through the program.
We can see the hit rates for instructions in the level one cache, and also
data in the level one cache are displayed. And we can also see the average
memory access time, the results from this. And then below everything, we can
see a table of past simulations so that we can quickly refer back to our
previous experiments when we do new ones. So, for example, let's say we were
curious about the effects of increasing the size of the cache. If we change
the parameter and then press "run", we can immediately see that the design
size has substantially increased, which sort of makes sense because we've
doubled the contents of the cache, the size of the contents of the cache, and
therefore we'd expect roughly double the area. And we can also see that the
hit rates have improved. So there's about a 1% improvement to the L1
instruction cache hits and a 1% improvement to the L1 data cache hits, which
has reduced the overall average memory access time. And so we can compare
these two designs to see which of them we think is better. It's a trade-off
of course, though. The larger design has got better performance, but it is
larger, and so depending on the context, we may need to pick the smaller
design or the bigger design depending on our performance goals. In the
exercises, you'll be invited to come up with a series of designs for caches
that meet certain performance goals. For example, you'll have a constraint on
the area of the design or the execution time of the program, and you need to
optimize the cache to meet those goals. Best of luck. [music]
Module 5, Video 1
[music] In this module, we'll look at how to further improve performance by
exploiting "instruction-level parallelism." In Module 2, we explored how
pipelining can improve the performance of our processor. This reduced our
clock period, and allowed execution of instructions to be overlapped,
improving throughput. One way to boost performance further would be to create
a much deeper pipeline. At some point, this would mean even the ALU in our
Execute stage will be pipelined. Consider the simple program shown in the
slide. Some instructions are dependent on a result from the previous
instruction. Remember in our 5-stage pipeline that these dependent
instructions could be executed in consecutive clock cycles with the aid of
data forwarding. If execution takes place over two pipeline stages within the
pipeline, we need to stall if adjacent instruction share a dependency. This
allows time for the result to be computed. The programmer may be able to
rewrite their program to get the same result with fewer stalls, by placing an
independent instruction between our pair of dependent instructions. In this
case, we can move the third and fifth instructions earlier to optimize
performance. The performance of programs that run on our "super-pipelined"
processor would, to some degree, be determined by the availability of
independent instructions that could be executed in parallel in the pipeline.
This is "instruction-level parallelism"—or ILP. Very deep pipelines are
problematic as they would require: a very high-frequency clock to be
distributed across the chip very precisely; careful balancing of logic
between many, very short, pipeline stages; the pipelining of logic that is
difficult to divide further into stages; and the division of logic at points
requiring many pipelining registers to be inserted. A different approach to
exploit ILP is to make our pipeline wider rather than deeper. In this design,
the processor will fetch, decode, and potentially execute multiple
instructions each cycle. Such a design avoids the problems of a super-
pipelined processor, although as we'll see in the next video, it does
introduce some new complications. [music]
Module 5, Video 2
[music] In this video, we are going to explore "superscalar" processors,
which can process multiple instructions in each pipeline stage. In our simple
5-stage pipeline, there is at most one instruction per pipeline stage. At
best, we can complete one instruction per cycle. We call such a design a
"scalar" processor. In a 2-way superscalar version of this processor, we
would extend this design so it is able to fetch, decode, execute and
writeback up to two instructions at a time. In general, superscalar
processors may vary the number of instructions that can be processed together
in each stage. Let's step through the design. Our instruction cache will need
to supply two instructions per cycle. Typical superscalar processors only
ever fetch adjacent instructions on a given cycle. This can lower performance
if, for example, the first instruction fetched is a taken branch, because
then the second would not be required. Note that now every cycle lost due to
control hazards will cost us two instructions rather than one, so accurate
branch prediction matters even more in superscalar designs. The Decode stage
must now decode and read the registers for two instructions simultaneously.
Fortunately, we are able to extend the register file design to read many
register values at the same time. The Decode stage also needs to check
whether the two instructions are independent. If so, and if the functional
units they both need are available, it can "issue" them for execution in
parallel on the next clock cycle. Otherwise, in this simple design, it will
only issue the first, and keep the second back. A simple design such as this—
where two instructions are fetched, decoded and issued— is called a "2-way"
or "dual-issue" processor. In other designs, the width may vary at different
stages of the pipeline. To support the execution of multiple instructions at
the same time, the Execute stage is expanded and contains two execution
pipelines. It's common for these to have slightly different capabilities to
save area. For example, the top pipeline can execute both ALU and memory
instructions, while the second pipeline only executes ALU instructions. To
ensure that dependent instructions can execute on consecutive clock cycles,
we must add data forwarding paths. These data forwarding paths must allow
results stored in either execution pipeline to be forwarded to the input of
either ALU. During writeback, we need to store both results to the register
file. This means the register file must be redesigned to allow two writes per
clock cycle. Overall, these changes typically require 25 percent more logic
circuitry in our processor, compared with a scalar processor. But we'd expect
an improvement in execution time of between 25 and 30 percent for real world
programs. [music]
Module 5, Video 3
[music] We've seen that instruction-level parallelism can be used on
superscalar processors, to run them faster than would be possible on a scalar
processor. But how much of a speedup is this in practice? Ultimately, this
depends on how much instruction-level parallelism is possible in a typical
program. How might we measure this? We can do this initially without
considering any constraints that will be imposed by the processor it will run
on. Let's consider the instructions executed by the program. Let's assume
that we can predict all the branches in the program perfectly. Then we can
ignore branch instructions, as they don't need to flow down our pipeline. Now
let's imagine we can execute any instruction as soon as the data it needs is
ready. That is, we are only restricted by the presence of true data
dependencies. Note that some dependencies are carried through writes and
reads to memory. Rather than considering program order, we can now just look
at the order the dependencies impose on instructions. This is referred to as
"data-flow analysis." Assuming each instruction takes exactly one cycle to
execute, the fastest possible execution time of the whole program in cycles
is given by the longest path in the data-flow graph. The instruction-level
parallelism of this program is the number of instructions divided by this
duration, as this gives the average number of instructions we would need to
be able to execute each cycle to achieve this duration. In real programs,
this can be anywhere from around five, to hundreds or even thousands. An
active area of research and innovation for computer architects is to imagine
processor designs that can expose and exploit as much of this parallelism as
possible. One insight architects have had is that superscaler processors need
to have a fast supply of instructions to be able to analyze dependencies
effectively. This often means that the front end of our processor pipeline is
much wider than the rest of the pipeline, so that it can "run ahead" and see
what behavior the program will have next. Fast and accurate branch prediction
is vital, as we often have to predict multiple branches ahead accurately, to
achieve good performance. Another key insight is that we don't have to wait
to execute the instructions in program order. If all the dependencies of an
instruction are satisfied, the instruction can proceed down the pipeline even
if previous instructions are yet to execute. This can reduce program
execution time by taking advantage of more instruction-level parallelism. In
practice though, this creates extra complications, as we will see in the next
module. [music]
Module 5, Lab
[music] In this exercise, we will be using a simulator to explore superscalar
microprocessor design. The simulator has a number of parameters that we can
configure, such as the number of pipeline stages, the width of the Fetch
stage, the width of the issue, and the number of ALUs in the design. We can
see a diagram of the processor that we've created and we can also see a
number of parameters about that design, for example the clock frequency and
the overall area of the design. When we press step, the simulator will
advance one clock cycle and so for example here we can see that the fetch
stage has fetched the first 4 instructions. However, immediately we see one
of the problems with designs such as this, which is the three of the four
instructions that have been fetched are in fact useless because the first
instruction was an unconditional taken branch and therefore the remaining
three instructions will not be executed by the program and so these will
immediately be discarded on the next cycle. Pressing the run button allows us
to simulate and we can use the fast-forward option to simulate much quicker
in order to get to the end of the long program execution. In this case we can
see that our design achieved an average cycles per instruction less than one,
which is to say that we on average executed more than one instruction per
cycle, which means that our superscaler design has fundamentally worked, and
we can see that, for example, the overall execution time of the program is
1.6 milliseconds. In a table below, we can also see a record of our previous
simulation runs. So let's say for example, we were curious about the effect
of increasing the issue width by one. We can make that change and then press
run again in order to run our new design, and when it finishes we can scroll
down to take a look and we can see that the program execution time has indeed
improved down to 1.51 milliseconds at a cost of only 1% area. So it looks
like this was a very good improvement to our design and it's almost surely
going to be a beneficial trade-off in practice. In the exercise you will be
invited to configure a number of different superscalar processor designs with
various targets in terms of clock frequency, design area and execution time.
Once you're happy that you've configured the process that you think completes
the exercise, you can scroll all the way down to the bottom, where you'll see
the submit button that you can press to have your answer checked. Good luck.
[music]
Module 6, Video 1
Introduction:
Module 6, Video 2
[music] In the previous video, we outlined the concepts of out-of-order
execution, and register renaming. The issue window will be filled with
instructions fetched along the path that our branch predictor believes the
program will take. While we hope our branch predictor will be correct in most
cases, it will sometimes be wrong. How do we handle such cases? A simple
approach is to start by recording the original program order of the
instructions, and then to monitor their progress. We call the structure that
stores the instructions the "reorder buffer." As each instruction executes
and produces a result, we can mark it as done. When the oldest instruction
has completed, we can remove it from the end of the reorder buffer, and the
instruction is said to have "committed." This stream of committed
instructions represents how our program would be executed on a simple in-
order pipeline or by an unpipelined processor. It usefully also provides a
point at which we can process exceptions. For example, if the program divides
by zero or attempts to access memory that does not exist. We also check
branch instructions as they complete in order. If they have been
mispredicted, we flush the reorder buffer, our instruction window and any
currently executing instructions and start fetching down the correct path. To
preserve correctness, we must also restore our registers and the register map
table to the values they had when we mispredicted the branch. This can be
done with the aid of a second register map table, updated only when
instructions commit in program order. This can simply be copied to the map
table used by our renaming hardware to "rewind time" for the processor. All
the register values we need will be present, as we don't recycle registers
before we know they will not be needed again. In reality, handling branches
in this way is too slow. Processors instead take many copies of the register
map tables and can handle branches as soon as they are resolved, and we
discover they have been mispredicted. They can also selectively neutralize
the in-flight instructions in the datapath that are on the wrong path, rather
than flushing all of these instructions away. [music]
Module 6, Video 3
[music] We can now bring everything together and look at what a typical
pipeline for an out-of-order superscalar processor might look like. The Fetch
stage is aided by an accurate branch predictor as we met in Module 3. It will
fetch a group of instructions on every clock cycle. This group of
instructions will be requested from the instruction cache, and will be from
consecutive memory locations. Branches may reduce the number of useful
instructions that can, in practice, be fetched in on each cycle. The Decode
stage decodes multiple instructions in parallel. At this point, modern high-
performance processors may also split complex instructions into simpler
operations or "macro-ops." In some cases, there may also be opportunities to
combine simple instructions into a single operation. The next step on an
instruction's journey is renaming to receive a unique destination register.
As we saw in the last video, this increases opportunities for out-of-order
execution. Remember, there are several times more physical registers in our
processor than those available to the compiler. Instructions are placed in
the reorder buffer, and are also "dispatched" to the Issue stage. They will
wait in the window as necessary, and are ready to be issued once all their
operands are available. In the most complex of today's superscalar
processors, there may be hundreds of instructions buffered in the issue
window at the same time. Instructions finally commit in program order. At
this point, any physical registers that are no longer needed can be added
back to the pool of free registers. These are then assigned during the
register renaming step. Once an instruction is issued, it reads its operands
from the physical register file. The Execute stage consists of many
functional units operating in parallel. These may each support different
operations and take different numbers of cycles to execute. A network of
forwarding paths is also provided to ensure we can execute any dependent
instruction on the next clock cycle after the generation of the result. This
requires being able to quickly communicate—or "forward"— a result from the
output of any functional unit, to the input of any other. Some instructions
will need access to memory. After computing their addresses, they are placed
in the processor's load-store queues. "Stores" are sent to memory in program
order, but "loads" can often be sent out of order, and ahead of other older
stores or loads that are not yet ready to be issued to memory. The memory
system reduces the average memory access time by providing numerous levels of
cache memory. After generating results, we write them back to the register
file. This overview is representative of the fastest modern microprocessors
found today in laptops, smartphones and servers. Whilst much extra innovation
goes into real designs, they generally follow the ideas discussed in the
course. [music]
Module 6, Video 4
[music] One question computer architects always ask themselves is: "how much
can we scale up our design?" Let's take a look at some further potential
optimizations to our out-of-order superscalar processor. We could try to make
it wider. For example, by doubling the number of parallel instructions, we
can fetch, decode and execute more instructions per cycle. Would this double
our performance? Sadly, no, things are not that simple! In practice, some
components quickly become very complex, and performance gains may be hard to
extract. For example, today's largest machines fetch at most ten instructions
per cycle from their instruction caches. Fetching more instructions than this
offers minimal performance gain, despite a large hardware cost. If we
increase the number of registers, the size of our issue window, the size of
our load-store queues, or perhaps use a larger and more accurate branch
predictor, our processor's performance will only improve slightly despite a
significant increase in the size of these structures. After a point, the
increase in performance is no longer worth the cost of the extra transistors.
It's also possible that performance might reduce overall as we may need to
lower our clock frequency as the structures get larger. Finally, we could
introduce more pipeline stages, but we know this doesn't necessarily lead to
higher performance, as mispredictions may become more costly. The combination
of these issues means that extracting performance using instruction-level
parallelism alone becomes more expensive as more performance is sought. This
graph shows how the energy cost of executing an instruction grows quickly as
we try to build higher performance processors. Let's look at some example
designs. Suppose we have a core, which requires a certain area. If we double
its area, its performance improves, although there is a small rise in energy
per instruction. If we quadruple its area instead, its performance has now
doubled compared to our original core, while energy has increased by 50
percent. Going further, if we increase our processor's area by a factor of
10, performance is only 2 point 5 times our original core, but energy per
instruction is now 3 times higher. Its performance does not improve as fast
as the cost of running it! Of course, engineers are clever and determined,
and are constantly developing new techniques to bypass many of these issues.
This means the performance of processors—even ones running a single thread or
program— still improves by around 10 to 25 percent each year. Nevertheless,
ultimately we often need more performance than can be provided by
instruction-level parallelism alone. A modern solution is to employ multiple
processor cores on the same chip—called a "multicore" processor. This changes
the task for programmers; they may need to redesign their programs to take
advantage of such parallelism, but if they can, it can give vast performance
benefits. As we've learned throughout the course, every decision involves
trade-offs and compromise. We are faced with a fascinating but often highly-
constrained design problem. We've seen how performance bottlenecks, that at
first seem impassable, can be overcome with innovative designs. What might
the future hold for microprocessors? Can you think of ideas? What would you
design? [music]
Module 6, Lab
[music] In this exercise, we'll be using a simulator to explore an out-of-
order superscalar processor design. The simulator allows us to configure a
number of aspects of our processor, for example, the number of pipeline
stages, the width of the Fetch stage, the size of the issue window, the
number of ALUs, and the size of our re-order buffer. The changes will be
reflected in the pipeline diagram, which we can see below, and also in the
statistics below that, with key design metrics, such as the clock frequency,
clock period, and design size, visible below. When we press "step," the
simulation will advance by 1 clock cycle, and so we can see, for example, on
the first clock cycle the 1st 4 instructions are loaded. Although three of
them are unusable because they follow a taken branch and therefore these will
not be executed and will be discarded on the next clock cycle. We can press
"run" to watch our design in action and in order to quickly get to the end,
we can use the "fast forward" feature to simulate the millions of
instructions in this particular program. After the simulation is complete, we
can check below to see a number of statistics about our pipeline, which are
useful for understanding why the performance is as it is, and in particular
we can see the program execution time. 1.18 milliseconds gives us the overall
execution time of the program that our design achieved, and notably our
average instructions per cycle gives us the number of instructions that we
were able to complete on each clock cycle, in this case well over 1,
indicating that we are taking advantage of instruction level parallelism in
this simulation. Below all that, we can see our completed simulations and we
can check back on these as we explore new designs. So let's say, for example,
we were curious about the effect of increasing the re-order buffer size. We
could change that and then press "run" to quickly run our next experiment.
And then if we scroll down to our table, we can compare the results and see
that actually the program execution time was substantially increased by
increasing the size of the re-order buffer. Although, admittedly, this did
come at a not insignificant increase in the area of our design, and so
whether or not this represents a good trade-off in practice would very much
depend on the problem we're trying to solve. In the exercises, you'll be
given a number of scenarios for processors, which generally revolve around
certain constraints on the design size, or the clock frequency, or the target
program execution time. And once you think you've found a design that meets
the goals required, configure it up using the settings and then scroll down
to the "submit" button at the bottom, and click that to have your answer
checked. Best of luck. [music]