Computer Architecture Arm Substitles7
Computer Architecture Arm Substitles7
Computer Architecture Arm Substitles7
Module 1, Intro
[music] I'm Richard Grisenthwaite, Chief Architect at Arm. What does a Chief
Architect do? The Chief Architect is responsible for the evolution of the Arm
architecture. One of the things about an architecture is, you want your software to
continue running on future products. So it's a very long term commitment, and the
role of the chief architect is essentially to curate the architecture over time,
adding new features as needed by the partner. Why are microprocessors so important
and ubiquitous? I think the fundamental reason is that they are a general purpose
device. They describe a basic instruction set that does the stuff that you can
construct pretty much any program out of. Indeed, they're Turing complete, which
means, by definition, you can create any computable problem, solving a computable
problem with the processor. They're not customized to any particular usage, but they
do two things really rather well, which are data processing and decision making. In
other words, having, for example, added two numbers and compared them, you're then
taking a branch and making a decision based off that. Now in reality, an awful lot of
general purpose problems are ones that involve: work out some value or some criteria,
compare against that, and then make a decision on it. And essentially that allows you
to solve a huge number of different problems and that's why the microprocessor has
become so ubiquitous. What is the difference between architecture and
microarchitecture? Architecture is what it has to do, and microarchitecture is how it
does it. So for example, we will define in the architecture a set of instructions
that do "add" or "load" or "branch", but it says nothing about whether you've got a
ten-stage pipeline or a three-stage pipeline. It says nothing about branch
prediction. All of those sort of features which actually determine the performance of
the device; all of those are part of the microarchitecture. The key point is that
software that is written to be compliant on the architecture will actually run on
lots of different implementations, lots of different microarchitectures that
essentially implement that architecture in different ways, choosing a trade-off
between the power, the performance and the area of the design. What does it take to
build a commercially successful processor today? If you actually start from scratch
with an established architecture, but we want to create a new microarchitecture, we
reckon, for a high end processor, you're talking about three-to-four hundred person-
years' worth of work in order to create a properly competitive multi-issue out-of-
order machine compared with the state-of-the-art that you can get from Arm. In
addition to that — and that's just to come up with the RTL — in addition to that,
you've then got to do the implementation. If you're going to build that on a three-
nanometer process, the leading edge to get the best possible performance, you're
talking about tens of millions of dollars for the mask sets. There's a whole bunch of
software you've got to go and build on top of that and do all of that work. When
we've looked at companies that are interested in actually starting up, taking an Arm
architecture license — say I want to go and build my own business — we reckon that
you need to be prepared to spend getting on for half a billion dollars before you're
actually going to be successful because it takes time. Your first product is not
necessarily going to be fully competitive because it would be slightly surprising if
the first thing that you built was as good as what people have been building for many
years. It takes time to build up that expertise and skills. And so you're going to
see a couple of iterations before even the best teams end up really being
competitive. And so as I say, I think if you went from nothing and wanted to
essentially create something, and that's using the Arm architecture with all of its
existing software and so on, you're talking the best part of half a billion. What is
the Arm business model? The Arm business model fundamentally is the licensing of
semiconductor IP to as wide a range of companies as possible and in as many ways as
possible, in order to maximize the uptake of the IP. When we talk about IP, we're
talking about essentially designs for processors and we license that both as an
architecture license, where you effectively gain the right to build your own
processor to the specification of the Arm architecture, or an implementation license
where we are licensing, actually, an implementation that is compliant with the Arm
architecture in the form of some richer transfer-level RTL. What makes Arm different
from its competitors? So if you go back to where we were when Arms started out in the
early 90s, there were many, many different architectures available and they were all
kind of doing the same thing but slightly differently. They would have different
instructions, different instruction sets, and so software needed to be ported. A huge
part of Arm's success actually comes from the fact that we created a business model
of licensing the IP to make it very easy for people to build processors, to
incorporate the designs into their SoCs, into their systems. And that then made it
very straightforward for people to be able to use the Arm architecture. Now what this
then meant was, people said: I will port more software to this because it's more
widely available and you get this positive feed-forward effect whereby more
availability of hardware encourages more software, encourages more hardware, and so
on. And essentially that meant that a lot of people said: there's no point in me
having a different architecture. I'm not getting a valuable difference from doing
that. All I've got is, kind of, a needless change to the software that I need to
make. So actually the whole model, we went on at Arm, which was: let's license our IP
to a great number of different players to come up with different solutions to meet
the form factor of a camera or a PDA, or whatever it was back in the day. Those
things made it much more straightforward for people to incorporate our technology.
[music]
Module 1, Video 1
Computer architecture is the study of tools and techniques that help us to design
computers. More precisely, it helps us understand how to meet the needs of particular
markets and applications, using the technology and components that are available. For
example, we might need to produce the chip at the heart of a smartphone using 10
billion transistors and a power of less than two Watts. How do we achieve the
performance a customer wants? The challenge is a fascinating one, and one that
requires a broad understanding. For example, what target market are we designing for?
What are the characteristics of the applications to be run? How will programming
languages and compilers interact with the microprocessor? How best to craft the
narrow interface between the hardware and software? How to organize the components of
our microprocessor? And how to design a circuit, given the characteristics of
individual transistors and wires? Like many design problems, computer architecture
requires many trade-offs to be made and evaluated. Each design decision will impact
trade-offs between size, performance, power, security, complexity, and cost. Trade-
offs must be re-evaluated regularly, due to advances in fabrication technology,
applications, and computer architecture. Computer architecture must be grounded in
quantitative techniques and experimentation, but the endless number of possible
designs means that the field depends on a high degree of human ingenuity and art.
Perhaps surprisingly, the earliest computers and today's most advanced machines have
much in common. They both execute a stored program constructed from machine
instructions. These instructions perform simple operations such as adding two
numbers. Nevertheless, greater numbers of faster transistors, and the application of
a relatively small number of computer architecture concepts, have enabled us to
construct machines that can perform billions of instructions per second, and shrink
these machines to fit in hand-held battery-powered devices. It is this rapid progress
that has supported breakthroughs in machine learning, drug discovery, climate
modeling, and supports our modern world where computation and storage are almost
free. The task of designing a microprocessor is split into different levels of
abstraction: "Architecture;" "Microarchitecture;" and "Implementation."
"Architecture" focuses on the contract between programmers and hardware. It allows
compatible families of microprocessor products to be built. The ARMv8-A architecture
is an example of this. Architecture includes the "Instruction Set Architecture," or
ISA, which defines what instructions exist. It also defines precisely the behavior of
memory and other features needed to build a complete processor. "Microarchitecture"
focuses on the organization and structure of the major components of a
microprocessor. It has to match the rules set by the architecture. The
microarchitecture still has flexibility though; and so the implementation specifies
the circuit detail precisely. This culminates in the exact circuit design for
manufacture. Each of these levels is vital, and each comes with its own challenges
and opportunities.
Module 1, Video 2
At the beginning of the 20th century, "computers" were people employed to perform
calculations. These computers used mechanical calculators to help them perform
arithmetic. They followed instructions to decide what calculation to perform next.
These instructions defined the algorithm or program they were executing. They would
consult paper records to find the inputs to their calculations, and would store their
intermediate results on paper so they could refer back to them later. A modern
electronic computer is organized in a similar way. We will look in more detail at
these components of a microprocessor in the next video, but for now let's look at how
it operates. Microprocessors also follow instructions one by one, and then perform
relevant calculations. This idea of fetching instructions from memory and executing
them is called the "Fetch-Execute Cycle." In the "Fetch" stage, the computer reads
the next instruction from the program. This instruction is encoded in binary as ones
and zeroes, so it must be decoded to understand what it means. This is done in the
"Decode" stage. Once it is clear what to do, we move to the "Execute" phase. This can
involve different tasks such as reading memory, performing a calculation, and storing
a result. Once done, the computer is then ready to begin the cycle again, by fetching
the next instruction. Instructions are normally fetched sequentially in order, but
some special instructions called "branches" can change which instruction will be
executed next. For branches, a calculation determines the next instruction. This can
mean evaluating a condition, or reading a register to determine the next
instruction's location. Branches allow computers to make decisions, and to re-use
instruction sequences for common tasks. A modern computer program, like a web
browser, contains millions of instructions, and computers execute billions of
instructions per second, but they all conceptually follow this "Fetch-Execute Cycle."
Module 1, Video 3
Modern microprocessors are circuits built using anywhere from 1,000 to 100 billion
tiny transistors. The key to designing circuits with such huge numbers of parts is to
build them from re-usable blocks. Let's take a look at some of these. Each transistor
is an electrically-controlled switch. When there is too little voltage at the gate,
the switch is off, so the electrical signal cannot propagate from the drain to the
source. However, when there is sufficient voltage at the gate, the switch is on, so
the signal does propagate. When designing processors, we use digital electronics. The
only voltage or current values we consider represent zero and one, enabling us to
build robust circuits from imprecise components, even in the event of electrical
noise or manufacturing imperfections. This means our circuits represent binary
numbers, since they only have two states: zero and one. We use transistors to build
increasingly complex circuits. We can design circuits that can remember binary
values, or select between multiple inputs, or even do arithmetic, like addition. We
can then use those circuits as building blocks in even larger designs. When designing
a digital system, we must keep everything synchronized to control when its behavior
occurs. To do this, we use a "Clock Signal," which is a wire in the circuit whose
signal cycles between zero and one. We measure the rate in Hertz. For example, if you
hear a processor has a clock speed of two Gigahertz, it means 2 billion ones and
zeroes per second. The maximum speed of the clock signal is determined by the
longest, and therefore slowest, path in the circuit between 2 clocked flip-flops.
This is referred to as the "Critical Path." The signal must have time to propagate
all the way along the critical path before the clock cycle completes. A
microprocessor needs to do many types of arithmetic. For this, we build an
"Arithmetic Logic Unit" or "ALU." This circuit receives 2 numbers as input, as well
as an indication of what operation to perform, for example, addition or subtraction.
In addition to logic, a microprocessor needs memory. Memory is organized as arrays of
memory cells that are able to store many "Words" of data. A specific word, commonly
32 bits in length, can be accessed by specifying its "Address." Each address is a
number that indicates the location in the memory that should be read or written.
Memory cells range from hundreds of bits to millions of bits in size, but larger ones
are slower to access, as signals and their long internal wires take longer to
propagate. For that reason, almost all microprocessors include at least two types for
storing data: a big slow "Data Memory," and a small fast memory called a "Register
File." In reality, the "Data Memory" may be implemented using many different sizes of
memory, as we'll see in Module 4. As well as storing data in memory, we also use some
memory to store the instructions. We need a way to keep track of which instruction we
will fetch next, so we have a "Program Counter." This stores the address in
Instruction Memory of the next instruction to be accessed. Since instructions are
encoded in binary, we also have "Instruction Decode Logic" that converts that binary
to the various signals needed to control the microprocessor.
Module 1, Video 4
We've already seen how a microprocessor is controlled by instructions. But what are
they really? An instruction is a simple command that the microprocessor hardware can
perform directly. We write them as text like this to make them easier to read. But
for the microprocessor we encode them in binary. We use a program called an assembler
to translate between the human text and the binary. In this video we'll be looking
specifically at an Arm instruction set called A64 but other instruction sets follow
similar principles. Arithmetic and logic instructions are the simplest type of
instruction. The first word tells us what operation will be performed such as
addition or multiplication. The values after this tell the processor where to put the
result and where to get the inputs. Values starting with X are addresses in the
register file. Arithmetic instructions read one or two registers and then put the
result into a third register. Branch instructions are used to make decisions and to
repeat instructions. Normally the microprocessor executes instructions in sequential
order but branches change that, and explicitly tell the microprocessor the address of
the instruction to run next. This is done by giving the address of the next
instruction in the instruction memory. Some branches are unconditional, meaning they
always occur and always affect the next instruction address. Other branches are
conditional, meaning the processor will perform a calculation to decide whether to
follow the branch or to continue executing instructions sequentially following the
branch. These are preceded by comparison instruction to calculate the condition.
Loads and stores are the instructions for accessing the data memory. Loads copy
values from memory to the register file. Stores do the opposite. In both cases, the
instruction needs to know the address in the data memory and the location in the
register file to copy between. For data memory, loads and stores read an address from
a base register. They can also optionally add to this base address by reading another
register, or by simply specifying a number in the instruction itself. Using sequences
of instructions we can build programs. Here is an example of a small program that
implements Euclid's greatest common divisor algorithm. Let's take a look at it
working, one instruction at a time. To start with, the input stored in X1 and X2 are
compared. If they're equal, because we have found the greatest common divisor a
conditional branch instruction moves to instruction 7. If they're not equal, another
conditional branch instruction can be used to determine whether X1 is smaller.
Finally, we use an arithmetic instruction to subtract either X1 or X2 depending on
which was larger and then unconditionally branch back to the start. Here we can see
an example of the program running. One instruction executes at a time. As we step
through the program, the values in the registers X1 and X2 are shown. And we see
these are modified each time we reach instruction 3 or 5. The instructions repeat in
a loop, as is common for many different types of program. Each time round the loop
instruction 0 and 1 are checking whether or not to continue the loop. When they
detect X1 and X2 have become equal, the program finishes and the processor moves on
to the next task.
Module 1, Lab
In this exercise, we're going to be using ASim. ASim is a behavioral simulator for a
subset of the Arm AArch64 instruction set. Behavioral simulator means that it's not
simulating real circuit level details of a microprocessor it's just simulating the
behavior of each instruction as it runs. Nevertheless, behavioral simulators are
vital tools for computer architects. We use them to check the designs we've built,
match with the behavior we intended and we can also use them to explore new ideas,
for example adding new instructions to the instruction set. We're just going to be
using it though to get familiar with the basics of Arm AArch64. So we can create a
new file which will allow us to type in some Arm AArch64 instructions and when we
press assemble that will cause ASim to begin the simulation of the instructions. If
we scroll down we can see that below we've put a little cheat sheet of Arm AArch64
instructions that you can use as a reference. Do note that these are just examples
though. So for example this first one is mov X1 X2 and the descriptive text says that
this copies the value of register X2 to register X1. But equally that would imply
that if we use mov X3, X4 instead that would copy the value of X4 to the register X3.
So we don't have to just use these exact instructions, we can tweak them as we need
to, to achieve our goals. So going back to the ASim, let's say that we wanted to add
the number two to the number three. In our cheat sheet we could see that there is an
'add' instruction but it requires that the inputs for the addition are stored in
registers. So first of all, we're going to need to load up two registers with the
numbers that we want to add in this case two and three. So looking at our cheat sheet
again we can see that there's an instruction mov that allows us to do that. So if I
write mov X2, #3 this would cause the register X2 to be loaded with the value 3 and
similarly I could write mov X1,#2 and this would cause the register X2 X1 to be
loaded with the value two. Last but not least, we could then do the addition we
wanted to do, such as add X3 X1 X2. This will cause X1 to be added to X2 and the
results stored in X3. If I press assemble now, we will see that the UI changes
slightly. Here we can now see the simulated state of the processor. ASim is only
going to, because it's a behavioral simulator it's only going to show the state
before and after each instruction. And so right now we are before executing the first
instruction. That's the yellow highlighting there and we can also see it in the
simulated memory. If I press step we can now see that that instruction has completed,
and it's before executing the mov X1 X2. And notably we can see that X2 has been
loaded with the number 3 which is exactly what we would expect. If I press step
again, we can see that mov X1 X2 has now occurred. MOV X1 #2, sorry has now occurred
which has loaded the value 2 into the register X one. And last but not least, if we
step again we see that X3 has become equal to five which is exactly what we would
expect if registry X1 was added to registry X2. So this allows us to get a feel for
what it's like for a machine to execute these instructions. We can reset back to the
beginning if we want to watch it go through again. If we want to do an edit for
example adding a new instruction we can just do that. Press assemble and that will
add the new instruction to our program, and we can see its effects by simulating. Now
in the exercise you will be invited to take an existing program and add 1 new
instruction to it at the indicated position. Do note that when you assemble the
exercise program, you'll be taken to a gray bit of code, which is our testing code.
But as you step, you'll see that the simulator flicks between the testing code and
the code that we wrote the exercise one code. You can also click between them using
this. Feel free to try to work out what our testing code does but you can just ignore
it if you want to. The point of the exercise is just to add the instruction in the
indicated position. When you're comfortable you've got the right instruction you can
press run to get all the way to the end. And if you really think that's correct if
you scroll down to the bottom of the page you'll see the submit button which you can
use to send in your answer. Best of luck.
Module 2, Intro
[music] Hello, my name is Martin Weidmann. I'm an engineer and product manager with
Arm's Architecture and Technology Group. I look after the A-profile architecture and
I maintain Arm's Interrupt Controller specifications. Computer architecture is
sometimes called a science of trade-offs. Why is everything a trade-off when it comes
to designing a processor? Let's take an example of where we've had to make trade-offs
when developing process architecture. So the Arm architecture has a feature called
Pointer Authentication. This is often abbreviated to PAC for Pointer Authentication
Code. What this feature is trying to do is protect against a form of attack called
ROP and JOP. These are Return Orientated and Jump Orientated Programming, and it's
where an attacker tries to subvert things like the call stack to run legitimate code,
but in ways that weren't expected by the programmer or the compiler. PAC or Pointer
Authentication tries to defend against this kind of attack by using part of an
address to provide an encrypted signature. So we can check the signature and the
address match and if they don't, we can spot an attack in progress. So why is this a
trade-off? Well, because to add security, we want that signature to be as big as
possible. The bigger the signature, the more bits we use for that, the stronger
cryptographically that signature is. The trade-off is: the more bits we use for the
signature, the fewer bits we have available for other things, such as the address. So
you can have a big signature with a small address, but if you want the address to get
bigger, then you get a smaller signature, and that's then cryptographically weaker.
So the trade-off we have to make when designing a technology like that is: What's the
right amount of bits for the signature? What's the strength of cryptography we need
from that signature in order to get the design goal, which is to defeat these attacks
and give us more robust computing? What sort of guiding principles do you use when
designing a processor? So when you're designing a processor, the key thing you have
to bear in mind is: What's it going to be used for? What you don't want to end up
with is a very expensive paperweight. So we need to understand the requirements that
the processor has to meet. We have to understand the design trade-offs we're making
and how they work into meeting those requirements. We also have to consider not just
the processor itself, but how we're going to show that that processor is correct.
We'll put as much time, if not more, into testing and validating the design as we do
into designing it. How do you design a new microprocessor? If you wanted to create a
new processor from scratch, the first thing you're going to have to do is understand
the market that that processor is going to address and to then build a team to design
that processor. There isn't such a thing as one processor for every possible market.
The requirements for something like an embedded microcontroller are going to be very
different to what you want from the processor in your mobile phone, your laptop, or
your server. So you need to understand those requirements as the first step into
building a new processor. What determines the best design for a microprocessor? So
when you're designing a processor, you need to work out what the best design for a
given market or application is going to be. There's no magic formula for this. It's
going to depend a lot on what you're trying to achieve with that processor. You need
to understand things like the power requirements, the performance requirements. Is it
going to work in a highly noisy electrical environment? There's a big difference
between the reliability requirements you need from something like a watch versus a
satellite. So you would take those requirements and you'd work out what the best set
of trade-offs is going to be, and that's an art more than it is a science. How do the
underlying technologies contribute to this best design? A lot of technologies go into
our processor. There's the design of the microarchitecture, the implementation of the
processor. There's the silicon process you're going to use, how you're going to
integrate that processor into an SoC or ASIC. Is it a single die, or is it going to
be using chiplets or multiple sockets? All of those different technologies are going
to be factors into how you design the process or what trade-offs you make, and what
performance and power you get out of the design once you're done. In reality there
may be many different 'best' designs, so how do you pick one? So when you're
designing a processor, what you want is the best design. But often there isn't "a"
best design, there's just different trade-offs. You have to decide what the best set
of trade-offs is for the particular use case you're going for. And that's also going
to depend on: Is this a device that will be off the shelf, used for lots of different
applications—a general purpose processor? Or is this being designed for one specific
use case? Again, there isn't really a magic bullet or single answer for this type of
question. You need to understand how the processor is going to be used and then use
your experience to judge the trade-offs, and what will give you the best mix of
power, performance, area, cost, reliability for your target use case.
Module 2, Video 1
[music] In this module, we're going to explore how to improve the simple
microprocessor design from Module 1 in order to allow it to execute programs more
efficiently. First, let's find out how long a program takes to execute. The time
taken to perform the average instruction is equal to the number of clock cycles taken
to perform an instruction multiplied by the duration of one clock cycle. The time
taken to run our program is found by multiplying the average time to perform an
instruction by the number of instructions in our program. How could we make this
faster? One thing we could try is to reduce the number of instructions in a program.
We might be able to optimize the code removing unnecessary and repeated work and
selecting instructions to minimize code size and maximize performance. We could give
our microprocessor the ability to perform more operations in order to help
programmers or compilers further reduce the number of instructions in their program.
For example, allowing the loading of two data values at the same time might allow
fewer instructions to be used in the program. The downside to this approach, is that
adding more instructions will require extra circuitry in the processor and therefore
we likely increase the clock period. If the extra instructions are rarely used this
could even mean an overall decrease in performance. We see this theme often in
computer architecture trade-offs that we have to carefully balance. Another approach
is to use faster transistors perhaps constructed from a more recent fabrication
technology. This would reduce the clock period but may increase costs. The rest of
this module focuses on an optimization to reduce the clock period called pipelining.
This is the most important optimization we use when designing processors. It uses a
similar concept to an assembly line in a factory where work can start on the next
item before the previous one finishes. Let's take a closer look. Imagine that each
instruction has to go through four circuits in a processor. If we attempt to do all
of these in one clock cycle this means our clock period is the latency of all four
circuits added together. If we were to pipeline this, we would add a pipeline
register in the middle. This divides the circuit into two sections called stages.
Notice that although each instruction takes a similar amount of time to travel down
the whole pipeline, the pipeline design can execute nearly twice as many instructions
per second. The throughput has doubled. This is because we can set the clock period
much shorter. It's now the maximum latency of the two stages. We can pipeline into
many stages and this allows for much faster execution of programs. Unfortunately,
though, pipelining a real microprocessor design is not quite as simple because the
processor has various feedback signals and loops in the circuit. In the next video,
we'll take a look at the challenges of pipelining in practice. [music]
Module 2, Video 2
[music] In this video, we're going to look at applying the pipeline optimization to a
realistic microprocessor design. In the first module, we met the components of a
microprocessor, so let's look at how these are really connected. This diagram shows
all the connections needed for a real, unpipelined microprocessor. Each clock cycle,
the processor starts by fetching an instruction from the instruction memory. Once the
instruction reaches the decode logic, it is decoded to produce the control signals
necessary to execute it. The exact control signals vary depending on the type of
instruction. For example, arithmetic instructions access the register file and
interact with the ALU. Ultimately, no matter how the instruction was executed, the
last step of each clock cycle is to update the program counter. This is done by the
branch unit. For non-branch instructions, this just means incrementing the program
counter. However, for branch instructions, the branch unit has to do some
calculations. When we apply our pipeline optimization to this design, we face some
challenges. The design has several loops because instructions have dependencies. How
can we break these cycles? The key observation is that not every instruction is the
same. In real programs, branch instructions usually make up less than 20 percent of
the program. For non-branches, the branch unit doesn't actually need to wait for the
ALU before calculating the result. Let's look at how we can use this fact to pipeline
the processor. Once the first instruction shown in yellow reaches the pipeline
register, we're ready to begin fetching the next instruction, shown in blue. The
yellow instruction can be in the execute stage whilst the blue instruction is being
fetched. Once the yellow instruction is finished, the blue instruction is ready to
enter the execute stage and a new green instruction enters the fetch stage. What
about the branches though? Let's imagine this next yellow instruction is a branch.
The fetch stage works normally until the branch unit, but the branch unit cannot
proceed. Consequently, the pipeline stalls. The fetch stage spends a cycle waiting
whilst the execute stage executes the branch. Finally, once the ALU is done, the
branch unit can proceed and the next instruction, in this case blue, can be fetched.
Overall, this means that the processor wasted one cycle stalling due to the branch.
Since only 20 percent of instructions are branches, this means that each instruction
would require on average 1.2 cycles. The same idea of stalling the pipeline can be
used to create even longer pipeline designs. This diagram shows a typical five-stage
processor pipeline. In the next video, we'll look at how we can manage or prevent
some of the stalls in a design like this. [music]
Module 2, Video 3
[music] Instructions within a program may be dependent on each other. That is, one
instruction may produce a value that a subsequent instruction consumes. Data values
may be communicated through registers or memory. The simple program shown has a
number of so-called true data dependencies. This means we must take care to execute
these instructions in order, and make sure results are correctly communicated.
Additionally, the outcomes of branch instructions may affect the path taken through
the program, and consequently, this affects whether an instruction is actually
executed. This sort of dependency is known as a control dependency. In the previous
video, we met a realistic processor pipeline with five stages. Circumstances that
prevent an instruction making progress in our pipeline are known as pipeline hazards.
Let's take a look at how dependencies cause hazards. This program has a true data
dependency. The first instruction writes to register one, which is then read by the
second instruction. If we send this down our pipeline, we see that the second
instruction must stall, waiting for register one to be written, before it can read
and proceed. This is a data hazard. Unfortunately, dependent instructions are common
and stalling in this way would significantly increase the average cycles per
instruction. Let's take a closer look at the hazard though. The ADD instruction is in
the execute stage, meaning its result is being computed. The SUB instruction needs
that result to proceed. Rather than waiting for the ADD to reach the writeback stage,
we could add an extra path into our pipeline to carry the output of one stage to a
later instruction, making the result available straight away. We call this a
forwarding path. In this case, the ALU result is forwarded to the SUB instruction to
be used as X1. This small piece of extra circuitry allows this data hazard to be
eliminated completely. Unfortunately, even if we add forwarding paths everywhere,
it's not possible to eliminate all data hazards. For example, this program has a data
hazard due to the load instruction. There are other types of hazard too. This program
contains a control hazard. We cannot be sure which instruction to fetch until after
the branch instruction executes. Consequently, this program has two stall cycles. We
will look in more detail at how control hazards can be mitigated in the next module.
Another class of hazards, called "structural hazards" occur when two instructions
require the same resources simultaneously. For example, if instructions and data were
stored in the same memory, and this could only be accessed once per cycle, we would
have to very frequently stall our pipeline to let these stages access memory one-by-
one. [music]
Module 2, Video 4
[music] In the previous videos, we explored how pipelining could improve performance
by reducing our clock period and by overlapping the execution of different
instructions. We also saw that it was sometimes necessary to stall our pipeline to
ensure that instructions were executed correctly. Ideally, our average cycles per
instruction, or CPI, will remain at 1.0. If we must stall, however, this will
increase. For example, if 20 percent of our instructions were loads and each of these
caused one stall cycle, our CPI would be 1.2. If a further 20 percent of instructions
were branches, and each of these caused two stall cycles, our CPI would be 1.6. The
longer we make our pipeline, the more stall cycles there will be, and eventually the
cost of stalls may outweigh the benefit of the faster clock period. For example,
let's imagine we added a stage to our five-stage pipeline from the previous video.
Now the number of stalls after a branch instruction increases to three, hurting our
CPI. On the other hand, our clock period would improve. So whether or not this helps
speed program execution would depend on the exact details. It may eventually become
more difficult to reduce our clock period by adding further pipelining stages. This
is because it becomes harder to perfectly balance the logic between stages and
because of the constant delays associated with clocking and our pipeline registers.
To mitigate these issues, we will need to invest in more transistors and our design
will require more area and power. The deeper our pipeline gets, the greater the
investment we need to make in terms of area and power for the same incremental
improvement. Commercial processes today have anywhere from two to twenty pipeline
stages. The faster, more expensive and power-hungry processors tend to have longer
pipelines than the smaller, cheaper processes in embedded devices. As with many
techniques in computer architecture, eventually it becomes more profitable to invest
our time and resources in an alternative way of improving performance. In later
modules, we'll explore how we can reduce the CPI, even in heavily pipelined
processes. [music]
Module 2, Lab
[music] In this exercise, we're going to be using a model of a processor pipeline to
explore the effect of the pipelining optimization. Computer architects use models
like this to make high level decisions early on about what parameters they will use
for a processor and using a model such as this saves the burden of actually building
the processor to find out its performance. The model is not simulating accurately the
performance of the processor but rather it's giving us an idea for what performance
we might expect. So what can we do with this model? Well, we can configure the number
of pipeline stages, which we can see affects the diagram. And we can also turn on or
off the forwarding optimization. As we change these numbers notice that the design
parameters change down here. So for example, the clock frequency is improved by
increasing the number of pipeline stages but the design area will get bigger. And so
this may be a consideration depending on the problem. We can also choose which of two
programs we're going to put down our pipeline. When we press the step forward button
the pipeline advances to the next clock cycle and we can see the instructions have
started to flow down our pipeline and interesting events such as forwarding will be
noted in the simulation. Sometimes the simulation will detect that there will be a
stall for example, in this case, we can see that there is a data hazard because the
instruction in the red memory stage writes to register X13 which is read by the
instruction in the yellow decode stage and therefore a stall cycle is necessary in
order to allow the result to be generated. If we press the play button, the
simulation will proceed automatically and we can see various stall events happening
as the simulation proceeds. But notice that the the program we're simulating is
nearly 1,000,000 cycles long so watching it play out at this speed is going to take
quite a while. So we can use the fast forward slider to simulate much, much faster.
Notice that the statistics down the bottom have updated depending on the results of
the simulation, and at this point we can see that the program is finished and the
simulation of the program, the simulated program took 3.98 milliseconds to execute.
We can also see that just below, the results of past simulations are stored in little
tables so we can easily refer back to them when we're doing later experiments. So as
an experiment, let's imagine what would happen if we disabled the forwarding
optimization but change nothing else and we'll just run this program through. What we
can see immediately is the design side is slightly better which is what we would
expect. It's got 1% better in fact in this case because of the lack of the forwarding
wires. But now that the program is finished, we can see that the program execution
time is a lot worse. 6.34 milliseconds is about 50% worse. So again, looking in our
table, we can compare the execution times in the area and we can see that in most
cases the forwarding optimization would be a big optimization here because at the
cost of an increase in the area of about 1%, we've had a improvement in execution
time of about 50% which is likely to be a good trade off, but not always. It would
depend on the exact scenario. Perhaps that 1% area is more important than the
performance of this program. In the exercise, you'll be invited to design a or
suggest a design using the number of pipeline stages and whether forwarding is
enabled that will meet certain criteria. You can play about and do as many
simulations as you wish to figure out what the best program might be. Once you've got
it set up select the processor that you're happy with at the top and then scroll down
to the submit button and press that. Good luck. [music]
Module 3, Intro
[music] Hi, I'm Nigel Stevens. I'm Lead Instruction Set Architect for the Arm A-
Profile architecture. I've been at Arm for about 14 years and I have responsibility
for the Arm V8-A instruction set including recent developments such as the Scalable
Vector Extension and Scalable Matrix Extension. What do we mean by the instruction
set architecture? The instructions in architecture, primarily, most people think of I
guess as the OP codes, the encodings of instructions that are executed by an Arm-
based processor. But it also includes other aspects as well such as the exception
model, system programming features, memory management and suchlike. The architecture
for Arm is rather distinct from what other companies may call an architecture. For
Arm, architecture is a legal contract, if you will, between hardware and software. If
software uses only those instructional codes and features of the ISA that are
described by the Arm architecture to perform its work, and the hardware that it's
running on implements all of those op codes and features exactly as defined by the
architecture, then any architecturally compliant software will run on any
architecturally compliant hardware that implements that Arm architecture. And that
doesn't mean just processors from Arm itself, but also processors that are designed
by our partners and which we have validated are conformant with our architecture. How
do you decide which instructions to include in an ISA? When we are looking at
requests from partners or from internal research to add a new instruction, we go
through quite a long process of trying to justify that instruction, or, quite
commonly, a set of instructions rather than a single instruction. We have to show
that it gives us some real benefit in performance, the performance of your code
running on that CPU. Or maybe not performance. Maybe it's security you're trying to
achieve. But it has to give you some really concrete benefit that is worth the cost
of adding all of the software, the validation software, the implementation costs for
all of the different implementations, compiler support, so on and so forth. It has to
meet the... It has to answer that cost benefit analysis. What is the difference
between an ISA and a microarchitecture? The difference between an instruction set
architecture, or ISA, and the microarchitecture is that the ISA is an abstract
concept. It defines a set of instruction encodings which software can use, and which
hardware has to recognize and implement. How that is implemented is a choice for the
microarchitecture. So the instruction set architecture is fixed, it's defined by Arm.
The microarchitecture is defined by whatever team of people is designing that CPU.
And there are many different approaches to implementing the Arm architecture, from
very small, efficient cores with in-order pipelines up to very high-performance,
state-of-the-art, out-of-order execution, and everywhere in between. So the
microarchitecture is implementation-specific, the architecture is generic, and
software written for the architecture should run on any microarchitecture. Why does
Arm produce processors with different instruction sets? Arm supports multiple
instruction sets. Some of that is to do with legacy: you can't abandon your legacy
software, your legacy ecosystem. So as the architecture has advanced and we've
introduced major new instruction sets, we still have to continue to support old
software. It takes years, maybe 10 years to move the software ecosystem to a major
new ISA. So for example, AArch64, which is the 64-bit architecture that we introduced
with Arm V8, also supported the AArch, what we called AArch32, the old 32-bit
architecture that was implemented in the Arm V7 architecture, and prior to that
including the Arm and the Thumb instruction sets. And we needed to do that because,
while some software might start to migrate to the 64-bit architecture, there's still
a lot of software on the planet which is going to continue using the 32-bit
architecture, and that has to survive. So that's part of the reason: it's about
legacy. You can't just obsolete the whole world when you introduce a new
architecture, a new instruction set architecture in particular. There are other
reasons as well, which is there are certain instruction sets that are different for
reasons of the ecosystem that they're working with. So if you were to compare, for
example, the A-profile architecture that's designed for application processors that
run rich operating systems with virtual memory supporting SMP, symmetric multi
processing operating systems running large applications, whatever it may be: web
browsers on your phone or something, or a web server in a server farm somewhere. You
have your R-profile architecture, which is designed for high-performance, real-time
embedded systems. The constraints there are somewhat different. The instruction set
is actually fairly similar to the A-profile, but some of the underpinnings of the
architecture, the system side of the architecture, are simplified in order to give
more consistent and predictable real-time response to things like interrupts or
memory translation and suchlike for real-time systems. And then at the other extreme
you have the M-profile architecture which is designed to be capable of being built in
a very simple, ultra-low power implementation with low code size and again, similar
to the R profile, very predictable real-time performance. So the answer is there are
different instruction sets for reasons of the market that they're trying to address,
and then there are different instruction sets because, well, we have history. [music]
Module 3, Video 1
[music] In the previous module, we explored how pipelining can be used to improve
performance. We also saw how it is sometimes necessary to stall our pipeline to
ensure our program is executed correctly. In a simple pipeline, it will be necessary
to stall the pipeline whenever we encounter a branch instruction. This is because we
must wait until our branch is executed before we can be sure which instruction to
fetch next. As a recap, branches are instructions that change which instruction in
the program will be executed next. There are two types of branches: conditional
branches and unconditional branches. Unconditional branches always change which
instruction executes next, whereas conditional ones may or may not, depending on the
computations in the program. In real programs, between approximately one fifth and
one quarter of all instructions are branches, and the majority of these are
conditional. Executing a branch involves calculating the new address to load into our
program counter. This is the branch's "target address." However, conditional branches
have an extra task: we must first determine whether the branch is taken. If the
branch is not taken, we can effectively ignore the branch and fetch the next
instruction as normal. Recall the processor performance equation from an earlier
video. Since we have to wait for branches to complete before fetching the next
instruction, we generate stall cycles. These increase the average number of "cycles
per instruction," which reduces our microprocessor's performance. The longer our
pipeline gets, the longer it is before each branch is resolved, and the more costly
branches become. Can you think of a way to avoid some of this stalling? One idea is
to evaluate branches earlier in the pipeline, for example in the Decode stage instead
of in the Execute stage. This can indeed help to reduce the number of stalls, but we
may still need to stall if the branch depends on other instructions that haven't been
executed yet. Another idea is to continue fetching instructions in program order,
effectively assuming that each branch is not taken. The number of stalls in the
pipeline for a not-taken branch is zero in this design. On the other hand, if the
branch is in fact taken, the subsequent instructions that we fetched will be
incorrect. So, we must remove all instructions that have been fetched on this
incorrect path from our pipeline. This is called "flushing" the pipeline.
Unfortunately, in real programs, branches are taken much more than not taken. Could
we then simply assume instead that all branches will be taken? Sadly not, no, because
then we would also need to know the specific "target address" immediately, in order
to know which instruction to fetch next. It may at first seem impossible to know this
before the instruction is decoded. However, computer architects have found a way to
do exactly this. The next video will look at "dynamic branch prediction:" the idea of
predicting the behavior of the branch instruction before it has even been fetched.
[music]
Module 3, Video 2
[music] In this video, we'll explore how to predict the behavior of a branch
instruction. This can sometimes eliminate the cost of branches altogether.
Fundamentally, a branch involves changing the value of the program counter, which is
the address of the next instruction in the program. If we could predict what this
change will be, quickly and accurately, we would have no need to stall. Precisely, we
need to predict that we are fetching a branch, predict it as taken or not taken, and
predict what its target address is. How could we ever make such predictions? What
would we base them on? Well, since instructions are often executed multiple times, we
can accurately make these predictions based on just the instruction address. If we've
previously seen that a particular address contains a branch, and we see that address
again, we can predict whether that branch will be taken, and its target address,
based on its behavior last time. Amazingly, for real programs, simply predicting
repeating behavior is typically around 90 percent accurate. This means we could
eliminate stalls caused by branch instructions 90 percent of the time. Let's apply
these insights to try to build a branch predictor. We will add two extra blocks to
the Fetch stage of the pipeline we met in Module 2. The first will remember
information about recently executed branches. This will include the program counter
values of branches and their target addresses. This memory is called the "Branch
Target Buffer" or BTB. The second block will make predictions about whether an
address containing a branch is taken or not. We simply call this the "branch
predictor." In the next video, we will look at these in more detail. Combining these
two gives us all the information we need to predict the next value of the program
counter based solely on the current value of the program counter. We don't even need
to decode the instruction to make this prediction. Here we can see how a running
branch predictor behaves for a sample program. Each program counter is checked in the
BTB to see if it's predicted to be a branch and to identify its predicted target. We
also simultaneously check the branch predictor to see if the branch is predicted to
be taken. Based on these predictions, the branch unit computes the predicted next
program counter. Many cycles after the prediction, feedback will be given by the rest
of the pipeline as to whether or not the prediction was correct. Whenever the
prediction is wrong, we have to flush the pipeline and update the BTB and branch
predictor. The pipeline will then resume fetching from the correct program counter as
computed by the pipeline. The majority of instructions are not branches, so most of
the time the branch unit just adds 1 to the program counter. The BTB contains the
instruction address and target address of some recently executed branches. The exact
number of entries in the BTB varies considerably. BTBs in large modern processors
contain many thousands of branches. In operation, the BTB checks the supply program
counter against its memory to see whether it has a match, and if so, it returns the
target address. Otherwise, it predicts that the instruction is not a branch. After
each branch executes, the BTB is updated with the true target address. BTB cannot be
arbitrarily large, so it may have to forget an existing branch to remember a new one.
A simple BTB design like this is typically around 90 percent accurate at predicting
target addresses in real programs. [music]
Module 3, Video 3
[music] In the previous video, we met the two major components of dynamic branch
prediction, the BTB and the branch predictor. In this video, we'll take a deeper look
at the branch predictor. It predicts whether or not a branch will be taken, based on
the program counter. A simple branch predictor would try to remember what the branch
did last time and predict that the same behavior will repeat. Let's see how such a
predictor might be organized. Remembering a branch prediction for every possible
instruction address would take up far too much memory, so we reuse the same memory
for different branches via a process called "hashing." We hash the address of the
branch to a smaller number. This does unfortunately lead to a problem called
"aliasing," where two different branches can hash to the same value, but this is rare
in practice. Let's see what happens now, when we execute a simple loop. We see a
misprediction when it encounters our branches for the first time, and another when we
exit the loop. The first case will be dependent on the value in our predictor's
memory, and it may be that we are able to predict the branch correctly the first time
we see it. The second case is hard to avoid, although some more sophisticated branch
predictors will learn how many iterations a loop will make. A common improvement to
this simple scheme is to avoid instantly flipping our prediction just because the
branch does something unexpected once. This can be achieved with a saturating
counter, which instead remembers how many times the branch has been taken recently,
versus not taken. The counter increments when the branch is taken and decrements when
not taken. It predicts "taken" if the majority of recent executions of this branch
were taken. When building higher-performance processors, we often have to discard
many instructions every time we mispredict a branch, so accurate branch prediction is
very important. Therefore, branch prediction is still an active area of research. One
of the key ideas used is correlation between branches. In real programs, a common
pattern is for example a pair of branches that always have opposite behavior. A
branch predictor can take advantage of this by remembering a history of whether or
not recent branches were taken or not taken and incorporating this in the hash.
Another idea is to combine multiple different types of predictor in a "tournament
predictor." We use a "meta predictor" to predict which of two branch predictors will
do a better job. As you can see, branch prediction can get quite complicated. [music]
Module 3, Video 4
[music] In this module, we've met the concept of branch prediction. Even using
relatively small simple circuits, we can accurately predict real branch behavior more
than 95 percent of the time. Could we do better, and do we really need to? In small,
simple processors, these prediction accuracies are fine, because each misprediction
causes only a few stall cycles. Increasing the accuracy is not that impactful.
However, in complex processors with very long pipelines, the difference between 98
percent and 99 percent prediction accuracy can be significant for performance as a
misprediction could incur dozens of stall cycles. Accurate prediction really does
matter. One of the problems we face in making accurate predictors is that they need
to be small enough and fast enough to fit in a microprocessor. We can imagine all
sorts of ways to do accurate branch prediction, but if their circuit would be slower
than simply executing the branch, they would not be useful. Modern high-performance
processors will often have multiple branch predictors, for example a small fast one
and a slower complex one. The slower one can override the prediction of the fast one
if it thinks it got it wrong, which does incur some stall cycles, but fewer than a
total misprediction. Another problem we face is that some branches are just very hard
to predict. No matter what technique is used, there are some branches in real
programs that are effectively "random". For example, when compressing or
decompressing data, the combination of the underlying algorithm and the input data
may provide no clear patterns to a prediction. No matter how hard we try, some
branches will never be predicted correctly 100 percent of the time. A final problem
is that since the predictors work based on observing the program, there will always
be a period of time when the predictors train on a program to learn its behavior.
[music]
Module 3, Lab
[music] In this exercise, we're going to be using a branch predictor simulator to
explore dynamic branch prediction. This simulator will accurately simulate the
details of a branch predictor but it uses a trace of a real program executing on a
real machine to avoid the need to simulate the rest of the processor. Computer
architects use techniques such as this to quickly explore one area of process design
understanding that perhaps the accuracy may not be completely correct given that
we're not simulating the full process of design. The interface allows us to configure
details of our branch predictor for example the maximum of the saturating counters
used in the branch predictors table or the hash function. And we can see the impact
of changes on the delay of the branch predictor and also the design size which are
two key metrics when designing a branch predictor. Once we've happily configured the
design we want we can press run to simulate a program and the results of that
program's execution will be displayed in the rest of the stats. So for example, we
can see here that the predictor predicted 95.24% of the branches correctly. Which is
fairly good and this resulted in overall execution time of the program of 5.335
milliseconds. Just below we can see a table that records previous simulation runs as
well so we can do multiple experiments and see which produces the best results. For
example, if we were curious about the effects of using a saturating counter with a
maximum of three rather than one we could change that and notice that the design size
has substantially increased and when we press run, we noticed that the predictor
accuracy, however, has also increased. It's gone up to 96.31% and consequently the
execution time the program has fallen slightly and so we can compare these two
designs and see whether or not this represents a good trade-off for our processor.
Perhaps the area cost is justified or perhaps it's too much. It would depend on the
exact processor we're trying to design In the problems you'll be invited in to come
up with designs that are suitable for particular constraints. For example, particular
constraints on the runtime of the program or the design size. Once you've configured
the branch predictor that you think meets the objectives you can scroll down to the
submit button and click that and you'll be told whether the answer is great or not.
Good luck. [music]
Module 4, Video 1
[music] So far, we've looked at the microprocessor's "datapath"— meaning its
execution units, registers, and control circuitry. We have given less attention to
its memory. We usually implement memory using a different type of chip than the
microprocessor, using a technology called DRAM. It is very dense, allowing us to
store lots of data in a small area. However, one issue with DRAM is its speed. Since
the 1980s, processor performance has increased very rapidly at roughly 55 percent per
year, so CPUs of today are many orders of magnitude faster than those of 40 years
ago. In contrast, memory performance has grown much more modestly. Whilst memories
are also much faster than they were in previous decades, their performance has not
kept pace with processors. This leads to a processor-memory performance gap, with it
becoming more costly to access memory as time goes on. One of the issues that makes
these memories slow is their size. We can make a memory as fast as our
microprocessor, if it is very small. However, this is the opposite of what
programmers want. Programmers can use extra memory to solve more complex problems, or
to solve existing problems faster. This complicates microprocessor design because
although we want a large memory to hold all our data, large memories are slow and
this slows down our processor. How do we overcome this? What if we could get the
speed benefit of a small memory alongside the size benefit of a large memory? One
solution is to have a large, slow memory and a small, fast memory in our system. For
this to be useful, we need the small memory to hold the data that we use most often,
so that we often get the speed benefits of accessing it, and only rarely have to
access the slow memory. Whilst there are different arrangements of these two
memories, the configuration that is most often used is an "on-chip cache memory." The
small memory sits inside the processor between the pipeline and the large memory. We
keep all data in the large main memory, and put copies of often-used data in the
small memory, which we call a cache. But we are not limited to only one cache! Our
pipeline reads memory in two places: when fetching the instructions; and when
accessing the data. It makes sense to have two caches here, each optimized for
different purposes. The instruction cache is optimized for fast reading of
instructions at Fetch. The data cache is optimized for reading and writing data from
the memory stage. We will often put a larger "level 2" cache between these two caches
and the main memory. The L2 cache is a "medium-sized" memory: faster and smaller than
main memory, but slower and larger than the two L1 caches. Using a hierarchy of
caches reduces the bandwidth requirements to main memory and the energy cost of
moving data around. [music]
Module 4, Video 2
[music] We previously looked at the need for a cache, which can be used to store
often-used data for fast access by the processor. But which data is used often enough
that it is worth including in the cache? Programs are mostly made of loops. Here's a
simple one that sums values in memory. It displays the two characteristics that we
can exploit to decide what data to put into a cache. Let's look briefly at how it
works. Each time round the loop, there are two loads and one store. The first load is
to load the data we're summing. The other load, and the store, are to update the
running sum. Notice two things here. First, we access the running sum over and over
again; each time round the loop. Second, when we access part of the data in one loop
iteration, we've already accessed its predecessor in the previous iteration, and will
access its successor in the next. Caches exploit two types of "locality" in programs
in order to be effective. The first is temporal locality: if a piece of data is
accessed, it is likely that it will be accessed again in the near future. The running
sum has temporal locality. The second type of locality is spatial locality: If a
piece of data is accessed, then its close neighbors are quite likely to be accessed
in the near future. By close neighbors, we mean data whose memory addresses are not
far from each other. The data accesses have spatial locality. It turns out that most
programs exhibit lots of temporal and spatial locality. We can exploit this to
determine what to put in the cache. Exploiting temporal locality is fairly easy: we
simply see which values have been accessed and place them in the cache. Exploiting
spatial locality is also quite simple: when a piece of data is accessed, place its
neighbors in the cache. We'll see in the next module how this is actually achieved.
We use locality to guess what values will be accessed next and store them in the
cache. If we are correct, we get the benefit of fast memory, but if we are wrong, we
must perform a slow access to main memory— just as we would if the cache were not
present. In real programs, we see hit rates of 90 percent or more in microprocessor
caches, resulting in vast performance improvements. However, it's worth noting that
some programs have less locality, and for those programs, caches offer little
performance benefit. [music]
Module 4, Video 3
[music] We've looked at the reasons why we build caches, but how do they actually
work? To the outside world, a cache simply takes an address as input, and either
provides the data that is stored at that location at output, or returns a signal to
say that it doesn't have it. If the data is found, this is called a "cache hit". If
the data is not in the cache, this is called a "cache miss", and it means we must
look for the data in main memory instead. After each miss, we update the contents of
the cache. The fundamental building block of a cache is called the "cache line". It's
a number of data bytes from consecutive addresses in main memory. Cache lines are
typically 32 or 64 bytes long, and a cache typically has an array of hundreds to many
thousands of lines. The line captures spatial locality, because it is larger than the
data read by a single load instruction. When a cache miss occurs, the whole line
containing that data is copied to the cache from main memory, meaning we have nearby
values for future accesses. When a request comes in, we use some bits from that
address to index the line array. Just like in the last module on branch prediction,
this leads to the problem of aliasing again, since selecting only some of the bits to
index into the array is like a hash. This means that many addresses map to the same
line in the cache, but we can only store one of their lines of data. We need to note
down which addresses' line is currently stored in the data array, in what we call the
"tag array". There is one tag for each cache line. When we access the cache, we
access the tag array with the same index to see if the data we need is present. This
design is called a "direct-mapped cache". Direct-mapped caches work fairly well, but
for some programs, we can be unlucky with the program accessing to aliasing lines
frequently. We can do something about this by duplicating both the arrays, so that
each line of data can now be stored in one of two places in the cache. When we access
the cache, we look at both arrays and only get a miss if neither of the tags match.
This is called a "2-way set-associative cache", because each line has a set of two
places or "ways" it could reside. The "associativity" of this cache is therefore two.
Set-associative caches introduce a further complication: when we want to add data,
where do we put it? In a 2-way set-associative cache, there are two choices. How we
decide which cache line to evict is called the "replacement policy". There are many
different types of replacement policy. A simple one is just to make a pseudo-random
choice from all the possible cache lines. Another option is to keep track of when
each cache line was last accessed and to evict the one last used furthest in the
past. This is called a "least recently used policy", and takes advantage of temporal
locality. It does, however, means storing extra information in the tags to track
usage. Many other ideas are possible too. [music]
Module 4, Video 4
[music] Now that we've seen how caches work, let's see how they affect the
performance of a processor. Recall the processor performance equation, where the
processing time is proportional to the average cycles per instruction. Without a data
cache, if 20 percent of instructions are loads, and main memory takes 20 cycles to
access, our CPI figure must be at least 5. However, if we provide a cache that holds
the required data 80 percent of the time... ...and only takes 2 cycles to access, our
CPI reduces to 2.2, which is a significant improvement! We can isolate the memory
terms in this equation to get the average memory access time —abbreviated to AMAT—
which allows us to compare different cache configurations more easily. Changing the
cache configuration will impact the AMAT. There are many different cache parameters
we can change, such as the size, replacement policy, associativity, whether we put
data in the cache for stores or just for loads, and so on. For example, reducing the
size of the cache will improve the access time for a hit, but will also increase the
miss rate. Let's say that we can halve the access time to 1 with a corresponding
halving of the hit rate. This alters the AMAT to 13, which in this case is worse for
performance overall. It's also useful to look at why an address might miss in the
cache. Broadly speaking, we can divide cache misses into three different categories.
Compulsory misses occur when we attempt to access an address that we have never seen
before and so never had the opportunity to cache it. Capacity misses occur when there
is more data being accessed than the cache could hold, even if we had complete
freedom in where to put each cache block. Conflict misses occur in caches where there
are more addresses hashing to the same index than arrays to hold the data. We can
alter our cache configurations to lower these misses, but as always, there are trade-
offs involved. Compulsory misses can be reduced by increasing the cache block size,
to take advantage of spatial locality. But for a fixed cache size, this reduces the
number of different addresses or cache lines that can be stored. A technique called
"pre-fetching" can also be used to predict the addresses that will soon be accessed,
and bring their data into the cache early. But this increases energy consumption, and
may make the cache perform worse if the predictions are not highly accurate. Capacity
misses can be reduced through increasing the size of the cache. Although, as we saw
before, this impacts the number of cycles taken to determine a hit. Conflict misses
can be reduced through increasing the number of cache blocks in each set, with an
increase in energy consumption as a side effect of this. [music]
Module 4, Lab
[music] In this exercise, we're going to be using a cache memory simulator to explore
the effects of cache memories on processor performance. The simulator accurately
simulates the behavior of the cache memory system, but it's using a model of the rest
of the processor to quickly allow us to simulate the effects of the cache on the
processor without needing to simulate the full processor. We can configure a number
of parameters about our cache, for example, the number of levels in our cache,
whether or not the cache separates instructions and data, or is unified, keeping them
both in the same cache. We can also configure the size, the line size, and the
associativity, and changing these numbers will affect design parameters, such as the
access times in the case of an L1 level one cache hit or a cache miss, and also the
design size. Once we're happy we've found design we'd like to investigate, we can
press "run", at which point the simulator will run through the program. We can see
the hit rates for instructions in the level one cache, and also data in the level one
cache are displayed. And we can also see the average memory access time, the results
from this. And then below everything, we can see a table of past simulations so that
we can quickly refer back to our previous experiments when we do new ones. So, for
example, let's say we were curious about the effects of increasing the size of the
cache. If we change the parameter and then press "run", we can immediately see that
the design size has substantially increased, which sort of makes sense because we've
doubled the contents of the cache, the size of the contents of the cache, and
therefore we'd expect roughly double the area. And we can also see that the hit rates
have improved. So there's about a 1% improvement to the L1 instruction cache hits and
a 1% improvement to the L1 data cache hits, which has reduced the overall average
memory access time. And so we can compare these two designs to see which of them we
think is better. It's a trade-off of course, though. The larger design has got better
performance, but it is larger, and so depending on the context, we may need to pick
the smaller design or the bigger design depending on our performance goals. In the
exercises, you'll be invited to come up with a series of designs for caches that meet
certain performance goals. For example, you'll have a constraint on the area of the
design or the execution time of the program, and you need to optimize the cache to
meet those goals. Best of luck. [music]
Module 5, Video 1
[music] In this module, we'll look at how to further improve performance by
exploiting "instruction-level parallelism." In Module 2, we explored how pipelining
can improve the performance of our processor. This reduced our clock period, and
allowed execution of instructions to be overlapped, improving throughput. One way to
boost performance further would be to create a much deeper pipeline. At some point,
this would mean even the ALU in our Execute stage will be pipelined. Consider the
simple program shown in the slide. Some instructions are dependent on a result from
the previous instruction. Remember in our 5-stage pipeline that these dependent
instructions could be executed in consecutive clock cycles with the aid of data
forwarding. If execution takes place over two pipeline stages within the pipeline, we
need to stall if adjacent instruction share a dependency. This allows time for the
result to be computed. The programmer may be able to rewrite their program to get the
same result with fewer stalls, by placing an independent instruction between our pair
of dependent instructions. In this case, we can move the third and fifth instructions
earlier to optimize performance. The performance of programs that run on our "super-
pipelined" processor would, to some degree, be determined by the availability of
independent instructions that could be executed in parallel in the pipeline. This is
"instruction-level parallelism"—or ILP. Very deep pipelines are problematic as they
would require: a very high-frequency clock to be distributed across the chip very
precisely; careful balancing of logic between many, very short, pipeline stages; the
pipelining of logic that is difficult to divide further into stages; and the division
of logic at points requiring many pipelining registers to be inserted. A different
approach to exploit ILP is to make our pipeline wider rather than deeper. In this
design, the processor will fetch, decode, and potentially execute multiple
instructions each cycle. Such a design avoids the problems of a super-pipelined
processor, although as we'll see in the next video, it does introduce some new
complications. [music]
Module 5, Video 2
[music] In this video, we are going to explore "superscalar" processors, which can
process multiple instructions in each pipeline stage. In our simple 5-stage pipeline,
there is at most one instruction per pipeline stage. At best, we can complete one
instruction per cycle. We call such a design a "scalar" processor. In a 2-way
superscalar version of this processor, we would extend this design so it is able to
fetch, decode, execute and writeback up to two instructions at a time. In general,
superscalar processors may vary the number of instructions that can be processed
together in each stage. Let's step through the design. Our instruction cache will
need to supply two instructions per cycle. Typical superscalar processors only ever
fetch adjacent instructions on a given cycle. This can lower performance if, for
example, the first instruction fetched is a taken branch, because then the second
would not be required. Note that now every cycle lost due to control hazards will
cost us two instructions rather than one, so accurate branch prediction matters even
more in superscalar designs. The Decode stage must now decode and read the registers
for two instructions simultaneously. Fortunately, we are able to extend the register
file design to read many register values at the same time. The Decode stage also
needs to check whether the two instructions are independent. If so, and if the
functional units they both need are available, it can "issue" them for execution in
parallel on the next clock cycle. Otherwise, in this simple design, it will only
issue the first, and keep the second back. A simple design such as this—where two
instructions are fetched, decoded and issued— is called a "2-way" or "dual-issue"
processor. In other designs, the width may vary at different stages of the pipeline.
To support the execution of multiple instructions at the same time, the Execute stage
is expanded and contains two execution pipelines. It's common for these to have
slightly different capabilities to save area. For example, the top pipeline can
execute both ALU and memory instructions, while the second pipeline only executes ALU
instructions. To ensure that dependent instructions can execute on consecutive clock
cycles, we must add data forwarding paths. These data forwarding paths must allow
results stored in either execution pipeline to be forwarded to the input of either
ALU. During writeback, we need to store both results to the register file. This means
the register file must be redesigned to allow two writes per clock cycle. Overall,
these changes typically require 25 percent more logic circuitry in our processor,
compared with a scalar processor. But we'd expect an improvement in execution time of
between 25 and 30 percent for real world programs. [music]
Module 5, Video 3
[music] We've seen that instruction-level parallelism can be used on superscalar
processors, to run them faster than would be possible on a scalar processor. But how
much of a speedup is this in practice? Ultimately, this depends on how much
instruction-level parallelism is possible in a typical program. How might we measure
this? We can do this initially without considering any constraints that will be
imposed by the processor it will run on. Let's consider the instructions executed by
the program. Let's assume that we can predict all the branches in the program
perfectly. Then we can ignore branch instructions, as they don't need to flow down
our pipeline. Now let's imagine we can execute any instruction as soon as the data it
needs is ready. That is, we are only restricted by the presence of true data
dependencies. Note that some dependencies are carried through writes and reads to
memory. Rather than considering program order, we can now just look at the order the
dependencies impose on instructions. This is referred to as "data-flow analysis."
Assuming each instruction takes exactly one cycle to execute, the fastest possible
execution time of the whole program in cycles is given by the longest path in the
data-flow graph. The instruction-level parallelism of this program is the number of
instructions divided by this duration, as this gives the average number of
instructions we would need to be able to execute each cycle to achieve this duration.
In real programs, this can be anywhere from around five, to hundreds or even
thousands. An active area of research and innovation for computer architects is to
imagine processor designs that can expose and exploit as much of this parallelism as
possible. One insight architects have had is that superscaler processors need to have
a fast supply of instructions to be able to analyze dependencies effectively. This
often means that the front end of our processor pipeline is much wider than the rest
of the pipeline, so that it can "run ahead" and see what behavior the program will
have next. Fast and accurate branch prediction is vital, as we often have to predict
multiple branches ahead accurately, to achieve good performance. Another key insight
is that we don't have to wait to execute the instructions in program order. If all
the dependencies of an instruction are satisfied, the instruction can proceed down
the pipeline even if previous instructions are yet to execute. This can reduce
program execution time by taking advantage of more instruction-level parallelism. In
practice though, this creates extra complications, as we will see in the next module.
[music]
Module 5, Lab
[music] In this exercise, we will be using a simulator to explore superscalar
microprocessor design. The simulator has a number of parameters that we can
configure, such as the number of pipeline stages, the width of the Fetch stage, the
width of the issue, and the number of ALUs in the design. We can see a diagram of the
processor that we've created and we can also see a number of parameters about that
design, for example the clock frequency and the overall area of the design. When we
press step, the simulator will advance one clock cycle and so for example here we can
see that the fetch stage has fetched the first 4 instructions. However, immediately
we see one of the problems with designs such as this, which is the three of the four
instructions that have been fetched are in fact useless because the first instruction
was an unconditional taken branch and therefore the remaining three instructions will
not be executed by the program and so these will immediately be discarded on the next
cycle. Pressing the run button allows us to simulate and we can use the fast-forward
option to simulate much quicker in order to get to the end of the long program
execution. In this case we can see that our design achieved an average cycles per
instruction less than one, which is to say that we on average executed more than one
instruction per cycle, which means that our superscaler design has fundamentally
worked, and we can see that, for example, the overall execution time of the program
is 1.6 milliseconds. In a table below, we can also see a record of our previous
simulation runs. So let's say for example, we were curious about the effect of
increasing the issue width by one. We can make that change and then press run again
in order to run our new design, and when it finishes we can scroll down to take a
look and we can see that the program execution time has indeed improved down to 1.51
milliseconds at a cost of only 1% area. So it looks like this was a very good
improvement to our design and it's almost surely going to be a beneficial trade-off
in practice. In the exercise you will be invited to configure a number of different
superscalar processor designs with various targets in terms of clock frequency,
design area and execution time. Once you're happy that you've configured the process
that you think completes the exercise, you can scroll all the way down to the bottom,
where you'll see the submit button that you can press to have your answer checked.
Good luck. [music]
Module 6, Video 1
Introduction:
So hi, my name is Peter Greenhalgh. I'm Senior Vice President of Technology and an
Arm fellow. I'm responsible for the Central Technology Group at Arm. We're about 250
people. We work on everything from machine learning to CPU, GPU, system IP, and the
solutions that we create as well. And we basically path-find future technology at the
product level that goes into all of our products and the IP that we produce. Arm is
known for the power efficiency of its microprocessors. How have you managed to keep a
focus on power when building processors with very complex and power-hungry features?
We've got some really great design teams. In fact, we churn out I think more CPUs
than pretty much anyone else does on the planet. I think we're producing something
like four or five CPUs per year. So we've got a lot of experience in designing for
power efficiency and performance, and in fact we can leverage the understanding that
we have all the way down to microcontrollers through to the smaller A-class
processors, all the way up to the high performance. There's a lot of sharing between
the teams in terms of strong knowledge and capability, and insight into how to design
for both performance and power efficiency. More specifically, I mean, ultimately, you
have a performance goal that you need to achieve, and then as part of that you have
to figure out how to get the best possible power out of the design when you're
achieving that performance goal. And to do that, there's kind of some different ways
of looking at it. There's the really detailed orientated work that you need to do
around things, like clock gating, data gating, all the things to try and stop
unnecessary power use deep within the microarchitecture when the instructions are
flowing through the pipeline or data's moving through the pipeline. And then there's
essentially the structure of the design that you've created. And that then dictates
fundamentally what the power of the design is going to be. You can't fix a design
that's got bad structure with improved clock gating, data gating, and just good low-
level design. You have to marry the two together. And that high-level work that you
do is around making sure that the pipeline is well balanced, that you aren't opening
up the pipeline, going too wide too soon; you're extracting data, you're extracting
information as late as you possibly can and just when you need it, and not just
pipelining it down through the design for the sake of it; and then, fundamentally,
good microarchitecture around branch prediction, which stops you putting things down
through the pipeline that you're just ultimately going to flush; good pre-fetching on
the data side so that you make sure you get the data in the design when you need it,
and you're not sitting around waiting for it. So you have to marry that altogether,
and we've got a lot of great techniques in order to achieve that, which
fundamentally, I say, you need to achieve the performance target, and then everything
else comes together to achieve that performance target in the best possible energy
efficiency. How did Moore's Law affect computer architectures of the past, and what
will its influence be on future designs? Gordon Moore's influence on the industry has
been massive, and the tenets behind the law still continue today, albeit in a
slightly different form. I mean, I started designing at .18 Micron, essentially 180
nanometers, and here we are today working on 3 nanometers. So it's a vast difference
now compared to when I started 22 years ago. And there's no way we could have got to
where we are today without the process scaling from all of the foundries out there
and all the companies that provide the foundry technology. So it's a little bit like
magic, all of the work that they do. I can't say I understand it in detail, but it's
incredible technology, and that allows us... If it hadn't happened, we'd still be
stuck in all the designs which were fairly simple. There's no way that we'd have got
to the sort of designs that we have today of massively out of order, deep, deep
amount of instruction and depth, very, very wide designs. All of that has been made
possible by the steady improvement, a predictable improvement of the foundries. And
that's kind of one of the key points which really was captured by Moore's law or is
captured by Moore's law of that kind of predictable knowledge that you will get an
improvement in the process. Is it 10%, is it 15? Is it 20% on, say, power, for
example? It kind of doesn't matter in a way because you can work with what you
eventually get. You can do things like voltage scaling to be able to make use of the
power that's available to you. Is it 5%? Is it 10% on frequency? Again, it kind of
doesn't matter in a way. But what matters is when we start designing a processor
today and we finish it in 18 months time, and then two years after that it arrives in
the product in the shops that consumers can buy. We know that over that period, the
process improvements have happened, which allows us to liberate essentially more
performance, more energy efficiency from the design. And we don't mind too much if it
takes another three months or six months to get to the process. We don't mind too
much if the performance or power is not exactly where it was predicted at the start.
But, ultimately, we know we'll get an improvement, and we know there'll be an
improvement in two years, and three years, and four years, and Moore's Law may have
slowed, but it's certainly not stopped.
[music] As we saw in the last module, instruction level parallelism can be used to
improve program execution time in our microprocessor designs. To enable this, the
compiler creates an optimized instruction schedule when the program is converted into
machine code. Unfortunately, the compiler cannot know precisely what will happen at
run-time, so this design is constrained by the order of instructions in the program.
The compiler won't know what the program's input data will be, whether branches will
be mispredicted, or whether memory accesses hit or miss in our data cache. In
contrast, a superscalar processor with "out-of-order" execution can produce an
instruction schedule at run-time, only constrained by true data dependencies and its
hardware limits. This schedule is produced on demand and so can even change each time
the code runs. To do this, we introduce an "issue window" or "issue queue" after the
Decode stage. This holds instructions until they can be executed, not necessarily in
the order they arrived in. Within this window, instructions can be issued whenever
their dependencies are available, and when a functional unit is available to process
it. To be able to detect when an instruction is ready to be issued, we must know
whether the instruction's dependencies are ready when it enters the issue window. We
must then update this status as new results are produced. To implement this, the
names of result registers of executed instructions are broadcast to the issue window.
The instructions waiting there compare the register names to the registers they
require. However, this scheme has a problem: A register will be written multiple
times in the program, and since the instructions are executed out-of-order, the
register name alone is not sufficient to record dependencies. It also means that
instructions would have to wait until all previous reads of a register had finished
before executing. These are called "false dependencies." These problems can be
resolved by "renaming" register names at run-time so that each "in-flight"
instruction writes to a unique destination register. We use a "physical register
file" that is large enough to ensure we don't run out. We keep a "register mapping
table" to store the mapping between architectural, compiler-assigned registers, and
physical registers. Register reads to the same architectural register are renamed
consistently, so that dependencies can be tracked correctly with physical register
names. Physical registers are reused only when they are no longer used by any
instruction currently in-flight or any entry in the register mapping table. The other
big issue with out-of-order execution is memory dependencies. Load and store
instructions can have memory dependencies because they access the same memory
location. To detect this, we need to compare the computed memory addresses that the
instructions access. We thus split memory operations into two steps: address
calculator and memory access. We issue their address calculation step as soon as the
dependencies are available. Then, the memory access step is placed in a special load-
store queue to be sent to our data cache as soon as possible. We carefully ensure
that operations that access the same address are kept properly ordered, but
independent accesses can be reordered if beneficial. No access can occur until the
addresses of all previous accesses are known. Since memory writes are irreversible,
store instructions must also wait until we are certain that they will execute.
[music]
Module 6, Video 2
[music] In the previous video, we outlined the concepts of out-of-order execution,
and register renaming. The issue window will be filled with instructions fetched
along the path that our branch predictor believes the program will take. While we
hope our branch predictor will be correct in most cases, it will sometimes be wrong.
How do we handle such cases? A simple approach is to start by recording the original
program order of the instructions, and then to monitor their progress. We call the
structure that stores the instructions the "reorder buffer." As each instruction
executes and produces a result, we can mark it as done. When the oldest instruction
has completed, we can remove it from the end of the reorder buffer, and the
instruction is said to have "committed." This stream of committed instructions
represents how our program would be executed on a simple in-order pipeline or by an
unpipelined processor. It usefully also provides a point at which we can process
exceptions. For example, if the program divides by zero or attempts to access memory
that does not exist. We also check branch instructions as they complete in order. If
they have been mispredicted, we flush the reorder buffer, our instruction window and
any currently executing instructions and start fetching down the correct path. To
preserve correctness, we must also restore our registers and the register map table
to the values they had when we mispredicted the branch. This can be done with the aid
of a second register map table, updated only when instructions commit in program
order. This can simply be copied to the map table used by our renaming hardware to
"rewind time" for the processor. All the register values we need will be present, as
we don't recycle registers before we know they will not be needed again. In reality,
handling branches in this way is too slow. Processors instead take many copies of the
register map tables and can handle branches as soon as they are resolved, and we
discover they have been mispredicted. They can also selectively neutralize the in-
flight instructions in the datapath that are on the wrong path, rather than flushing
all of these instructions away. [music]
Module 6, Video 3
[music] We can now bring everything together and look at what a typical pipeline for
an out-of-order superscalar processor might look like. The Fetch stage is aided by an
accurate branch predictor as we met in Module 3. It will fetch a group of
instructions on every clock cycle. This group of instructions will be requested from
the instruction cache, and will be from consecutive memory locations. Branches may
reduce the number of useful instructions that can, in practice, be fetched in on each
cycle. The Decode stage decodes multiple instructions in parallel. At this point,
modern high-performance processors may also split complex instructions into simpler
operations or "macro-ops." In some cases, there may also be opportunities to combine
simple instructions into a single operation. The next step on an instruction's
journey is renaming to receive a unique destination register. As we saw in the last
video, this increases opportunities for out-of-order execution. Remember, there are
several times more physical registers in our processor than those available to the
compiler. Instructions are placed in the reorder buffer, and are also "dispatched" to
the Issue stage. They will wait in the window as necessary, and are ready to be
issued once all their operands are available. In the most complex of today's
superscalar processors, there may be hundreds of instructions buffered in the issue
window at the same time. Instructions finally commit in program order. At this point,
any physical registers that are no longer needed can be added back to the pool of
free registers. These are then assigned during the register renaming step. Once an
instruction is issued, it reads its operands from the physical register file. The
Execute stage consists of many functional units operating in parallel. These may each
support different operations and take different numbers of cycles to execute. A
network of forwarding paths is also provided to ensure we can execute any dependent
instruction on the next clock cycle after the generation of the result. This requires
being able to quickly communicate—or "forward"— a result from the output of any
functional unit, to the input of any other. Some instructions will need access to
memory. After computing their addresses, they are placed in the processor's load-
store queues. "Stores" are sent to memory in program order, but "loads" can often be
sent out of order, and ahead of other older stores or loads that are not yet ready to
be issued to memory. The memory system reduces the average memory access time by
providing numerous levels of cache memory. After generating results, we write them
back to the register file. This overview is representative of the fastest modern
microprocessors found today in laptops, smartphones and servers. Whilst much extra
innovation goes into real designs, they generally follow the ideas discussed in the
course. [music]
Module 6, Video 4
[music] One question computer architects always ask themselves is: "how much can we
scale up our design?" Let's take a look at some further potential optimizations to
our out-of-order superscalar processor. We could try to make it wider. For example,
by doubling the number of parallel instructions, we can fetch, decode and execute
more instructions per cycle. Would this double our performance? Sadly, no, things are
not that simple! In practice, some components quickly become very complex, and
performance gains may be hard to extract. For example, today's largest machines fetch
at most ten instructions per cycle from their instruction caches. Fetching more
instructions than this offers minimal performance gain, despite a large hardware
cost. If we increase the number of registers, the size of our issue window, the size
of our load-store queues, or perhaps use a larger and more accurate branch predictor,
our processor's performance will only improve slightly despite a significant increase
in the size of these structures. After a point, the increase in performance is no
longer worth the cost of the extra transistors. It's also possible that performance
might reduce overall as we may need to lower our clock frequency as the structures
get larger. Finally, we could introduce more pipeline stages, but we know this
doesn't necessarily lead to higher performance, as mispredictions may become more
costly. The combination of these issues means that extracting performance using
instruction-level parallelism alone becomes more expensive as more performance is
sought. This graph shows how the energy cost of executing an instruction grows
quickly as we try to build higher performance processors. Let's look at some example
designs. Suppose we have a core, which requires a certain area. If we double its
area, its performance improves, although there is a small rise in energy per
instruction. If we quadruple its area instead, its performance has now doubled
compared to our original core, while energy has increased by 50 percent. Going
further, if we increase our processor's area by a factor of 10, performance is only 2
point 5 times our original core, but energy per instruction is now 3 times higher.
Its performance does not improve as fast as the cost of running it! Of course,
engineers are clever and determined, and are constantly developing new techniques to
bypass many of these issues. This means the performance of processors—even ones
running a single thread or program— still improves by around 10 to 25 percent each
year. Nevertheless, ultimately we often need more performance than can be provided by
instruction-level parallelism alone. A modern solution is to employ multiple
processor cores on the same chip—called a "multicore" processor. This changes the
task for programmers; they may need to redesign their programs to take advantage of
such parallelism, but if they can, it can give vast performance benefits. As we've
learned throughout the course, every decision involves trade-offs and compromise. We
are faced with a fascinating but often highly-constrained design problem. We've seen
how performance bottlenecks, that at first seem impassable, can be overcome with
innovative designs. What might the future hold for microprocessors? Can you think of
ideas? What would you design? [music]
Module 6, Lab
[music] In this exercise, we'll be using a simulator to explore an out-of-order
superscalar processor design. The simulator allows us to configure a number of
aspects of our processor, for example, the number of pipeline stages, the width of
the Fetch stage, the size of the issue window, the number of ALUs, and the size of
our re-order buffer. The changes will be reflected in the pipeline diagram, which we
can see below, and also in the statistics below that, with key design metrics, such
as the clock frequency, clock period, and design size, visible below. When we press
"step," the simulation will advance by 1 clock cycle, and so we can see, for example,
on the first clock cycle the 1st 4 instructions are loaded. Although three of them
are unusable because they follow a taken branch and therefore these will not be
executed and will be discarded on the next clock cycle. We can press "run" to watch
our design in action and in order to quickly get to the end, we can use the "fast
forward" feature to simulate the millions of instructions in this particular program.
After the simulation is complete, we can check below to see a number of statistics
about our pipeline, which are useful for understanding why the performance is as it
is, and in particular we can see the program execution time. 1.18 milliseconds gives
us the overall execution time of the program that our design achieved, and notably
our average instructions per cycle gives us the number of instructions that we were
able to complete on each clock cycle, in this case well over 1, indicating that we
are taking advantage of instruction level parallelism in this simulation. Below all
that, we can see our completed simulations and we can check back on these as we
explore new designs. So let's say, for example, we were curious about the effect of
increasing the re-order buffer size. We could change that and then press "run" to
quickly run our next experiment. And then if we scroll down to our table, we can
compare the results and see that actually the program execution time was
substantially increased by increasing the size of the re-order buffer. Although,
admittedly, this did come at a not insignificant increase in the area of our design,
and so whether or not this represents a good trade-off in practice would very much
depend on the problem we're trying to solve. In the exercises, you'll be given a
number of scenarios for processors, which generally revolve around certain
constraints on the design size, or the clock frequency, or the target program
execution time. And once you think you've found a design that meets the goals
required, configure it up using the settings and then scroll down to the "submit"
button at the bottom, and click that to have your answer checked. Best of luck.
[music]