Computer Architecture Arm Substitles3

Computer Architecture Essesntials by
ARM
Module 1, Intro
[music] I'm Richard Grisenthwaite, Chief Architect at Arm. What does a Chief
Architect do? The Chief Architect is responsible for the evolution of the Arm
architecture. One of the things about an architecture is, you want your
software to continue running on future products. So it's a very long term
commitment, and the role of the chief architect is essentially to curate the
architecture over time, adding new features as needed by the partner. Why are
microprocessors so important and ubiquitous? I think the fundamental reason is
that they are a general purpose device. They describe a basic instruction set
that does the stuff that you can construct pretty much any program out of.
Indeed, they're Turing complete, which means, by definition, you can create any
computable problem, solving a computable problem with the processor. They're
not customized to any particular usage, but they do two things really rather
well, which are data processing and decision making. In other words, having,
for example, added two numbers and compared them, you're then taking a branch
and making a decision based off that. Now in reality, an awful lot of general
purpose problems are ones that involve: work out some value or some criteria,
compare against that, and then make a decision on it. And essentially that
allows you to solve a huge number of different problems and that's why the
microprocessor has become so ubiquitous. What is the difference between
architecture and microarchitecture? Architecture is what it has to do, and
microarchitecture is how it does it. So for example, we will define in the
architecture a set of instructions that do "add" or "load" or "branch", but it
says nothing about whether you've got a ten-stage pipeline or a three-stage
pipeline. It says nothing about branch prediction. All of those sort of
features which actually determine the performance of the device; all of those
are part of the microarchitecture. The key point is that software that is
written to be compliant on the architecture will actually run on lots of
different implementations, lots of different microarchitectures that
essentially implement that architecture in different ways, choosing a trade-off
between the power, the performance and the area of the design. What does it
take to build a commercially successful processor today? If you actually start
from scratch with an established architecture, but we want to create a new
microarchitecture, we reckon, for a high end processor, you're talking about
three-to-four hundred person-years' worth of work in order to create a properly
competitive multi-issue out-of-order machine compared with the state-of-the-art
that you can get from Arm. In addition to that — and that's just to come up
with the RTL — in addition to that, you've then got to do the implementation.
If you're going to build that on a three-nanometer process, the leading edge to
get the best possible performance, you're talking about tens of millions of
dollars for the mask sets. There's a whole bunch of software you've got to go
and build on top of that and do all of that work. When we've looked at
companies that are interested in actually starting up, taking an Arm
architecture license — say I want to go and build my own business — we reckon
that you need to be prepared to spend getting on for half a billion dollars
before you're actually going to be successful because it takes time. Your first
product is not necessarily going to be fully competitive because it would be
slightly surprising if the first thing that you built was as good as what
people have been building for many years. It takes time to build up that
expertise and skills. And so you're going to see a couple of iterations before
even the best teams end up really being competitive. And so as I say, I think
if you went from nothing and wanted to essentially create something, and that's
using the Arm architecture with all of its existing software and so on, you're
talking the best part of half a billion. What is the Arm business model? The
Arm business model fundamentally is the licensing of semiconductor IP to as
wide a range of companies as possible and in as many ways as possible, in order
to maximize the uptake of the IP. When we talk about IP, we're talking about
essentially designs for processors and we license that both as an architecture
license, where you effectively gain the right to build your own processor to
the specification of the Arm architecture, or an implementation license where
we are licensing, actually, an implementation that is compliant with the Arm
architecture in the form of some richer transfer-level RTL. What makes Arm
different from its competitors? So if you go back to where we were when Arms
started out in the early 90s, there were many, many different architectures
available and they were all kind of doing the same thing but slightly
differently. They would have different instructions, different instruction
sets, and so software needed to be ported. A huge part of Arm's success
actually comes from the fact that we created a business model of licensing the
IP to make it very easy for people to build processors, to incorporate the
designs into their SoCs, into their systems. And that then made it very
straightforward for people to be able to use the Arm architecture. Now what
this then meant was, people said: I will port more software to this because
it's more widely available and you get this positive feed-forward effect
whereby more availability of hardware encourages more software, encourages more
hardware, and so on. And essentially that meant that a lot of people said:
there's no point in me having a different architecture. I'm not getting a
valuable difference from doing that. All I've got is, kind of, a needless
change to the software that I need to make. So actually the whole model, we
went on at Arm, which was: let's license our IP to a great number of different
players to come up with different solutions to meet the form factor of a camera
or a PDA, or whatever it was back in the day. Those things made it much more
straightforward for people to incorporate our technology. [music]
Module 1, Video 1
Computer architecture is the study of tools and techniques that help us to
design computers. More precisely, it helps us understand how to meet the needs
of particular markets and applications, using the technology and components
that are available. For example, we might need to produce the chip at the heart
of a smartphone using 10 billion transistors and a power of less than two
Watts. How do we achieve the performance a customer wants? The challenge is a
fascinating one, and one that requires a broad understanding. For example, what
target market are we designing for? What are the characteristics of the
applications to be run? How will programming languages and compilers interact
with the microprocessor? How best to craft the narrow interface between the
hardware and software? How to organize the components of our microprocessor?
And how to design a circuit, given the characteristics of individual
transistors and wires? Like many design problems, computer architecture
requires many trade-offs to be made and evaluated. Each design decision will
impact trade-offs between size, performance, power, security, complexity, and
cost. Trade-offs must be re-evaluated regularly, due to advances in fabrication
technology, applications, and computer architecture. Computer architecture must
be grounded in quantitative techniques and experimentation, but the endless
number of possible designs means that the field depends on a high degree of
human ingenuity and art. Perhaps surprisingly, the earliest computers and
today's most advanced machines have much in common. They both execute a stored
program constructed from machine instructions. These instructions perform
simple operations such as adding two numbers. Nevertheless, greater numbers of
faster transistors, and the application of a relatively small number of
computer architecture concepts, have enabled us to construct machines that can
perform billions of instructions per second, and shrink these machines to fit
in hand-held battery-powered devices. It is this rapid progress that has
supported breakthroughs in machine learning, drug discovery, climate modeling,
and supports our modern world where computation and storage are almost free.
The task of designing a microprocessor is split into different levels of
abstraction: "Architecture;" "Microarchitecture;" and "Implementation."
"Architecture" focuses on the contract between programmers and hardware. It
allows compatible families of microprocessor products to be built. The ARMv8-A
architecture is an example of this. Architecture includes the "Instruction Set
Architecture," or ISA, which defines what instructions exist. It also defines
precisely the behavior of memory and other features needed to build a complete
processor. "Microarchitecture" focuses on the organization and structure of the
major components of a microprocessor. It has to match the rules set by the
architecture. The microarchitecture still has flexibility though; and so the
implementation specifies the circuit detail precisely. This culminates in the
exact circuit design for manufacture. Each of these levels is vital, and each
comes with its own challenges and opportunities.
Module 1, Video 2
At the beginning of the 20th century, "computers" were people employed to
perform calculations. These computers used mechanical calculators to help them
perform arithmetic. They followed instructions to decide what calculation to
perform next. These instructions defined the algorithm or program they were
executing. They would consult paper records to find the inputs to their
calculations, and would store their intermediate results on paper so they could
refer back to them later. A modern electronic computer is organized in a
similar way. We will look in more detail at these components of a
microprocessor in the next video, but for now let's look at how it operates.
Microprocessors also follow instructions one by one, and then perform relevant
calculations. This idea of fetching instructions from memory and executing them
is called the "Fetch-Execute Cycle." In the "Fetch" stage, the computer reads
the next instruction from the program. This instruction is encoded in binary as
ones and zeroes, so it must be decoded to understand what it means. This is
done in the "Decode" stage. Once it is clear what to do, we move to the
"Execute" phase. This can involve different tasks such as reading memory,
performing a calculation, and storing a result. Once done, the computer is then
ready to begin the cycle again, by fetching the next instruction. Instructions
are normally fetched sequentially in order, but some special instructions
called "branches" can change which instruction will be executed next. For
branches, a calculation determines the next instruction. This can mean
evaluating a condition, or reading a register to determine the next
instruction's location. Branches allow computers to make decisions, and to re-
use instruction sequences for common tasks. A modern computer program, like a
web browser, contains millions of instructions, and computers execute billions
of instructions per second, but they all conceptually follow this "Fetch-
Execute Cycle."
Module 1, Video 3
Modern microprocessors are circuits built using anywhere from 1,000 to 100
billion tiny transistors. The key to designing circuits with such huge numbers
of parts is to build them from re-usable blocks. Let's take a look at some of
these. Each transistor is an electrically-controlled switch. When there is too
little voltage at the gate, the switch is off, so the electrical signal cannot
propagate from the drain to the source. However, when there is sufficient
voltage at the gate, the switch is on, so the signal does propagate. When
designing processors, we use digital electronics. The only voltage or current
values we consider represent zero and one, enabling us to build robust circuits
from imprecise components, even in the event of electrical noise or
manufacturing imperfections. This means our circuits represent binary numbers,
since they only have two states: zero and one. We use transistors to build
increasingly complex circuits. We can design circuits that can remember binary
values, or select between multiple inputs, or even do arithmetic, like
addition. We can then use those circuits as building blocks in even larger
designs. When designing a digital system, we must keep everything synchronized
to control when its behavior occurs. To do this, we use a "Clock Signal," which
is a wire in the circuit whose signal cycles between zero and one. We measure
the rate in Hertz. For example, if you hear a processor has a clock speed of
two Gigahertz, it means 2 billion ones and zeroes per second. The maximum speed
of the clock signal is determined by the longest, and therefore slowest, path
in the circuit between 2 clocked flip-flops. This is referred to as the
"Critical Path." The signal must have time to propagate all the way along the
critical path before the clock cycle completes. A microprocessor needs to do
many types of arithmetic. For this, we build an "Arithmetic Logic Unit" or
"ALU." This circuit receives 2 numbers as input, as well as an indication of
what operation to perform, for example, addition or subtraction. In addition to
logic, a microprocessor needs memory. Memory is organized as arrays of memory
cells that are able to store many "Words" of data. A specific word, commonly 32
bits in length, can be accessed by specifying its "Address." Each address is a
number that indicates the location in the memory that should be read or
written. Memory cells range from hundreds of bits to millions of bits in size,
but larger ones are slower to access, as signals and their long internal wires
take longer to propagate. For that reason, almost all microprocessors include
at least two types for storing data: a big slow "Data Memory," and a small fast
memory called a "Register File." In reality, the "Data Memory" may be
implemented using many different sizes of memory, as we'll see in Module 4. As
well as storing data in memory, we also use some memory to store the
instructions. We need a way to keep track of which instruction we will fetch
next, so we have a "Program Counter." This stores the address in Instruction
Memory of the next instruction to be accessed. Since instructions are encoded
in binary, we also have "Instruction Decode Logic" that converts that binary to
the various signals needed to control the microprocessor.
Module 1, Video 4
We've already seen how a microprocessor is controlled by instructions. But what
are they really? An instruction is a simple command that the microprocessor
hardware can perform directly. We write them as text like this to make them
easier to read. But for the microprocessor we encode them in binary. We use a
program called an assembler to translate between the human text and the binary.
In this video we'll be looking specifically at an Arm instruction set called
A64 but other instruction sets follow similar principles. Arithmetic and logic
instructions are the simplest type of instruction. The first word tells us what
operation will be performed such as addition or multiplication. The values
after this tell the processor where to put the result and where to get the
inputs. Values starting with X are addresses in the register file. Arithmetic
instructions read one or two registers and then put the result into a third
register. Branch instructions are used to make decisions and to repeat
instructions. Normally the microprocessor executes instructions in sequential
order but branches change that, and explicitly tell the microprocessor the
address of the instruction to run next. This is done by giving the address of
the next instruction in the instruction memory. Some branches are
unconditional, meaning they always occur and always affect the next instruction
address. Other branches are conditional, meaning the processor will perform a
calculation to decide whether to follow the branch or to continue executing
instructions sequentially following the branch. These are preceded by
comparison instruction to calculate the condition. Loads and stores are the
instructions for accessing the data memory. Loads copy values from memory to
the register file. Stores do the opposite. In both cases, the instruction needs
to know the address in the data memory and the location in the register file to
copy between. For data memory, loads and stores read an address from a base
register. They can also optionally add to this base address by reading another
register, or by simply specifying a number in the instruction itself. Using
sequences of instructions we can build programs. Here is an example of a small
program that implements Euclid's greatest common divisor algorithm. Let's take
a look at it working, one instruction at a time. To start with, the input
stored in X1 and X2 are compared. If they're equal, because we have found the
greatest common divisor a conditional branch instruction moves to instruction
7. If they're not equal, another conditional branch instruction can be used to
determine whether X1 is smaller. Finally, we use an arithmetic instruction to
subtract either X1 or X2 depending on which was larger and then unconditionally
branch back to the start. Here we can see an example of the program running.
One instruction executes at a time. As we step through the program, the values
in the registers X1 and X2 are shown. And we see these are modified each time
we reach instruction 3 or 5. The instructions repeat in a loop, as is common
for many different types of program. Each time round the loop instruction 0 and
1 are checking whether or not to continue the loop. When they detect X1 and X2
have become equal, the program finishes and the processor moves on to the next
task.
Module 1, Lab
In this exercise, we're going to be using ASim. ASim is a behavioral simulator
for a subset of the Arm AArch64 instruction set. Behavioral simulator means
that it's not simulating real circuit level details of a microprocessor it's
just simulating the behavior of each instruction as it runs. Nevertheless,
behavioral simulators are vital tools for computer architects. We use them to
check the designs we've built, match with the behavior we intended and we can
also use them to explore new ideas, for example adding new instructions to the
instruction set. We're just going to be using it though to get familiar with
the basics of Arm AArch64. So we can create a new file which will allow us to
type in some Arm AArch64 instructions and when we press assemble that will
cause ASim to begin the simulation of the instructions. If we scroll down we
can see that below we've put a little cheat sheet of Arm AArch64 instructions
that you can use as a reference. Do note that these are just examples though.
So for example this first one is mov X1 X2 and the descriptive text says that
this copies the value of register X2 to register X1. But equally that would
imply that if we use mov X3, X4 instead that would copy the value of X4 to the
register X3. So we don't have to just use these exact instructions, we can
tweak them as we need to, to achieve our goals. So going back to the ASim,
let's say that we wanted to add the number two to the number three. In our
cheat sheet we could see that there is an 'add' instruction but it requires
that the inputs for the addition are stored in registers. So first of all,
we're going to need to load up two registers with the numbers that we want to
add in this case two and three. So looking at our cheat sheet again we can see
that there's an instruction mov that allows us to do that. So if I write mov
X2, #3 this would cause the register X2 to be loaded with the value 3 and
similarly I could write mov X1,#2 and this would cause the register X2 X1 to be
loaded with the value two. Last but not least, we could then do the addition we
wanted to do, such as add X3 X1 X2. This will cause X1 to be added to X2 and
the results stored in X3. If I press assemble now, we will see that the UI
changes slightly. Here we can now see the simulated state of the processor.
ASim is only going to, because it's a behavioral simulator it's only going to
show the state before and after each instruction. And so right now we are
before executing the first instruction. That's the yellow highlighting there
and we can also see it in the simulated memory. If I press step we can now see
that that instruction has completed, and it's before executing the mov X1 X2.
And notably we can see that X2 has been loaded with the number 3 which is
exactly what we would expect. If I press step again, we can see that mov X1 X2
has now occurred. MOV X1 #2, sorry has now occurred which has loaded the value
2 into the register X one. And last but not least, if we step again we see that
X3 has become equal to five which is exactly what we would expect if registry
X1 was added to registry X2. So this allows us to get a feel for what it's like
for a machine to execute these instructions. We can reset back to the beginning
if we want to watch it go through again. If we want to do an edit for example
adding a new instruction we can just do that. Press assemble and that will add
the new instruction to our program, and we can see its effects by simulating.
Now in the exercise you will be invited to take an existing program and add 1
new instruction to it at the indicated position. Do note that when you assemble
the exercise program, you'll be taken to a gray bit of code, which is our
testing code. But as you step, you'll see that the simulator flicks between the
testing code and the code that we wrote the exercise one code. You can also
click between them using this. Feel free to try to work out what our testing
code does but you can just ignore it if you want to. The point of the exercise
is just to add the instruction in the indicated position. When you're
comfortable you've got the right instruction you can press run to get all the
way to the end. And if you really think that's correct if you scroll down to
the bottom of the page you'll see the submit button which you can use to send
in your answer. Best of luck.
Module 2, Intro
[music] Hello, my name is Martin Weidmann. I'm an engineer and product manager
with Arm's Architecture and Technology Group. I look after the A-profile
architecture and I maintain Arm's Interrupt Controller specifications. Computer
architecture is sometimes called a science of trade-offs. Why is everything a
trade-off when it comes to designing a processor? Let's take an example of
where we've had to make trade-offs when developing process architecture. So the
Arm architecture has a feature called Pointer Authentication. This is often
abbreviated to PAC for Pointer Authentication Code. What this feature is trying
to do is protect against a form of attack called ROP and JOP. These are Return
Orientated and Jump Orientated Programming, and it's where an attacker tries to
subvert things like the call stack to run legitimate code, but in ways that
weren't expected by the programmer or the compiler. PAC or Pointer
Authentication tries to defend against this kind of attack by using part of an
address to provide an encrypted signature. So we can check the signature and
the address match and if they don't, we can spot an attack in progress. So why
is this a trade-off? Well, because to add security, we want that signature to
be as big as possible. The bigger the signature, the more bits we use for that,
the stronger cryptographically that signature is. The trade-off is: the more
bits we use for the signature, the fewer bits we have available for other
things, such as the address. So you can have a big signature with a small
address, but if you want the address to get bigger, then you get a smaller
signature, and that's then cryptographically weaker. So the trade-off we have
to make when designing a technology like that is: What's the right amount of
bits for the signature? What's the strength of cryptography we need from that
signature in order to get the design goal, which is to defeat these attacks and
give us more robust computing? What sort of guiding principles do you use when
designing a processor? So when you're designing a processor, the key thing you
have to bear in mind is: What's it going to be used for? What you don't want to
end up with is a very expensive paperweight. So we need to understand the
requirements that the processor has to meet. We have to understand the design
trade-offs we're making and how they work into meeting those requirements. We
also have to consider not just the processor itself, but how we're going to
show that that processor is correct. We'll put as much time, if not more, into
testing and validating the design as we do into designing it. How do you design
a new microprocessor? If you wanted to create a new processor from scratch, the
first thing you're going to have to do is understand the market that that
processor is going to address and to then build a team to design that
processor. There isn't such a thing as one processor for every possible market.
The requirements for something like an embedded microcontroller are going to be
very different to what you want from the processor in your mobile phone, your
laptop, or your server. So you need to understand those requirements as the
first step into building a new processor. What determines the best design for a
microprocessor? So when you're designing a processor, you need to work out what
the best design for a given market or application is going to be. There's no
magic formula for this. It's going to depend a lot on what you're trying to
achieve with that processor. You need to understand things like the power
requirements, the performance requirements. Is it going to work in a highly
noisy electrical environment? There's a big difference between the reliability
requirements you need from something like a watch versus a satellite. So you
would take those requirements and you'd work out what the best set of trade-
offs is going to be, and that's an art more than it is a science. How do the
underlying technologies contribute to this best design? A lot of technologies
go into our processor. There's the design of the microarchitecture, the
implementation of the processor. There's the silicon process you're going to
use, how you're going to integrate that processor into an SoC or ASIC. Is it a
single die, or is it going to be using chiplets or multiple sockets? All of
those different technologies are going to be factors into how you design the
process or what trade-offs you make, and what performance and power you get out
of the design once you're done. In reality there may be many different 'best'
designs, so how do you pick one? So when you're designing a processor, what you
want is the best design. But often there isn't "a" best design, there's just
different trade-offs. You have to decide what the best set of trade-offs is for
the particular use case you're going for. And that's also going to depend on:
Is this a device that will be off the shelf, used for lots of different
applications—a general purpose processor? Or is this being designed for one
specific use case? Again, there isn't really a magic bullet or single answer
for this type of question. You need to understand how the processor is going to
be used and then use your experience to judge the trade-offs, and what will
give you the best mix of power, performance, area, cost, reliability for your
target use case.
Module 2, Video 1
[music] In this module, we're going to explore how to improve the simple
microprocessor design from Module 1 in order to allow it to execute programs
more efficiently. First, let's find out how long a program takes to execute.
The time taken to perform the average instruction is equal to the number of
clock cycles taken to perform an instruction multiplied by the duration of one
clock cycle. The time taken to run our program is found by multiplying the
average time to perform an instruction by the number of instructions in our
program. How could we make this faster? One thing we could try is to reduce the
number of instructions in a program. We might be able to optimize the code
removing unnecessary and repeated work and selecting instructions to minimize
code size and maximize performance. We could give our microprocessor the
ability to perform more operations in order to help programmers or compilers
further reduce the number of instructions in their program. For example,
allowing the loading of two data values at the same time might allow fewer
instructions to be used in the program. The downside to this approach, is that
adding more instructions will require extra circuitry in the processor and
therefore we likely increase the clock period. If the extra instructions are
rarely used this could even mean an overall decrease in performance. We see
this theme often in computer architecture trade-offs that we have to carefully
balance. Another approach is to use faster transistors perhaps constructed from
a more recent fabrication technology. This would reduce the clock period but
may increase costs. The rest of this module focuses on an optimization to
reduce the clock period called pipelining. This is the most important
optimization we use when designing processors. It uses a similar concept to an
assembly line in a factory where work can start on the next item before the
previous one finishes. Let's take a closer look. Imagine that each instruction
has to go through four circuits in a processor. If we attempt to do all of
these in one clock cycle this means our clock period is the latency of all four
circuits added together. If we were to pipeline this, we would add a pipeline
register in the middle. This divides the circuit into two sections called
stages. Notice that although each instruction takes a similar amount of time to
travel down the whole pipeline, the pipeline design can execute nearly twice as
many instructions per second. The throughput has doubled. This is because we
can set the clock period much shorter. It's now the maximum latency of the two
stages. We can pipeline into many stages and this allows for much faster
execution of programs. Unfortunately, though, pipelining a real microprocessor
design is not quite as simple because the processor has various feedback
signals and loops in the circuit. In the next video, we'll take a look at the
challenges of pipelining in practice. [music]
Module 2, Video 2
[music] In this video, we're going to look at applying the pipeline
optimization to a realistic microprocessor design. In the first module, we met
the components of a microprocessor, so let's look at how these are really
connected. This diagram shows all the connections needed for a real,
unpipelined microprocessor. Each clock cycle, the processor starts by fetching
an instruction from the instruction memory. Once the instruction reaches the
decode logic, it is decoded to produce the control signals necessary to execute
it. The exact control signals vary depending on the type of instruction. For
example, arithmetic instructions access the register file and interact with the
ALU. Ultimately, no matter how the instruction was executed, the last step of
each clock cycle is to update the program counter. This is done by the branch
unit. For non-branch instructions, this just means incrementing the program
counter. However, for branch instructions, the branch unit has to do some
calculations. When we apply our pipeline optimization to this design, we face
some challenges. The design has several loops because instructions have
dependencies. How can we break these cycles? The key observation is that not
every instruction is the same. In real programs, branch instructions usually
make up less than 20 percent of the program. For non-branches, the branch unit
doesn't actually need to wait for the ALU before calculating the result. Let's
look at how we can use this fact to pipeline the processor. Once the first
instruction shown in yellow reaches the pipeline register, we're ready to begin
fetching the next instruction, shown in blue. The yellow instruction can be in
the execute stage whilst the blue instruction is being fetched. Once the yellow
instruction is finished, the blue instruction is ready to enter the execute
stage and a new green instruction enters the fetch stage. What about the
branches though? Let's imagine this next yellow instruction is a branch. The
fetch stage works normally until the branch unit, but the branch unit cannot
proceed. Consequently, the pipeline stalls. The fetch stage spends a cycle
waiting whilst the execute stage executes the branch. Finally, once the ALU is
done, the branch unit can proceed and the next instruction, in this case blue,
can be fetched. Overall, this means that the processor wasted one cycle
stalling due to the branch. Since only 20 percent of instructions are branches,
this means that each instruction would require on average 1.2 cycles. The same
idea of stalling the pipeline can be used to create even longer pipeline
designs. This diagram shows a typical five-stage processor pipeline. In the
next video, we'll look at how we can manage or prevent some of the stalls in a
design like this. [music]
Module 2, Video 3
[music] Instructions within a program may be dependent on each other. That is,
one instruction may produce a value that a subsequent instruction consumes.
Data values may be communicated through registers or memory. The simple program
shown has a number of so-called true data dependencies. This means we must take
care to execute these instructions in order, and make sure results are
correctly communicated. Additionally, the outcomes of branch instructions may
affect the path taken through the program, and consequently, this affects
whether an instruction is actually executed. This sort of dependency is known
as a control dependency. In the previous video, we met a realistic processor
pipeline with five stages. Circumstances that prevent an instruction making
progress in our pipeline are known as pipeline hazards. Let's take a look at
how dependencies cause hazards. This program has a true data dependency. The
first instruction writes to register one, which is then read by the second
instruction. If we send this down our pipeline, we see that the second
instruction must stall, waiting for register one to be written, before it can
read and proceed. This is a data hazard. Unfortunately, dependent instructions
are common and stalling in this way would significantly increase the average
cycles per instruction. Let's take a closer look at the hazard though. The ADD
instruction is in the execute stage, meaning its result is being computed. The
SUB instruction needs that result to proceed. Rather than waiting for the ADD
to reach the writeback stage, we could add an extra path into our pipeline to
carry the output of one stage to a later instruction, making the result
available straight away. We call this a forwarding path. In this case, the ALU
result is forwarded to the SUB instruction to be used as X1. This small piece
of extra circuitry allows this data hazard to be eliminated completely.
Unfortunately, even if we add forwarding paths everywhere, it's not possible to
eliminate all data hazards. For example, this program has a data hazard due to
the load instruction. There are other types of hazard too. This program
contains a control hazard. We cannot be sure which instruction to fetch until
after the branch instruction executes. Consequently, this program has two stall
cycles. We will look in more detail at how control hazards can be mitigated in
the next module. Another class of hazards, called "structural hazards" occur
when two instructions require the same resources simultaneously. For example,
if instructions and data were stored in the same memory, and this could only be
accessed once per cycle, we would have to very frequently stall our pipeline to
let these stages access memory one-by-one. [music]
Module 2, Video 4
[music] In the previous videos, we explored how pipelining could improve
performance by reducing our clock period and by overlapping the execution of
different instructions. We also saw that it was sometimes necessary to stall
our pipeline to ensure that instructions were executed correctly. Ideally, our
average cycles per instruction, or CPI, will remain at 1.0. If we must stall,
however, this will increase. For example, if 20 percent of our instructions
were loads and each of these caused one stall cycle, our CPI would be 1.2. If a
further 20 percent of instructions were branches, and each of these caused two
stall cycles, our CPI would be 1.6. The longer we make our pipeline, the more
stall cycles there will be, and eventually the cost of stalls may outweigh the
benefit of the faster clock period. For example, let's imagine we added a stage
to our five-stage pipeline from the previous video. Now the number of stalls
after a branch instruction increases to three, hurting our CPI. On the other
hand, our clock period would improve. So whether or not this helps speed
program execution would depend on the exact details. It may eventually become
more difficult to reduce our clock period by adding further pipelining stages.
This is because it becomes harder to perfectly balance the logic between stages
and because of the constant delays associated with clocking and our pipeline
registers. To mitigate these issues, we will need to invest in more transistors
and our design will require more area and power. The deeper our pipeline gets,
the greater the investment we need to make in terms of area and power for the
same incremental improvement. Commercial processes today have anywhere from two
to twenty pipeline stages. The faster, more expensive and power-hungry
processors tend to have longer pipelines than the smaller, cheaper processes in
embedded devices. As with many techniques in computer architecture, eventually
it becomes more profitable to invest our time and resources in an alternative
way of improving performance. In later modules, we'll explore how we can reduce
the CPI, even in heavily pipelined processes. [music]
Module 2, Lab
[music] In this exercise, we're going to be using a model of a processor
pipeline to explore the effect of the pipelining optimization. Computer
architects use models like this to make high level decisions early on about
what parameters they will use for a processor and using a model such as this
saves the burden of actually building the processor to find out its
performance. The model is not simulating accurately the performance of the
processor but rather it's giving us an idea for what performance we might
expect. So what can we do with this model? Well, we can configure the number of
pipeline stages, which we can see affects the diagram. And we can also turn on
or off the forwarding optimization. As we change these numbers notice that the
design parameters change down here. So for example, the clock frequency is
improved by increasing the number of pipeline stages but the design area will
get bigger. And so this may be a consideration depending on the problem. We can
also choose which of two programs we're going to put down our pipeline. When we
press the step forward button the pipeline advances to the next clock cycle and
we can see the instructions have started to flow down our pipeline and
interesting events such as forwarding will be noted in the simulation.
Sometimes the simulation will detect that there will be a stall for example, in
this case, we can see that there is a data hazard because the instruction in
the red memory stage writes to register X13 which is read by the instruction in
the yellow decode stage and therefore a stall cycle is necessary in order to
allow the result to be generated. If we press the play button, the simulation
will proceed automatically and we can see various stall events happening as the
simulation proceeds. But notice that the the program we're simulating is nearly
1,000,000 cycles long so watching it play out at this speed is going to take
quite a while. So we can use the fast forward slider to simulate much, much
faster. Notice that the statistics down the bottom have updated depending on
the results of the simulation, and at this point we can see that the program is
finished and the simulation of the program, the simulated program took 3.98
milliseconds to execute. We can also see that just below, the results of past
simulations are stored in little tables so we can easily refer back to them
when we're doing later experiments. So as an experiment, let's imagine what
would happen if we disabled the forwarding optimization but change nothing else
and we'll just run this program through. What we can see immediately is the
design side is slightly better which is what we would expect. It's got 1%
better in fact in this case because of the lack of the forwarding wires. But
now that the program is finished, we can see that the program execution time is
a lot worse. 6.34 milliseconds is about 50% worse. So again, looking in our
table, we can compare the execution times in the area and we can see that in
most cases the forwarding optimization would be a big optimization here because
at the cost of an increase in the area of about 1%, we've had a improvement in
execution time of about 50% which is likely to be a good trade off, but not
always. It would depend on the exact scenario. Perhaps that 1% area is more
important than the performance of this program. In the exercise, you'll be
invited to design a or suggest a design using the number of pipeline stages and
whether forwarding is enabled that will meet certain criteria. You can play
about and do as many simulations as you wish to figure out what the best
program might be. Once you've got it set up select the processor that you're
happy with at the top and then scroll down to the submit button and press that.
Good luck. [music]
Module 3, Intro
[music] Hi, I'm Nigel Stevens. I'm Lead Instruction Set Architect for the Arm
A-Profile architecture. I've been at Arm for about 14 years and I have
responsibility for the Arm V8-A instruction set including recent developments
such as the Scalable Vector Extension and Scalable Matrix Extension. What do we
mean by the instruction set architecture? The instructions in architecture,
primarily, most people think of I guess as the OP codes, the encodings of
instructions that are executed by an Arm-based processor. But it also includes
other aspects as well such as the exception model, system programming features,
memory management and suchlike. The architecture for Arm is rather distinct
from what other companies may call an architecture. For Arm, architecture is a
legal contract, if you will, between hardware and software. If software uses
only those instructional codes and features of the ISA that are described by
the Arm architecture to perform its work, and the hardware that it's running on
implements all of those op codes and features exactly as defined by the
architecture, then any architecturally compliant software will run on any
architecturally compliant hardware that implements that Arm architecture. And
that doesn't mean just processors from Arm itself, but also processors that are
designed by our partners and which we have validated are conformant with our
architecture. How do you decide which instructions to include in an ISA? When
we are looking at requests from partners or from internal research to add a new
instruction, we go through quite a long process of trying to justify that
instruction, or, quite commonly, a set of instructions rather than a single
instruction. We have to show that it gives us some real benefit in performance,
the performance of your code running on that CPU. Or maybe not performance.
Maybe it's security you're trying to achieve. But it has to give you some
really concrete benefit that is worth the cost of adding all of the software,
the validation software, the implementation costs for all of the different
implementations, compiler support, so on and so forth. It has to meet the... It
has to answer that cost benefit analysis. What is the difference between an ISA
and a microarchitecture? The difference between an instruction set
architecture, or ISA, and the microarchitecture is that the ISA is an abstract
concept. It defines a set of instruction encodings which software can use, and
which hardware has to recognize and implement. How that is implemented is a
choice for the microarchitecture. So the instruction set architecture is fixed,
it's defined by Arm. The microarchitecture is defined by whatever team of
people is designing that CPU. And there are many different approaches to
implementing the Arm architecture, from very small, efficient cores with in-
order pipelines up to very high-performance, state-of-the-art, out-of-order
execution, and everywhere in between. So the microarchitecture is
implementation-specific, the architecture is generic, and software written for
the architecture should run on any microarchitecture. Why does Arm produce
processors with different instruction sets? Arm supports multiple instruction
sets. Some of that is to do with legacy: you can't abandon your legacy
software, your legacy ecosystem. So as the architecture has advanced and we've
introduced major new instruction sets, we still have to continue to support old
software. It takes years, maybe 10 years to move the software ecosystem to a
major new ISA. So for example, AArch64, which is the 64-bit architecture that
we introduced with Arm V8, also supported the AArch, what we called AArch32,
the old 32-bit architecture that was implemented in the Arm V7 architecture,
and prior to that including the Arm and the Thumb instruction sets. And we
needed to do that because, while some software might start to migrate to the
64-bit architecture, there's still a lot of software on the planet which is
going to continue using the 32-bit architecture, and that has to survive. So
that's part of the reason: it's about legacy. You can't just obsolete the whole
world when you introduce a new architecture, a new instruction set architecture
in particular. There are other reasons as well, which is there are certain
instruction sets that are different for reasons of the ecosystem that they're
working with. So if you were to compare, for example, the A-profile
architecture that's designed for application processors that run rich operating
systems with virtual memory supporting SMP, symmetric multi processing
operating systems running large applications, whatever it may be: web browsers
on your phone or something, or a web server in a server farm somewhere. You
have your R-profile architecture, which is designed for high-performance, real-
time embedded systems. The constraints there are somewhat different. The
instruction set is actually fairly similar to the A-profile, but some of the
underpinnings of the architecture, the system side of the architecture, are
simplified in order to give more consistent and predictable real-time response
to things like interrupts or memory translation and suchlike for real-time
systems. And then at the other extreme you have the M-profile architecture
which is designed to be capable of being built in a very simple, ultra-low
power implementation with low code size and again, similar to the R profile,
very predictable real-time performance. So the answer is there are different
instruction sets for reasons of the market that they're trying to address, and
then there are different instruction sets because, well, we have history.
[music]
Module 3, Video 1
[music] In the previous module, we explored how pipelining can be used to
improve performance. We also saw how it is sometimes necessary to stall our
pipeline to ensure our program is executed correctly. In a simple pipeline, it
will be necessary to stall the pipeline whenever we encounter a branch
instruction. This is because we must wait until our branch is executed before
we can be sure which instruction to fetch next. As a recap, branches are
instructions that change which instruction in the program will be executed
next. There are two types of branches: conditional branches and unconditional
branches. Unconditional branches always change which instruction executes next,
whereas conditional ones may or may not, depending on the computations in the
program. In real programs, between approximately one fifth and one quarter of
all instructions are branches, and the majority of these are conditional.
Executing a branch involves calculating the new address to load into our
program counter. This is the branch's "target address." However, conditional
branches have an extra task: we must first determine whether the branch is
taken. If the branch is not taken, we can effectively ignore the branch and
fetch the next instruction as normal. Recall the processor performance equation
from an earlier video. Since we have to wait for branches to complete before
fetching the next instruction, we generate stall cycles. These increase the
average number of "cycles per instruction," which reduces our microprocessor's
performance. The longer our pipeline gets, the longer it is before each branch
is resolved, and the more costly branches become. Can you think of a way to
avoid some of this stalling? One idea is to evaluate branches earlier in the
pipeline, for example in the Decode stage instead of in the Execute stage. This
can indeed help to reduce the number of stalls, but we may still need to stall
if the branch depends on other instructions that haven't been executed yet.
Another idea is to continue fetching instructions in program order, effectively
assuming that each branch is not taken. The number of stalls in the pipeline
for a not-taken branch is zero in this design. On the other hand, if the branch
is in fact taken, the subsequent instructions that we fetched will be
incorrect. So, we must remove all instructions that have been fetched on this
incorrect path from our pipeline. This is called "flushing" the pipeline.
Unfortunately, in real programs, branches are taken much more than not taken.
Could we then simply assume instead that all branches will be taken? Sadly not,
no, because then we would also need to know the specific "target address"
immediately, in order to know which instruction to fetch next. It may at first
seem impossible to know this before the instruction is decoded. However,
computer architects have found a way to do exactly this. The next video will
look at "dynamic branch prediction:" the idea of predicting the behavior of the
branch instruction before it has even been fetched. [music]
Module 3, Video 2
[music] In this video, we'll explore how to predict the behavior of a branch
instruction. This can sometimes eliminate the cost of branches altogether.
Fundamentally, a branch involves changing the value of the program counter,
which is the address of the next instruction in the program. If we could
predict what this change will be, quickly and accurately, we would have no need
to stall. Precisely, we need to predict that we are fetching a branch, predict
it as taken or not taken, and predict what its target address is. How could we
ever make such predictions? What would we base them on? Well, since
instructions are often executed multiple times, we can accurately make these
predictions based on just the instruction address. If we've previously seen
that a particular address contains a branch, and we see that address again, we
can predict whether that branch will be taken, and its target address, based on
its behavior last time. Amazingly, for real programs, simply predicting
repeating behavior is typically around 90 percent accurate. This means we could
eliminate stalls caused by branch instructions 90 percent of the time. Let's
apply these insights to try to build a branch predictor. We will add two extra
blocks to the Fetch stage of the pipeline we met in Module 2. The first will
remember information about recently executed branches. This will include the
program counter values of branches and their target addresses. This memory is
called the "Branch Target Buffer" or BTB. The second block will make
predictions about whether an address containing a branch is taken or not. We
simply call this the "branch predictor." In the next video, we will look at
these in more detail. Combining these two gives us all the information we need
to predict the next value of the program counter based solely on the current
value of the program counter. We don't even need to decode the instruction to
make this prediction. Here we can see how a running branch predictor behaves
for a sample program. Each program counter is checked in the BTB to see if it's
predicted to be a branch and to identify its predicted target. We also
simultaneously check the branch predictor to see if the branch is predicted to
be taken. Based on these predictions, the branch unit computes the predicted
next program counter. Many cycles after the prediction, feedback will be given
by the rest of the pipeline as to whether or not the prediction was correct.
Whenever the prediction is wrong, we have to flush the pipeline and update the
BTB and branch predictor. The pipeline will then resume fetching from the
correct program counter as computed by the pipeline. The majority of
instructions are not branches, so most of the time the branch unit just adds 1
to the program counter. The BTB contains the instruction address and target
address of some recently executed branches. The exact number of entries in the
BTB varies considerably. BTBs in large modern processors contain many thousands
of branches. In operation, the BTB checks the supply program counter against
its memory to see whether it has a match, and if so, it returns the target
address. Otherwise, it predicts that the instruction is not a branch. After
each branch executes, the BTB is updated with the true target address. BTB
cannot be arbitrarily large, so it may have to forget an existing branch to
remember a new one. A simple BTB design like this is typically around 90
percent accurate at predicting target addresses in real programs. [music]
Module 3, Video 3
[music] In the previous video, we met the two major components of dynamic
branch prediction, the BTB and the branch predictor. In this video, we'll take
a deeper look at the branch predictor. It predicts whether or not a branch will
be taken, based on the program counter. A simple branch predictor would try to
remember what the branch did last time and predict that the same behavior will
repeat. Let's see how such a predictor might be organized. Remembering a branch
prediction for every possible instruction address would take up far too much
memory, so we reuse the same memory for different branches via a process called
"hashing." We hash the address of the branch to a smaller number. This does
unfortunately lead to a problem called "aliasing," where two different branches
can hash to the same value, but this is rare in practice. Let's see what
happens now, when we execute a simple loop. We see a misprediction when it
encounters our branches for the first time, and another when we exit the loop.
The first case will be dependent on the value in our predictor's memory, and it
may be that we are able to predict the branch correctly the first time we see
it. The second case is hard to avoid, although some more sophisticated branch
predictors will learn how many iterations a loop will make. A common
improvement to this simple scheme is to avoid instantly flipping our prediction
just because the branch does something unexpected once. This can be achieved
with a saturating counter, which instead remembers how many times the branch
has been taken recently, versus not taken. The counter increments when the
branch is taken and decrements when not taken. It predicts "taken" if the
majority of recent executions of this branch were taken. When building higher-
performance processors, we often have to discard many instructions every time
we mispredict a branch, so accurate branch prediction is very important.
Therefore, branch prediction is still an active area of research. One of the
key ideas used is correlation between branches. In real programs, a common
pattern is for example a pair of branches that always have opposite behavior. A
branch predictor can take advantage of this by remembering a history of whether
or not recent branches were taken or not taken and incorporating this in the
hash. Another idea is to combine multiple different types of predictor in a
"tournament predictor." We use a "meta predictor" to predict which of two
branch predictors will do a better job. As you can see, branch prediction can
get quite complicated. [music]
Module 3, Video 4
[music] In this module, we've met the concept of branch prediction. Even using
relatively small simple circuits, we can accurately predict real branch
behavior more than 95 percent of the time. Could we do better, and do we really
need to? In small, simple processors, these prediction accuracies are fine,
because each misprediction causes only a few stall cycles. Increasing the
accuracy is not that impactful. However, in complex processors with very long
pipelines, the difference between 98 percent and 99 percent prediction accuracy
can be significant for performance as a misprediction could incur dozens of
stall cycles. Accurate prediction really does matter. One of the problems we
face in making accurate predictors is that they need to be small enough and
fast enough to fit in a microprocessor. We can imagine all sorts of ways to do
accurate branch prediction, but if their circuit would be slower than simply
executing the branch, they would not be useful. Modern high-performance
processors will often have multiple branch predictors, for example a small fast
one and a slower complex one. The slower one can override the prediction of the
fast one if it thinks it got it wrong, which does incur some stall cycles, but
fewer than a total misprediction. Another problem we face is that some branches
are just very hard to predict. No matter what technique is used, there are some
branches in real programs that are effectively "random". For example, when
compressing or decompressing data, the combination of the underlying algorithm
and the input data may provide no clear patterns to a prediction. No matter how
hard we try, some branches will never be predicted correctly 100 percent of the
time. A final problem is that since the predictors work based on observing the
program, there will always be a period of time when the predictors train on a
program to learn its behavior. [music]
Module 3, Lab
[music] In this exercise, we're going to be using a branch predictor simulator
to explore dynamic branch prediction. This simulator will accurately simulate
the details of a branch predictor but it uses a trace of a real program
executing on a real machine to avoid the need to simulate the rest of the
processor. Computer architects use techniques such as this to quickly explore
one area of process design understanding that perhaps the accuracy may not be
completely correct given that we're not simulating the full process of design.
The interface allows us to configure details of our branch predictor for
example the maximum of the saturating counters used in the branch predictors
table or the hash function. And we can see the impact of changes on the delay
of the branch predictor and also the design size which are two key metrics when
designing a branch predictor. Once we've happily configured the design we want
we can press run to simulate a program and the results of that program's
execution will be displayed in the rest of the stats. So for example, we can
see here that the predictor predicted 95.24% of the branches correctly. Which
is fairly good and this resulted in overall execution time of the program of
5.335 milliseconds. Just below we can see a table that records previous
simulation runs as well so we can do multiple experiments and see which
produces the best results. For example, if we were curious about the effects of
using a saturating counter with a maximum of three rather than one we could
change that and notice that the design size has substantially increased and
when we press run, we noticed that the predictor accuracy, however, has also
increased. It's gone up to 96.31% and consequently the execution time the
program has fallen slightly and so we can compare these two designs and see
whether or not this represents a good trade-off for our processor. Perhaps the
area cost is justified or perhaps it's too much. It would depend on the exact
processor we're trying to design In the problems you'll be invited in to come
up with designs that are suitable for particular constraints. For example,
particular constraints on the runtime of the program or the design size. Once
you've configured the branch predictor that you think meets the objectives you
can scroll down to the submit button and click that and you'll be told whether
the answer is great or not. Good luck. [music]
Module 4, Video 1
[music] So far, we've looked at the microprocessor's "datapath"— meaning its
execution units, registers, and control circuitry. We have given less attention
to its memory. We usually implement memory using a different type of chip than
the microprocessor, using a technology called DRAM. It is very dense, allowing
us to store lots of data in a small area. However, one issue with DRAM is its
speed. Since the 1980s, processor performance has increased very rapidly at
roughly 55 percent per year, so CPUs of today are many orders of magnitude
faster than those of 40 years ago. In contrast, memory performance has grown
much more modestly. Whilst memories are also much faster than they were in
previous decades, their performance has not kept pace with processors. This
leads to a processor-memory performance gap, with it becoming more costly to
access memory as time goes on. One of the issues that makes these memories slow
is their size. We can make a memory as fast as our microprocessor, if it is
very small. However, this is the opposite of what programmers want. Programmers
can use extra memory to solve more complex problems, or to solve existing
problems faster. This complicates microprocessor design because although we
want a large memory to hold all our data, large memories are slow and this
slows down our processor. How do we overcome this? What if we could get the
speed benefit of a small memory alongside the size benefit of a large memory?
One solution is to have a large, slow memory and a small, fast memory in our
system. For this to be useful, we need the small memory to hold the data that
we use most often, so that we often get the speed benefits of accessing it, and
only rarely have to access the slow memory. Whilst there are different
arrangements of these two memories, the configuration that is most often used
is an "on-chip cache memory." The small memory sits inside the processor
between the pipeline and the large memory. We keep all data in the large main
memory, and put copies of often-used data in the small memory, which we call a
cache. But we are not limited to only one cache! Our pipeline reads memory in
two places: when fetching the instructions; and when accessing the data. It
makes sense to have two caches here, each optimized for different purposes. The
instruction cache is optimized for fast reading of instructions at Fetch. The
data cache is optimized for reading and writing data from the memory stage. We
will often put a larger "level 2" cache between these two caches and the main
memory. The L2 cache is a "medium-sized" memory: faster and smaller than main
memory, but slower and larger than the two L1 caches. Using a hierarchy of
caches reduces the bandwidth requirements to main memory and the energy cost of
moving data around. [music]
Module 4, Video 2
[music] We previously looked at the need for a cache, which can be used to
store often-used data for fast access by the processor. But which data is used
often enough that it is worth including in the cache? Programs are mostly made
of loops. Here's a simple one that sums values in memory. It displays the two
characteristics that we can exploit to decide what data to put into a cache.
Let's look briefly at how it works. Each time round the loop, there are two
loads and one store. The first load is to load the data we're summing. The
other load, and the store, are to update the running sum. Notice two things
here. First, we access the running sum over and over again; each time round the
loop. Second, when we access part of the data in one loop iteration, we've
already accessed its predecessor in the previous iteration, and will access its
successor in the next. Caches exploit two types of "locality" in programs in
order to be effective. The first is temporal locality: if a piece of data is
accessed, it is likely that it will be accessed again in the near future. The
running sum has temporal locality. The second type of locality is spatial
locality: If a piece of data is accessed, then its close neighbors are quite
likely to be accessed in the near future. By close neighbors, we mean data
whose memory addresses are not far from each other. The data accesses have
spatial locality. It turns out that most programs exhibit lots of temporal and
spatial locality. We can exploit this to determine what to put in the cache.
Exploiting temporal locality is fairly easy: we simply see which values have
been accessed and place them in the cache. Exploiting spatial locality is also
quite simple: when a piece of data is accessed, place its neighbors in the
cache. We'll see in the next module how this is actually achieved. We use
locality to guess what values will be accessed next and store them in the
cache. If we are correct, we get the benefit of fast memory, but if we are
wrong, we must perform a slow access to main memory— just as we would if the
cache were not present. In real programs, we see hit rates of 90 percent or
more in microprocessor caches, resulting in vast performance improvements.
However, it's worth noting that some programs have less locality, and for those
programs, caches offer little performance benefit. [music]
Module 4, Video 3
[music] We've looked at the reasons why we build caches, but how do they
actually work? To the outside world, a cache simply takes an address as input,
and either provides the data that is stored at that location at output, or
returns a signal to say that it doesn't have it. If the data is found, this is
called a "cache hit". If the data is not in the cache, this is called a "cache
miss", and it means we must look for the data in main memory instead. After
each miss, we update the contents of the cache. The fundamental building block
of a cache is called the "cache line". It's a number of data bytes from
consecutive addresses in main memory. Cache lines are typically 32 or 64 bytes
long, and a cache typically has an array of hundreds to many thousands of
lines. The line captures spatial locality, because it is larger than the data
read by a single load instruction. When a cache miss occurs, the whole line
containing that data is copied to the cache from main memory, meaning we have
nearby values for future accesses. When a request comes in, we use some bits
from that address to index the line array. Just like in the last module on
branch prediction, this leads to the problem of aliasing again, since selecting
only some of the bits to index into the array is like a hash. This means that
many addresses map to the same line in the cache, but we can only store one of
their lines of data. We need to note down which addresses' line is currently
stored in the data array, in what we call the "tag array". There is one tag for
each cache line. When we access the cache, we access the tag array with the
same index to see if the data we need is present. This design is called a
"direct-mapped cache". Direct-mapped caches work fairly well, but for some
programs, we can be unlucky with the program accessing to aliasing lines
frequently. We can do something about this by duplicating both the arrays, so
that each line of data can now be stored in one of two places in the cache.
When we access the cache, we look at both arrays and only get a miss if neither
of the tags match. This is called a "2-way set-associative cache", because each
line has a set of two places or "ways" it could reside. The "associativity" of
this cache is therefore two. Set-associative caches introduce a further
complication: when we want to add data, where do we put it? In a 2-way set-
associative cache, there are two choices. How we decide which cache line to
evict is called the "replacement policy". There are many different types of
replacement policy. A simple one is just to make a pseudo-random choice from
all the possible cache lines. Another option is to keep track of when each
cache line was last accessed and to evict the one last used furthest in the
past. This is called a "least recently used policy", and takes advantage of
temporal locality. It does, however, means storing extra information in the
tags to track usage. Many other ideas are possible too. [music]
Module 4, Video 4
[music] Now that we've seen how caches work, let's see how they affect the
performance of a processor. Recall the processor performance equation, where
the processing time is proportional to the average cycles per instruction.
Without a data cache, if 20 percent of instructions are loads, and main memory
takes 20 cycles to access, our CPI figure must be at least 5. However, if we
provide a cache that holds the required data 80 percent of the time... ...and
only takes 2 cycles to access, our CPI reduces to 2.2, which is a significant
improvement! We can isolate the memory terms in this equation to get the
average memory access time —abbreviated to AMAT— which allows us to compare
different cache configurations more easily. Changing the cache configuration
will impact the AMAT. There are many different cache parameters we can change,
such as the size, replacement policy, associativity, whether we put data in the
cache for stores or just for loads, and so on. For example, reducing the size
of the cache will improve the access time for a hit, but will also increase the
miss rate. Let's say that we can halve the access time to 1 with a
corresponding halving of the hit rate. This alters the AMAT to 13, which in
this case is worse for performance overall. It's also useful to look at why an
address might miss in the cache. Broadly speaking, we can divide cache misses
into three different categories. Compulsory misses occur when we attempt to
access an address that we have never seen before and so never had the
opportunity to cache it. Capacity misses occur when there is more data being
accessed than the cache could hold, even if we had complete freedom in where to
put each cache block. Conflict misses occur in caches where there are more
addresses hashing to the same index than arrays to hold the data. We can alter
our cache configurations to lower these misses, but as always, there are trade-
offs involved. Compulsory misses can be reduced by increasing the cache block
size, to take advantage of spatial locality. But for a fixed cache size, this
reduces the number of different addresses or cache lines that can be stored. A
technique called "pre-fetching" can also be used to predict the addresses that
will soon be accessed, and bring their data into the cache early. But this
increases energy consumption, and may make the cache perform worse if the
predictions are not highly accurate. Capacity misses can be reduced through
increasing the size of the cache. Although, as we saw before, this impacts the
number of cycles taken to determine a hit. Conflict misses can be reduced
through increasing the number of cache blocks in each set, with an increase in
energy consumption as a side effect of this. [music]
Module 4, Lab
[music] In this exercise, we're going to be using a cache memory simulator to
explore the effects of cache memories on processor performance. The simulator
accurately simulates the behavior of the cache memory system, but it's using a
model of the rest of the processor to quickly allow us to simulate the effects
of the cache on the processor without needing to simulate the full processor.
We can configure a number of parameters about our cache, for example, the
number of levels in our cache, whether or not the cache separates instructions
and data, or is unified, keeping them both in the same cache. We can also
configure the size, the line size, and the associativity, and changing these
numbers will affect design parameters, such as the access times in the case of
an L1 level one cache hit or a cache miss, and also the design size. Once we're
happy we've found design we'd like to investigate, we can press "run", at which
point the simulator will run through the program. We can see the hit rates for
instructions in the level one cache, and also data in the level one cache are
displayed. And we can also see the average memory access time, the results from
this. And then below everything, we can see a table of past simulations so that
we can quickly refer back to our previous experiments when we do new ones. So,
for example, let's say we were curious about the effects of increasing the size
of the cache. If we change the parameter and then press "run", we can
immediately see that the design size has substantially increased, which sort of
makes sense because we've doubled the contents of the cache, the size of the
contents of the cache, and therefore we'd expect roughly double the area. And
we can also see that the hit rates have improved. So there's about a 1%
improvement to the L1 instruction cache hits and a 1% improvement to the L1
data cache hits, which has reduced the overall average memory access time. And
so we can compare these two designs to see which of them we think is better.
It's a trade-off of course, though. The larger design has got better
performance, but it is larger, and so depending on the context, we may need to
pick the smaller design or the bigger design depending on our performance
goals. In the exercises, you'll be invited to come up with a series of designs
for caches that meet certain performance goals. For example, you'll have a
constraint on the area of the design or the execution time of the program, and
you need to optimize the cache to meet those goals. Best of luck. [music]
Module 5, Video 1
[music] In this module, we'll look at how to further improve performance by
exploiting "instruction-level parallelism." In Module 2, we explored how
pipelining can improve the performance of our processor. This reduced our clock
period, and allowed execution of instructions to be overlapped, improving
throughput. One way to boost performance further would be to create a much
deeper pipeline. At some point, this would mean even the ALU in our Execute
stage will be pipelined. Consider the simple program shown in the slide. Some
instructions are dependent on a result from the previous instruction. Remember
in our 5-stage pipeline that these dependent instructions could be executed in
consecutive clock cycles with the aid of data forwarding. If execution takes
place over two pipeline stages within the pipeline, we need to stall if
adjacent instruction share a dependency. This allows time for the result to be
computed. The programmer may be able to rewrite their program to get the same
result with fewer stalls, by placing an independent instruction between our
pair of dependent instructions. In this case, we can move the third and fifth
instructions earlier to optimize performance. The performance of programs that
run on our "super-pipelined" processor would, to some degree, be determined by
the availability of independent instructions that could be executed in parallel
in the pipeline. This is "instruction-level parallelism"—or ILP. Very deep
pipelines are problematic as they would require: a very high-frequency clock to
be distributed across the chip very precisely; careful balancing of logic
between many, very short, pipeline stages; the pipelining of logic that is
difficult to divide further into stages; and the division of logic at points
requiring many pipelining registers to be inserted. A different approach to
exploit ILP is to make our pipeline wider rather than deeper. In this design,
the processor will fetch, decode, and potentially execute multiple instructions
each cycle. Such a design avoids the problems of a super-pipelined processor,
although as we'll see in the next video, it does introduce some new
complications. [music]
Module 5, Video 2
[music] In this video, we are going to explore "superscalar" processors, which
can process multiple instructions in each pipeline stage. In our simple 5-stage
pipeline, there is at most one instruction per pipeline stage. At best, we can
complete one instruction per cycle. We call such a design a "scalar" processor.
In a 2-way superscalar version of this processor, we would extend this design
so it is able to fetch, decode, execute and writeback up to two instructions at
a time. In general, superscalar processors may vary the number of instructions
that can be processed together in each stage. Let's step through the design.
Our instruction cache will need to supply two instructions per cycle. Typical
superscalar processors only ever fetch adjacent instructions on a given cycle.
This can lower performance if, for example, the first instruction fetched is a
taken branch, because then the second would not be required. Note that now
every cycle lost due to control hazards will cost us two instructions rather
than one, so accurate branch prediction matters even more in superscalar
designs. The Decode stage must now decode and read the registers for two
instructions simultaneously. Fortunately, we are able to extend the register
file design to read many register values at the same time. The Decode stage
also needs to check whether the two instructions are independent. If so, and if
the functional units they both need are available, it can "issue" them for
execution in parallel on the next clock cycle. Otherwise, in this simple
design, it will only issue the first, and keep the second back. A simple design
such as this—where two instructions are fetched, decoded and issued— is called
a "2-way" or "dual-issue" processor. In other designs, the width may vary at
different stages of the pipeline. To support the execution of multiple
instructions at the same time, the Execute stage is expanded and contains two
execution pipelines. It's common for these to have slightly different
capabilities to save area. For example, the top pipeline can execute both ALU
and memory instructions, while the second pipeline only executes ALU
instructions. To ensure that dependent instructions can execute on consecutive
clock cycles, we must add data forwarding paths. These data forwarding paths
must allow results stored in either execution pipeline to be forwarded to the
input of either ALU. During writeback, we need to store both results to the
register file. This means the register file must be redesigned to allow two
writes per clock cycle. Overall, these changes typically require 25 percent
more logic circuitry in our processor, compared with a scalar processor. But
we'd expect an improvement in execution time of between 25 and 30 percent for
real world programs. [music]
Module 5, Video 3
[music] We've seen that instruction-level parallelism can be used on
superscalar processors, to run them faster than would be possible on a scalar
processor. But how much of a speedup is this in practice? Ultimately, this
depends on how much instruction-level parallelism is possible in a typical
program. How might we measure this? We can do this initially without
considering any constraints that will be imposed by the processor it will run
on. Let's consider the instructions executed by the program. Let's assume that
we can predict all the branches in the program perfectly. Then we can ignore
branch instructions, as they don't need to flow down our pipeline. Now let's
imagine we can execute any instruction as soon as the data it needs is ready.
That is, we are only restricted by the presence of true data dependencies. Note
that some dependencies are carried through writes and reads to memory. Rather
than considering program order, we can now just look at the order the
dependencies impose on instructions. This is referred to as "data-flow
analysis." Assuming each instruction takes exactly one cycle to execute, the
fastest possible execution time of the whole program in cycles is given by the
longest path in the data-flow graph. The instruction-level parallelism of this
program is the number of instructions divided by this duration, as this gives
the average number of instructions we would need to be able to execute each
cycle to achieve this duration. In real programs, this can be anywhere from
around five, to hundreds or even thousands. An active area of research and
innovation for computer architects is to imagine processor designs that can
expose and exploit as much of this parallelism as possible. One insight
architects have had is that superscaler processors need to have a fast supply
of instructions to be able to analyze dependencies effectively. This often
means that the front end of our processor pipeline is much wider than the rest
of the pipeline, so that it can "run ahead" and see what behavior the program
will have next. Fast and accurate branch prediction is vital, as we often have
to predict multiple branches ahead accurately, to achieve good performance.
Another key insight is that we don't have to wait to execute the instructions
in program order. If all the dependencies of an instruction are satisfied, the
instruction can proceed down the pipeline even if previous instructions are yet
to execute. This can reduce program execution time by taking advantage of more
instruction-level parallelism. In practice though, this creates extra
complications, as we will see in the next module. [music]
Module 5, Lab
[music] In this exercise, we will be using a simulator to explore superscalar
microprocessor design. The simulator has a number of parameters that we can
configure, such as the number of pipeline stages, the width of the Fetch stage,
the width of the issue, and the number of ALUs in the design. We can see a
diagram of the processor that we've created and we can also see a number of
parameters about that design, for example the clock frequency and the overall
area of the design. When we press step, the simulator will advance one clock
cycle and so for example here we can see that the fetch stage has fetched the
first 4 instructions. However, immediately we see one of the problems with
designs such as this, which is the three of the four instructions that have
been fetched are in fact useless because the first instruction was an
unconditional taken branch and therefore the remaining three instructions will
not be executed by the program and so these will immediately be discarded on
the next cycle. Pressing the run button allows us to simulate and we can use
the fast-forward option to simulate much quicker in order to get to the end of
the long program execution. In this case we can see that our design achieved an
average cycles per instruction less than one, which is to say that we on
average executed more than one instruction per cycle, which means that our
superscaler design has fundamentally worked, and we can see that, for example,
the overall execution time of the program is 1.6 milliseconds. In a table
below, we can also see a record of our previous simulation runs. So let's say
for example, we were curious about the effect of increasing the issue width by
one. We can make that change and then press run again in order to run our new
design, and when it finishes we can scroll down to take a look and we can see
that the program execution time has indeed improved down to 1.51 milliseconds
at a cost of only 1% area. So it looks like this was a very good improvement to
our design and it's almost surely going to be a beneficial trade-off in
practice. In the exercise you will be invited to configure a number of
different superscalar processor designs with various targets in terms of clock
frequency, design area and execution time. Once you're happy that you've
configured the process that you think completes the exercise, you can scroll
all the way down to the bottom, where you'll see the submit button that you can
press to have your answer checked. Good luck. [music]
Module 6, Video 1
Introduction:
So hi, my name is Peter Greenhalgh. I'm Senior Vice President of Technology and
an Arm fellow. I'm responsible for the Central Technology Group at Arm. We're
about 250 people. We work on everything from machine learning to CPU, GPU,
system IP, and the solutions that we create as well. And we basically path-find
future technology at the product level that goes into all of our products and
the IP that we produce. Arm is known for the power efficiency of its
microprocessors. How have you managed to keep a focus on power when building
processors with very complex and power-hungry features? We've got some really
great design teams. In fact, we churn out I think more CPUs than pretty much
anyone else does on the planet. I think we're producing something like four or
five CPUs per year. So we've got a lot of experience in designing for power
efficiency and performance, and in fact we can leverage the understanding that
we have all the way down to microcontrollers through to the smaller A-class
processors, all the way up to the high performance. There's a lot of sharing
between the teams in terms of strong knowledge and capability, and insight into
how to design for both performance and power efficiency. More specifically, I
mean, ultimately, you have a performance goal that you need to achieve, and
then as part of that you have to figure out how to get the best possible power
out of the design when you're achieving that performance goal. And to do that,
there's kind of some different ways of looking at it. There's the really
detailed orientated work that you need to do around things, like clock gating,
data gating, all the things to try and stop unnecessary power use deep within
the microarchitecture when the instructions are flowing through the pipeline or
data's moving through the pipeline. And then there's essentially the structure
of the design that you've created. And that then dictates fundamentally what
the power of the design is going to be. You can't fix a design that's got bad
structure with improved clock gating, data gating, and just good low-level
design. You have to marry the two together. And that high-level work that you
do is around making sure that the pipeline is well balanced, that you aren't
opening up the pipeline, going too wide too soon; you're extracting data,
you're extracting information as late as you possibly can and just when you
need it, and not just pipelining it down through the design for the sake of it;
and then, fundamentally, good microarchitecture around branch prediction, which
stops you putting things down through the pipeline that you're just ultimately
going to flush; good pre-fetching on the data side so that you make sure you
get the data in the design when you need it, and you're not sitting around
waiting for it. So you have to marry that altogether, and we've got a lot of
great techniques in order to achieve that, which fundamentally, I say, you need
to achieve the performance target, and then everything else comes together to
achieve that performance target in the best possible energy efficiency. How did
Moore's Law affect computer architectures of the past, and what will its
influence be on future designs? Gordon Moore's influence on the industry has
been massive, and the tenets behind the law still continue today, albeit in a
slightly different form. I mean, I started designing at .18 Micron, essentially
180 nanometers, and here we are today working on 3 nanometers. So it's a vast
difference now compared to when I started 22 years ago. And there's no way we
could have got to where we are today without the process scaling from all of
the foundries out there and all the companies that provide the foundry
technology. So it's a little bit like magic, all of the work that they do. I
can't say I understand it in detail, but it's incredible technology, and that
allows us... If it hadn't happened, we'd still be stuck in all the designs
which were fairly simple. There's no way that we'd have got to the sort of
designs that we have today of massively out of order, deep, deep amount of
instruction and depth, very, very wide designs. All of that has been made
possible by the steady improvement, a predictable improvement of the foundries.
And that's kind of one of the key points which really was captured by Moore's
law or is captured by Moore's law of that kind of predictable knowledge that
you will get an improvement in the process. Is it 10%, is it 15? Is it 20% on,
say, power, for example? It kind of doesn't matter in a way because you can
work with what you eventually get. You can do things like voltage scaling to be
able to make use of the power that's available to you. Is it 5%? Is it 10% on
frequency? Again, it kind of doesn't matter in a way. But what matters is when
we start designing a processor today and we finish it in 18 months time, and
then two years after that it arrives in the product in the shops that consumers
can buy. We know that over that period, the process improvements have happened,
which allows us to liberate essentially more performance, more energy
efficiency from the design. And we don't mind too much if it takes another
three months or six months to get to the process. We don't mind too much if the
performance or power is not exactly where it was predicted at the start. But,
ultimately, we know we'll get an improvement, and we know there'll be an
improvement in two years, and three years, and four years, and Moore's Law may
have slowed, but it's certainly not stopped.
[music] As we saw in the last module, instruction level parallelism can be used
to improve program execution time in our microprocessor designs. To enable
this, the compiler creates an optimized instruction schedule when the program
is converted into machine code. Unfortunately, the compiler cannot know
precisely what will happen at run-time, so this design is constrained by the
order of instructions in the program. The compiler won't know what the
program's input data will be, whether branches will be mispredicted, or whether
memory accesses hit or miss in our data cache. In contrast, a superscalar
processor with "out-of-order" execution can produce an instruction schedule at
run-time, only constrained by true data dependencies and its hardware limits.
This schedule is produced on demand and so can even change each time the code
runs. To do this, we introduce an "issue window" or "issue queue" after the
Decode stage. This holds instructions until they can be executed, not
necessarily in the order they arrived in. Within this window, instructions can
be issued whenever their dependencies are available, and when a functional unit
is available to process it. To be able to detect when an instruction is ready
to be issued, we must know whether the instruction's dependencies are ready
when it enters the issue window. We must then update this status as new results
are produced. To implement this, the names of result registers of executed
instructions are broadcast to the issue window. The instructions waiting there
compare the register names to the registers they require. However, this scheme
has a problem: A register will be written multiple times in the program, and
since the instructions are executed out-of-order, the register name alone is
not sufficient to record dependencies. It also means that instructions would
have to wait until all previous reads of a register had finished before
executing. These are called "false dependencies." These problems can be
resolved by "renaming" register names at run-time so that each "in-flight"
instruction writes to a unique destination register. We use a "physical
register file" that is large enough to ensure we don't run out. We keep a
"register mapping table" to store the mapping between architectural, compiler-
assigned registers, and physical registers. Register reads to the same
architectural register are renamed consistently, so that dependencies can be
tracked correctly with physical register names. Physical registers are reused
only when they are no longer used by any instruction currently in-flight or any
entry in the register mapping table. The other big issue with out-of-order
execution is memory dependencies. Load and store instructions can have memory
dependencies because they access the same memory location. To detect this, we
need to compare the computed memory addresses that the instructions access. We
thus split memory operations into two steps: address calculator and memory
access. We issue their address calculation step as soon as the dependencies are
available. Then, the memory access step is placed in a special load-store queue
to be sent to our data cache as soon as possible. We carefully ensure that
operations that access the same address are kept properly ordered, but
independent accesses can be reordered if beneficial. No access can occur until
the addresses of all previous accesses are known. Since memory writes are
irreversible, store instructions must also wait until we are certain that they
will execute. [music]
Module 6, Video 2
[music] In the previous video, we outlined the concepts of out-of-order
execution, and register renaming. The issue window will be filled with
instructions fetched along the path that our branch predictor believes the
program will take. While we hope our branch predictor will be correct in most
cases, it will sometimes be wrong. How do we handle such cases? A simple
approach is to start by recording the original program order of the
instructions, and then to monitor their progress. We call the structure that
stores the instructions the "reorder buffer." As each instruction executes and
produces a result, we can mark it as done. When the oldest instruction has
completed, we can remove it from the end of the reorder buffer, and the
instruction is said to have "committed." This stream of committed instructions
represents how our program would be executed on a simple in-order pipeline or
by an unpipelined processor. It usefully also provides a point at which we can
process exceptions. For example, if the program divides by zero or attempts to
access memory that does not exist. We also check branch instructions as they
complete in order. If they have been mispredicted, we flush the reorder buffer,
our instruction window and any currently executing instructions and start
fetching down the correct path. To preserve correctness, we must also restore
our registers and the register map table to the values they had when we
mispredicted the branch. This can be done with the aid of a second register map
table, updated only when instructions commit in program order. This can simply
be copied to the map table used by our renaming hardware to "rewind time" for
the processor. All the register values we need will be present, as we don't
recycle registers before we know they will not be needed again. In reality,
handling branches in this way is too slow. Processors instead take many copies
of the register map tables and can handle branches as soon as they are
resolved, and we discover they have been mispredicted. They can also
selectively neutralize the in-flight instructions in the datapath that are on
the wrong path, rather than flushing all of these instructions away. [music]
Module 6, Video 3
[music] We can now bring everything together and look at what a typical
pipeline for an out-of-order superscalar processor might look like. The Fetch
stage is aided by an accurate branch predictor as we met in Module 3. It will
fetch a group of instructions on every clock cycle. This group of instructions
will be requested from the instruction cache, and will be from consecutive
memory locations. Branches may reduce the number of useful instructions that
can, in practice, be fetched in on each cycle. The Decode stage decodes
multiple instructions in parallel. At this point, modern high-performance
processors may also split complex instructions into simpler operations or
"macro-ops." In some cases, there may also be opportunities to combine simple
instructions into a single operation. The next step on an instruction's journey
is renaming to receive a unique destination register. As we saw in the last
video, this increases opportunities for out-of-order execution. Remember, there
are several times more physical registers in our processor than those available
to the compiler. Instructions are placed in the reorder buffer, and are also
"dispatched" to the Issue stage. They will wait in the window as necessary, and
are ready to be issued once all their operands are available. In the most
complex of today's superscalar processors, there may be hundreds of
instructions buffered in the issue window at the same time. Instructions
finally commit in program order. At this point, any physical registers that are
no longer needed can be added back to the pool of free registers. These are
then assigned during the register renaming step. Once an instruction is issued,
it reads its operands from the physical register file. The Execute stage
consists of many functional units operating in parallel. These may each support
different operations and take different numbers of cycles to execute. A network
of forwarding paths is also provided to ensure we can execute any dependent
instruction on the next clock cycle after the generation of the result. This
requires being able to quickly communicate—or "forward"— a result from the
output of any functional unit, to the input of any other. Some instructions
will need access to memory. After computing their addresses, they are placed in
the processor's load-store queues. "Stores" are sent to memory in program
order, but "loads" can often be sent out of order, and ahead of other older
stores or loads that are not yet ready to be issued to memory. The memory
system reduces the average memory access time by providing numerous levels of
cache memory. After generating results, we write them back to the register
file. This overview is representative of the fastest modern microprocessors
found today in laptops, smartphones and servers. Whilst much extra innovation
goes into real designs, they generally follow the ideas discussed in the
course. [music]
Module 6, Video 4
[music] One question computer architects always ask themselves is: "how much
can we scale up our design?" Let's take a look at some further potential
optimizations to our out-of-order superscalar processor. We could try to make
it wider. For example, by doubling the number of parallel instructions, we can
fetch, decode and execute more instructions per cycle. Would this double our
performance? Sadly, no, things are not that simple! In practice, some
components quickly become very complex, and performance gains may be hard to
extract. For example, today's largest machines fetch at most ten instructions
per cycle from their instruction caches. Fetching more instructions than this
offers minimal performance gain, despite a large hardware cost. If we increase
the number of registers, the size of our issue window, the size of our load-
store queues, or perhaps use a larger and more accurate branch predictor, our
processor's performance will only improve slightly despite a significant
increase in the size of these structures. After a point, the increase in
performance is no longer worth the cost of the extra transistors. It's also
possible that performance might reduce overall as we may need to lower our
clock frequency as the structures get larger. Finally, we could introduce more
pipeline stages, but we know this doesn't necessarily lead to higher
performance, as mispredictions may become more costly. The combination of these
issues means that extracting performance using instruction-level parallelism
alone becomes more expensive as more performance is sought. This graph shows
how the energy cost of executing an instruction grows quickly as we try to
build higher performance processors. Let's look at some example designs.
Suppose we have a core, which requires a certain area. If we double its area,
its performance improves, although there is a small rise in energy per
instruction. If we quadruple its area instead, its performance has now doubled
compared to our original core, while energy has increased by 50 percent. Going
further, if we increase our processor's area by a factor of 10, performance is
only 2 point 5 times our original core, but energy per instruction is now 3
times higher. Its performance does not improve as fast as the cost of running
it! Of course, engineers are clever and determined, and are constantly
developing new techniques to bypass many of these issues. This means the
performance of processors—even ones running a single thread or program— still
improves by around 10 to 25 percent each year. Nevertheless, ultimately we
often need more performance than can be provided by instruction-level
parallelism alone. A modern solution is to employ multiple processor cores on
the same chip—called a "multicore" processor. This changes the task for
programmers; they may need to redesign their programs to take advantage of such
parallelism, but if they can, it can give vast performance benefits. As we've
learned throughout the course, every decision involves trade-offs and
compromise. We are faced with a fascinating but often highly-constrained design
problem. We've seen how performance bottlenecks, that at first seem impassable,
can be overcome with innovative designs. What might the future hold for
microprocessors? Can you think of ideas? What would you design? [music]
Module 6, Lab
[music] In this exercise, we'll be using a simulator to explore an out-of-order
superscalar processor design. The simulator allows us to configure a number of
aspects of our processor, for example, the number of pipeline stages, the width
of the Fetch stage, the size of the issue window, the number of ALUs, and the
size of our re-order buffer. The changes will be reflected in the pipeline
diagram, which we can see below, and also in the statistics below that, with
key design metrics, such as the clock frequency, clock period, and design size,
visible below. When we press "step," the simulation will advance by 1 clock
cycle, and so we can see, for example, on the first clock cycle the 1st 4
instructions are loaded. Although three of them are unusable because they
follow a taken branch and therefore these will not be executed and will be
discarded on the next clock cycle. We can press "run" to watch our design in
action and in order to quickly get to the end, we can use the "fast forward"
feature to simulate the millions of instructions in this particular program.
After the simulation is complete, we can check below to see a number of
statistics about our pipeline, which are useful for understanding why the
performance is as it is, and in particular we can see the program execution
time. 1.18 milliseconds gives us the overall execution time of the program that
our design achieved, and notably our average instructions per cycle gives us
the number of instructions that we were able to complete on each clock cycle,
in this case well over 1, indicating that we are taking advantage of
instruction level parallelism in this simulation. Below all that, we can see
our completed simulations and we can check back on these as we explore new
designs. So let's say, for example, we were curious about the effect of
increasing the re-order buffer size. We could change that and then press "run"
to quickly run our next experiment. And then if we scroll down to our table, we
can compare the results and see that actually the program execution time was
substantially increased by increasing the size of the re-order buffer.
Although, admittedly, this did come at a not insignificant increase in the area
of our design, and so whether or not this represents a good trade-off in
practice would very much depend on the problem we're trying to solve. In the
exercises, you'll be given a number of scenarios for processors, which
generally revolve around certain constraints on the design size, or the clock
frequency, or the target program execution time. And once you think you've
found a design that meets the goals required, configure it up using the
settings and then scroll down to the "submit" button at the bottom, and click
that to have your answer checked. Best of luck. [music]

Computer Architecture Arm Substitles3

Uploaded by

Copyright:

Available Formats

Computer Architecture Arm Substitles3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computer Architecture Arm Substitles3

Uploaded by

Copyright:

Available Formats

Computer Architecture Essesntials by

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.