Computer Architecture Arm Substitles2

Computer Architecture Essesntials by ARM
Module 1, Intro
[music] I'm Richard Grisenthwaite, Chief Architect at Arm. What does a Chief Architect
do? The Chief Architect is responsible for the evolution of the Arm architecture. One
of the things about an architecture is, you want your software to continue running on
future products. So it's a very long term commitment, and the role of the chief
architect is essentially to curate the architecture over time, adding new features as
needed by the partner. Why are microprocessors so important and ubiquitous? I think
the fundamental reason is that they are a general purpose device. They describe a
basic instruction set that does the stuff that you can construct pretty much any
program out of. Indeed, they're Turing complete, which means, by definition, you can
create any computable problem, solving a computable problem with the processor.
They're not customized to any particular usage, but they do two things really rather
well, which are data processing and decision making. In other words, having, for
example, added two numbers and compared them, you're then taking a branch and making a
decision based off that. Now in reality, an awful lot of general purpose problems are
ones that involve: work out some value or some criteria, compare against that, and
then make a decision on it. And essentially that allows you to solve a huge number of
different problems and that's why the microprocessor has become so ubiquitous. What is
the difference between architecture and microarchitecture? Architecture is what it has
to do, and microarchitecture is how it does it. So for example, we will define in the
architecture a set of instructions that do "add" or "load" or "branch", but it says
nothing about whether you've got a ten-stage pipeline or a three-stage pipeline. It
says nothing about branch prediction. All of those sort of features which actually
determine the performance of the device; all of those are part of the
microarchitecture. The key point is that software that is written to be compliant on
the architecture will actually run on lots of different implementations, lots of
different microarchitectures that essentially implement that architecture in different
ways, choosing a trade-off between the power, the performance and the area of the
design. What does it take to build a commercially successful processor today? If you
actually start from scratch with an established architecture, but we want to create a
new microarchitecture, we reckon, for a high end processor, you're talking about
three-to-four hundred person-years' worth of work in order to create a properly
competitive multi-issue out-of-order machine compared with the state-of-the-art that
you can get from Arm. In addition to that — and that's just to come up with the RTL —
in addition to that, you've then got to do the implementation. If you're going to
build that on a three-nanometer process, the leading edge to get the best possible
performance, you're talking about tens of millions of dollars for the mask sets.
There's a whole bunch of software you've got to go and build on top of that and do all
of that work. When we've looked at companies that are interested in actually starting
up, taking an Arm architecture license — say I want to go and build my own business —
we reckon that you need to be prepared to spend getting on for half a billion dollars
before you're actually going to be successful because it takes time. Your first
product is not necessarily going to be fully competitive because it would be slightly
surprising if the first thing that you built was as good as what people have been
building for many years. It takes time to build up that expertise and skills. And so
you're going to see a couple of iterations before even the best teams end up really
being competitive. And so as I say, I think if you went from nothing and wanted to
essentially create something, and that's using the Arm architecture with all of its
existing software and so on, you're talking the best part of half a billion. What is
the Arm business model? The Arm business model fundamentally is the licensing of
semiconductor IP to as wide a range of companies as possible and in as many ways as
possible, in order to maximize the uptake of the IP. When we talk about IP, we're
talking about essentially designs for processors and we license that both as an
architecture license, where you effectively gain the right to build your own processor
to the specification of the Arm architecture, or an implementation license where we
are licensing, actually, an implementation that is compliant with the Arm architecture
in the form of some richer transfer-level RTL. What makes Arm different from its
competitors? So if you go back to where we were when Arms started out in the early
90s, there were many, many different architectures available and they were all kind of
doing the same thing but slightly differently. They would have different instructions,
different instruction sets, and so software needed to be ported. A huge part of Arm's
success actually comes from the fact that we created a business model of licensing the
IP to make it very easy for people to build processors, to incorporate the designs
into their SoCs, into their systems. And that then made it very straightforward for
people to be able to use the Arm architecture. Now what this then meant was, people
said: I will port more software to this because it's more widely available and you get
this positive feed-forward effect whereby more availability of hardware encourages
more software, encourages more hardware, and so on. And essentially that meant that a
lot of people said: there's no point in me having a different architecture. I'm not
getting a valuable difference from doing that. All I've got is, kind of, a needless
change to the software that I need to make. So actually the whole model, we went on at
Arm, which was: let's license our IP to a great number of different players to come up
with different solutions to meet the form factor of a camera or a PDA, or whatever it
was back in the day. Those things made it much more straightforward for people to
incorporate our technology. [music]
Module 1, Video 1
Computer architecture is the study of tools and techniques that help us to design
computers. More precisely, it helps us understand how to meet the needs of particular
markets and applications, using the technology and components that are available. For
example, we might need to produce the chip at the heart of a smartphone using 10
billion transistors and a power of less than two Watts. How do we achieve the
performance a customer wants? The challenge is a fascinating one, and one that
requires a broad understanding. For example, what target market are we designing for?
What are the characteristics of the applications to be run? How will programming
languages and compilers interact with the microprocessor? How best to craft the narrow
interface between the hardware and software? How to organize the components of our
microprocessor? And how to design a circuit, given the characteristics of individual
transistors and wires? Like many design problems, computer architecture requires many
trade-offs to be made and evaluated. Each design decision will impact trade-offs
between size, performance, power, security, complexity, and cost. Trade-offs must be
re-evaluated regularly, due to advances in fabrication technology, applications, and
computer architecture. Computer architecture must be grounded in quantitative
techniques and experimentation, but the endless number of possible designs means that
the field depends on a high degree of human ingenuity and art. Perhaps surprisingly,
the earliest computers and today's most advanced machines have much in common. They
both execute a stored program constructed from machine instructions. These
instructions perform simple operations such as adding two numbers. Nevertheless,
greater numbers of faster transistors, and the application of a relatively small
number of computer architecture concepts, have enabled us to construct machines that
can perform billions of instructions per second, and shrink these machines to fit in
hand-held battery-powered devices. It is this rapid progress that has supported
breakthroughs in machine learning, drug discovery, climate modeling, and supports our
modern world where computation and storage are almost free. The task of designing a
microprocessor is split into different levels of abstraction: "Architecture;"
"Microarchitecture;" and "Implementation." "Architecture" focuses on the contract
between programmers and hardware. It allows compatible families of microprocessor
products to be built. The ARMv8-A architecture is an example of this. Architecture
includes the "Instruction Set Architecture," or ISA, which defines what instructions
exist. It also defines precisely the behavior of memory and other features needed to
build a complete processor. "Microarchitecture" focuses on the organization and
structure of the major components of a microprocessor. It has to match the rules set
by the architecture. The microarchitecture still has flexibility though; and so the
implementation specifies the circuit detail precisely. This culminates in the exact
circuit design for manufacture. Each of these levels is vital, and each comes with its
own challenges and opportunities.
Module 1, Video 2
At the beginning of the 20th century, "computers" were people employed to perform
calculations. These computers used mechanical calculators to help them perform
arithmetic. They followed instructions to decide what calculation to perform next.
These instructions defined the algorithm or program they were executing. They would
consult paper records to find the inputs to their calculations, and would store their
intermediate results on paper so they could refer back to them later. A modern
electronic computer is organized in a similar way. We will look in more detail at
these components of a microprocessor in the next video, but for now let's look at how
it operates. Microprocessors also follow instructions one by one, and then perform
relevant calculations. This idea of fetching instructions from memory and executing
them is called the "Fetch-Execute Cycle." In the "Fetch" stage, the computer reads the
next instruction from the program. This instruction is encoded in binary as ones and
zeroes, so it must be decoded to understand what it means. This is done in the
"Decode" stage. Once it is clear what to do, we move to the "Execute" phase. This can
involve different tasks such as reading memory, performing a calculation, and storing
a result. Once done, the computer is then ready to begin the cycle again, by fetching
the next instruction. Instructions are normally fetched sequentially in order, but
some special instructions called "branches" can change which instruction will be
executed next. For branches, a calculation determines the next instruction. This can
mean evaluating a condition, or reading a register to determine the next instruction's
location. Branches allow computers to make decisions, and to re-use instruction
sequences for common tasks. A modern computer program, like a web browser, contains
millions of instructions, and computers execute billions of instructions per second,
but they all conceptually follow this "Fetch-Execute Cycle."
Module 1, Video 3
Modern microprocessors are circuits built using anywhere from 1,000 to 100 billion
tiny transistors. The key to designing circuits with such huge numbers of parts is to
build them from re-usable blocks. Let's take a look at some of these. Each transistor
is an electrically-controlled switch. When there is too little voltage at the gate,
the switch is off, so the electrical signal cannot propagate from the drain to the
source. However, when there is sufficient voltage at the gate, the switch is on, so
the signal does propagate. When designing processors, we use digital electronics. The
only voltage or current values we consider represent zero and one, enabling us to
build robust circuits from imprecise components, even in the event of electrical noise
or manufacturing imperfections. This means our circuits represent binary numbers,
since they only have two states: zero and one. We use transistors to build
increasingly complex circuits. We can design circuits that can remember binary values,
or select between multiple inputs, or even do arithmetic, like addition. We can then
use those circuits as building blocks in even larger designs. When designing a digital
system, we must keep everything synchronized to control when its behavior occurs. To
do this, we use a "Clock Signal," which is a wire in the circuit whose signal cycles
between zero and one. We measure the rate in Hertz. For example, if you hear a
processor has a clock speed of two Gigahertz, it means 2 billion ones and zeroes per
second. The maximum speed of the clock signal is determined by the longest, and
therefore slowest, path in the circuit between 2 clocked flip-flops. This is referred
to as the "Critical Path." The signal must have time to propagate all the way along
the critical path before the clock cycle completes. A microprocessor needs to do many
types of arithmetic. For this, we build an "Arithmetic Logic Unit" or "ALU." This
circuit receives 2 numbers as input, as well as an indication of what operation to
perform, for example, addition or subtraction. In addition to logic, a microprocessor
needs memory. Memory is organized as arrays of memory cells that are able to store
many "Words" of data. A specific word, commonly 32 bits in length, can be accessed by
specifying its "Address." Each address is a number that indicates the location in the
memory that should be read or written. Memory cells range from hundreds of bits to
millions of bits in size, but larger ones are slower to access, as signals and their
long internal wires take longer to propagate. For that reason, almost all
microprocessors include at least two types for storing data: a big slow "Data Memory,"
and a small fast memory called a "Register File." In reality, the "Data Memory" may be
implemented using many different sizes of memory, as we'll see in Module 4. As well as
storing data in memory, we also use some memory to store the instructions. We need a
way to keep track of which instruction we will fetch next, so we have a "Program
Counter." This stores the address in Instruction Memory of the next instruction to be
accessed. Since instructions are encoded in binary, we also have "Instruction Decode
Logic" that converts that binary to the various signals needed to control the
microprocessor.
Module 1, Video 4
We've already seen how a microprocessor is controlled by instructions. But what are
they really? An instruction is a simple command that the microprocessor hardware can
perform directly. We write them as text like this to make them easier to read. But for
the microprocessor we encode them in binary. We use a program called an assembler to
translate between the human text and the binary. In this video we'll be looking
specifically at an Arm instruction set called A64 but other instruction sets follow
similar principles. Arithmetic and logic instructions are the simplest type of
instruction. The first word tells us what operation will be performed such as addition
or multiplication. The values after this tell the processor where to put the result
and where to get the inputs. Values starting with X are addresses in the register
file. Arithmetic instructions read one or two registers and then put the result into a
third register. Branch instructions are used to make decisions and to repeat
instructions. Normally the microprocessor executes instructions in sequential order
but branches change that, and explicitly tell the microprocessor the address of the
instruction to run next. This is done by giving the address of the next instruction in
the instruction memory. Some branches are unconditional, meaning they always occur and
always affect the next instruction address. Other branches are conditional, meaning
the processor will perform a calculation to decide whether to follow the branch or to
continue executing instructions sequentially following the branch. These are preceded
by comparison instruction to calculate the condition. Loads and stores are the
instructions for accessing the data memory. Loads copy values from memory to the
register file. Stores do the opposite. In both cases, the instruction needs to know
the address in the data memory and the location in the register file to copy between.
For data memory, loads and stores read an address from a base register. They can also
optionally add to this base address by reading another register, or by simply
specifying a number in the instruction itself. Using sequences of instructions we can
build programs. Here is an example of a small program that implements Euclid's
greatest common divisor algorithm. Let's take a look at it working, one instruction at
a time. To start with, the input stored in X1 and X2 are compared. If they're equal,
because we have found the greatest common divisor a conditional branch instruction
moves to instruction 7. If they're not equal, another conditional branch instruction
can be used to determine whether X1 is smaller. Finally, we use an arithmetic
instruction to subtract either X1 or X2 depending on which was larger and then
unconditionally branch back to the start. Here we can see an example of the program
running. One instruction executes at a time. As we step through the program, the
values in the registers X1 and X2 are shown. And we see these are modified each time
we reach instruction 3 or 5. The instructions repeat in a loop, as is common for many
different types of program. Each time round the loop instruction 0 and 1 are checking
whether or not to continue the loop. When they detect X1 and X2 have become equal, the
program finishes and the processor moves on to the next task.
Module 1, Lab
In this exercise, we're going to be using ASim. ASim is a behavioral simulator for a
subset of the Arm AArch64 instruction set. Behavioral simulator means that it's not
simulating real circuit level details of a microprocessor it's just simulating the
behavior of each instruction as it runs. Nevertheless, behavioral simulators are vital
tools for computer architects. We use them to check the designs we've built, match
with the behavior we intended and we can also use them to explore new ideas, for
example adding new instructions to the instruction set. We're just going to be using
it though to get familiar with the basics of Arm AArch64. So we can create a new file
which will allow us to type in some Arm AArch64 instructions and when we press
assemble that will cause ASim to begin the simulation of the instructions. If we
scroll down we can see that below we've put a little cheat sheet of Arm AArch64
instructions that you can use as a reference. Do note that these are just examples
though. So for example this first one is mov X1 X2 and the descriptive text says that
this copies the value of register X2 to register X1. But equally that would imply that
if we use mov X3, X4 instead that would copy the value of X4 to the register X3. So we
don't have to just use these exact instructions, we can tweak them as we need to, to
achieve our goals. So going back to the ASim, let's say that we wanted to add the
number two to the number three. In our cheat sheet we could see that there is an 'add'
instruction but it requires that the inputs for the addition are stored in registers.
So first of all, we're going to need to load up two registers with the numbers that we
want to add in this case two and three. So looking at our cheat sheet again we can see
that there's an instruction mov that allows us to do that. So if I write mov X2, #3
this would cause the register X2 to be loaded with the value 3 and similarly I could
write mov X1,#2 and this would cause the register X2 X1 to be loaded with the value
two. Last but not least, we could then do the addition we wanted to do, such as add X3
X1 X2. This will cause X1 to be added to X2 and the results stored in X3. If I press
assemble now, we will see that the UI changes slightly. Here we can now see the
simulated state of the processor. ASim is only going to, because it's a behavioral
simulator it's only going to show the state before and after each instruction. And so
right now we are before executing the first instruction. That's the yellow
highlighting there and we can also see it in the simulated memory. If I press step we
can now see that that instruction has completed, and it's before executing the mov X1
X2. And notably we can see that X2 has been loaded with the number 3 which is exactly
what we would expect. If I press step again, we can see that mov X1 X2 has now
occurred. MOV X1 #2, sorry has now occurred which has loaded the value 2 into the
register X one. And last but not least, if we step again we see that X3 has become
equal to five which is exactly what we would expect if registry X1 was added to
registry X2. So this allows us to get a feel for what it's like for a machine to
execute these instructions. We can reset back to the beginning if we want to watch it
go through again. If we want to do an edit for example adding a new instruction we can
just do that. Press assemble and that will add the new instruction to our program, and
we can see its effects by simulating. Now in the exercise you will be invited to take
an existing program and add 1 new instruction to it at the indicated position. Do note
that when you assemble the exercise program, you'll be taken to a gray bit of code,
which is our testing code. But as you step, you'll see that the simulator flicks
between the testing code and the code that we wrote the exercise one code. You can
also click between them using this. Feel free to try to work out what our testing code
does but you can just ignore it if you want to. The point of the exercise is just to
add the instruction in the indicated position. When you're comfortable you've got the
right instruction you can press run to get all the way to the end. And if you really
think that's correct if you scroll down to the bottom of the page you'll see the
submit button which you can use to send in your answer. Best of luck.
Module 2, Intro
[music] Hello, my name is Martin Weidmann. I'm an engineer and product manager with
Arm's Architecture and Technology Group. I look after the A-profile architecture and I
maintain Arm's Interrupt Controller specifications. Computer architecture is sometimes
called a science of trade-offs. Why is everything a trade-off when it comes to
designing a processor? Let's take an example of where we've had to make trade-offs
when developing process architecture. So the Arm architecture has a feature called
Pointer Authentication. This is often abbreviated to PAC for Pointer Authentication
Code. What this feature is trying to do is protect against a form of attack called ROP
and JOP. These are Return Orientated and Jump Orientated Programming, and it's where
an attacker tries to subvert things like the call stack to run legitimate code, but in
ways that weren't expected by the programmer or the compiler. PAC or Pointer
Authentication tries to defend against this kind of attack by using part of an address
to provide an encrypted signature. So we can check the signature and the address match
and if they don't, we can spot an attack in progress. So why is this a trade-off?
Well, because to add security, we want that signature to be as big as possible. The
bigger the signature, the more bits we use for that, the stronger cryptographically
that signature is. The trade-off is: the more bits we use for the signature, the fewer
bits we have available for other things, such as the address. So you can have a big
signature with a small address, but if you want the address to get bigger, then you
get a smaller signature, and that's then cryptographically weaker. So the trade-off we
have to make when designing a technology like that is: What's the right amount of bits
for the signature? What's the strength of cryptography we need from that signature in
order to get the design goal, which is to defeat these attacks and give us more robust
computing? What sort of guiding principles do you use when designing a processor? So
when you're designing a processor, the key thing you have to bear in mind is: What's
it going to be used for? What you don't want to end up with is a very expensive
paperweight. So we need to understand the requirements that the processor has to meet.
We have to understand the design trade-offs we're making and how they work into
meeting those requirements. We also have to consider not just the processor itself,
but how we're going to show that that processor is correct. We'll put as much time, if
not more, into testing and validating the design as we do into designing it. How do
you design a new microprocessor? If you wanted to create a new processor from scratch,
the first thing you're going to have to do is understand the market that that
processor is going to address and to then build a team to design that processor. There
isn't such a thing as one processor for every possible market. The requirements for
something like an embedded microcontroller are going to be very different to what you
want from the processor in your mobile phone, your laptop, or your server. So you need
to understand those requirements as the first step into building a new processor. What
determines the best design for a microprocessor? So when you're designing a processor,
you need to work out what the best design for a given market or application is going
to be. There's no magic formula for this. It's going to depend a lot on what you're
trying to achieve with that processor. You need to understand things like the power
requirements, the performance requirements. Is it going to work in a highly noisy
electrical environment? There's a big difference between the reliability requirements
you need from something like a watch versus a satellite. So you would take those
requirements and you'd work out what the best set of trade-offs is going to be, and
that's an art more than it is a science. How do the underlying technologies contribute
to this best design? A lot of technologies go into our processor. There's the design
of the microarchitecture, the implementation of the processor. There's the silicon
process you're going to use, how you're going to integrate that processor into an SoC
or ASIC. Is it a single die, or is it going to be using chiplets or multiple sockets?
All of those different technologies are going to be factors into how you design the
process or what trade-offs you make, and what performance and power you get out of the
design once you're done. In reality there may be many different 'best' designs, so how
do you pick one? So when you're designing a processor, what you want is the best
design. But often there isn't "a" best design, there's just different trade-offs. You
have to decide what the best set of trade-offs is for the particular use case you're
going for. And that's also going to depend on: Is this a device that will be off the
shelf, used for lots of different applications—a general purpose processor? Or is this
being designed for one specific use case? Again, there isn't really a magic bullet or
single answer for this type of question. You need to understand how the processor is
going to be used and then use your experience to judge the trade-offs, and what will
give you the best mix of power, performance, area, cost, reliability for your target
use case.
Module 2, Video 1
[music] In this module, we're going to explore how to improve the simple
microprocessor design from Module 1 in order to allow it to execute programs more
efficiently. First, let's find out how long a program takes to execute. The time taken
to perform the average instruction is equal to the number of clock cycles taken to
perform an instruction multiplied by the duration of one clock cycle. The time taken
to run our program is found by multiplying the average time to perform an instruction
by the number of instructions in our program. How could we make this faster? One thing
we could try is to reduce the number of instructions in a program. We might be able to
optimize the code removing unnecessary and repeated work and selecting instructions to
minimize code size and maximize performance. We could give our microprocessor the
ability to perform more operations in order to help programmers or compilers further
reduce the number of instructions in their program. For example, allowing the loading
of two data values at the same time might allow fewer instructions to be used in the
program. The downside to this approach, is that adding more instructions will require
extra circuitry in the processor and therefore we likely increase the clock period. If
the extra instructions are rarely used this could even mean an overall decrease in
performance. We see this theme often in computer architecture trade-offs that we have
to carefully balance. Another approach is to use faster transistors perhaps
constructed from a more recent fabrication technology. This would reduce the clock
period but may increase costs. The rest of this module focuses on an optimization to
reduce the clock period called pipelining. This is the most important optimization we
use when designing processors. It uses a similar concept to an assembly line in a
factory where work can start on the next item before the previous one finishes. Let's
take a closer look. Imagine that each instruction has to go through four circuits in a
processor. If we attempt to do all of these in one clock cycle this means our clock
period is the latency of all four circuits added together. If we were to pipeline
this, we would add a pipeline register in the middle. This divides the circuit into
two sections called stages. Notice that although each instruction takes a similar
amount of time to travel down the whole pipeline, the pipeline design can execute
nearly twice as many instructions per second. The throughput has doubled. This is
because we can set the clock period much shorter. It's now the maximum latency of the
two stages. We can pipeline into many stages and this allows for much faster execution
of programs. Unfortunately, though, pipelining a real microprocessor design is not
quite as simple because the processor has various feedback signals and loops in the
circuit. In the next video, we'll take a look at the challenges of pipelining in
practice. [music]
Module 2, Video 2
[music] In this video, we're going to look at applying the pipeline optimization to a
realistic microprocessor design. In the first module, we met the components of a
microprocessor, so let's look at how these are really connected. This diagram shows
all the connections needed for a real, unpipelined microprocessor. Each clock cycle,
the processor starts by fetching an instruction from the instruction memory. Once the
instruction reaches the decode logic, it is decoded to produce the control signals
necessary to execute it. The exact control signals vary depending on the type of
instruction. For example, arithmetic instructions access the register file and
interact with the ALU. Ultimately, no matter how the instruction was executed, the
last step of each clock cycle is to update the program counter. This is done by the
branch unit. For non-branch instructions, this just means incrementing the program
counter. However, for branch instructions, the branch unit has to do some
calculations. When we apply our pipeline optimization to this design, we face some
challenges. The design has several loops because instructions have dependencies. How
can we break these cycles? The key observation is that not every instruction is the
same. In real programs, branch instructions usually make up less than 20 percent of
the program. For non-branches, the branch unit doesn't actually need to wait for the
ALU before calculating the result. Let's look at how we can use this fact to pipeline
the processor. Once the first instruction shown in yellow reaches the pipeline
register, we're ready to begin fetching the next instruction, shown in blue. The
yellow instruction can be in the execute stage whilst the blue instruction is being
fetched. Once the yellow instruction is finished, the blue instruction is ready to
enter the execute stage and a new green instruction enters the fetch stage. What about
the branches though? Let's imagine this next yellow instruction is a branch. The fetch
stage works normally until the branch unit, but the branch unit cannot proceed.
Consequently, the pipeline stalls. The fetch stage spends a cycle waiting whilst the
execute stage executes the branch. Finally, once the ALU is done, the branch unit can
proceed and the next instruction, in this case blue, can be fetched. Overall, this
means that the processor wasted one cycle stalling due to the branch. Since only 20
percent of instructions are branches, this means that each instruction would require
on average 1.2 cycles. The same idea of stalling the pipeline can be used to create
even longer pipeline designs. This diagram shows a typical five-stage processor
pipeline. In the next video, we'll look at how we can manage or prevent some of the
stalls in a design like this. [music]
Module 2, Video 3
[music] Instructions within a program may be dependent on each other. That is, one
instruction may produce a value that a subsequent instruction consumes. Data values
may be communicated through registers or memory. The simple program shown has a number
of so-called true data dependencies. This means we must take care to execute these
instructions in order, and make sure results are correctly communicated. Additionally,
the outcomes of branch instructions may affect the path taken through the program, and
consequently, this affects whether an instruction is actually executed. This sort of
dependency is known as a control dependency. In the previous video, we met a realistic
processor pipeline with five stages. Circumstances that prevent an instruction making
progress in our pipeline are known as pipeline hazards. Let's take a look at how
dependencies cause hazards. This program has a true data dependency. The first
instruction writes to register one, which is then read by the second instruction. If
we send this down our pipeline, we see that the second instruction must stall, waiting
for register one to be written, before it can read and proceed. This is a data hazard.
Unfortunately, dependent instructions are common and stalling in this way would
significantly increase the average cycles per instruction. Let's take a closer look at
the hazard though. The ADD instruction is in the execute stage, meaning its result is
being computed. The SUB instruction needs that result to proceed. Rather than waiting
for the ADD to reach the writeback stage, we could add an extra path into our pipeline
to carry the output of one stage to a later instruction, making the result available
straight away. We call this a forwarding path. In this case, the ALU result is
forwarded to the SUB instruction to be used as X1. This small piece of extra circuitry
allows this data hazard to be eliminated completely. Unfortunately, even if we add
forwarding paths everywhere, it's not possible to eliminate all data hazards. For
example, this program has a data hazard due to the load instruction. There are other
types of hazard too. This program contains a control hazard. We cannot be sure which
instruction to fetch until after the branch instruction executes. Consequently, this
program has two stall cycles. We will look in more detail at how control hazards can
be mitigated in the next module. Another class of hazards, called "structural hazards"
occur when two instructions require the same resources simultaneously. For example, if
instructions and data were stored in the same memory, and this could only be accessed
once per cycle, we would have to very frequently stall our pipeline to let these
stages access memory one-by-one. [music]
Module 2, Video 4
[music] In the previous videos, we explored how pipelining could improve performance
by reducing our clock period and by overlapping the execution of different
instructions. We also saw that it was sometimes necessary to stall our pipeline to
ensure that instructions were executed correctly. Ideally, our average cycles per
instruction, or CPI, will remain at 1.0. If we must stall, however, this will
increase. For example, if 20 percent of our instructions were loads and each of these
caused one stall cycle, our CPI would be 1.2. If a further 20 percent of instructions
were branches, and each of these caused two stall cycles, our CPI would be 1.6. The
longer we make our pipeline, the more stall cycles there will be, and eventually the
cost of stalls may outweigh the benefit of the faster clock period. For example, let's
imagine we added a stage to our five-stage pipeline from the previous video. Now the
number of stalls after a branch instruction increases to three, hurting our CPI. On
the other hand, our clock period would improve. So whether or not this helps speed
program execution would depend on the exact details. It may eventually become more
difficult to reduce our clock period by adding further pipelining stages. This is
because it becomes harder to perfectly balance the logic between stages and because of
the constant delays associated with clocking and our pipeline registers. To mitigate
these issues, we will need to invest in more transistors and our design will require
more area and power. The deeper our pipeline gets, the greater the investment we need
to make in terms of area and power for the same incremental improvement. Commercial
processes today have anywhere from two to twenty pipeline stages. The faster, more
expensive and power-hungry processors tend to have longer pipelines than the smaller,
cheaper processes in embedded devices. As with many techniques in computer
architecture, eventually it becomes more profitable to invest our time and resources
in an alternative way of improving performance. In later modules, we'll explore how we
can reduce the CPI, even in heavily pipelined processes. [music]
Module 2, Lab
[music] In this exercise, we're going to be using a model of a processor pipeline to
explore the effect of the pipelining optimization. Computer architects use models like
this to make high level decisions early on about what parameters they will use for a
processor and using a model such as this saves the burden of actually building the
processor to find out its performance. The model is not simulating accurately the
performance of the processor but rather it's giving us an idea for what performance we
might expect. So what can we do with this model? Well, we can configure the number of
pipeline stages, which we can see affects the diagram. And we can also turn on or off
the forwarding optimization. As we change these numbers notice that the design
parameters change down here. So for example, the clock frequency is improved by
increasing the number of pipeline stages but the design area will get bigger. And so
this may be a consideration depending on the problem. We can also choose which of two
programs we're going to put down our pipeline. When we press the step forward button
the pipeline advances to the next clock cycle and we can see the instructions have
started to flow down our pipeline and interesting events such as forwarding will be
noted in the simulation. Sometimes the simulation will detect that there will be a
stall for example, in this case, we can see that there is a data hazard because the
instruction in the red memory stage writes to register X13 which is read by the
instruction in the yellow decode stage and therefore a stall cycle is necessary in
order to allow the result to be generated. If we press the play button, the simulation
will proceed automatically and we can see various stall events happening as the
simulation proceeds. But notice that the the program we're simulating is nearly
1,000,000 cycles long so watching it play out at this speed is going to take quite a
while. So we can use the fast forward slider to simulate much, much faster. Notice
that the statistics down the bottom have updated depending on the results of the
simulation, and at this point we can see that the program is finished and the
simulation of the program, the simulated program took 3.98 milliseconds to execute. We
can also see that just below, the results of past simulations are stored in little
tables so we can easily refer back to them when we're doing later experiments. So as
an experiment, let's imagine what would happen if we disabled the forwarding
optimization but change nothing else and we'll just run this program through. What we
can see immediately is the design side is slightly better which is what we would
expect. It's got 1% better in fact in this case because of the lack of the forwarding
wires. But now that the program is finished, we can see that the program execution
time is a lot worse. 6.34 milliseconds is about 50% worse. So again, looking in our
table, we can compare the execution times in the area and we can see that in most
cases the forwarding optimization would be a big optimization here because at the cost
of an increase in the area of about 1%, we've had a improvement in execution time of
about 50% which is likely to be a good trade off, but not always. It would depend on
the exact scenario. Perhaps that 1% area is more important than the performance of
this program. In the exercise, you'll be invited to design a or suggest a design using
the number of pipeline stages and whether forwarding is enabled that will meet certain
criteria. You can play about and do as many simulations as you wish to figure out what
the best program might be. Once you've got it set up select the processor that you're
happy with at the top and then scroll down to the submit button and press that. Good
luck. [music]
Module 3, Intro
[music] Hi, I'm Nigel Stevens. I'm Lead Instruction Set Architect for the Arm A-
Profile architecture. I've been at Arm for about 14 years and I have responsibility
for the Arm V8-A instruction set including recent developments such as the Scalable
Vector Extension and Scalable Matrix Extension. What do we mean by the instruction set
architecture? The instructions in architecture, primarily, most people think of I
guess as the OP codes, the encodings of instructions that are executed by an Arm-based
processor. But it also includes other aspects as well such as the exception model,
system programming features, memory management and suchlike. The architecture for Arm
is rather distinct from what other companies may call an architecture. For Arm,
architecture is a legal contract, if you will, between hardware and software. If
software uses only those instructional codes and features of the ISA that are
described by the Arm architecture to perform its work, and the hardware that it's
running on implements all of those op codes and features exactly as defined by the
architecture, then any architecturally compliant software will run on any
architecturally compliant hardware that implements that Arm architecture. And that
doesn't mean just processors from Arm itself, but also processors that are designed by
our partners and which we have validated are conformant with our architecture. How do
you decide which instructions to include in an ISA? When we are looking at requests
from partners or from internal research to add a new instruction, we go through quite
a long process of trying to justify that instruction, or, quite commonly, a set of
instructions rather than a single instruction. We have to show that it gives us some
real benefit in performance, the performance of your code running on that CPU. Or
maybe not performance. Maybe it's security you're trying to achieve. But it has to
give you some really concrete benefit that is worth the cost of adding all of the
software, the validation software, the implementation costs for all of the different
implementations, compiler support, so on and so forth. It has to meet the... It has to
answer that cost benefit analysis. What is the difference between an ISA and a
microarchitecture? The difference between an instruction set architecture, or ISA, and
the microarchitecture is that the ISA is an abstract concept. It defines a set of
instruction encodings which software can use, and which hardware has to recognize and
implement. How that is implemented is a choice for the microarchitecture. So the
instruction set architecture is fixed, it's defined by Arm. The microarchitecture is
defined by whatever team of people is designing that CPU. And there are many different
approaches to implementing the Arm architecture, from very small, efficient cores with
in-order pipelines up to very high-performance, state-of-the-art, out-of-order
execution, and everywhere in between. So the microarchitecture is implementation-
specific, the architecture is generic, and software written for the architecture
should run on any microarchitecture. Why does Arm produce processors with different
instruction sets? Arm supports multiple instruction sets. Some of that is to do with
legacy: you can't abandon your legacy software, your legacy ecosystem. So as the
architecture has advanced and we've introduced major new instruction sets, we still
have to continue to support old software. It takes years, maybe 10 years to move the
software ecosystem to a major new ISA. So for example, AArch64, which is the 64-bit
architecture that we introduced with Arm V8, also supported the AArch, what we called
AArch32, the old 32-bit architecture that was implemented in the Arm V7 architecture,
and prior to that including the Arm and the Thumb instruction sets. And we needed to
do that because, while some software might start to migrate to the 64-bit
architecture, there's still a lot of software on the planet which is going to continue
using the 32-bit architecture, and that has to survive. So that's part of the reason:
it's about legacy. You can't just obsolete the whole world when you introduce a new
architecture, a new instruction set architecture in particular. There are other
reasons as well, which is there are certain instruction sets that are different for
reasons of the ecosystem that they're working with. So if you were to compare, for
example, the A-profile architecture that's designed for application processors that
run rich operating systems with virtual memory supporting SMP, symmetric multi
processing operating systems running large applications, whatever it may be: web
browsers on your phone or something, or a web server in a server farm somewhere. You
have your R-profile architecture, which is designed for high-performance, real-time
embedded systems. The constraints there are somewhat different. The instruction set is
actually fairly similar to the A-profile, but some of the underpinnings of the
architecture, the system side of the architecture, are simplified in order to give
more consistent and predictable real-time response to things like interrupts or memory
translation and suchlike for real-time systems. And then at the other extreme you have
the M-profile architecture which is designed to be capable of being built in a very
simple, ultra-low power implementation with low code size and again, similar to the R
profile, very predictable real-time performance. So the answer is there are different
instruction sets for reasons of the market that they're trying to address, and then
there are different instruction sets because, well, we have history. [music]
Module 3, Video 1
[music] In the previous module, we explored how pipelining can be used to improve
performance. We also saw how it is sometimes necessary to stall our pipeline to ensure
our program is executed correctly. In a simple pipeline, it will be necessary to stall
the pipeline whenever we encounter a branch instruction. This is because we must wait
until our branch is executed before we can be sure which instruction to fetch next. As
a recap, branches are instructions that change which instruction in the program will
be executed next. There are two types of branches: conditional branches and
unconditional branches. Unconditional branches always change which instruction
executes next, whereas conditional ones may or may not, depending on the computations
in the program. In real programs, between approximately one fifth and one quarter of
all instructions are branches, and the majority of these are conditional. Executing a
branch involves calculating the new address to load into our program counter. This is
the branch's "target address." However, conditional branches have an extra task: we
must first determine whether the branch is taken. If the branch is not taken, we can
effectively ignore the branch and fetch the next instruction as normal. Recall the
processor performance equation from an earlier video. Since we have to wait for
branches to complete before fetching the next instruction, we generate stall cycles.
These increase the average number of "cycles per instruction," which reduces our
microprocessor's performance. The longer our pipeline gets, the longer it is before
each branch is resolved, and the more costly branches become. Can you think of a way
to avoid some of this stalling? One idea is to evaluate branches earlier in the
pipeline, for example in the Decode stage instead of in the Execute stage. This can
indeed help to reduce the number of stalls, but we may still need to stall if the
branch depends on other instructions that haven't been executed yet. Another idea is
to continue fetching instructions in program order, effectively assuming that each
branch is not taken. The number of stalls in the pipeline for a not-taken branch is
zero in this design. On the other hand, if the branch is in fact taken, the subsequent
instructions that we fetched will be incorrect. So, we must remove all instructions
that have been fetched on this incorrect path from our pipeline. This is called
"flushing" the pipeline. Unfortunately, in real programs, branches are taken much more
than not taken. Could we then simply assume instead that all branches will be taken?
Sadly not, no, because then we would also need to know the specific "target address"
immediately, in order to know which instruction to fetch next. It may at first seem
impossible to know this before the instruction is decoded. However, computer
architects have found a way to do exactly this. The next video will look at "dynamic
branch prediction:" the idea of predicting the behavior of the branch instruction
before it has even been fetched. [music]
Module 3, Video 2
[music] In this video, we'll explore how to predict the behavior of a branch
instruction. This can sometimes eliminate the cost of branches altogether.
Fundamentally, a branch involves changing the value of the program counter, which is
the address of the next instruction in the program. If we could predict what this
change will be, quickly and accurately, we would have no need to stall. Precisely, we
need to predict that we are fetching a branch, predict it as taken or not taken, and
predict what its target address is. How could we ever make such predictions? What
would we base them on? Well, since instructions are often executed multiple times, we
can accurately make these predictions based on just the instruction address. If we've
previously seen that a particular address contains a branch, and we see that address
again, we can predict whether that branch will be taken, and its target address, based
on its behavior last time. Amazingly, for real programs, simply predicting repeating
behavior is typically around 90 percent accurate. This means we could eliminate stalls
caused by branch instructions 90 percent of the time. Let's apply these insights to
try to build a branch predictor. We will add two extra blocks to the Fetch stage of
the pipeline we met in Module 2. The first will remember information about recently
executed branches. This will include the program counter values of branches and their
target addresses. This memory is called the "Branch Target Buffer" or BTB. The second
block will make predictions about whether an address containing a branch is taken or
not. We simply call this the "branch predictor." In the next video, we will look at
these in more detail. Combining these two gives us all the information we need to
predict the next value of the program counter based solely on the current value of the
program counter. We don't even need to decode the instruction to make this prediction.
Here we can see how a running branch predictor behaves for a sample program. Each
program counter is checked in the BTB to see if it's predicted to be a branch and to
identify its predicted target. We also simultaneously check the branch predictor to
see if the branch is predicted to be taken. Based on these predictions, the branch
unit computes the predicted next program counter. Many cycles after the prediction,
feedback will be given by the rest of the pipeline as to whether or not the prediction
was correct. Whenever the prediction is wrong, we have to flush the pipeline and
update the BTB and branch predictor. The pipeline will then resume fetching from the
correct program counter as computed by the pipeline. The majority of instructions are
not branches, so most of the time the branch unit just adds 1 to the program counter.
The BTB contains the instruction address and target address of some recently executed
branches. The exact number of entries in the BTB varies considerably. BTBs in large
modern processors contain many thousands of branches. In operation, the BTB checks the
supply program counter against its memory to see whether it has a match, and if so, it
returns the target address. Otherwise, it predicts that the instruction is not a
branch. After each branch executes, the BTB is updated with the true target address.
BTB cannot be arbitrarily large, so it may have to forget an existing branch to
remember a new one. A simple BTB design like this is typically around 90 percent
accurate at predicting target addresses in real programs. [music]
Module 3, Video 3
[music] In the previous video, we met the two major components of dynamic branch
prediction, the BTB and the branch predictor. In this video, we'll take a deeper look
at the branch predictor. It predicts whether or not a branch will be taken, based on
the program counter. A simple branch predictor would try to remember what the branch
did last time and predict that the same behavior will repeat. Let's see how such a
predictor might be organized. Remembering a branch prediction for every possible
instruction address would take up far too much memory, so we reuse the same memory for
different branches via a process called "hashing." We hash the address of the branch
to a smaller number. This does unfortunately lead to a problem called "aliasing,"
where two different branches can hash to the same value, but this is rare in practice.
Let's see what happens now, when we execute a simple loop. We see a misprediction when
it encounters our branches for the first time, and another when we exit the loop. The
first case will be dependent on the value in our predictor's memory, and it may be
that we are able to predict the branch correctly the first time we see it. The second
case is hard to avoid, although some more sophisticated branch predictors will learn
how many iterations a loop will make. A common improvement to this simple scheme is to
avoid instantly flipping our prediction just because the branch does something
unexpected once. This can be achieved with a saturating counter, which instead
remembers how many times the branch has been taken recently, versus not taken. The
counter increments when the branch is taken and decrements when not taken. It predicts
"taken" if the majority of recent executions of this branch were taken. When building
higher-performance processors, we often have to discard many instructions every time
we mispredict a branch, so accurate branch prediction is very important. Therefore,
branch prediction is still an active area of research. One of the key ideas used is
correlation between branches. In real programs, a common pattern is for example a pair
of branches that always have opposite behavior. A branch predictor can take advantage
of this by remembering a history of whether or not recent branches were taken or not
taken and incorporating this in the hash. Another idea is to combine multiple
different types of predictor in a "tournament predictor." We use a "meta predictor" to
predict which of two branch predictors will do a better job. As you can see, branch
prediction can get quite complicated. [music]
Module 3, Video 4
[music] In this module, we've met the concept of branch prediction. Even using
relatively small simple circuits, we can accurately predict real branch behavior more
than 95 percent of the time. Could we do better, and do we really need to? In small,
simple processors, these prediction accuracies are fine, because each misprediction
causes only a few stall cycles. Increasing the accuracy is not that impactful.
However, in complex processors with very long pipelines, the difference between 98
percent and 99 percent prediction accuracy can be significant for performance as a
misprediction could incur dozens of stall cycles. Accurate prediction really does
matter. One of the problems we face in making accurate predictors is that they need to
be small enough and fast enough to fit in a microprocessor. We can imagine all sorts
of ways to do accurate branch prediction, but if their circuit would be slower than
simply executing the branch, they would not be useful. Modern high-performance
processors will often have multiple branch predictors, for example a small fast one
and a slower complex one. The slower one can override the prediction of the fast one
if it thinks it got it wrong, which does incur some stall cycles, but fewer than a
total misprediction. Another problem we face is that some branches are just very hard
to predict. No matter what technique is used, there are some branches in real programs
that are effectively "random". For example, when compressing or decompressing data,
the combination of the underlying algorithm and the input data may provide no clear
patterns to a prediction. No matter how hard we try, some branches will never be
predicted correctly 100 percent of the time. A final problem is that since the
predictors work based on observing the program, there will always be a period of time
when the predictors train on a program to learn its behavior. [music]
Module 3, Lab
[music] In this exercise, we're going to be using a branch predictor simulator to
explore dynamic branch prediction. This simulator will accurately simulate the details
of a branch predictor but it uses a trace of a real program executing on a real
machine to avoid the need to simulate the rest of the processor. Computer architects
use techniques such as this to quickly explore one area of process design
understanding that perhaps the accuracy may not be completely correct given that we're
not simulating the full process of design. The interface allows us to configure
details of our branch predictor for example the maximum of the saturating counters
used in the branch predictors table or the hash function. And we can see the impact of
changes on the delay of the branch predictor and also the design size which are two
key metrics when designing a branch predictor. Once we've happily configured the
design we want we can press run to simulate a program and the results of that
program's execution will be displayed in the rest of the stats. So for example, we can
see here that the predictor predicted 95.24% of the branches correctly. Which is
fairly good and this resulted in overall execution time of the program of 5.335
milliseconds. Just below we can see a table that records previous simulation runs as
well so we can do multiple experiments and see which produces the best results. For
example, if we were curious about the effects of using a saturating counter with a
maximum of three rather than one we could change that and notice that the design size
has substantially increased and when we press run, we noticed that the predictor
accuracy, however, has also increased. It's gone up to 96.31% and consequently the
execution time the program has fallen slightly and so we can compare these two designs
and see whether or not this represents a good trade-off for our processor. Perhaps the
area cost is justified or perhaps it's too much. It would depend on the exact
processor we're trying to design In the problems you'll be invited in to come up with
designs that are suitable for particular constraints. For example, particular
constraints on the runtime of the program or the design size. Once you've configured
the branch predictor that you think meets the objectives you can scroll down to the
submit button and click that and you'll be told whether the answer is great or not.
Good luck. [music]
Module 4, Video 1
[music] So far, we've looked at the microprocessor's "datapath"— meaning its execution
units, registers, and control circuitry. We have given less attention to its memory.
We usually implement memory using a different type of chip than the microprocessor,
using a technology called DRAM. It is very dense, allowing us to store lots of data in
a small area. However, one issue with DRAM is its speed. Since the 1980s, processor
performance has increased very rapidly at roughly 55 percent per year, so CPUs of
today are many orders of magnitude faster than those of 40 years ago. In contrast,
memory performance has grown much more modestly. Whilst memories are also much faster
than they were in previous decades, their performance has not kept pace with
processors. This leads to a processor-memory performance gap, with it becoming more
costly to access memory as time goes on. One of the issues that makes these memories
slow is their size. We can make a memory as fast as our microprocessor, if it is very
small. However, this is the opposite of what programmers want. Programmers can use
extra memory to solve more complex problems, or to solve existing problems faster.
This complicates microprocessor design because although we want a large memory to hold
all our data, large memories are slow and this slows down our processor. How do we
overcome this? What if we could get the speed benefit of a small memory alongside the
size benefit of a large memory? One solution is to have a large, slow memory and a
small, fast memory in our system. For this to be useful, we need the small memory to
hold the data that we use most often, so that we often get the speed benefits of
accessing it, and only rarely have to access the slow memory. Whilst there are
different arrangements of these two memories, the configuration that is most often
used is an "on-chip cache memory." The small memory sits inside the processor between
the pipeline and the large memory. We keep all data in the large main memory, and put
copies of often-used data in the small memory, which we call a cache. But we are not
limited to only one cache! Our pipeline reads memory in two places: when fetching the
instructions; and when accessing the data. It makes sense to have two caches here,
each optimized for different purposes. The instruction cache is optimized for fast
reading of instructions at Fetch. The data cache is optimized for reading and writing
data from the memory stage. We will often put a larger "level 2" cache between these
two caches and the main memory. The L2 cache is a "medium-sized" memory: faster and
smaller than main memory, but slower and larger than the two L1 caches. Using a
hierarchy of caches reduces the bandwidth requirements to main memory and the energy
cost of moving data around. [music]
Module 4, Video 2
[music] We previously looked at the need for a cache, which can be used to store
often-used data for fast access by the processor. But which data is used often enough
that it is worth including in the cache? Programs are mostly made of loops. Here's a
simple one that sums values in memory. It displays the two characteristics that we can
exploit to decide what data to put into a cache. Let's look briefly at how it works.
Each time round the loop, there are two loads and one store. The first load is to load
the data we're summing. The other load, and the store, are to update the running sum.
Notice two things here. First, we access the running sum over and over again; each
time round the loop. Second, when we access part of the data in one loop iteration,
we've already accessed its predecessor in the previous iteration, and will access its
successor in the next. Caches exploit two types of "locality" in programs in order to
be effective. The first is temporal locality: if a piece of data is accessed, it is
likely that it will be accessed again in the near future. The running sum has temporal
locality. The second type of locality is spatial locality: If a piece of data is
accessed, then its close neighbors are quite likely to be accessed in the near future.
By close neighbors, we mean data whose memory addresses are not far from each other.
The data accesses have spatial locality. It turns out that most programs exhibit lots
of temporal and spatial locality. We can exploit this to determine what to put in the
cache. Exploiting temporal locality is fairly easy: we simply see which values have
been accessed and place them in the cache. Exploiting spatial locality is also quite
simple: when a piece of data is accessed, place its neighbors in the cache. We'll see
in the next module how this is actually achieved. We use locality to guess what values
will be accessed next and store them in the cache. If we are correct, we get the
benefit of fast memory, but if we are wrong, we must perform a slow access to main
memory— just as we would if the cache were not present. In real programs, we see hit
rates of 90 percent or more in microprocessor caches, resulting in vast performance
improvements. However, it's worth noting that some programs have less locality, and
for those programs, caches offer little performance benefit. [music]
Module 4, Video 3
[music] We've looked at the reasons why we build caches, but how do they actually
work? To the outside world, a cache simply takes an address as input, and either
provides the data that is stored at that location at output, or returns a signal to
say that it doesn't have it. If the data is found, this is called a "cache hit". If
the data is not in the cache, this is called a "cache miss", and it means we must look
for the data in main memory instead. After each miss, we update the contents of the
cache. The fundamental building block of a cache is called the "cache line". It's a
number of data bytes from consecutive addresses in main memory. Cache lines are
typically 32 or 64 bytes long, and a cache typically has an array of hundreds to many
thousands of lines. The line captures spatial locality, because it is larger than the
data read by a single load instruction. When a cache miss occurs, the whole line
containing that data is copied to the cache from main memory, meaning we have nearby
values for future accesses. When a request comes in, we use some bits from that
address to index the line array. Just like in the last module on branch prediction,
this leads to the problem of aliasing again, since selecting only some of the bits to
index into the array is like a hash. This means that many addresses map to the same
line in the cache, but we can only store one of their lines of data. We need to note
down which addresses' line is currently stored in the data array, in what we call the
"tag array". There is one tag for each cache line. When we access the cache, we access
the tag array with the same index to see if the data we need is present. This design
is called a "direct-mapped cache". Direct-mapped caches work fairly well, but for some
programs, we can be unlucky with the program accessing to aliasing lines frequently.
We can do something about this by duplicating both the arrays, so that each line of
data can now be stored in one of two places in the cache. When we access the cache, we
look at both arrays and only get a miss if neither of the tags match. This is called a
"2-way set-associative cache", because each line has a set of two places or "ways" it
could reside. The "associativity" of this cache is therefore two. Set-associative
caches introduce a further complication: when we want to add data, where do we put it?
In a 2-way set-associative cache, there are two choices. How we decide which cache
line to evict is called the "replacement policy". There are many different types of
replacement policy. A simple one is just to make a pseudo-random choice from all the
possible cache lines. Another option is to keep track of when each cache line was last
accessed and to evict the one last used furthest in the past. This is called a "least
recently used policy", and takes advantage of temporal locality. It does, however,
means storing extra information in the tags to track usage. Many other ideas are
possible too. [music]
Module 4, Video 4
[music] Now that we've seen how caches work, let's see how they affect the performance
of a processor. Recall the processor performance equation, where the processing time
is proportional to the average cycles per instruction. Without a data cache, if 20
percent of instructions are loads, and main memory takes 20 cycles to access, our CPI
figure must be at least 5. However, if we provide a cache that holds the required data
80 percent of the time... ...and only takes 2 cycles to access, our CPI reduces to
2.2, which is a significant improvement! We can isolate the memory terms in this
equation to get the average memory access time —abbreviated to AMAT— which allows us
to compare different cache configurations more easily. Changing the cache
configuration will impact the AMAT. There are many different cache parameters we can
change, such as the size, replacement policy, associativity, whether we put data in
the cache for stores or just for loads, and so on. For example, reducing the size of
the cache will improve the access time for a hit, but will also increase the miss
rate. Let's say that we can halve the access time to 1 with a corresponding halving of
the hit rate. This alters the AMAT to 13, which in this case is worse for performance
overall. It's also useful to look at why an address might miss in the cache. Broadly
speaking, we can divide cache misses into three different categories. Compulsory
misses occur when we attempt to access an address that we have never seen before and
so never had the opportunity to cache it. Capacity misses occur when there is more
data being accessed than the cache could hold, even if we had complete freedom in
where to put each cache block. Conflict misses occur in caches where there are more
addresses hashing to the same index than arrays to hold the data. We can alter our
cache configurations to lower these misses, but as always, there are trade-offs
involved. Compulsory misses can be reduced by increasing the cache block size, to take
advantage of spatial locality. But for a fixed cache size, this reduces the number of
different addresses or cache lines that can be stored. A technique called "pre-
fetching" can also be used to predict the addresses that will soon be accessed, and
bring their data into the cache early. But this increases energy consumption, and may
make the cache perform worse if the predictions are not highly accurate. Capacity
misses can be reduced through increasing the size of the cache. Although, as we saw
before, this impacts the number of cycles taken to determine a hit. Conflict misses
can be reduced through increasing the number of cache blocks in each set, with an
increase in energy consumption as a side effect of this. [music]
Module 4, Lab
[music] In this exercise, we're going to be using a cache memory simulator to explore
the effects of cache memories on processor performance. The simulator accurately
simulates the behavior of the cache memory system, but it's using a model of the rest
of the processor to quickly allow us to simulate the effects of the cache on the
processor without needing to simulate the full processor. We can configure a number of
parameters about our cache, for example, the number of levels in our cache, whether or
not the cache separates instructions and data, or is unified, keeping them both in the
same cache. We can also configure the size, the line size, and the associativity, and
changing these numbers will affect design parameters, such as the access times in the
case of an L1 level one cache hit or a cache miss, and also the design size. Once
we're happy we've found design we'd like to investigate, we can press "run", at which
point the simulator will run through the program. We can see the hit rates for
instructions in the level one cache, and also data in the level one cache are
displayed. And we can also see the average memory access time, the results from this.
And then below everything, we can see a table of past simulations so that we can
quickly refer back to our previous experiments when we do new ones. So, for example,
let's say we were curious about the effects of increasing the size of the cache. If we
change the parameter and then press "run", we can immediately see that the design size
has substantially increased, which sort of makes sense because we've doubled the
contents of the cache, the size of the contents of the cache, and therefore we'd
expect roughly double the area. And we can also see that the hit rates have improved.
So there's about a 1% improvement to the L1 instruction cache hits and a 1%
improvement to the L1 data cache hits, which has reduced the overall average memory
access time. And so we can compare these two designs to see which of them we think is
better. It's a trade-off of course, though. The larger design has got better
performance, but it is larger, and so depending on the context, we may need to pick
the smaller design or the bigger design depending on our performance goals. In the
exercises, you'll be invited to come up with a series of designs for caches that meet
certain performance goals. For example, you'll have a constraint on the area of the
design or the execution time of the program, and you need to optimize the cache to
meet those goals. Best of luck. [music]
Module 5, Video 1
[music] In this module, we'll look at how to further improve performance by exploiting
"instruction-level parallelism." In Module 2, we explored how pipelining can improve
the performance of our processor. This reduced our clock period, and allowed execution
of instructions to be overlapped, improving throughput. One way to boost performance
further would be to create a much deeper pipeline. At some point, this would mean even
the ALU in our Execute stage will be pipelined. Consider the simple program shown in
the slide. Some instructions are dependent on a result from the previous instruction.
Remember in our 5-stage pipeline that these dependent instructions could be executed
in consecutive clock cycles with the aid of data forwarding. If execution takes place
over two pipeline stages within the pipeline, we need to stall if adjacent instruction
share a dependency. This allows time for the result to be computed. The programmer may
be able to rewrite their program to get the same result with fewer stalls, by placing
an independent instruction between our pair of dependent instructions. In this case,
we can move the third and fifth instructions earlier to optimize performance. The
performance of programs that run on our "super-pipelined" processor would, to some
degree, be determined by the availability of independent instructions that could be
executed in parallel in the pipeline. This is "instruction-level parallelism"—or ILP.
Very deep pipelines are problematic as they would require: a very high-frequency clock
to be distributed across the chip very precisely; careful balancing of logic between
many, very short, pipeline stages; the pipelining of logic that is difficult to divide
further into stages; and the division of logic at points requiring many pipelining
registers to be inserted. A different approach to exploit ILP is to make our pipeline
wider rather than deeper. In this design, the processor will fetch, decode, and
potentially execute multiple instructions each cycle. Such a design avoids the
problems of a super-pipelined processor, although as we'll see in the next video, it
does introduce some new complications. [music]
Module 5, Video 2
[music] In this video, we are going to explore "superscalar" processors, which can
process multiple instructions in each pipeline stage. In our simple 5-stage pipeline,
there is at most one instruction per pipeline stage. At best, we can complete one
instruction per cycle. We call such a design a "scalar" processor. In a 2-way
superscalar version of this processor, we would extend this design so it is able to
fetch, decode, execute and writeback up to two instructions at a time. In general,
superscalar processors may vary the number of instructions that can be processed
together in each stage. Let's step through the design. Our instruction cache will need
to supply two instructions per cycle. Typical superscalar processors only ever fetch
adjacent instructions on a given cycle. This can lower performance if, for example,
the first instruction fetched is a taken branch, because then the second would not be
required. Note that now every cycle lost due to control hazards will cost us two
instructions rather than one, so accurate branch prediction matters even more in
superscalar designs. The Decode stage must now decode and read the registers for two
instructions simultaneously. Fortunately, we are able to extend the register file
design to read many register values at the same time. The Decode stage also needs to
check whether the two instructions are independent. If so, and if the functional units
they both need are available, it can "issue" them for execution in parallel on the
next clock cycle. Otherwise, in this simple design, it will only issue the first, and
keep the second back. A simple design such as this—where two instructions are fetched,
decoded and issued— is called a "2-way" or "dual-issue" processor. In other designs,
the width may vary at different stages of the pipeline. To support the execution of
multiple instructions at the same time, the Execute stage is expanded and contains two
execution pipelines. It's common for these to have slightly different capabilities to
save area. For example, the top pipeline can execute both ALU and memory instructions,
while the second pipeline only executes ALU instructions. To ensure that dependent
instructions can execute on consecutive clock cycles, we must add data forwarding
paths. These data forwarding paths must allow results stored in either execution
pipeline to be forwarded to the input of either ALU. During writeback, we need to
store both results to the register file. This means the register file must be
redesigned to allow two writes per clock cycle. Overall, these changes typically
require 25 percent more logic circuitry in our processor, compared with a scalar
processor. But we'd expect an improvement in execution time of between 25 and 30
percent for real world programs. [music]
Module 5, Video 3
[music] We've seen that instruction-level parallelism can be used on superscalar
processors, to run them faster than would be possible on a scalar processor. But how
much of a speedup is this in practice? Ultimately, this depends on how much
instruction-level parallelism is possible in a typical program. How might we measure
this? We can do this initially without considering any constraints that will be
imposed by the processor it will run on. Let's consider the instructions executed by
the program. Let's assume that we can predict all the branches in the program
perfectly. Then we can ignore branch instructions, as they don't need to flow down our
pipeline. Now let's imagine we can execute any instruction as soon as the data it
needs is ready. That is, we are only restricted by the presence of true data
dependencies. Note that some dependencies are carried through writes and reads to
memory. Rather than considering program order, we can now just look at the order the
dependencies impose on instructions. This is referred to as "data-flow analysis."
Assuming each instruction takes exactly one cycle to execute, the fastest possible
execution time of the whole program in cycles is given by the longest path in the
data-flow graph. The instruction-level parallelism of this program is the number of
instructions divided by this duration, as this gives the average number of
instructions we would need to be able to execute each cycle to achieve this duration.
In real programs, this can be anywhere from around five, to hundreds or even
thousands. An active area of research and innovation for computer architects is to
imagine processor designs that can expose and exploit as much of this parallelism as
possible. One insight architects have had is that superscaler processors need to have
a fast supply of instructions to be able to analyze dependencies effectively. This
often means that the front end of our processor pipeline is much wider than the rest
of the pipeline, so that it can "run ahead" and see what behavior the program will
have next. Fast and accurate branch prediction is vital, as we often have to predict
multiple branches ahead accurately, to achieve good performance. Another key insight
is that we don't have to wait to execute the instructions in program order. If all the
dependencies of an instruction are satisfied, the instruction can proceed down the
pipeline even if previous instructions are yet to execute. This can reduce program
execution time by taking advantage of more instruction-level parallelism. In practice
though, this creates extra complications, as we will see in the next module. [music]
Module 5, Lab
[music] In this exercise, we will be using a simulator to explore superscalar
microprocessor design. The simulator has a number of parameters that we can configure,
such as the number of pipeline stages, the width of the Fetch stage, the width of the
issue, and the number of ALUs in the design. We can see a diagram of the processor
that we've created and we can also see a number of parameters about that design, for
example the clock frequency and the overall area of the design. When we press step,
the simulator will advance one clock cycle and so for example here we can see that the
fetch stage has fetched the first 4 instructions. However, immediately we see one of
the problems with designs such as this, which is the three of the four instructions
that have been fetched are in fact useless because the first instruction was an
unconditional taken branch and therefore the remaining three instructions will not be
executed by the program and so these will immediately be discarded on the next cycle.
Pressing the run button allows us to simulate and we can use the fast-forward option
to simulate much quicker in order to get to the end of the long program execution. In
this case we can see that our design achieved an average cycles per instruction less
than one, which is to say that we on average executed more than one instruction per
cycle, which means that our superscaler design has fundamentally worked, and we can
see that, for example, the overall execution time of the program is 1.6 milliseconds.
In a table below, we can also see a record of our previous simulation runs. So let's
say for example, we were curious about the effect of increasing the issue width by
one. We can make that change and then press run again in order to run our new design,
and when it finishes we can scroll down to take a look and we can see that the program
execution time has indeed improved down to 1.51 milliseconds at a cost of only 1%
area. So it looks like this was a very good improvement to our design and it's almost
surely going to be a beneficial trade-off in practice. In the exercise you will be
invited to configure a number of different superscalar processor designs with various
targets in terms of clock frequency, design area and execution time. Once you're happy
that you've configured the process that you think completes the exercise, you can
scroll all the way down to the bottom, where you'll see the submit button that you can
press to have your answer checked. Good luck. [music]
Module 6, Video 1
Introduction:
So hi, my name is Peter Greenhalgh. I'm Senior Vice President of Technology and an Arm
fellow. I'm responsible for the Central Technology Group at Arm. We're about 250
people. We work on everything from machine learning to CPU, GPU, system IP, and the
solutions that we create as well. And we basically path-find future technology at the
product level that goes into all of our products and the IP that we produce. Arm is
known for the power efficiency of its microprocessors. How have you managed to keep a
focus on power when building processors with very complex and power-hungry features?
We've got some really great design teams. In fact, we churn out I think more CPUs than
pretty much anyone else does on the planet. I think we're producing something like
four or five CPUs per year. So we've got a lot of experience in designing for power
efficiency and performance, and in fact we can leverage the understanding that we have
all the way down to microcontrollers through to the smaller A-class processors, all
the way up to the high performance. There's a lot of sharing between the teams in
terms of strong knowledge and capability, and insight into how to design for both
performance and power efficiency. More specifically, I mean, ultimately, you have a
performance goal that you need to achieve, and then as part of that you have to figure
out how to get the best possible power out of the design when you're achieving that
performance goal. And to do that, there's kind of some different ways of looking at
it. There's the really detailed orientated work that you need to do around things,
like clock gating, data gating, all the things to try and stop unnecessary power use
deep within the microarchitecture when the instructions are flowing through the
pipeline or data's moving through the pipeline. And then there's essentially the
structure of the design that you've created. And that then dictates fundamentally what
the power of the design is going to be. You can't fix a design that's got bad
structure with improved clock gating, data gating, and just good low-level design. You
have to marry the two together. And that high-level work that you do is around making
sure that the pipeline is well balanced, that you aren't opening up the pipeline,
going too wide too soon; you're extracting data, you're extracting information as late
as you possibly can and just when you need it, and not just pipelining it down through
the design for the sake of it; and then, fundamentally, good microarchitecture around
branch prediction, which stops you putting things down through the pipeline that
you're just ultimately going to flush; good pre-fetching on the data side so that you
make sure you get the data in the design when you need it, and you're not sitting
around waiting for it. So you have to marry that altogether, and we've got a lot of
great techniques in order to achieve that, which fundamentally, I say, you need to
achieve the performance target, and then everything else comes together to achieve
that performance target in the best possible energy efficiency. How did Moore's Law
affect computer architectures of the past, and what will its influence be on future
designs? Gordon Moore's influence on the industry has been massive, and the tenets
behind the law still continue today, albeit in a slightly different form. I mean, I
started designing at .18 Micron, essentially 180 nanometers, and here we are today
working on 3 nanometers. So it's a vast difference now compared to when I started 22
years ago. And there's no way we could have got to where we are today without the
process scaling from all of the foundries out there and all the companies that provide
the foundry technology. So it's a little bit like magic, all of the work that they do.
I can't say I understand it in detail, but it's incredible technology, and that allows
us... If it hadn't happened, we'd still be stuck in all the designs which were fairly
simple. There's no way that we'd have got to the sort of designs that we have today of
massively out of order, deep, deep amount of instruction and depth, very, very wide
designs. All of that has been made possible by the steady improvement, a predictable
improvement of the foundries. And that's kind of one of the key points which really
was captured by Moore's law or is captured by Moore's law of that kind of predictable
knowledge that you will get an improvement in the process. Is it 10%, is it 15? Is it
20% on, say, power, for example? It kind of doesn't matter in a way because you can
work with what you eventually get. You can do things like voltage scaling to be able
to make use of the power that's available to you. Is it 5%? Is it 10% on frequency?
Again, it kind of doesn't matter in a way. But what matters is when we start designing
a processor today and we finish it in 18 months time, and then two years after that it
arrives in the product in the shops that consumers can buy. We know that over that
period, the process improvements have happened, which allows us to liberate
essentially more performance, more energy efficiency from the design. And we don't
mind too much if it takes another three months or six months to get to the process. We
don't mind too much if the performance or power is not exactly where it was predicted
at the start. But, ultimately, we know we'll get an improvement, and we know there'll
be an improvement in two years, and three years, and four years, and Moore's Law may
have slowed, but it's certainly not stopped.
[music] As we saw in the last module, instruction level parallelism can be used to
improve program execution time in our microprocessor designs. To enable this, the
compiler creates an optimized instruction schedule when the program is converted into
machine code. Unfortunately, the compiler cannot know precisely what will happen at
run-time, so this design is constrained by the order of instructions in the program.
The compiler won't know what the program's input data will be, whether branches will
be mispredicted, or whether memory accesses hit or miss in our data cache. In
contrast, a superscalar processor with "out-of-order" execution can produce an
instruction schedule at run-time, only constrained by true data dependencies and its
hardware limits. This schedule is produced on demand and so can even change each time
the code runs. To do this, we introduce an "issue window" or "issue queue" after the
Decode stage. This holds instructions until they can be executed, not necessarily in
the order they arrived in. Within this window, instructions can be issued whenever
their dependencies are available, and when a functional unit is available to process
it. To be able to detect when an instruction is ready to be issued, we must know
whether the instruction's dependencies are ready when it enters the issue window. We
must then update this status as new results are produced. To implement this, the names
of result registers of executed instructions are broadcast to the issue window. The
instructions waiting there compare the register names to the registers they require.
However, this scheme has a problem: A register will be written multiple times in the
program, and since the instructions are executed out-of-order, the register name alone
is not sufficient to record dependencies. It also means that instructions would have
to wait until all previous reads of a register had finished before executing. These
are called "false dependencies." These problems can be resolved by "renaming" register
names at run-time so that each "in-flight" instruction writes to a unique destination
register. We use a "physical register file" that is large enough to ensure we don't
run out. We keep a "register mapping table" to store the mapping between
architectural, compiler-assigned registers, and physical registers. Register reads to
the same architectural register are renamed consistently, so that dependencies can be
tracked correctly with physical register names. Physical registers are reused only
when they are no longer used by any instruction currently in-flight or any entry in
the register mapping table. The other big issue with out-of-order execution is memory
dependencies. Load and store instructions can have memory dependencies because they
access the same memory location. To detect this, we need to compare the computed
memory addresses that the instructions access. We thus split memory operations into
two steps: address calculator and memory access. We issue their address calculation
step as soon as the dependencies are available. Then, the memory access step is placed
in a special load-store queue to be sent to our data cache as soon as possible. We
carefully ensure that operations that access the same address are kept properly
ordered, but independent accesses can be reordered if beneficial. No access can occur
until the addresses of all previous accesses are known. Since memory writes are
irreversible, store instructions must also wait until we are certain that they will
execute. [music]
Module 6, Video 2
[music] In the previous video, we outlined the concepts of out-of-order execution, and
register renaming. The issue window will be filled with instructions fetched along the
path that our branch predictor believes the program will take. While we hope our
branch predictor will be correct in most cases, it will sometimes be wrong. How do we
handle such cases? A simple approach is to start by recording the original program
order of the instructions, and then to monitor their progress. We call the structure
that stores the instructions the "reorder buffer." As each instruction executes and
produces a result, we can mark it as done. When the oldest instruction has completed,
we can remove it from the end of the reorder buffer, and the instruction is said to
have "committed." This stream of committed instructions represents how our program
would be executed on a simple in-order pipeline or by an unpipelined processor. It
usefully also provides a point at which we can process exceptions. For example, if the
program divides by zero or attempts to access memory that does not exist. We also
check branch instructions as they complete in order. If they have been mispredicted,
we flush the reorder buffer, our instruction window and any currently executing
instructions and start fetching down the correct path. To preserve correctness, we
must also restore our registers and the register map table to the values they had when
we mispredicted the branch. This can be done with the aid of a second register map
table, updated only when instructions commit in program order. This can simply be
copied to the map table used by our renaming hardware to "rewind time" for the
processor. All the register values we need will be present, as we don't recycle
registers before we know they will not be needed again. In reality, handling branches
in this way is too slow. Processors instead take many copies of the register map
tables and can handle branches as soon as they are resolved, and we discover they have
been mispredicted. They can also selectively neutralize the in-flight instructions in
the datapath that are on the wrong path, rather than flushing all of these
instructions away. [music]
Module 6, Video 3
[music] We can now bring everything together and look at what a typical pipeline for
an out-of-order superscalar processor might look like. The Fetch stage is aided by an
accurate branch predictor as we met in Module 3. It will fetch a group of instructions
on every clock cycle. This group of instructions will be requested from the
instruction cache, and will be from consecutive memory locations. Branches may reduce
the number of useful instructions that can, in practice, be fetched in on each cycle.
The Decode stage decodes multiple instructions in parallel. At this point, modern
high-performance processors may also split complex instructions into simpler
operations or "macro-ops." In some cases, there may also be opportunities to combine
simple instructions into a single operation. The next step on an instruction's journey
is renaming to receive a unique destination register. As we saw in the last video,
this increases opportunities for out-of-order execution. Remember, there are several
times more physical registers in our processor than those available to the compiler.
Instructions are placed in the reorder buffer, and are also "dispatched" to the Issue
stage. They will wait in the window as necessary, and are ready to be issued once all
their operands are available. In the most complex of today's superscalar processors,
there may be hundreds of instructions buffered in the issue window at the same time.
Instructions finally commit in program order. At this point, any physical registers
that are no longer needed can be added back to the pool of free registers. These are
then assigned during the register renaming step. Once an instruction is issued, it
reads its operands from the physical register file. The Execute stage consists of many
functional units operating in parallel. These may each support different operations
and take different numbers of cycles to execute. A network of forwarding paths is also
provided to ensure we can execute any dependent instruction on the next clock cycle
after the generation of the result. This requires being able to quickly communicate—or
"forward"— a result from the output of any functional unit, to the input of any other.
Some instructions will need access to memory. After computing their addresses, they
are placed in the processor's load-store queues. "Stores" are sent to memory in
program order, but "loads" can often be sent out of order, and ahead of other older
stores or loads that are not yet ready to be issued to memory. The memory system
reduces the average memory access time by providing numerous levels of cache memory.
After generating results, we write them back to the register file. This overview is
representative of the fastest modern microprocessors found today in laptops,
smartphones and servers. Whilst much extra innovation goes into real designs, they
generally follow the ideas discussed in the course. [music]
Module 6, Video 4
[music] One question computer architects always ask themselves is: "how much can we
scale up our design?" Let's take a look at some further potential optimizations to our
out-of-order superscalar processor. We could try to make it wider. For example, by
doubling the number of parallel instructions, we can fetch, decode and execute more
instructions per cycle. Would this double our performance? Sadly, no, things are not
that simple! In practice, some components quickly become very complex, and performance
gains may be hard to extract. For example, today's largest machines fetch at most ten
instructions per cycle from their instruction caches. Fetching more instructions than
this offers minimal performance gain, despite a large hardware cost. If we increase
the number of registers, the size of our issue window, the size of our load-store
queues, or perhaps use a larger and more accurate branch predictor, our processor's
performance will only improve slightly despite a significant increase in the size of
these structures. After a point, the increase in performance is no longer worth the
cost of the extra transistors. It's also possible that performance might reduce
overall as we may need to lower our clock frequency as the structures get larger.
Finally, we could introduce more pipeline stages, but we know this doesn't necessarily
lead to higher performance, as mispredictions may become more costly. The combination
of these issues means that extracting performance using instruction-level parallelism
alone becomes more expensive as more performance is sought. This graph shows how the
energy cost of executing an instruction grows quickly as we try to build higher
performance processors. Let's look at some example designs. Suppose we have a core,
which requires a certain area. If we double its area, its performance improves,
although there is a small rise in energy per instruction. If we quadruple its area
instead, its performance has now doubled compared to our original core, while energy
has increased by 50 percent. Going further, if we increase our processor's area by a
factor of 10, performance is only 2 point 5 times our original core, but energy per
instruction is now 3 times higher. Its performance does not improve as fast as the
cost of running it! Of course, engineers are clever and determined, and are constantly
developing new techniques to bypass many of these issues. This means the performance
of processors—even ones running a single thread or program— still improves by around
10 to 25 percent each year. Nevertheless, ultimately we often need more performance
than can be provided by instruction-level parallelism alone. A modern solution is to
employ multiple processor cores on the same chip—called a "multicore" processor. This
changes the task for programmers; they may need to redesign their programs to take
advantage of such parallelism, but if they can, it can give vast performance benefits.
As we've learned throughout the course, every decision involves trade-offs and
compromise. We are faced with a fascinating but often highly-constrained design
problem. We've seen how performance bottlenecks, that at first seem impassable, can be
overcome with innovative designs. What might the future hold for microprocessors? Can
you think of ideas? What would you design? [music]
Module 6, Lab
[music] In this exercise, we'll be using a simulator to explore an out-of-order
superscalar processor design. The simulator allows us to configure a number of aspects
of our processor, for example, the number of pipeline stages, the width of the Fetch
stage, the size of the issue window, the number of ALUs, and the size of our re-order
buffer. The changes will be reflected in the pipeline diagram, which we can see below,
and also in the statistics below that, with key design metrics, such as the clock
frequency, clock period, and design size, visible below. When we press "step," the
simulation will advance by 1 clock cycle, and so we can see, for example, on the first
clock cycle the 1st 4 instructions are loaded. Although three of them are unusable
because they follow a taken branch and therefore these will not be executed and will
be discarded on the next clock cycle. We can press "run" to watch our design in action
and in order to quickly get to the end, we can use the "fast forward" feature to
simulate the millions of instructions in this particular program. After the simulation
is complete, we can check below to see a number of statistics about our pipeline,
which are useful for understanding why the performance is as it is, and in particular
we can see the program execution time. 1.18 milliseconds gives us the overall
execution time of the program that our design achieved, and notably our average
instructions per cycle gives us the number of instructions that we were able to
complete on each clock cycle, in this case well over 1, indicating that we are taking
advantage of instruction level parallelism in this simulation. Below all that, we can
see our completed simulations and we can check back on these as we explore new
designs. So let's say, for example, we were curious about the effect of increasing the
re-order buffer size. We could change that and then press "run" to quickly run our
next experiment. And then if we scroll down to our table, we can compare the results
and see that actually the program execution time was substantially increased by
increasing the size of the re-order buffer. Although, admittedly, this did come at a
not insignificant increase in the area of our design, and so whether or not this
represents a good trade-off in practice would very much depend on the problem we're
trying to solve. In the exercises, you'll be given a number of scenarios for
processors, which generally revolve around certain constraints on the design size, or
the clock frequency, or the target program execution time. And once you think you've
found a design that meets the goals required, configure it up using the settings and
then scroll down to the "submit" button at the bottom, and click that to have your
answer checked. Best of luck. [music]

Computer Architecture Arm Substitles2

Uploaded by

Copyright:

Available Formats

Computer Architecture Arm Substitles2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computer Architecture Arm Substitles2

Uploaded by

Copyright:

Available Formats

Computer Architecture Essesntials by ARM

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.