Lec 5
Lec 5
Lec 5
Lecture – 05
Architecture of ARM Microcontroller (Part II)
In this lecture, we shall be mainly talking about the pipeline features that are present in
the ARM processor. Because the concept of pipelining may be new to some of you, I
shall be devoting some time in explaining the basic concept of pipeline. Then I shall very
briefly tell how the pipeline in the ARM processor looks like, and what is expected to be
gained out of it, why do use pipeline, what are the advantages that you have out of it.
(Refer Slide Time: 01:16)
We start with the basic question what is pipelining? Basically pipelining is a mechanism
for overlapped execution. When you say we are caring out certain tasks, normally we do
a task, complete it and then start with the next task. In pipelining the concept is that even
before you finish the first task, we start with the second task, even before we finish the
second task we start the third task. So, different parts of the tasks can be carried out in
parallel in an overlapped fashion.
In the present context we are talking about instruction execution, but pipelining can be
used very effectively in other domains of processing also, during arithmetic operations,
during memory access of a vector of data, etc. But in the present context, since
instruction execution is the only thing we are concerned about, we shall not be going to
too much detail about the other ones.
(Refer Slide Time: 04:18)
Let us understand first the concept. We take a real life example, which has nothing to do
with computers. Suppose we want to wash some cloths. Suppose I have built a machine
M that can do three things one by one, wash, dry and iron. I give the machine a cloth, it
will wash it, dry it, iron it, and output that cloth in the ironed form. Let me assume that
the total time required for the whole thing is T.
Pictorially I show it like this. I have my machine that does washing, drying and ironing,
the total time taken is T. So, if I have a N number of clothes, then the total time required
will be N x T.
Now let us assume that instead of a single very complex machine that can do everything,
let me divide this machine into three smaller pieces. Three simple machines one which
can only wash, one can only dry, and one can only iron. Naturally this will not be costing
three times as compared to the original machine, this will be cheaper. Let us make
another assumption; the total time early was T, let us say for each of these time is T/3, T/
3 and T/3, so that the total time still remains T.
We shall be seeing in the next slide that for processing N number of clothes the total time
required will be only (2 + N) T / 3. If N is very large, then this 2 can be neglected, and
you can approximate it to NT/3. So, I have got a 3 time speed up, but I have not paid 3
times more, I am using simpler machines. This is the concept of pipeline.
(Refer Slide Time: 07:11)
Let us look at this diagram in an animated form. The first cloth comes for washing, this
will be taking T/3 time. After it is finished with washing, this cloth can go for drying.
And when cloth-1 has come for drying, machine W is free again. W can start with cloth-
2. Next step, cloth-1 is coming for ironing, cloth-2 is coming for drying, cloth-3 for
washing and so on. So, you see after all the machines are all filled up, every T/3 time one
cloth will be coming out. Earlier one cloth was coming out every time T. So, I am getting
a 3 times speed up effectively. This is the essential idea behind pipelining.
In alternative 1 the cost will also go up k time, because you have to buy k copies of this
CPU. Alternative 2 is to split the computation into k-stages; here you are not multiplying
the hardware by k, rather the you are splitting it into k pieces resulting in very nominal
increase in cost.
But in order do this, you need some buffering. Take the earlier example of clothes. When
the first cloth washing is finished, you want to give it for drying. You must keep it
somewhere in between, so that you can accept the next cloth for washing. There must be
a buffer or a tray between the machines, these are something called buffering
requirements. For a instruction pipeline also we need some registers or latches in
between the stages, which will be temporary storing the result of the previous stage,
which will be used by the next stage for processing. If you do not use this registers, then
the previous stage can go and modify this value, so the next stage might carry out some
wrong computation because of that.
Let us carry out a quick calculation of the speedup and efficiency of a pipeline in the
general sense. Earlier we showed the calculation for the washing example. Let us say tau
is the clock period of the pipeline, which means, every tau time data moves from one
stage to the other. And ti is the time delay for the circuit that is there in stage Si. And the
latches delay will be dL.
What will be the maximum stage delay, see this is t1, this is t2, this is t3. So, the slowest
stage in the pipeline will determine what is the maximum speed with which we can shift
the data, because this slowest stage will be become the bottleneck. So, maxt{ti}, let us
call it taum, is the maximum delay. And to it we have to also add the latch delay dL. So,
taum + dL will be your clock period tau.
Now, the pipeline frequency will be 1 / tau. f will also be the maximum throughput of the
pipeline if you are expecting one result to come out every clock. With this assumption let
us try to make a quick calculation.
(Refer Slide Time: 14:02)
We calculate the total time to process N sets of data. tau is my clock period. Now, k - 1
clocks are required to fill up the pipeline. There are k-stages, so I need k - 1 clocks to
reach a stage where all the k-stages are working on something. After that every tau time
there will be one new result being generated. So, (k – 1) x tau will be the initial time for
the pipe to fill up, and then this N x tau for the outputs to be generated. This will be the
total time to process N data sets.
Now, if we have an equivalent non-pipelined processor, if you ignore the latch delays for
the time being, then the total time can be estimated as N x k x tau. So, in a pipeline the
speedup Sk we are getting is this. As N becomes very large this Sk approaches k. So, for
a large number of data that you are processing the pipeline, your speedup will be close to
number of stages k. This is an important result.
(Refer Slide Time: 16:21)
Now, there is another term we define called pipeline efficiency; how much is the
performance close to the ideal value. Well, this Sk we just calculated will tend to k, so
that is when the pipeline is operated at maximum efficiency. If I divided by that, k, k can
cancels out, so I get a factor. This I can define as the actual pipeline efficiency. So, it will
never be 100%, maybe it is working in 90% efficiency.
And another term which is of course is not that important in the present context is called
pipeline throughput; the number of operations completed per unit time. Total number of
operation is N. And the time taken is Tk. If you divide it, you get an expression, this is
pipeline throughput.
(Refer Slide Time: 17:48)
This is a very typical plot I am showing, number of task N versus speedup for various
values of k. Let us say k = 4; you see as the number of tasks increases, the speedup
increases increase and levels to very close to 4. For k = 8, it levels to very close to 8; and
so on. Here I have shown up to 256. You get some idea what is actually happening.
Now, coming to ARM I am not going into the details because this is not a course where I
am teaching computer architecture rather I am trying to tell you that ARM uses
instruction pipelining. If I have a k-stage pipeline, I can expect to have k times speedup.
In ARM7 architecture this is one particular processor TDMI. There are three stages,
fetch, decode and execute. If everything else is fine, we are expected to get about 3 times
speedup in terms of instruction execution.
Similarly ARM9 has a 5 stage pipeline, fetch, decode, execute, memory, write. We can
see some smaller things. Within the decode, the register values are also read whatever
registers are required. During execute the barrel shifter is also working, ALU operations
also carried out, memory access are carried out here; during write, the results written
back into the register bank, so it is done here. And for ARM7 all of these were done
during execute. The register read, shift, ALU, register write everything was done in
execute. But in ARM 9, you are making the pipeline more elaborate and more flexible.
So, here the speedup can be maximum 5.
Just one thing I want to mention here is that this speedup of k that I am talking about is
an ideal speed up. It is the speedup you can get when the pipeline is operating in its full
speed, but sometimes due to some reason, you cannot operate the pipeline at full speed.
In those cases, the speedup will become less than k. I am giving one example. Suppose,
there are some instructions that are executed and let us say these are all ADD
instructions, and there is one complex multi register store instruction.. So, it will need
multiple clock cycles.
The idea is that normally everything finishes in one clock. But for STR instruction what
might happen that this pink colored box can actually require multiple cycles. What does
that mean? Multiple cycle means here you cannot decode this next instruction, unless
this execute is over you cannot decode it. So, there will be some delay here because it is
requiring multiple cycles. Such delays are referred to as stalls; we call them stall cycles.
Stall cycle means some cycles are wasted. You see here first instruction is finished;
second instruction was finished here, third instruction here, fourth instruction here. But
because of this delay this instruction was supposed to finish here, but it got delayed. Not
only this, all subsequent instructions got delayed and there can be many such instructions
like this in between. So, for every such instruction there will be some stall cycle inserted.
And once a stall is there this stall will be carried by all subsequent instructions until that
instruction exits the pipe. Such cases can slow down the maximum operational speed of a
pipeline.
We give a bird’s eye side view of the ARM7, ARM9 and ARM10 pipelines.
There are something called pipeline hazards. There can be data dependency whether you
can feed the next instruction or not. There are a lot of architectural issues that can
prevent the pipeline from operating at its maximum speed.
Stall instructions can occur due to this, one example I gave because of a complex
instructions, but there can be other reasons. There are situations called data hazards,
there can be structural hazards, there can be control hazards. Because of various
sequence of operations that are being carried out, and some instructions like jumps and
branches you may have to insert stall cycles. All of these prevent the pipeline from
operating at the maximum clock frequency.
(Refer Slide Time: 25:39)
Just another thing let me tell you, in this ARM7 kind of architecture a 3-stage pipeline is
there. Let us say when an instruction reaches the execute phase, one thing you remember
I told you all instructions are 32-bit instructions. And your memory is byte addressable.
So, if the first instruction is stored in memory location 100 the next instruction will be
stored in memory location 104, next location will be stored in location 108, because each
instruction will be requiring 4 bytes.
The point is that when some instruction is executed the PC will always be 8 bytes ahead.
You add 4 to it because you will be fetching this. Each instruction will be adding 4 to the
memory address. And PC is a special register which always stores the address of the next
instruction to be fetched. When you are fetching this instruction, this will be the PC of
the current instruction plus 8, because current instruction is here. If we add 4 to it, we
will be getting the next instruction; if we add 8 to it, we will be the next to next
instruction. Here you are always fetching the next to next instruction because there are
three stages that is why the PC is always 8 bytes ahead.
Because when you are executing here already incremented PC two times it is always 8
bytes ahead, so that when you are fetching you should accordingly adjust and fetch
accordingly. With this we come to the end of this lecture. In the next lecture, we shall be
looking at some of the unique features of ARM with respect to register organization, the
various execution modes and so on. This we shall be discussing in the next lecture.
Thank you.