Advanced Processing Technique
Advanced Processing Technique
segment registers
CS - points at the segment containing the current program.
DS - generally points at segment where variables are defined.
ES - extra segment register, it's up to a coder to define its usage.
SS - points at the segment containing the stack.
special purpose registers
IP - the instruction pointer.
flags register - determines the current state of the microprocessor.
Mov Command : Copy operand2 to operand1
Flags set in the Status Register by the ALU
An important function of the ALU is to set up bits or flags which give information to
the control unit about the result of an operation. The flags are grouped together in the
status word.
As the ALU has only an adder, to subtract numbers one has to use 2s-complement
arithmetic. The ALU has no knowledge of this at all — it simply adds two binary inputs
and sets the flags. It is up to control unit (or really the programmer’s instructions
executed by the control unit) to interpret the results.
Z Zero flag: This is set to 1 whenever the output from the ALU is zero.
N Negative flag: This is set to 1 whenever the most significant bit of the output is 1.
Note that it is not correct to say that it is set when the output of the ALU is negative
the ALU doesn’t know or care whether you are working in 2’s complement . However
,this flag is used by the controller for just such interpretation.
Overflow Flag (O) – This flag will be set (1) if the result of a signed operation is too large to fit
in the number of bits available to represent it, otherwise reset (0)
Counter Registers the PC and SP are counting registers, which either act
as loadable registers (loaded from the IR (address) register) or count,
both on receipt of a clock pulse. The internal workings appear
them as black complicated, but are unremarkable, so we leave
boxes here.
Array multiplier for above example
العملي
العملي
Example:
Write Emu8086 for adding content of array a[4].
Homework:
Q1. Write EMU8086 for checking any number if it is odd
print “ odd“ else print “even”
Q2.Write Emu8086 for adding content of array a[10].
Q3. Write Emu8086 for adding 1…10.
Q4. Write Emu8086 for finding the number of odd
number of an array a[10].
Q5. Write Emu8086 for finding the factorial of any
number.
العملي
Homework:
Most significant bit (MSB) provides the Address of the Module & least significant bit (LSB)
provides the address of the data in the module.
For Example, to get 90 (Data) 1000 will be provided by the processor. In this 10 will indicate that
the data is in module 10 (module 3) & 00 is the address of 90 in Module 10 (module 3). So,
Draw the hardware pipelining of 4 stages unit (Fetch ,Decode ,Execute , Write)
In Pipelining
Time =210
Non Pipelining
Time=360
العملي
Example: Write a macro for print any string and use it to print Computer Science ,
Network .
Example: Write a macro for print 3 string and call it 2 times.
Homework:
Q1. Write a macro for finding the factorial of any number and find the factorial of
4,5.
Q2. Write a macro for print any string and use it to print ‘Computer Engineering’ .
Q3. Find y use macro :
1 2 3 4 n
y= 2 + 2 + 2 + 2 +……. +2
Pipelining
Example : If you have 9 instruction with 6 stages
Fetch instruction (FI)
Decode instruction (DI)
Calculate operands (CO)
Fetch operands (FO)
Execute instructions (EI)
Write result (WR)
Represent that without pipeline and with pipeline (draw timing diagram)
and find the time for each type :
1. With pipeline : time =14 units
Structural Hazards. They arise from resource conflicts when the hardware cannot
support all possible combinations of instructions in simultaneous overlapped
execution.
Data Hazards. They arise when an instruction depends on the result of a previous
instruction in a way that is exposed يكشاby the overlapping of instructions in the
pipeline.
Control Hazards . They arise from the pipelining of branches and other instructions
that change the PC.
Structural Hazards
Instr 1 2 3 4 5 6 7 8
Load IF ID EX MEM WB
Instr 1 IF ID EX MEM WB
Instr 2 IF ID EX MEM WB
Instr 3 IF ID EX MEM WB
To resolve this, we stall ممطرلاthe pipeline for one clock cycle when a data-memory
access occurs. The effect of the stall is actually to occupy the resources for that
instruction slot فتحااا. The following table shows how the stalls are actually
implemented.
Clock cycle number
Instr 1 2 3 4 5 6 7 8 9
Load IF ID EX MEM WB
Instr 1 IF ID EX MEM WB
Instr 2 IF ID EX MEM WB
Instr 3 IF ID EX MEM WB
Instr 1 2 3 4 5 6 7 8 9
Load IF ID EX MEM WB
Instr 1 IF ID EX MEM WB
Instr 2 IF ID EX MEM WB
Introducing stalls degrades يحرperformance as we saw before. Why, then, would the
designer allow structural hazards? There are two reasons:
To reduce cost. For example, machines that support both an instruction and a cache
access every cycle (to prevent the structural hazard of the above example) require at
least twice as much total memory.
To reduce the latency of the unit. The shorter latency comes from the lack of pipeline
registers that introduce overhead.
Data Hazards
A major effect of pipelining is to change the relative timing التوقيا السبا يof instructions
by overlapping their execution. This introduces data and control hazards. Data hazards
occur when the pipeline changes the order of read/write accesses to operands so that
the order differs from the order seen by sequentially executing instructions on the
unpipelined machine.
Consider the pipelined execution of these instructions:
1 2 3 4 5 6 7 8 9
Control hazards can cause a greater performance loss for pipeline than data hazards.
When a branch is executed, it may or may not change the PC (program counter) to
something other than its current value plus 4. If a branch changes the PC to its target
address, it is a taken branch; if it falls through, it is not taken.
If instruction i is a taken branch, then the PC is normally not changed until the end of
MEM stage, after the completion of the address calculation and comparison.
The simplest method of dealing with branches is to stall the pipeline as soon as the
branch is detected until we reach the MEM stage, which determines the new PC. The
pipeline behavior looks like :
Branch IF ID EX MEM WB
Branch
successor IF(stall) stall stall IF ID EX MEM WB
Branch
successor IF ID EX MEM WB
+1
The stall does not occur until after ID stage (where we know that the instruction is a
branch).This control hazards stall must be implemented differently from a data hazard,
since the IF cycle of the instruction following the branch must be repeated as soon as
we know the branch outcome. Thus, the first IF cycle is essentially a stall (because it
never performs useful work), which comes to total 3 stalls.
Three clock cycles wasted for every branch is a significant loss. With a 30% branch
frequency and an ideal CPI of 1, the machine with branch stalls achieves only half the
ideal speedup from pipelining!
The number of clock cycles can be reduced by two steps:
• Find out whether the branch is taken or not taken earlier in the pipeline;
• Compute the taken PC (i.e., the address of the branch target) earlier.
Both steps should be taken as early in the pipeline as possible . By moving the zero test
into the ID stage, it is possible to know if the branch is taken at the end of the ID cycle.
Computing the branch target address during ID requires an additional adder, because
the main ALU, which has been used for this function so far, is not usable until EX. The
revised datapath : With this datapath we will need only one-clock-cycle stall on
branches.
In some machines, branch hazards are even more expensive in clock
cycles. For example, a machine with separate decode and register fetch
stages will probably have a branch delay - the length of the control
hazard - that is at least one clock cycle longer. The branch delay, unless it
is dealt with, turns into a branch penalty. Many older machines that
implement more complex instruction sets have branch delays of four
clock cycles or more. In general, the deeper the pipeline, the worse the
branch penalty in clock cycles.
Finding the length of any string : write $-string
msg db 'Hello, world!',0xa ;our dear string
len equ $ - msg ;length of our dear string
Homework :
Q1. Write Emu8086 for printing the letter from A-Z.
Q2. Write Emu8086 for finding the biggest number of an array
a[10] of integer number.
Q3. Write Emu8086 for sorting an integer array a[6] ascending .
Q4. Write Emu8086 for finding the length of your name.
Vectorization and Vector Processors
A Scalar processor is a normal processor, which works on
simple instruction at a time, which operates on single data
items. But in today's world, this technique will prove to be
highly inefficient, as the overall processing of instructions will
be very slow. A scalar processor is classified as a SISD
processor in Flynn's taxonomy.
In computing, a vector processor or array processor is a central
processing unit (CPU) that implements an instruction set containing instructions that
operate on one-dimensional arrays of data called vectors, compared to the scalar
processors, whose instructions operate on single data items. Vector processors can
greatly improve performance on certain workloads, notably numerical simulation and
similar tasks. Vector machines appeared in the early 1970s and supercomputer design
through the 1970s into the 1990s. ظهةت الال المجههةا.وال سيما المحاكاة العددية والمهام المماثلةة
. في أوائل السبعينيا وجم جصميم الحواسيب العمالقة خالل السبعينيا وحجى الجسعينيا
Characteristics of Vector Processors
A vector processor is a CPU (Central Processing Unit) in a computer with parallel
processors and the capability for vector processing. The main characteristic of a
vector processor is that it makes use of the parallel processing capability of the
processor where two or more processors operate concurrently. This makes it
possible for: the processors to perform multiple tasks simultaneously
or
for the task to be split into different subtasks handled by different processors and
combined to get the result.
The vector processor considers all of the elements of the vector as one single
element as it traverses through the vector in a single loop. Computers with vector
processors find many uses that involve computation of massive amounts of data
such as image processing, artificial intelligence, mapping the human genome,
space simulations, seismic data, and hurricane predictions. الهينةوم البرةةتم ومحاكةةاة
.الفضاء والبيانا الزلزالية وجنبؤا األعاصيت
Types of Array Processors
There are basically two types of array processors:
• Attached Array Processors
• SIMD Array Processors
SIMD Array Processors Single Instruction Multiple Data
SIMD is the organization of a single computer containing multiple processors operating in
parallel. The processing units are made to operate under the control of a common control unit,
thus providing a single instruction stream and multiple data streams.
A general block diagram of an array processor is shown below. It contains a set of identical
processing elements (PE's), each of which is having a local memory M. Each processor element
includes an ALU and registers. The master control unit controls all the operations of the
processor elements. It also decodes the instructions and determines how the instruction is to
be executed. The main memory is used for storing the program. The control unit is responsible
for fetching the instructions. Vector instructions are send to all PE's simultaneously and results
are returned to the memory.
Why use the Array Processor
example:
org 100h
mov dx, offset msg
mov ah, 9
int 21h
ret
msg db "hello world $"
INT 21h / AH=1 - read character from standard input, with echo, result is stored in AL.
if there is no character in the keyboard buffer, the function waits until any key is pressed.
example:
mov ah, 1
int 21h
ret
Computer Architecture Flynn’s taxonomy
Parallel computing is a computing where the jobs are broken into discrete parts that can be
executed concurrently. Each part is further broken down to a series of instructions. Instructions
from each part execute simultaneously on different CPUs. Parallel systems deal with the
simultaneous use of multiple computer resources that can include a single computer with
multiple processors, a number of computers connected by a network to form a parallel
processing .
Parallel systems are more difficult to program than computers with a single processor because
the architecture of parallel computers varies accordingly and the processes of multiple CPUs
must be coordinated and synchronized.
The crux جوهرof parallel processing are CPUs. Based on the number of instruction and data
streams that can be processed simultaneously, computing systems are classified into four major
categories:
Flynn's taxonomy is a classification of computer architectures
The four classifications defined by Flynn are based upon the number of concurrent
instruction (or control) streams and data streams available in the architecture
According to Flynn's taxonomy- the data transfer mode, computer can be divided into 4 major
groups:
• SISD
• SIMD
• MISD
• MIMD
Multiprocessor:
A Multiprocessor is a computer system with two or more central processing units (CPUs)
share full access to a common RAM. The main objective of using a multiprocessor is to boost
the system’s execution speed, with other objectives being fault tolerance and application
matching.
There are two types of multiprocessors, one is called shared memory multiprocessor and
another is distributed memory multiprocessor. In shared memory multiprocessors, all the
CPUs shares the common memory but in a distributed memory multiprocessor, every CPU has
its own private memory.
Multicomputer:
A multicomputer system is a computer system with multiple processors that are connected
together to solve a problem. Each processor has its own memory and it is accessible by that
particular processor and those processors can communicate with each other via an
interconnection network.
b=2
Where h is height of tree
Here h=4,D=2*4=8
Where p represents Number of nodes
Here D=2√9 -2=4 , when we connect p0 to p8
It need 4 link,b=3;
Shared Memory. A shared-memory multiprocessor can be modeled as a complete graph,
in which every node is connected to every other node, as shown in Fig. 4 .However, it has to go
through an intermediary to send/receive data to/from P 4 , say. In a shared-memory
multiprocessor, every piece of data is directly accessible to every processor (we assume that
each processor can simultaneously send/receive data over all of its p – 1 links). The diameter
D = 1 of a complete graph is an indicator of this direct access . The node degree d = p – 1,
here 9-1=8 ,d=8.