0% found this document useful (0 votes)
22 views

CMP 345

This document provides an outline for a course on computer architecture. It discusses topics that will be covered including hardware design, instruction set architecture, primary and secondary memory, CPUs, pipelines, and multiprocessors. It notes that the focus will be on methodology rather than specific conclusions. The basic building blocks of a microcomputer including the central processing unit, memory unit, and input/output unit are described. Microprocessor buses, analog and digital systems, logic gates, and truth tables are also outlined.

Uploaded by

zdatimfon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

CMP 345

This document provides an outline for a course on computer architecture. It discusses topics that will be covered including hardware design, instruction set architecture, primary and secondary memory, CPUs, pipelines, and multiprocessors. It notes that the focus will be on methodology rather than specific conclusions. The basic building blocks of a microcomputer including the central processing unit, memory unit, and input/output unit are described. Microprocessor buses, analog and digital systems, logic gates, and truth tables are also outlined.

Uploaded by

zdatimfon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

CMP 345: COMPUTER ARCHITECTURE

COURSE OUTLINE

Computer architecture is concerned with the structure and behaviour of the various functional
modules of the computer; and how they interact to provide the processing needs of the user. In
particular this course covers computer systems ranging from PCs through multiprocessors with
respect to hardware design and instruction set architecture. This includes units and related
technology such as primary and secondary memories, caches, central processing unit (CPU) and
pipelines. A menu of “possibilities” will be presented, analyzed and evaluated based on the
technology available today. In no event should it be assumed that the architecture that looks
strongest today will be the best in the new millennium. The approach will be that it is
methodology, not conclusions that must be emphasized. For while methodology is relatively
timeless, conclusions are not. Levels of machine design: gates, register and processor levels

(Parallelism, Multiprocessors and pipelining, Memory system organization, Fault tolerance]\7 )

Basic Blocks of a Microcomputer

The microcomputer is made up of three basic building blocks: the Central Processing Unit, the
memory unit and the input and output unit.

The Central Processing Unit: executes all instructions and performs arithmetic and logic
operations on data. It is also called the microprocessor. The MOS microprocessor is a single LSI
chip that contains all of the control, arithmetic and logic circuits of the microcomputer.

Memory unit: stores both data and instructions. Typically, it contains ROM and RAM chips.

ROM: is read only, non-volatile, used to store instructions and data that do not change.

RAM: is volatile, used to store program and data that are temporary and might change in the
course of executing a program

I/O unit: transfers data between the microcomputer and external devices. The transfer involves
data, status and control signals.

Microprocessor Bus

Address bus: information transfer takes place in only one direction from the microprocessor to
memory or I/O elements (unidirectional).

1
Data bus: data can flow in both directions to and from the microprocessor (bi-directional).

Control bus: consists of a number of signals that are used to synchronize the operation

of the individual microcomputer elements. The microprocessor sends some of these control
signals to the other elements to indicate the type of operation being performed.

The basic elements of a digital computer are gates and flip-flops which are also known as
combinational and sequential logic elements respectively.

A combinational logic element is a circuit whose output depends only on its current inputs
whereas the output from a sequential element depends on its past history (i.e. remembers its
previous input). The behaviour of digital circuits can be described in terms of a formal notation
called Boolean Algebra.

Analog and Digital Systems

Analog circuits represent physical quantities in continuous forms whereas digital circuits
represent physical quantities in discrete forms.

Fundamental gates

A gate is a two-state device, open or close. A gate can be thought of as a black box with one or
more input terminals and an output terminal. It is important to note that:

Electronic gates require a power supply.


Gate INPUTS are driven by voltages having two nominal values, e.g. 0V and 5V representing
logic 0 and logic 1 respectively.
The OUTPUT of a gate provides two nominal values of voltage only, e.g. 0V and 5V
representing logic 0 and logic 1 respectively. In general, there is only one output to a logic
gate except in some special cases.
There is always a time delay between an input being applied and the output responding.

Digital systems are said to be constructed by using logic gates. These gates are the AND, OR, NOT,
NAND, NOR, EXOR and EXNOR gates. The basic operations are described below with the aid of
truth tables.

The AND gate

The output of an AND gate is true if and only if each of its input is also in a true state. It gives a
high output (1) only if all its inputs are high. A dot (.) is used to show the AND operation i.e.
A.B. Bear in mind that this dot is sometimes omitted i.e. AB

2
Representation

A B
The circuit is completed
only if switch A and switch
Switch B B are closed
Switch A

Consider the effect of ANDing the following 2 8-bit words: A = 11011100 and B = 01100101

11011100 word A

01100101 word B

01000100 C=A.B

AND gate is used to mask certain bits in a word by forcing them to 0. For example, we wish to
clear the leftmost 4 (four) bits of an 8-bit word to zero. ANDing the word with 00001111 will do
the task.

11011011 source word

00001111 mask

00001011 result

3
The OR gate

The OR gate is an electronic circuit that gives a high output (1) if one or more of its inputs are
high. In other words, the output is true if anyone of the input is true. A plus (+) is used to show
the OR operation.
Switch
A
A

Switch
B
OR operation is used to set one or more bits in a word to a logical 1. Consider the following
example

10011100

0 01 0 0 1 0 1

10111101

NOT gate

The NOT gate is an electronic circuit that produces an inverted version of the input at its
output. It is also known as an inverter or a complementer. If the input variable is A, the inverted
output is known as NOT A. This is also shown as A', or A with a bar over the top, as shown at the
outputs. The diagrams below show two ways that the NAND logic gate can be configured to
produce a NOT gate. It can also be done using NOR logic gates in the same way.

4
NAND gate

AND gate following by an


A A.B
C = inverter
B A.B

This is a NOT-AND gate which is equal to an AND gate followed by a NOT gate. The outputs of all
NAND gates are high if any of the inputs are low. The symbol is an AND gate with a small circle
on the output. The small circle represents inversion.

NOR gate

This is a NOT-OR gate which is equal to an OR gate followed by a NOT gate. The outputs of all
NOR gates are low if any of the inputs are high.

The symbol is an OR gate with a small circle on the output. The small circle represents inversion.

Exclusive OR (EOR/XOR) gate

The 'Exclusive-OR' gate is a circuit which will give a high output if either, but not both, of its two
inputs are high. An encircled plus sign ( ) is used to show the EOR operation.

5
Exclusive NOR EXNOR gate

The 'Exclusive-NOR' gate circuit does the opposite to the EOR gate. It will give a low output if
either, but not both, of its two inputs are high. It is the complement of the exclusive OR gate
with the output inverted, it is true if both inputs are the same. The symbol is an EXOR gate with
a small circle on the output. The small circle represents inversion.

The ability of EOR gate to detect whether its inputs are the same allows us build an equality tester
that indicates whether or not two words are identical. The NAND and NOR gates are called
universal functions since with either one the AND and OR functions and NOT can be generated,
thus they are the most widely used. A function in sum of products form can be implemented using
NAND gates by replacing all AND and OR gates by NAND gates.

A function in product of sums form can be implemented using NOR gates by replacing all AND
and OR gates by NOR gates.

Table below is a summary truth table of the input/output combinations for the NOT gate together
with all possible input/output combinations for the other gate functions. Also note that a truth
table with 'n' inputs has 2n rows.

Application of Gates

Circuits are constructed from connecting gates together. The output from one gate can be
connected to the input of one or more other gates. Two outputs however cannot be connected
together.

6
A,B,C are inputs, P,Q,R are
Intermediate variables and F
A P
is the output
B

Q
F
C

Create a truth table with 2n possible combinations of the n inputs.

A B C P = A.B Q = B.C R = F = P+R+Q


A.C
0 0 0 0 0 0 0
0 0 1 0 0 0 0
0 1 0 0 0 0 0
0 1 1 0 1 0 1
1 0 0 0 0 0 0
1 0 1 0 0 1 1
1 1 0 1 0 0 1
1 1 1 1 1 1 1

The circuit implements a majority logic function whose output takes the same value as the
majority of inputs. We can write down the output function F as the logic OR of the three (3)
intermediate signal P,Q,R, i.e F = P + Q + R , we can substitute F = A.B + B.C + A.C. This is Boolean
equation.

To derive expressions for output in terms of its input can be done in two ways

• From the circuit diagram by writing the output of each gate in terms of its inputs
• From the truth table, each time a logic one appears in the output column, write down the
set of inputs that cause the output to be true.

7
Y
P Q
F

R
Z

X Y Z P=X Q= R = F= The output is true when


(P.Y)’ (X.Z)’ Q.R
X = 0, Y = 1, Z = 0 (X Y Z)

0 0 0 1 1 1 0 X = 0, Y = 1, Z = 1 (X Y Z)
0 0 1 1 1 1 0 X = 1, Y = 0, Z = 1 (X Y Z)
0 1 0 1 0 1 1
0 1 1 1 0 1 1 X = 1, Y = 1 Z =1 (X Y Z)
1 0 0 0 1 1 0
1 0 1 0 1 0 1
1 1 0 0 1 1 0
1 1 1 0 1 0 1
Exercise

1. Using AND, OR and NOT gates only, design circuit diagram to generate P and Q from inputs
X, Y and Z, where

P = (X + Y) (Y Z)

Q = Y.Z + X.Y.Z

Do not simplify or otherwise modify these expressions

(a) By means of truth table, establish a relationship between P and Q


(b) Compare the circuit diagram of P and Q in terms of cost implementation.

2. Using AND, OR and INVERTER construct the corresponding logic circuits for the following
expression
I. X = AB (C + D + E) + BCD

8
II. Z = MN (P ⨁ N)

MICROPROCESSOR REGISTERS

Basic microprocessor Registers

There are four (4) basic microprocessor registers

Instruction Register (IR): stores instructions. The contents of an IR are always decoded by the
microprocessor as an instruction. After fetching an instruction code from memory, the
microprocessor stores it in the IR. The instruction is decoded internally by the
microprocessor, which then performs the required operation.
Program Counter (PC): contains the address of the instruction (Operation code). The PC
usually points to the next location, i.e. contains the address of the next instruction to be
executed.
Memory Address Register (MAR) or Data Counter: contains the address of data. The
microprocessor uses the address which is store in the memory address register as a direct
pointer to memory. The contents of the address are the actual data that are being
transferred.
Accumulator (A): used to store the result after most ALU operations. The accumulator is
typically used for inputting a byte from an external device into the accumulator or
outputting a byte to an external device from the accumulator.

Other Microprocessor Register

General Purpose Register: for storing temporary data or for carrying out data transfers between
various registers.

Status Register (Processor status word or condition code register) contains individual bits with
each bit having significance; the bits in the status register are called flags. The status of a specific
microprocessors operation is indicated by each flag.

• Carry flag: is 1 if arithmetic operation results in a carry and 0 otherwise.


• Auxiliary carry flag: is 1 if the carry bit of the low 4 bits is 1, 0 otherwise.
• Zero flag: is 1 if the result of an operation is zero and 0 otherwise
• Parity flag: is 1 if the last operation result contains even number of 1’s (even parity) or
odd number of 1’s (odd parity).
• Sign flag: indicates the result of last operation is positive or negative
• Overflow flag: is set to 1 if the result of an arithmetic operation is too big for the
microprocessor’s maximum word size, otherwise, it is reset to 0

9
Stack pointer register: contains the address of a stack. Two operations that are performed on
stack are PUSH and POP. Push adds an item unto the stack and pop deletes and item from the
stack.

COMPUTER ARCHITECTURE

Computer hardware comprises of the components from which computers are built, i.e computer
organization. In contrast, computer architecture is the science of integrating those components to
achieve a level of functionality and performance. Computer architecture is the structure and
organization of a computer’s hardware or software systems, or the structure and organization
of different components of a computer system. It is the combination of machine organization
and instruction set architecture. In its broadest definition, computer architecture is the design of
the abstraction layers that allow us to implement information processing applications efficiently
using available manufacturing technologies.

Abstraction Layers
Abstractions help us express intangible concepts in visible representations that can be
manipulated.

Levels of abstraction in computer systems, adapted from [Maf01]

• Operating System - Provides a convenient interface between (a) the user and his/her
application software, and (b) the hardware (sometimes called the bare machine).
• Assembler - Translates assembly language, a primitive type of programming language,
into machine code, which is a stream of ones and zeroes.
• Instruction Set Architecture(ISA) - Interfaces the software (listed above) to the hardware
(listed below), and provides support for programming.
• Processor, Memory, and I/O System - These components support the execution of
machine code instructions expressed in terms of the ISA.

10
• Datapath and Control - Provide a convenient abstraction for connecting the processor,
memory, and I/O system and controlling their function efficiently.

• Level 5 - Problem Oriented Language - Provides a convenient interface and applications


engine that helps the user produce results specific to a given application area. For
example, Microsoft Word is used for document creation or editing, Excel for accounting
spreadsheets, etc. The language at this level is usually a sequence of keystrokes or a high-
level scripting language. In software design, the language is a high-level programming
language such as C, C++, or Java.
• Level 4 - Assembly Language - Assembly is a very detailed language that helps the
systems programmer or software designer move information around in a computer
architecture in a highly specific way. For example, many compilers (programs that
translate programming language into an assembly-like language) are written in assembly
language. The advantage to assembly language is speed and power in accessing various
features of the hardware.
• Level 3 - Operating System Machine - Provides a convenient interface between assembly
language and the abstraction of the hardware architecture's instruction set. Operating
systems generally contain many libraries that help a software developer or programmer
connect to lower-level system functions (e.g., I/O, memory allocation, etc.) in an
organized way.
• Level 2 - Instruction Set Architecture (ISA) - One of the most important parts of a
computer is the interface between the lowest-level software and the hardware. The ISA
11
includes anything programmers use to make a binary machine language program work
correctly, including instructions, I/O, etc. The ISA facilitates design of functionality
independent of the hardware.
• Level 1 - Microarchitectural Level - Microinstructions are low-level control instructions
that define the set of datapath control signals which apply to a given state of a computing
machine. The microinstructions, together with their sequencing, comprise the
microarchitecture, whose purpose is to rigorously and consistently express the control of
logic circuits that comprise the computer hardware. Designing this control in terms of a
program that implements machine instructions in terms of simpler microinstructions is
called microprogramming.
• Level 0 - Digital Logic - The circuitry that makes a digital computer run is called logic.
All processes of a digital computer are expressed in terms of functions of ones and zeros,
for example, and, or, and not functions.

Structure of Computer Architecture

12
Application O/S

Computer Firmware

Instruction set architecture

Instruction set / processor/ I/O


system Computer
Machine
Architecture
Organisation Logic Design

Circuit Design
Circuit

Instruction Set Architecture (ISA) is the interface between hardware and software; it includes
anything programmers need to know correctly to make the machine function effectively
including instructions, registers, I/O devices, memory addressing etc. ISA permits two different
machines (different implementation, cost and performance) to run the same software if they
have the same defined ISA. it describes the functional behaviour of the system which includes:

• Organisation of storage (Memory organization / Memory addressing techniques)


• Data types organization / data representation (number of bits used to represent data
e.g. floating point, bitwise, bytes, words, block, and sector e.t.c)
• Coding and representation i.e instruction format
• Instruction set or operation code (opcode) e.g SISC, CISC, RISC
• Addressing modes
• Program exemptions or conditions e.g. INT, OV, UD etc which enables the
development of error correction routines

Factors affecting the architecture of a machine

Technology
Applications
Interface
Machine organisation
Measurement and evaluation

13
Technology

The first generation used vacuum tubes and relays which were electromechanical devices
requiring bistable components. They generate so much heat, were big and prone to wear
due to the mechanical components.
The second generation used diodes and transistors. This made them core compact, faster and
improved heat generation and dissipation. These were discrete components
Third generation used Large Scale Integration (LSI), which was a microchip made of silicon.
They were faster and more reliable.
Forth generation: used Very Large Scale Integration (VLSI) which was an improvement on the
third generation and ushered in the era of microprocessor. This made it possible to build
tens of thousands of transistor on a single microchip.
(SSI: tens of chips, MSI: hundreds of chips, LSI: thousands of chip, VLSI: tens of thousands
chip)

The effect of technology on architecture of the machine

High speed performance


Systems reliability
Overall machine size
Increased functionality (multi dimensional)
Improved storage organization and management (increase in speed and volume)
Secondary storage devices are improved in terms of volume of RAM and speed. Magnetic
disks are also increasing in density at the rate of 60% every 18 months

Problem of technological Advances on the architecture of the machine

Crosstalk between the constituent circuits


Electron bombardment
Electron are being replaced by photons which do not generate magnetic radiation and have
higher speed.

Computer organisation

This consists of the basic blocks of a computer system, more specifically, basic blocks of CPU,
basic blocks of memory hierarchy. It is concerned with how these basic blocks are designed,
controlled and connected. Computer organisation is influenced by the input comprising the data
and the program. The programs are made up of instructions which consist of the address of the
opcode and that of the operand.

14
The primitive machines contain few and simple instructions. Generally, the number of instruction
contained by a machine influences the design of the control logic of the machine. To solve the
problems created by the complex instruction set machines, compilers were introduced which
reduces the work load of the processor while programming was used to simplify the design of
the control logic of the machine. Because of the inefficiencies of CISC, RISC was introduced,
because most of the instructions in CISC were rarely used. RISC enables hardwiring (firmware)
which makes the machine faster due to the sequential circuits used in them.

Classical Machine Organisation

Classical machine organisation includes feature such as microprogramming, hardwiring,


modularity and virtual memory management and it is the basis of modern machines. It is also
the forerunner of parallel processing. Array processing makes use of single instruction with
multiple processors. The problem with parallel processing machine is that the fetch cycle is less
than the execute cycle. Pipeline machine works on distributed processing as in assembly line
manufacturing process. There also exists super-pipelining organisation to improve on the
efficiency of the machine.

4- address machine (primitive machine)


3- address machine Von
2- address machine Neumann
1- address machine (classical machine) Architecture
0- address machine(stack machine)

Organisation of Stacks

Stack machines can be implemented in three different ways

Through the stack pointer (SP) which normally points to the top stack
Through a set of registers which can be arranged either in parallel or in series as shift registers.
Register implementation is faster because information can be moved faster but is more
expensive due to the cost of the registers.

Main memory implementation is cost effective and is used in commercial computers.

Stack Architecture

Stack is a linear structure which may only be accessed at one end known as the top of the stack.
Only two operations can be carried out, PUSH is used to insert item into the stack and POP is
used to remove item from the stack. Stack is a LIFO list. A stack is a temporary storage facility in
memory used for storing information. Stack machine executes more efficiently than HLL machine
code. Data and instruction codes are represented in POLISH notation.

15
In stack machine, data and control statements are referred to as ‘SYLLABLES’. There are three
basic types of syllables:

OPERATOR Syllables: they express the arithmetic and logic operations to be performed by the
processor.
VALUE-CALL syllables: indicate the memory address where the desired operands are located.
LITERAL Syllables: are actual operands.

A program for execution by a static process consists simply of sequence of syllables that
represent the arithmetic and logic operation to be performed in reverse POLISH NOTATION.
These syllables are stored in consecutive memory locations i.e they are stacked.

N.B: the reverse Polish notation exactly corresponds to the sequences of the required stack
machine instructions. All modern compilers first translate any arithmetic or logic expression into
POLISH notation and then into machine language. Compilers for stack machine only have to
perform the second translation and that makes the machine to be faster.

16
bn-1
R
SP TOS

R1 PUSH

POP

RN POP

A. Parallel register

R0b0 RN
PUSH

POP
bn-1

B. Shift register implementation

A Stack Processor

Fetch
SP Memory MAR Counter
(mm)

Memory
Register
Control
unit
Program
Counter C. Stack implementation of
Burrough’s 6500

17
The von Neumann Architecture (defined in 1945)
The principles:
The von Neumann architecture is based n the stored program concept by which data and
instructions are both stored in the main memory The content of the memory is addressable by
location (without regard to what is stored in that location: data or code) Instructions are executed
sequentially (from one instruction to the next, in order of their location in memory) unless the
order is explicitly modified. • The computer contains the following subsystems: input / output
system, memory, control unit, arithmetic/logic unit (ALU); von Neumann computers are general
purpose computers. • Computers based on von Neumann Architecture are sequential computer

Differences between stack machine and von Neumann machine

Stack processors are particularly good for block structure language such as Pascal, COBOL,
Algol, C, Java PL1 etc.
High level languages are comparatively more efficient on stack machines
Stack machines produces extremely compact codes which could be as half as their
conventional counterparts
It produces a faster machine.

Example: consider the expression of the form

W = (b - c) * F / (H + J)

18
We have to rewrite this in parenthesis free polish notation using postfix notation, thus, we have

BC- F*HJ+/

The syllable string program for the problem thus becomes

Program Action Content of


stack
Value call B Stack B B
Value call C C stack BC
Operator subtract B – C = r1 r1
Value call for F F stack r1F
Multiply operator r1 * F r2
Value call for H H stack r2H
Value call for J J stack r2HJ
Add operator H + J = r3 r2r3
Divide operator r2/r3 = r4 r4

Tree representation

/
* +

F
+ H J

B C

19
VON NEUMANN ISA REGISTERS

Name Function

Architected Registers
Accumulator, AC, 40 bits Holds the output of the ALU after an arithmetic operation, a
datum loaded from memory, the most-significant digits
of a product, and the divisor for division.
Multiplier quotient register, MQ, 40 Holds a temporary data value such as the multiplier, the
bits least-significant bits of the product as multiplication
proceeds, and the quotient from division.

Implemented Registers
Program counter, PC, 12 bits* Holds the pointer to memory. The PC contains the address of
the instruction pair to be fetched next.
Instruction buffer register, IBR, 40 Holds the instruction pair when fetched from the memory.
bits
Instruction register, IR, 20 bits Holds the active instruction while it is decoded in the control
unit.
Memory address register, MAR, 12 Holds the memory address while the memory is being cycled
bits (read or write). The MAR receives input from the
program counter for an instruction fetch and from the
address field of an instruction for a datum read or write.
Memory data register, MDR, 40 bits Holds the datum (instruction or data) for a memory read or
write cycle.
* The program counter is a special case. The PC can be loaded with a value by a branch
instruction, making it architected, but cannot be read and stored, making it implemented.

20
Instruction fetch cycle

21
Instruction Decode and Execute cycle

The primitive machines

The basic structure of the primitive computer is composed of the memory, data busses and
registers as shown below

M Memory
A
R0 R1 R2 R3 R4
R MBR

ALU

Control
Unit

The operation of this primitive computer is described as follows:


22
If the first instruction of program is stored in address zero, the following steps are necessary in
order to execute the program.

i. Clear MAR Address of first instruction in MAR


ii. Transfer first instruction into MBR
iii. Decode MBR into R0, R1, R2, R3, R4
iv. Address of first operand in MAR
v. First operand in MBR
vi. Transfer first operand into R1
vii. Address of second operand in MAR
viii. Second operand in MBR
ix. Transfer second operand into R2
x. Address of result in MAR
xi. Perform operation specified by the operand
xii. Transfer result into R3
xiii. Address of instruction in MAR
xiv. Go to step ii

Advantages of primitive machines

they offer simple and systematic machine operation


All the functions may have the same format
A relatively small number of operation code is needed to specify the essential operations

Disadvantages

It is wastes memory space, waste memory access time and wasteful of information transfer.

Three Address Instruction Format

To improve on the type of design, instead of specifying the 4 address explicitly in an instruction,
some of the addresses specified by the instruction can be omitted by specifying certain addresses
implicitly in the operation code as follows:

Let the instructions be stored at consecutive locations in the memory


A program counter (PC) is employed to point to the address of instruction being executed.
After the present instruction is executed, the program counter is automatically incremented
by one to point to the address of the next instruction. Next instruction address may be
omitted from the instruction. The result is a 3-address machine as shown

OPCODE Address of first operand Address of second Address of


operand result

23
Two-Address Instruction Format

By designating the accumulator (AC) to store the result of the operation automatically memory
address of the result is omitted from the instruction. Since the operation result is not stored in
memory, this leads to a 2-address machine as shown below

OPCODE Address of first operand Address of second operand

One-Address Instruction Format

Let the 2-operands to be operated upon be fetched from the memory one at a time, each
specified by an instruction carrying a single address thus eliminating the second operand address.
We now have a one-address machine. The instruction format is shown below

OPCODE Address of first operand


This leads to the development of the classical machine organization shown below.

MEMORY MAR PC

OP ADDRES
AC
ALU

IR Control Contro
ADDRES Network l Signal
S

G CLK MSR

Classical Machine

24
Memory Buffer Register (MBR) is used for reading the data out of memory or for writing data
into the memory. It has the same number of bits as the memory word.

Memory Address Register (MAR): is the means of addressing specific memory locations. Its size
depends on the number of words the computer can address directly. E.g if it has 16 bits then it
will be able to address 4096 words (4KB) of main memory directly.

Instruction Register (IR): is used to keep the operation code of the instruction currently being
executed by the processor. If the operation code has 8 bits, IR will also have 8 bits. The IR decodes
the opcode and initiates the steps necessary for the execution of the instructions.

Major State Register (MSR): all operations in the machine (computer) are performed step by step.
One step represents the basic computer cycle, and in modern machine it lasts about 1
microsecond. These are instructions which are performed in 2 or 3 cycles. Not all the cycles will
be used for the same purpose. The programmed instruction controls the computer MSR with 3
possible states Fetch, wait and execute. Each state is used for specific purpose.

FETCH: the fetch state is used to read the instruction from the memory and to decode it. It goes
through the following steps:

Contents of the memory locations specified by the PC are read in the MAR
Contents of the PC are incremented by 1
Instruction is decoded, operation code of the instruction is transferred into IR to cause
enactment and the address part is transferred into the MAR
Machine will then go to execute state.
Fetch state operation
/F.P0/ : MAR PC
/F.P1/ :MBR m(MAR); PC PC + 1
/F.P2/ : IR MBR [OP]: MAR MBR[ADDR] $$ Decode Instruction
/F.P3/ : MSR 2 $$ Go to Execute
state

EXECUTE: the execute state is used to execute instructions as follows:

For instructions which need an operand from the memory, the execute state is used to read
the operand into MBR and to perform the operation specified by the operation code of
the instruction, leaving the result in the accumulator (ACC). These instructions are ADD,
AND, LOAD e.t.c.

25
For the store instruction, this state is used to deposit the contents of the ACC in the specified
memory location
For the jump (JMP) instruction, during this state, the contents of the PC are transferred into
the main memory address designated by the instruction. Also, the address designated by
the instruction is transferred into the PC in order to change the program control
Execute state operation
/E . Add P1/ : MBR m(MAR)
/E . Add P2/ : ACC ACC (+) MBR
/E . Add P3/ : (G = 1) MSR 1 this is done at the end of every
(G = 0) MSR 0 instructions

There is no operation in P0 clock pulse because this is synchronous operation e.g for a CLA
operation.

/E . CLA . P1/ AC 0

/E . CLA . P3/ (G = 1) MSR 1

(G = 0) MSR 0

Wait: during this state the computer is idle. It will only check whether to remain idle or to go to
a FETCH state.

GO/WAIT register: is used to indicate to the computer when to go into action and when to be
idle. Wait state operation:

/W .P3/ (G = 1) MSR = 1 $$ Go to FETCH state

(G = 0) MSR =0 $$ stay in wait state

26
C
Fetch cycle

Locate Fetc
Decode
next
instructi

G W
o/ Locate Fetch
next

Execute cycle

Each control state lasts for one computer cycle. It is divided into smaller time slices which can be
used to perform sequential logic operation. Each instruction determines the major state the
computer must enter for the execution of that instruction. The computer operates in
synchronism (unison) with a clock (CLK). A number of clock cycles are required to accomplish the
tasks specified by one instruction. The execution of one instruction is called an instruction cycle.
An instruction cycle typically consists of one or more machine cycles.

W= 00 Major

F = 0 0 State
Fetch Execute
E = 1 0 Register (MSR)
Machine cycle

Modern Computer Organisation

The classical machine organisation in its basic form is primarily a result of the minimum hardware
requirement in the early days of computer development. In recent years, the technological
advancement in the field of semiconductor and magnetic has drastically reduced the cost of

27
digital electronic components. Henceforth, some wastefulness of hardware components is
justified in order to:

Improve the overall machine performance in speed, precision and versatility.


Render the design and construction of the machine simple and systematic, thus economical.
Make user’s programming and system programming easier and cheaper to prepare.

Thus over the years the basic classical organisation has been modified in a number of ways to
achieve these objectives. Some of the prominent features of modern computer design are:

Modular design
Micro-programmed control
Memory hierarchy

The modular design employs a number of standardised functional units linked by several data
buses to realise a systematic and flexible organisation. The functional units may be arranged to
operate simultaneously (parallel operation) or in cascade (pipeline operation). The aim is to
increase the effective computing speed by allowing several instructions to be executed at the
same time (i.e instruction level parallelism)

The simplest method is to have 2 or more independent CPUs sharing common memory as shown
below

Control Control
ALU ALU
Registers Registers

Main Memory (MM)

Multiple Instruction stream Multiple Data


(MIMD) computer
Usually, each CPU has its own program to run with no interaction between the CPUs. A
computer with multiple processors is called a “MULTIPROCESSOR SYSTEM”.
Some problems require carrying out the same computation on memory sets of data e.g a
weather prediction program might read hourly temperature measurements taken from
1000 weather stations and then computer the daily average at each station by performing
exactly the same computation on each set of 24-hourly readings; since the same program
is used on each data set, a processor with one program counter and one instruction

28
decoder but N-arithmetic (functional) units and N register sets could carry out
computations on N datasets simultaneously as shown below. An array processor and the
configuration is sometimes called a single instruction stream multiple data stream
computer. Example of this type of organisation is the ILLIAC IV, developed by the
Burroughs Corporation. The ILLIAC IV consists of 4 functional units, each of which
operates on 64 data sets simultaneously. The ILLIAC IV can perform 4 different
computations at the same time, each computation being carried out on 64 sets of data,
making a total of 256 calculations in parallel.

Control Unit

ALU ALU ALU ALU

Registers Registers Registers Registers

Maim Memory (MM)

Array processor- SIMD computer

The CDC 6600 and cyber 74 use yet another form of parallel instruction execution. These machines have
separate arithmetic units for addition, multiplication, division and other operations for a total
of 10 separate units. There is a single control unit that fetches and decodes instructions. As soon
as the type of an instruction is known, it is sent to the appropriate unit (e.g addition) and the
fetching and decoding of the next instruction begins, concurrently with the execution of the first
instruction. A well written program may have up to 10 instructions being executed
simultaneously. However, it should be noted that this system only makes sense if the time
needed to fetch and decode instructions is very small compared to the time needed to carry it
out, which is the case on the CDC machines.

29
Control Unit

Instruction Analyser

Add Shift
Unit unit
Compa
Boolean Decod
Multipl re Unit
Unit Floatin e unit
y unit
g point

Main Memory

Single CPU with multiple functional units

Instruction Set Architectures

When computers were first developed, they had very small instruction sets, because the algorithms
and hardware for complicated tasks had not yet been developed. As computer design continued
into the 1960s, high-level languages (e.g., FORTRAN, Cobol, Lisp) evolved. Hardware designers
formed the idea that it might be useful to develop different computers for each type of language -
a FORTRAN machine, a COBOL machine, etc. When this approach was tried, it resulted in very
complicated instruction sets. Parallel to this development was the IBM philosophy of upward
compatibility, which they also tried to implement in hardware. This produced a huge collection of
instructions to support all their old computer programs (called legacy code). The result of this was
called Complex Instruction Set Computing (CISC), whose philosophy is summarized as follows:

• Bigger is better!
• Make the hardware as "smart" (and, hence, as complex) as possible.
• If you have a great sprawling architecture, that's ok. The hardware fabricators will
figure out what to do with your design.
• Don't worry about whether or not the system design is neatly partitioned into
layers. (One great sea of logic gates would be ok until we figure out something
else that works.)
• When one little part fails, the whole system dies and we will never find out why.
That's ok - just build another CPU chip from the ground up. Maybe that one will
work better.

CISC has many problems. Some of the bigger problems include lack of maintainability, lack of
verifiability, and brittleness. In practice, humans don't know how to verify or maintain really

30
complicated designs. And, we don't yet have software that can perform all the verification and
maintenance tasks for us. As a result, as CISC machines got more and more complex, they failed
considerably more frequently. This yielded brittle (non-robust) performance in practical
computing problems. As the world became more dependent on computers, the CISC design
philosophy gradually became unacceptable.

In response to this problem, computer designers returned back to the primitive roots of computer
science, and developed the Reduced Instruction Set Computing (RISC) philosophy. The main
concept in RISC is that of a very simple Instruction Set Architecture. There is a compact
microinstruction set, into which every high-level command or instruction is translated by the
compiler. RISC computers tend to run faster, are smaller, and have fewer problems because they
have a simple instruction set. The RISC philosophy includes the following concepts:

• Small is beautiful.
• Keep the hardware simple and stupid (KISS design philosophy).
• Hardware and software designers should work together to make the architecture
simple and modular.
• Neatly partition the system design into layers,
• Then, take the vast majority of the functionality that CISC implements in
hardware, and put it into software (using compiler transformations) instead.
• Make the hardware and compiler robust, so the entire system can perform
reliably.

By keeping the hardware small and modular, the design and fabrication, maintenance, and
debugging costs are reduced. This makes sense from an economic perspective. It is also easier to
make new generations of RISC computer chips, and to produce them more quickly. This implies
potentially greater profits by shortening both the development and product life cycles.

Modularity and simplicity in hardware and software help designers and engineers achieve greater
robustness, because the system can be maintained and understood more easily. With simple
hardware and software, the functionality of the design can be verified more extensively using
software-based testing and proof-of-correctness. Finally, it makes sense to put the CISC
complexity into the compiler software, because the software can be modified and corrected much
more easily, inexpensively, and quickly than hardware can be changed or updated. Again, this
makes good economic sense, because development and re-work costs are significantly reduced.

Reduced Instruction Set Computer (RISC)

Basically, machine is restricted to only memory reference instruction e.g ADD instruction. (Note:
there are also non-memory reference instructions). RISC represents a CPU design strategy
emphasizing the insight that simplified instructions that “do less” may still provide for higher
performance if this simplicity can be utilized to make instructions execute very quickly.

31
RISC is a type of microprocessor architecture that utilizes a small, highly-optimized set of
instructions, rather than a more specialized set of instructions often found in other types of
architecture. In RISC, memory reference instructions are many and this wastes time. Therefore,
memory reference instructions are limited to LOAD/STORE. Moreover, registers are used to store
instructions and hence prevent persistent referencing of memory, hence there is a bank of
registers. One of the rules is that RISC must be able to execute one instruction per cycle. This
means that such instruction must be very simple. Other features of RISC are: 3-register
instruction format, with register to register arithmetic/logical operations. Only memory
operations are to load a register from memory and to store contents of a register in memory.
RISC has limited number of addressing modes and instructions perform elementary operations

Rules Guiding the Building of RISC

Minimum number of instructions and addressing modes: minimum number of instructions


reduces the number of control because with reduction on addressing mode, decoding is
made easier as fewer instructions will be addressed.
Hardwired instruction decoding: hardwired control is used in conjunction with RIS and this
makes the machine to be faster.
Single cycle execution of most instructions: this helps to avoid complex instruction that will
always entail more than one cycle.
Only LOAD/STORE instructions deal with the memory: all ALU instructions work with values
in registers and this makes the machine operations to be faster.

RISC architecture, despite the above guidelines still relies on effective utilization of additional
architectural techniques such as:

• Pipelining
• Multiple data paths for parallelism and concurrency.
• Large register sets: this is used for parameter passing e.g in subroutine call. This makes
RISC to use register window to pass parameter.

PARAMETER PASSING IN RISC


Parameter passing in subroutine without memory intensive stack operation
Subroutines need not save state before beginning actual work
In subroutine execution, only instruction relevant to the main program is saved not that of
the subroutine (e.g. address of next instruction, values of condition register in the main
program before executing the subroutines).

Comparison of RISC and CISC

32
RISC CISC
Simple instruction taking one cycle Complex instruction Taking multiple
cycles
Only LOAD and STORE access memory any instruction may reference memory
Highly pipelined Less pipelined
Instruction executed by the hardware Instruction interpreted by the
microprogram
Fixed format instructions Variable format instruction
Few instructions and modes Many instructions and modes
Complexity is in the compiler Complexity is in the microprogram
Multiple register sets (138 general purpose Single register set (16 registers)
register)
Instruction size about 32 bits 16-256 bits instruction size

PARALLELISM

The architectural technique that allows an overlap of individual machine operations. Multiple
operations will execute in parallel (simultaneously). The goal is to speed up the execution.
A parallel computer (or multiple processor system) is a collection of communicating processing
elements (processors) that cooperate to solve large computational problems fast by dividing such
problems into parallel tasks, exploiting Thread-Level Parallelism (TLP).

Broad issues involved:


• The concurrency and communication characteristics of parallel algorithms for a given
computational problem (represented by dependency graphs)
• Computing Resources and Computation Allocation:
The number of processing elements (PEs), computing power of each element and
amount/organization of physical memory used.
What portions of the computation and data are allocated or mapped to each PE.
• Data access, Communication and Synchronization
How the processing elements cooperate and communicate.
How data is shared/transmitted between processors.
Abstractions and primitives for cooperation/communication and synchronization.
The characteristics and performance of parallel system network (System
interconnects).
Parallel Processing Performance and Scalability Goals:
• Maximize performance enhancement of parallelism: Maximize Speedup.
– By minimizing parallelization overheads and balancing workload on processors
• Scalability of performance to larger systems/problems.

The Need and Feasibility of Parallel Computing

33
Application demands: More computing cycles/memory needed
– Scientific/Engineering computing: CFD, Biology, Chemistry, Physics, ...
– General-purpose computing: Video, Graphics, CAD, Databases, Transaction
Processing,Gaming…
– Mainstream multithreaded programs, are similar to parallel programs
Technology Trends:
– Number of transistors on chip growing rapidly. Clock rates expected to continue to go
up but only slowly. Actual performance returns diminishing due to deeper pipelines.
– Increased transistor density allows integrating multiple processor cores per creating Chip-
Multiprocessors (CMPs) even for mainstream computing applications (desktop/laptop..).
Architecture Trends:
– Instruction-level parallelism (ILP) is valuable (superscalar, VLIW) but limited.
– Increased clock rates require deeper pipelines with longer latencies and higher CPIs.
– Coarser-level parallelism (at the task or thread level, TLP), utilized in multiprocessor
systems is the most viable approach to further improve performance.
Main motivation for development of chip-multiprocessors (CMPs)
Economics:
– The increased utilization of commodity of-the-shelf (COTS) components in high
performance parallel computing systems instead of costly custom components used in
traditional supercomputers leading to much lower parallel system cost.
Today’s microprocessors offer high-performance and have multiprocessor support
eliminating the need for designing expensive custom Pes.
Commercial System Area Networks (SANs) offer an alternative to custom more costly
Networks

Elements of Parallel Computing


1. Computing Problems:
a) Numerical Computing: Science and engineering numerical problems demand
intensive integer and floating point computations.
b) Logical Reasoning: Artificial intelligence (AI) demand logic inferences and symbolic
manipulations and large space searches.
2. Parallel Algorithms and Data Structures
– Special algorithms and data structures are needed to specify the computations and
communication present in computing problems (from dependency analysis).
Most numerical algorithms are deterministic using regular data structures.
Symbolic processing may use heuristics or non-deterministic searches.
Parallel algorithm development requires interdisciplinary interaction.
3. Hardware Resources
Processors, memory, and peripheral devices (processing nodes) form the hardware core
of a computer system.
Processor connectivity (system interconnects, network), memory organization, influence
the system architecture.
4. Operating Systems
Manages the allocation of resources to running processes.

34
Mapping to match algorithmic structures with hardware architecture and vice versa:
processor scheduling, memory mapping, interprocessor communication.
• Parallelism exploitation possible at: 1- algorithm design, 2- program writing,
3- compilation, and 4- run time.
5. System Software Support
Needed for the development of efficient programs in high-level languages (HLLs.)
Assemblers, loaders.
Portable parallel programming languages/libraries
User interfaces and tools.
6. Compiler Support
a) Implicit Parallelism Approach
Parallelizing compiler: Can automatically detect parallelism in sequential source code
and transforms it into parallel constructs/code.
Source code written in conventional sequential languages
b) Explicit Parallelism Approach:
Programmer explicitly specifies parallelism using:
Sequential compiler (conventional sequential HLL) and low-level library of the target
parallel computer , or ..
Concurrent (parallel) HLL .
Concurrency Preserving Compiler: The compiler in this case preserves the
parallelism explicitly specified by the programmer. It may perform some program flow
analysis, dependence checking, limited optimizations for parallelism detection.

Classification of Computer System

Flynn’s Taxonomy is a classification system for parallel (sequential) computers and programs.
Flynn classified computers and programs by

Set of instructions, operating with single set and multiple set of instruction
Data set, instructions are using single or multiple data set.

35
Single instruction single data stream (SISD) computers have only one processor operating upon
a single instruction stream. It has a single memory for storing data i.e single data stream. In such
computers instructions are executed sequentially and the system possesses low level parallelism,
it is equivalent to sequential program.

Multiple instruction single data stream (MISD) computers have a single data stream transmitted
to multiple processors, each of which will execute a different instruction sequence. (A rarely used
classification).

Single instruction multiple data stream (SIMD). In SIMD, a single instruction controls execution
of a number of processing elements. Each of the processing elements is associated with its own
data memory. Hence, a single instruction is executed upon data from multiple data streams. This
is analogous to doing the same operation repeatedly over a large data set (commonly done in
signal processing application).

Multiple instruction multiple data (MIMD), with this classification, a set of processors
simultaneously execute different instructions on different data sets. They have the capability to
execute several programs at the same time. (this is the most common type of parallel programs).

Classes of parallel application

Fine – grained parallelism: the applications subtasks must communicate many times per second.

Coarse-grained parallelism: subtask do not communicate many times per second

Embarrassingly parallel: the subtasks rarely communicate or never have to communicate.


Embarrassingly parallel applications are considered the easiest to parallelize.

Grain Size

Fine Coarse Embarra


Grained Grained ssingly
parallel
Grain

Types of parallelism

Bit level parallelism: is a form of parallelism computing based on increasing processor word
size. Increasing the word size reduces the number of instructions the processor must
execute to perform an operation on variables whose sizes are greater than the length of

36
the word. Example an 8-bit process to add two 16-bit integers, will first add the 8-lower
bits from each integer and the 8 higher order bits, requiring 2-instructions, a 16-bit
processor will require only one instruction.
Instruction level parallelism: A computer program is a stream of instructions executed by a
processor. These instructions can be reordered and combined into groups which are then
executed in parallel without changing the result of the program. This is instruction level
parallelism.

IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM
IF ID EX MEM

Modern processors have multi-stage instruction pipeline. Each stage in the pipeline corresponds
to a different action the processor with N-stage pipeline can have up to N different instructions
at different stages of completion. Example of a pipelined processor is a RISC processor with 5
stages, instruction fetch, instruction decode, execute, memory access and register write back.

Data parallelism: is inherent in program loops, which focuses on distributing the data across
different computing nodes to be processed in parallel. Parallelizing loops often leads to similar
(not necessarily identical) operation sequences or functions being performed on elements of a
large data structure.

Task parallelism: is the characteristic of a parallel program that entirely different calculations can
be performed on either the same or different sets of data. This contrasts with data parallelism,
where the same calculation is performed on the same or different sets of data.

Multiprocessing

Multiprocessing is the coordinated processing of programs by more than one computer


processor. Multiprocessing is a general term that can mean the dynamic assignment of a program
to one of two or more computers working in tandem or can involve multiple computers working
on the same program at the same time (in parallel).

With the advent of parallel processing multiprocessing is divided into two:

37
Symmetric multiprocessing (SMP) or tightly coupled multiprocessing: the processors share
memory and the I/O bus or data path. A single copy of the operating system is in charge
of all the processors. SMP also known as a “shared everything” system , it does not usually
exceed 16 processors. SMP allows either system or user tasks to run on any processor,
which is more flexible and leads to better performance.
Massively parallel or loosely coupled multiprocessing (MPP): may have up to 200 or more
processors working on the same application. Each processor has its own operating system
and memory, but an interconnect arrangement of data paths allows messages to be sent
between processors. MPP system is known as “shared nothing system”.

Tightly coupled systems

Uniform memory access Cache-only


memory access model memory
model (UMA) (NUMA) architecture
model (COMA)

The aim of multiprocessing is to double performance by using two processors instead of one. In
reality, it does not work this well. Although, multiprocessing results in improved performance
under certain conditions. In order to employ multiprocessing effectively, the computer system
must have all of the following in place

Motherboard support: a motherboard capable of handling multiple processors. This means


additional sockets or slots for the extra chips and chip sets capable of handling the
multiprocessing arrangement.
Processor support: processors that are capable of being used in a multiprocessing system.
Not all are, and in fact some versions of the same processor are while others are not.
Operating system support. An os that supports multiprocessing such as windows NT or one
of the flavours of Unix.
: multiprocessing protocol which dictate the processors and chip sets talk to one another.

Multiprocessing is most effective when used with application software designed for it.
Multiprocessing is managed by the OS, which allocates different tasks to be performed by the
various processors in the system. Intel processors such as Pentium and Pentium Pro, use SMP
protocol called APIC and Intel chipsets that support multiprocessing such as 430HX, 440FX,
450GX/KX are designed to work with this chip.

Pipelining
38
A pipeline is a set of data processing elements connected in series, so that the output of one
element is the input of the next. The elements of a pipeline are often executed in parallel or in
time-sliced fashion. In that case, some amount of buffer storage is often inserted between
elements.

Computer Related Pipelines

Instruction pipelines: used in processors to allow overlapping execution of multiple


instructions with the same circuitry. The circuitry is divided up into stages, each stage
processes one instruction at a time.
Graphic pipelines: found in most graphics cards, which consists of multiple arithmetic units
or complete CPUs, that implement the various stages of common rendering operations
(perspective projection, window clipping, colour and light calculation, rendering etc)
Software pipelines: consisting of multiple processes arranged so that the output stream of
one processes is automatically and promptly fed as input stream of the next one. Unix
pipelines are the classical implementation of this concept.

Pipe IF ID EX MEM WB
latency
t

The classic 5-Stage RISC


Pipeline
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM
IF ID EX MEM
IF: Instruction Fetch
ID: Instruction Decode
EX: Execute
MEM: Memory Access
The vertical axis is successful instructions; the
WB: Register Write Back
horizontal axis is the time.

39
Instruction Fetch: during IF stage, a 32-bit instruction is fetched from the cache. The Program
Counter predictor sends the PC to read the current instruction from the instruction cache. At the
same time, the PC predictor predicts the address of the next instruction by incrementing the pc
by 4 (all instructions are 4 bytes long). Read instruction from the system memory and increment
the program counter.

Instruction Decode: identify input registers and read register files, decode the instructions read
from the memory during the previous phase. Most instructions have at most 2 register inputs.
During the decode stage, these 2 register names are identified within the instruction and the 2
registers named are read from the register file. At the same time the register file is read,
instruction issue logic in this stage determines if the pipeline was ready to execute the instruction
in this stage. If not, the issue logic would cause both the instruction fetch stage and the decode
stage to stall. On a stall cycle, the stages would prevent the initial flops from accepting new bits.

Execute

Execute: the specified operation (by the instruction) is carried out. Instructions in RISC machines
can be divided into three latency classes according to the type of the operation.

Register-Register operation (single cycle latency). Add, subtract, compare and logical
operations. During the execute stage, the 2 arguments were fed into a simple ALU, which
generates the result by the end of the execute stage.
Memory reference (two cycle latency) all loads from memory. During EX stage, the ALU added
the two arguments (a register and a constant offset) to produce a virtual address by the
end of the cycle.
Multi cycle instruction (many cycle latency). Integer multiply and divide and floating point
operations. During EX stage, the operands to these operations were fed to the multi-cycle
multiply/divide unit.

Memory Access: locate memory location to write the result. During this stage, single cycle
latency instructions, simply have their results forwarded to the next stage. This forwarding
ensures that both single and two cycle instructions always write their results in the same stage
of the pipeline, so that just one write port to the register file can be used and it is always available.

writeBack: the result obtained during the execution phase is written into the operand
destination. During this stage, both single cycle and 2-cycle instructions write their results into
the register file.

Hazards: are situations in which a completely trivial pipeline would produce wrong answers.

40
Structural hazards: occur when two instructions might attempt to use the same resources at
the same time. Classis RISC pipelines avoided these hazards by replacing hardware.
Control hazards: is caused by delay between the fetching of instruction and decision about
changes in control flow (branches and jumps)
Data hazards: occurs when an instruction scheduled blindly, would attempt to use data
before the data is available in the register file.
Bypassing: suppose the CPU is executing the following piece of code
SUB r 3, r 4 --→ r 10
AND r 10, 3 ----→r 11
Clock Cycle
Pipeline 1 2 3 4 5 6
stage
Fetch SUB AND
Decode SUB AND
Execute SUB AND
Access SUB AND
writeBack SUB AND

In cycle 3, SUB calculates the new value of r 10, in the same cycle and is decoded and the value
of r 10 is fetched from register file. However, the SUB instruction has not yet written its result to
r 10, writeback occurs in cycle 5. Therefore, the value read from the register file and passed to
the ALU to the execute state of AND is incorrect.

IF ID EX MEM WB
IF ID EX MEM WB

IF ID EX MEM WB
IF ID EX MEM WB

IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB

IF ID EX MEM WB
IF ID EX MEM WB

41
A 5-STAGEPipelined super scalar processor, capable of issuing 2 instructions per cycle. It can have
2 instructions in each stage of the pipeline, for a total of up to 10 instructions being
simultaneously executed.

Cost drawbacks and benefits

Pipeline does not reduce the time for a single datum to be processed. It only increases the
throughput of the system when processing a stream of data.

High pipeline leads to increase of latency: the time required for a signal to propagate through a
full pipe.

Pipelined systems require more resources (circuit elements, processing units, computer memory
e.t.c.) than one that executes batch at a time because its stages cannot reuse the resources of a
previous stage. Moreover, pipelining may increase the time it takes for an instruction to finish.

Design consideration

Balancing pipeline stages


Provision of adequate buffering between the pipeline stages, especially when the processing
times are irregular, or when data items may be created or destroyed along the pipeline.

Illustration.

Assembly of a car involves installing the engine, the hood and the wheels. Assume installation of
each part takes place as follows: engine 20 mins, hood 5 mins and wheels 10 mins, using pipeline
determine how long it takes to assemble 3 cars.

The assembly involve a 3-stage pipeline as follows:

Engine hood wheel


20 5m 10
m m
Latency = 35m
The latency of a full pipe is 20 + 5 + 10 =35

Using pipeline
5 15m 5
m 15m
Engine hood wheel m
Engine hood wheel
Engine hood wheel

42
Assembling three cars individually would take (35 * 3 = 105 mins) whereas using pipeline would
take (35 +5+ 15 +5 +15 = 75 mins).

A computer assembly process takes X stages, installing the mother board, installing the CPU and
battery, the hard drive and add-on cards. Assuming the installation of each component takes
place as follows: installing the board: 28s, installing the CPU and battery: 15s, installing the
harddrive: 12s and the add-on cards: 32s.

i. What is the latency of a single computer without pipeline?


ii. What is the latency of a single computer with pipeline?
iii. How long does it take to assemble 50 instructions without pipeline?
iv. How long does it take to assemble 50 instructions by pipelined CPU?

MEMORY SYSTEM ORGANIZATION

The memory construct.

The visible part of the memory system on the computer is composed of the memory module and
the insertion slot. Sometimes referred to as ‘BANKS’. Each generation and type of memory has
a slightly different layout to avoid accidental installation of incompatible modules, risking
damage.

Measurements

Prefix Symbol Word Decimal equivalent


Peta P Quadrillion 1015 1,000,000,000,000,000
Tera T Trillion 1012 1,000,000,000,000
Giga G Billion 109 1,000,000,000
Mega M Million 106 1,000,000
Kile K Thousand 103 1,000
Micro Μ Millionth 10-6 0.000,001
Nano N Billionth 10-9 0.000,000,001
Pico P Trillionth 10-12 0.000,000,000,001
Commonly used metric prefix with SI units.

Memory Hierarchy

Memory hierarchy is a pyramid structure that is commonly used to illustrate the significant
differences among memory types. Different types of memory perform different functions and

43
are classified based on their purpose, speed, complexities and cost of manufacturing. Generally,
faster memories cost more to design and manufacture. As a result their capacity is usually more
limited. Also, slower memory and storage always have more or higher capacity.

At the highest level (closest to the processor) are the processor registers. Next comes one or
more levels of cache. When multiple levels are used, they are denoted L1, L2, etc… Next comes
main memory, which is usually made out of a dynamic random-access memory (DRAM). All of
these are considered internal to the computer system. The hierarchy continues with external
memory, with the next level typically being a fixed hard disk, and one or more levels below that
consisting of removable media such as ZIP cartridges, optical disks, and tape. As one goes down
the memory hierarchy, one finds decreasing cost/bit, increasing capacity, and slower access time.
It would be nice to use only the fastest memory, but because that is the most expensive memory,
we trade off access time and cost by using more of the slower memory. The trick is to organize
the data and programs in memory so that the memory words needed are usually in the fastest
memory.

Why Hierarchical Memory System?

1) Economics and performance


2) Main memory has less storage capacity, more expensive and slow (i.e. access time
required to fetch data or instruction)
3) Cache memory is the most expensive, smallest and fastest memory with least access time.

However, using a hierarchical memory system, programs are processed fast. Programs which
are not manipulated currently can be kept in the auxiliary memory. As and when these are
needed, they can be loaded into memory. The cache memory would hold even more
frequently used programs such as subroutines and data and therefore, enhances processing
speed of the CPU.

Primary memory: this is a small, relatively fast and relatively low cost storage unti that stores
data and instructions which are currently used by the CPU. It is directly addressed by the CPU.
It’s size determines the length and number of programs that can be stored at a time.

Characteristics of Memory Systems

Location

• Processor

44
• Internal – main memory

• External – secondary memory

Capacity

• Word size – natural unit or organization

• Number of words – number of bytes

Unit of Transfer

• Internal

o Usually governed by bus width

• External

o Usually a block which is much larger than a word

Addressable unit

o Smallest location which can be uniquely addressed

o Cluster on external disk

Access Methods

• Sequential – tape

o Start at the beginning and read through in order

o Access time depends on location of data and previous location

• Direct – disk

o Individual blocks have unique address

o Access is by jumping to vicinity plus sequential search

o Access time depends on location of data and previous location

• Random - RAM

o Individual addresses identify location exactly

o Access time is independent of data location and previous location

45
• Associative – cache

o Data is located by a comparison with contents of a portion of the store

o Access time is independent of data location and previous location

Performance

• Access time (latency)

o The time between presenting an address and getting access to valid data

• Memory Cycle time – primarily random-access memory

o Time may be required for the memory to “recover” before the next access

o Access time plus recovery time

• Transfer rate

o The rate at which data can be transferred into or out of a memory unit

Physical Types

• Semiconductor – RAM

• Magnetic – disk and tape

• Optical – CD and DVD

• Magneto-optical

Physical Characteristics

• Volatile/non-volatile

• Erasable/non-erasable

• Power requirements

Organization

• The physical arrangement of bits to form words

• The obvious arrangement is not always used

The Memory Hierarchy

46
• How much?

o If the capacity is there, applications will be developed to use it.

• How fast?

o To achieve performance, the memory must be able to keep up with the

processor.

• How expensive?

o For a practical system, the cost of memory must be reasonable in relationship to other
components

There is a trade-off among the three key characteristics of memory: cost, capacity,

and access time.

• Faster access time – greater cost per bit

• Greater capacity – smaller cost per bit

• Greater capacity – slower access time

The way out of this dilemma is not to rely on a single memory component or technology. Employ
a memory hierarchy.

As one goes down the hierarchy: (a) decreasing cost per bit; (b) increasing

capacity; (c) increasing access time; (d) decreasing frequency of access of the

memory by the processor.

Thus smaller, more expensive, faster memories are supplemented by larger,

cheaper, slower memories. The key to the success of this organization is item (d).

Locality of Reference principle

• Memory references by the processor, for both data and instructions, cluster

• Programs contain iterative loops and subroutines - once a loop or subroutine is

entered, there are repeated references to a small set of instructions

• Operations on tables and arrays involve access to a clustered set of data word

47
Cache Memory Principles

Cache memory

• Small amount of fast memory

• Placed between the processor and main memory

• Located either on the processor chip or on a separate module

Cache Operation Overview

• Processor requests the contents of some memory location

• The cache is checked for the requested data

o If found, the requested word is delivered to the processor

o If not found, a block of main memory is first read into the cache, then the

requested word is delivered to the processor

Types of Memory

48
MSD (Memory Storage Devices)

Primary Secondary
Storage Storage

Magneti Optic
RA RO

CAM ROM M CD-ROM


WORM (CD-R)
SRAM PROM tape Magneto-optical CD-RW
DRAM (OTP) DVD
Hologram Optical Disc
Molecul RPROM
ar RAM
Flash
CAM- Content Addressable
ROM
Memory (Associative
EEPROM
Memory)
[UVROM
• ROM: Read Only memory, permanent storage (boot code, embedded code)
• SRAM: Static Random Access memory, cache and high speed access
• DRAM: dynamic RAM (Main Memory)
• SDRAM - synchronous dynamic RAM
• DDR: Double Data Rate memory
• DDR-SDRAM - double-data-rate SDRAM
• MDRAM – multi-bank DRAM
• ESDRAM - cache-enhanced DRAM
• EPROM: Electrically Programmable ROM, replace ROM when reprogramming required
• EEPROM: Electrically Erasable Programmable ROM. Alternative to EPROM, limited but regular
reprogramming. Device configuration info during power down (USB memories)
• Flash: advancement on EEPROM technology allowing blocks of memory location to be written and
cleared at one time instead.

FAULT TOLERANCE

Fault tolerance is the property that enables a system to continue operating properly in the event
of the failure of (or one or more faults within) some of its components. If its operating quality
decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively
designed system in which even a small failure can cause total breakdown. Fault tolerance is
particularly sought after in high-availability or life-critical systems.

49
A fault-tolerant design enables a system to continue its intended operation, possibly at a reduced
level, rather than failing completely, when some part of the system fails. The term is most
commonly used to describe computer systems designed to continue more or less fully operational
with, perhaps, a reduction in throughput or an increase in response time in the event of some partial
failure. That is, the system as a whole is not stopped due to problems either in the hardware or the
software. An example in another field is a motor vehicle designed so it will continue to be drivable
if one of the tires is punctured. A structure is able to retain its integrity in the presence of damage
due to causes such as fatigue, corrosion, manufacturing flaws, or impact.

Terminology
An example of graceful degradation by design is an image with transparency. The top two images
are each the result of viewing the composite image in a viewer that recognises transparency. The
bottom two images are the result in a viewer with no support for transparency. Because the
transparency mask (centre bottom) is discarded, only the overlay (centre top) remains; the image
on the left has been designed to degrade gracefully, hence is still meaningful without its
transparency information.

A highly fault-tolerant system might continue at the same level of performance even though one
or more components have failed. For example, a building with a backup electrical generator will
provide the same voltage to all outlets even if the grid power fails.

A system that is designed to fail safe, or fail-secure, or fail gracefully, whether it functions at a
reduced level or fails completely, does so in a way that protects people, property, or data from
injury, damage, intrusion, or disclosure. In computers, a program might fail-safe by executing a
graceful exit (as opposed to an uncontrolled crash) in order to prevent data corruption after
experiencing an error. A similar distinction is made between "failing well" and "failing badly".

Fail-deadly is the opposite strategy, which can be used in weapon systems that are designed to kill
or injure targets even if part of the system is damaged or destroyed.

A system that is designed to experience graceful degradation, or to fail soft (used in computing,
similar to "fail safe"[2]) operates at a reduced level of performance after some component failures.
For example, a building may operate lighting at reduced levels and elevators at reduced speeds if
grid power fails, rather than either trapping people in the dark completely or continuing to operate
at full power. In computing an example of graceful degradation is that if insufficient network
bandwidth is available to stream an online video, a lower-resolution version might be streamed in
place of the high-resolution version.

A system with high failure transparency will alert users that a component failure has occurred,
even if it continues to operate with full performance, so that failure can be repaired or imminent
complete failure anticipated. Likewise, a fail-fast component is designed to report at the first point
of failure, rather than allow downstream components to fail and generate reports then. This allows
easier diagnosis of the underlying problem, and may prevent improper operation in a broken state.

50
Components
If each component, in turn, can continue to function when one of its subcomponents fails, this will
allow the total system to continue to operate as well. Using a passenger vehicle as an example, a
car can have "run-flat" tires, which each contain a solid rubber core, allowing them to be used even
if a tire is punctured. The punctured "run-flat" tire may be used for a limited time at a reduced
speed.

Redundancy
Redundancy is the provision of functional capabilities that would be unnecessary in a fault-free
environment.[3] This can consist of backup components which automatically "kick in" should one
component fail. For example, large cargo trucks can lose a tire without any major consequences.
They have many tires, and no one tire is critical (with the exception of the front tires, which are
used to steer). The idea of incorporating redundancy in order to improve the reliability of a system
was pioneered by John von Neumann in the 1950s

Two kinds of redundancy are possible: space redundancy and time redundancy. Space redundancy
provides additional components, functions, or data items that are unnecessary for fault-free
operation. Space redundancy is further classified into hardware, software and information
redundancy, depending on the type of redundant resources added to the system. In time redundancy
the computation or data transmission is repeated and the result is compared to a stored copy of the
previous result.

Criteria
Providing fault-tolerant design for every component is normally not an option. Associated
redundancy brings a number of penalties: increase in weight, size, power consumption, cost, as
well as time to design, verify, and test. Therefore, a number of choices have to be examined to
determine which components should be fault tolerant:[6]

• How critical is the component? In a car, the radio is not critical, so this component has
less need for fault tolerance.

• How likely is the component to fail? Some components, like the drive shaft in a car, are
not likely to fail, so no fault tolerance is needed.

• How expensive is it to make the component fault tolerant? Requiring a redundant car
engine, for example, would likely be too expensive both economically and in terms of
weight and space, to be considered.

Requirements
The basic characteristics of fault tolerance require:

51
1. No single point of failure – If a system experiences a failure, it must continue to operate
without interruption during the repair process.
2. Fault isolation to the failing component – When a failure occurs, the system must be able
to isolate the failure to the offending component. This requires the addition of dedicated
failure detection mechanisms that exist only for the purpose of fault isolation. Recovery
from a fault condition requires classifying the fault or failing component.
3. Fault containment to prevent propagation of the failure – Some failure mechanisms can
cause a system to fail by propagating the failure to the rest of the system. Firewalls or
other mechanisms that isolate a failing component to protect the system are required.
4. Availability of reversion modes

Replication
Spare components address the first fundamental characteristic of fault tolerance in three ways:

• Replication: Providing multiple identical instances of the same system or subsystem,


directing tasks or requests to all of them in parallel, and choosing the correct result on the
basis of a quorum;
• Redundancy: Providing multiple identical instances of the same system and switching to
one of the remaining instances in case of a failure (failover);
• Diversity: Providing multiple different implementations of the same specification, and
using them like replicated systems to cope with errors in a specific implementation.

Disadvantages
Fault-tolerant design's advantages are obvious, while many of its disadvantages are not:

• Interference with fault detection in the same component. To continue the above
passenger vehicle example, with either of the fault-tolerant systems it may not be obvious
to the driver when a tire has been punctured. This is usually handled with a separate
"automated fault-detection system". In the case of the tire, an air pressure monitor detects
the loss of pressure and notifies the driver. The alternative is a "manual fault-detection
system", such as manually inspecting all tires at each stop.

• Interference with fault detection in another component. Another variation of this


problem is when fault tolerance in one component prevents fault detection in a different
component. For example, if component B performs some operation based on the output
from component A, then fault tolerance in B can hide a problem with A. If component B
is later changed (to a less fault-tolerant design) the system may fail suddenly, making it
appear that the new component B is the problem. Only after the system has been carefully
scrutinized will it become clear that the root problem is actually with component A.

• Reduction of priority of fault correction. Even if the operator is aware of the fault,
having a fault-tolerant system is likely to reduce the importance of repairing the fault. If

52
the faults are not corrected, this will eventually lead to system failure, when the fault-
tolerant component fails completely or when all redundant components have also failed.

• Test difficulty. For certain critical fault-tolerant systems, such as a nuclear reactor, there
is no easy way to verify that the backup components are functional. The most infamous
example of this is Chernobyl, where operators tested the emergency backup cooling by
disabling primary and secondary cooling. The backup failed, resulting in a core meltdown
and massive release of radiation.

• Cost. Both fault-tolerant components and redundant components tend to increase cost.
This can be a purely economic cost or can include other measures, such as weight.
Manned spaceships, for example, have so many redundant and fault-tolerant components
that their weight is increased dramatically over unmanned systems, which don't require
the same level of safety.

• Inferior components. A fault-tolerant design may allow for the use of inferior
components, which would have otherwise made the system inoperable. While this
practice has the potential to mitigate the cost increase, use of multiple inferior
components may lower the reliability of the system to a level equal to, or even worse
than, a comparable non-fault-tolerant system.

Related terms
There is a difference between fault tolerance and systems that rarely have problems. For instance,
the Western Electric crossbar systems had failure rates of two hours per forty years, and therefore
were highly fault resistant. But when a fault did occur they still stopped operating completely, and
therefore were not fault tolerant.

A number of recent trends, such as harsh environments, novice users, larger and more complex
systems, and downtime costs, have accelerated interest in making general-purpose computer
systems fault tolerant, and the primary goals of fault tolerance are to avoid downtime and to ensure
correct operation even in the presence of faults, or more applicable, high availability, long life,
postponed maintenance, high-performance computing, and critical computations. System
performance, minimally defined to be the number of results per unit time times the uninterrupted
length of time of correct processing, should not be compromised. In real systems, however, price-
performance trade-offs must be made; fault-tolerance features will incur some costs in hardware,
in performance, or both.

Fault tolerance features basic allow the computer keep executing with the presence of defects.
these systems are usually classified as either highly reliable or highly available. Reliability, as a
function of time, is the conditional probability that the system has survived the interval [0,t], given
that it was operational at time t=0. Highly reliable systems are used in situations in which repair
cannot take place (e.g. spacecraft) or in which the computer is performing a critical function for
which even the small amount of time lost due to repairs cannot be tolerated (e.g. flight-control
computers). Availability is the intuitive sense of reliability. A system is available if it is able to
perform its intended function at the moment the function is required. Formally, the availability of

53
a system as a function of time is the probability that the system is operational at the instant of time,
t. If the limit of this function exists as t goes to infinity, it expresses the expected fraction of time
that the system is available to perform useful computations.

Faults and their manifestation

To understand how a system fails is certainly necessary before design a fault-tolerant


system. Basically, failures start from physical failure, and then logical faults arise, and then system
errors are results. Usually the definitions involved in this propagation process are as follow:

• Failure. Physical change in hardware.


• Fault. Erroneous state of hardware or software resulting from failures of components,
physical interference from the environment, operator error, or incorrect design.
• Error. Manifestation of a fault within a program or data structure. The error may occur
some distance from the fault site.
• Permanent / hard. Describes a failure, fault, or error that is continuous and stable. In
hardware, permanent failure reflects an irreversible physical change.
• Intermittent. Describes a fault or error that is only occasionally present due to unstable
hardware or varying hardware or software states(e.g. as a function of load or activity).
• Transient / soft. Describes a fault or error resulting from temporary environmental
conditions.

Transient faults and intermittent faults are the major source of system errors. Transient faults are
considered not repairable and intermittent ones as repairable. The manifestations of transient and
intermittent faults and of incorrect hardware or software design are much more difficult to
determine than permanent faults.

System Fault Response stages

Table-1 shows the detail of the ten system fault response stages, and give each stage a detailed
explanation and some more points that need to pay attention.

Table-1 System Fault Response Stages

Name of Fault
Explanation Extra mention
Response Stages
This technique may be applied in both
limit the scope of fault affection into hardware and software. For instance, it
Fault local area, or protect other areas of the can be achieved by liberal use of fault
confinement system from getting contaminated by detection circuits, consistency checks
this fault. before performing a function("mutual
suspicion"), and multiple

54
requests/confirmations before
performing a function.
Locate the fault. Multiple techniques
Thus, off-line detection assures
have been developed and applied for
integrity before and possibly at
fault detection. They can be basically
intervals during operation, but not
classified into off-line fault detection
during the entire time of operation,
and on-line fault detection. With the
while on-line techniques have to
Fault detection off-line detection, the device is unable
guarantee system integrity all through
to perform any function during the test,
the detection stage.
while for the on-line detection, the
The arbitrary period that passes by
operation can keep going on while tests
before detection occurs is called fault
and the consequent work are being
latency.
applied.
In its pure form, masking provides no
Also called static redundancy, fault
detection. However, many fault-
masking techniques hide the effects of
masking techniques can be extended to
failures through the means that
Fault masking provide on-line detection as
redundant information outweighs the
well. Otherwise, off-line detection
incorrect information. Majority voting
techniques are needed to discover
is an example of fault masking.
failures.
It may appear that "retry" should be
In some cases, a second attempt to a attempted after recovery is
operation is effective enough, effected. But many times an operation
especially for those transient faults that failed will execute correctly if it is
which cause no physical damage.. tried again immediately. For instance,
Retry
Retry can be applied more than once, a transient error may prevent a
and then when certain number is successful operation, but an immediate
arrived, system need to go through retry will succeed since the transient
diagnosis, detection. will have died away a few moments
later.
Diagnosis stage becomes necessary refer to the article of "Diagnosis" listed
Diagnosis when detection could not provide fault also in this course web page for more
location and other fault information detail
If a gut is detected and a permanent
failure located, the system may be able
to reconfigure its components to
replace the failed component or to
isolate it from the rest of the
Graceful degradation is one of the
Reconfiguration system. The component may be
dynamic redundancy techniques
replaced by backup
spares. Alternatively, it may simply be
switched off and the system capability
degraded as called graceful
degradation

55
After detection and maybe
reconfiguration, the effects of errors
must be eliminated. Normally the
In recovery, error latency becomes an
system operation is backed up to some
important issue because the rollback
point in its processing that preceded the
Recovery must go far enough to avoid the effects
fault detection, and operation
of undetected errors that occurred
recommences from this point. This
before the detected one.
form of recovery, often called rollback,
usually entails strategies using backup
files, checkpointing, and journaling.
This might be possible in the case too
much information is damaged by an
error, or if the system is not designed
for recovery. A "hot" restart, a
resumption of all operations from the
point of fault detection, is possible only
Restart if the occurred damage is not
unrecoverable. A "warm" restart
implies that only some of the processes
can be resumed without loss. A "cold"
restart corresponds to a complete
reload of the system, with no processes
surviving.
Replace the damaged component. It
Repair
can be either off-line or on-line.
After all, the repaired the device or
module is reintegrated into the system.
Reintegration And specially for on-line repair, this
has to be done without delay system
operation.

Reliability and Availability Techniques

Two approaches to increasing system reliability are fault avoidance and fault tolerance. Fault
avoidance results from conservative design practices such as the use of high-reliability
parts. Though the goal of fault avoidance is to reduce the likelihood of failure, even after the most
careful application of fault-avoidance techniques, failures will occur eventually owing to defects
in the system. In comparison to this approach, fault tolerance appears much better, as fault
tolerance approaches the system design with the assumptions that defects would very much likely
surface any way during system operational stage, so that the design is orientated towards making
the system keep operating correctly with the presence of defects. Redundancy is a very classic
technique used in both fault avoidance and fault tolerance approaches. With the redundancy
technique a system could highly likely pass the ten fault response stages listed above.

56
Table-2 will give a very clear graphical description of the reliability techniques.

Table-2 Taxonomy of Reliability Techniques

Brief extra
Region Technique
mention
Environment modification
Fault avoidance Quality changes
Component integration level
Duplication
Error detection codes( M-of-N codes, Parity,
Checksums, Arithmetic codes, Cyclic codes
Fault detection
Self-checking and fail-safe logic
Watchdog timers and time-outs
Consistency and capability checks
NMR/voting
Error correcting codes(Hamming SEC/DED, Other
Static redundancy/masking
codes)
redundancy
Masking logic(Interwoven logic, Coded-state
machines)
reconfigurable duplication
Reconfigurable nmr
Backup sparing
Dynamic redundancy
Graceful degradation
Reconfiguration
Recovery

And there are some other useful fault tolerance techniques such as hardware redundancy, n-version
programming, graceful degradation, etc. For those detail would not be mentioned here and can be
found in [1]

57

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy