System Software3160715 Handbook
System Software3160715 Handbook
System Software3160715 Handbook
SYSTEM SOFTWARE
(3160715 )
Lecture
Note
Year: 2020-21
Index
1.0 Overview of System Software : Introduction
2.0 :Software, Software Hierarchy, Systems Programming
3.0 : Machine Structure, Interfaces, Address Space, Computer Languages
4.0: Life Cycle of a Source Program
5.0: System Software Development, Recent Trends in Software
Development, Levels of System Software
6.0. Overview of Language Processors
7.0 Programming Languages and Language Processors, Language
Processing Activities
8.0 Program Execution, Fundamental of Language
Processing,
9.0 Symbol Tables
10.0 Data Structures for Language Processing: Search Data structures,
Allocation Data Structures.
11.0Assemblers : Elements of Assembly Language Programming
12.0 Design of the Assembler, Assembler Design Criteria, Example of an
assembly program Types of Assemblers , Working of an Assembler
13.0Two-Pass Assemblers, One-Pass Assemblers, Single pass Assembler
for Intel x86
14.0Algorithm of Single Pass Assembler, Multi-Pass Assemblers
15.0 Design Alternatives, Advanced Assembly Process, Variants of
Assemblers Design of two pass assembler
Characteristic of Software
Usability: Refers to the extent to which the software can be used with ease.
Efficiency: Refers to the ability of the software to use system resources in the most
effective and efficient manner.
Maintainability: Refers to the ease with which the modifications can be made in a
software system to extend its functionality, improve its performance, or correct errors.
Portability: Refers to the ease with which software developers can transfer software
from one platform to another, without (or with minimum) changes. In simple terms,
it
refers to the ability of software to function properly on different hardware and
software platforms without making any changes in it.
In addition to the above mentioned characteristics, robustness and integrity are also
important.
Robustness refers to the degree to which the software can keep on
functioning in spite of being provided with invalid data.
integrity refers to the degree to which unauthorized access to the software or data
can be
prevented.
Software Hierarchy
Application software
which is software that uses the computer system to perform special functions or provide
entertainment functions beyond the basic operation of the computer itself. There are
many
different types of application software, because the range of tasks that can be performed
with a modern computer is so large—see list of software.
System software
which is software that directly operates the computer hardware, to provide basic
functionality needed by users and other software, and to provide a platform for
running application software.
System software includes:
Operating systems
which are essential collections of software that manage resources and provides common
services for other software that runs "on top" of them. Supervisory programs, boot
loaders, shells and window systems are core parts of operating systems. In practice, an
operating system comes bundled with additional software (including application
software)
so that a user can potentially do some work with a computer that only has one operating
system.
Device drivers
System Programming
1) Programmers are expected to know the hardware and internal behavior of the
computer
system on which the program will run. System programmers explore these known
hardware
properties and write software for specific hardware using efficient algorithms.
2) Uses a low level programming language or some programming dialect.
3) Requires little run time overheads and can execute in a resource-constrained
environment.
) These are very efficient programs with a small or no runtime library
requirements.
5) Has access to systems resources, including memory
6) Can be written in assembly language
Interface
Types of Interface
1) Software Interface
2) Hardware Interface
• These hardware interfaces may be designed to support either parallel or serial data
transfer or both. Hardware interfaces with parallel implementations allow more than
one connection to carry data simultaneously, while serial allows data to be sent one
bit
at a time.
• One of the popular standard interfaces is Small Computer System Interface (SCSI)
that
defines the standards for physically connecting and communicating data between
peripherals and computers.
3) User Interface
User interface allows interaction between a computer and a user by providing various
modalities of interaction including graphics, sound, position, movement, etc. These
interfaces facilitate transfer of data between the user and the computing system. User
interface is very important for all systems that require user inputs.
Address space
Address space is the amount of memory allocated for all possible addresses for a
computational entity, such as a device, a file, a server, or a networked computer.
Address space may refer to a range of either physical or virtual addresses accessible
to a
processor or reserved for a process.
Computer Languages
A Computer language includes various languages that are used to communicate with
a
Computer machine. Some of the languages like programming language which is a set
of
codes or instructions used for communicating the machine.
Machine code is also considered as a computer language that can be used for
programming. And also HTML which is a computer language or a markup language
but
not a programming language.
Similarly there are different types of languages developed for different types of work
to be
performed by communicating with the machine. But all the languages that are now
available are categorized into two basic types of languages including Low-level
language
and High level language.
Low level languages are the machine codes in which the instructions are given in
machine language in the form of 0 and 1 to a Computer system. It is mainly designed
to operate and handle all the hardware and instructions set architecture of a Computer.
The main function of the Low level language is to operate, manage and manipulate
the
hardware and system components. There are various programs and applications
written in low level languages that are directly executable without any interpretation
or translation.
The most famous and the base of all programming languages “C” and “C++” are
mostly used Low level languages till today. Low level language is also divided into
two parts are Machine language and Assembly language.
Machine Language
The main advantage of using Machine language is that there is no need of a translator
or interpreter to translate the code, as the Computer directly can understand. But
there
are some disadvantages also like you have to remember the operation codes, memory
address every time you write a program and also hard to find errors in a written
program. It is a machine dependent and can be used by a single type of Computer.
Assembly Language
The main disadvantage of this language is that it is written only for a single type of
CPU and does not run on any other CPU. But its speed makes it the most used low
level language till today which is used by many programmers.
The high level languages are the most used and also more considered programming
languages that helps a programmer to read, write and maintain. It is also the third
generation language that is used and also running till now by many programmers.
They are less independent to a particular type of Computer and also require a
translator that can convert the high level language to machine language. The
translator
may be an interpreter and Compiler that helps to convert into binary code for a
Computer to understand. There is various high level programming languages like C,
FORTRAN or Pascal that are less independent and also enables the programmer to
write a program.
The Compiler plays an important role on the Computer as it can convert to machine
language and also checks for errors if any before executing. There are several high
level languages that were used earlier and also now like COBOL, FORTRAN,
BASIC,
C, C++, PASCAL, LISP, Ada, Algol, Prologue and Java. It is user-friendly as the
programs are written in English using words, symbols, characters, numbers that
needs
to be converted to machine code for processing.
o It is machine-independent.
o It can be used both as problem- and procedure-oriented.
o It follows simple English-like structure for program coding.
o It does not necessitate extensive knowledge of computer architectures.
o Writing code using HLL consumes less time.
o Debugging and program maintenance are easier in HLL.
The disadvantage of high level languages is as follows:
o It requires language translators for converting instructions from high level language
into low level object code.
The life cycle of a source program defines the program behavior and extends through
execution stage, which exhibits the behavior specified in the program. Every source
program goes through a life cycle of several stages.
Edit time
It is the phase where editing of the program code takes place and is also known as
design time. At this stage, the code is in its raw form and may not be in a consistent
state.
Compile time
At the compile time stage, the source code after editing is passed to a translator that
translates it into machine code. One such translator is a compiler. This stage checks
the program for inconsistencies and errors and produces an executable file.
Distribution time
It is the stage that sends or distributes the program from the entity creating it to an
entity invoking it. Mostly executable files distributed.
Installation time
Typically, a program goes through the installation process, which makes it ready for
execution within the system. The installation can also optionally generate calls to
other stages of a program’s life cycle.
Load time
This stage actively takes the executable image from its stored repositories and places
them into active memory to initiate the execution. Load time activities influenced by
the underlying operating system.
Run time
This is the final stage of the life cycle in which the programmed behavior of the
source program demonstrated.
Link time
At this stage, the specific implementation of the interface linked and associated to the
program invoking it. System libraries linked by using the look up of the name and
the
interface of the library needed during compile time or throughout the installation
time,
or invoked with the start or even during the execution process.
Software development process follows the Software Development Life Cycle (SDLC)
which has each step doing a specific activity till the final software is built. The
system software development process also follows all the stages of SDLC, which
are as follows:
System analysis: It investigates the problem on a large scale and gathers all the
information.
It identifies the execution environment and interfaces required by the software to be built.
System design: This is concerned with designing the blueprint of system software that
specifies how the system software looks like and how it will perform.
System tool acquisition: It decides and works around the software tools to develop the
functionalities of the system software.
Implementation: It builds the software using the software tools with all the
functionality,
interfaces, and support for the execution. This may be very specific as the system software
adheres to the architecture. Operating system support is sought for allocations and other
related matters.
System maintenance: Once the system software is ready, it is installed and used. The
maintenance includes timely updating software what is already installed.
Latest trends in program development from the coding perspective include the following:
Language Processors
Any computer system is made of hardware and software. The hardware understands a
language, which humans cannot understand. So we write programs in high-level
language, which is easier for us to understand and remember. These programs are
then fed into a series of tools and OS components to get the desired code that can be
used by the machine. This is known as Language Processing System.
Preprocessor
Interpreter
Assembler
Linker
Linker is a computer program that links and merges various object files together in
order to make an executable file. All these files might have been compiled by
separate assemblers. The major task of a linker is to search and locate referenced
module/routines in a program and to determine the memory location where these
codes will be loaded, making the program instruction to have absolute references.
Loader
Loader is a part of operating system and is responsible for loading executable files
into memory and execute them. It calculates the size of a program (instructions and
data) and creates memory space for it. It initializes various registers to initiate
execution.
Cross-compiler
A compiler that runs on platform (A) and is capable of generating executable code for
platform (B) is called a cross-compiler.
Source-to-source Compiler
A compiler that takes the source code of one programming language and translates it
into the source code of another programming language is called a source-to-source
compiler
There are mainly two types of language processing activity which bridges the
semantic gap
between source language and target language.
2. Program Execution
Two popular models for program execution are translation and interpretation.
A. Translation
The program translation model bridges the execution gap by translating a program
written in PL, called source program, into an equivalent program in machine or
assembly
language of the computer system, called target program.
B. Interpretation
The interpreter reads the source program and stores it in its memory.
The CPU uses the program counter (PC) to note the address of the next instruction
to be
executed.
The statement would be subjected to the interpretation cycle, which could consist
the
following steps:
A compiler can broadly be divided into two phases based on the way they compile.
Analysis Phase
Known as the front-end of the compiler, the analysis phase of the compiler reads the
source program, divides it into core parts and then checks for lexical, grammar and
syntax errors.The analysis phase generates an intermediate representation of the
source program and symbol table, which should be fed to the Synthesis phase as
input.
Synthesis Phase
Known as the back-end of the compiler, the synthesis phase generates the target
program with the help of intermediate source code representation and symbol table.
A compiler can have many phases and passes.
Pass : A pass refers to the traversal of a compiler through the entire program.
Phase : A phase of a compiler is a distinguishable stage, which takes input from the
previous stage, processes and yields output that can be used as input for the next stage.
A pass can have more than one phase.
The compilation process is a sequence of various phases. Each phase takes input from
its previous stage, has its own representation of source program, and feeds its output
to the next phase of the compiler. Let us understand the phases of a compiler.
as
Analysis phase
Lexical Analysis
The first phase of scanner works as a text scanner. This phase scans the source code
as a stream of characters and converts it into meaningful lexemes. Lexical analyzer
represents these lexemes in the form of tokens as:
<token-name, attribute-value>
Code #no
where Code can be Id or Op for identifier or operator respectively and no indicates the
entry
for the identifier or operator in symbol or operator table.
Consider following code
i: integer;
a, b: real;
a= b + i;
A = b + I
Id#1 Op#1 Id#2 Op#2 Id#3
Syntax Analysis
The next phase is called the syntax analysis or parsing. It takes the token produced by
lexical analysis as input and generates a parse tree (or syntax tree). In this phase,
token arrangements are checked against the source code grammar, i.e. the parser
checks if the expression made by the tokens is syntactically correct.
Syntax analysis processes the string of token to determine its grammatical structure
and
builds an intermediate code that represents the structure.
Semantic Analysis
Semantic analysis checks whether the parse tree constructed follows the rules of
language. For example, assignment of values is between compatible data types, and
adding string to an integer. Also, the semantic analyzer keeps track of identifiers,
their types and expressions; whether identifiers are declared before use or not etc.
The semantic analyzer produces an annotated syntax tree as an output.
If node is operand, then type of operand is added in the description field of operand.
While evaluating the expression the type of b is real and i is int so type of i is
converted to real i*.
After semantic analysis the compiler generates an intermediate code of the source
code for the target machine. It represents a program for some abstract machine. It is
in between the high-level language and the machine language. This intermediate code
should be generated in such a way that it makes it easier to be translated into the
target machine code.
IR contains intermediate code and table.
Symbol table
Intermediate code
Memory allocation
The memory requirement of an identifier is computed from its type, length and
dimensionality and memory is allocated to it.
The address of the memory area is entered in the symbol table
Synthesis phase
Code Optimization
The next phase does code optimization of the intermediate code. Optimization can be
assumed as something that removes unnecessary code lines, and arranges the
sequence of statements in order to speed up the program execution without wasting
resources (CPU, memory).
Code Generation
In this phase, the code generator takes the optimized representation of the
intermediate code and maps it to the target machine language. The code generator
translates the intermediate code into a sequence of (generally) re-locatable machine
code. Sequence of instructions of machine code performs the task as the intermediate
code would do.
The synthesis phase may decide to hold the value of i* and temp in machine registers
and
may generate the assembly code
CONV_R AREG, I CONV_R TEMP, I
ADD_R AREG, B < - - ADD_R B , TEMP
MOV_R A, B
Symbol Tables
Lexical analysis is the first phase of a compiler. It takes the modified source code
from language preprocessors that are written in the form of sentences. The lexical
analyzer breaks these syntaxes into a series of tokens, by removing any white space or
comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer
works closely with the syntax analyzer. It reads character streams from the source
code, checks for legal tokens, and passes the data to the syntax analyzer when it
demands.
Tokens
A pattern
explains what can be a token, and these patterns are defined by means of regular
expressions.
In programming language, keywords, constants, identifiers, strings, numbers,
operators and punctuations symbols can be considered as tokens.
A source code can directly be translated into its target machine code, then why at all
we need to translate the source code into an intermediate code which is then
translated to its target code? Let us see the reasons why we need an intermediate
code.
If a compiler translates the source language to its target machine language without
having the option for generating intermediate code, then for each new machine, a full
native compiler is required.
Intermediate code eliminates the need of a new full compiler for every unique
machine by keeping the analysis portion same for all the compilers.
The second part of compiler, synthesis, is changed according to the target machine.
It becomes easier to apply the source code modifications to improve code
performance by applying code optimization techniques on the intermediate code.
Intermediate Representation
Intermediate codes can be represented in a variety of ways and they have their own
benefits.
Low Level IR - This one is close to the target machine, which makes it suitable for
register and memory allocation, instruction set selection, etc. It is good for
machine-independent optimization.
symbol table
A symbol table is simply a table which can be either linear or a hash table. It
maintains an entry for each name in the following format:
<symbol name, type, attribute>
For example, if a symbol table has to store information about the following variable
declaration:
static int interest;
then it should store the entry such as:
<interest, int, static>
The attribute clause contains the entries related to the name.
Implementation
If a compiler is to handle a small amount of data, then the symbol table can be
implemented as an unordered list, which is easy to code, but it is only suitable for
small tables only. A symbol table can be implemented in one of the following ways:
Among all, symbol tables are mostly implemented as hash tables, where the source
code symbol itself is treated as a key for the hash function and the return value is the
information about the symbol.
Operations
A symbol table, either linear or hash, should provide the following operations.
1. insert()
This operation is more frequently used by analysis phase, i.e., the first half of the
compiler where tokens are identified and names are stored in the table. This operation
is used to add information in the symbol table about unique names occurring in the
source code. The format or structure in which the names are stored depends upon the
compiler in hand.
An attribute for a symbol in the source code is the information associated with that
symbol. This information contains the value, state, scope, and type about the symbol.
The insert() function takes the symbol and its attributes as arguments and stores the
information in the symbol table.
For example:
int a;
should be processed by the compiler as:
insert(a, int);
2. lookup()
The format of lookup() function varies according to the programming language. The
basic format should match the following:
lookup(symbol)
This method returns 0 (zero) if the symbol does not exist in the symbol table.
If the symbol exists in the symbol table, it returns its attributes stored in the table.
Two kinds of data structures can be used for organizing its entries:
o Linear data structure: Entries in the symbol table occupy adjoining areas of
memory.
This property is used to facilitate search.
o Nonlinear data structure: Entries in the symbol table do no occupy contiguous
areas
of memory. The entries are searched and accessed using pointers.
Each entry in the symbol table is comprised of fields that accommodate the
attributes of one symbol. The symbol field of fields stores the symbol to which entry
pertains.
The symbol field is key field which forms the basis for a search in the table.
The following entry formats can be used for accommodating the attributes:
o Fixed length entries: Each entry in the symbol table has fields for all attributes
specified in the programming language.
o Variable-length entries: the entry occupied by a symbol has fields only for the
attributes specified for symbols of its class.
o Hybrid entries: A hybrid entry has fixed-length part and a variable-length part.
Search data structures (Search structure) is used to create and organize various
tables of information and mainly used during the analysis of the program.
Search Operation: To enable and support search and locate activity for the entry
of symbol.
Delete Operation: To delete the entry of a symbol especially when identified by
language
processor as redundant declarations.
In sequential search organization, during the search for a symbol, probability of all
active
entries being accessed in the table is same.
For an unsuccessful search, the symbol can be entered using an ‘add’ operation into
the
table.
Tables using binary search organization have their entries assumed to satisfy
an ordering relation. It should be considered that for table containing ‘f’ occupied
entries, the
probability of successful search is log2f and unsuccessful search is log2f. The binary
search organization requires that entry number of a symbol table should not change
after ‘add’ operation. This may become limiting factor for addition and deletion during
language processing.
• A hash table, also known as a hash map is a data structure that has the ability to map
keys to the
values using a hash function.
• Hash table organization is an efficient m implementing associative arrays and symbol
tables that
outperform other data structures with its capability of performing 'm' accesses on 'n'
names.
• It has the following two parts:
o A hash table that contains a fixed array of 'm' pointers to storage table entries.
o Storage table entries organized into separate linked lists called buckets.
• Hash function is used for the mapping of a key value and the slot where that value
belongs
to the hash table.
• The hash function takes any key value from the collection and computes an integer
value
from it in the range of slot names, between 0 and m - 1.
• The advantage of using list is that it takes minimum possible space. On the other hand,
it may suffer for performance for larger values of 'n' and 'm’.
Self-Organizing List
• Searching in symbol table takes most of the time during symbol table management
process.
• The pointer field called 'LINK' is added to each record, and the search is controlled
by the order
indicated by the ‘LINK’.
• A pointer called 'FIRST' can be used to designate the position of the first record on
the
linked list,
and each 'LINK' field indicates the next record on the list.
• Self-organizing list is advantageous over simple list implementation in the sense that
frequently
referenced name variables will likely to be at the top of the list.
• If the access is random, the self-organizing list will cost time and space.
Search Trees
• Symbol tables can also be organized as binary tree organization with two pointer
fields, namely,
'LEFT' and 'RIGHT' in each record that points to the left and right sub trees
respectively.
• The left sub tree of the record contains only records with names less than the current
records name.
• The right sub tree of the node will contain only records with name variables greater
than the current name.
• The advantage of using search tree organization is that it proves efficient in searching
operations, which are the most performed operations over the symbol tables.
• A binary search tree gives both performance compared to list organization at some
difficulty in implementation.
Stack Allocation
• Stack is a linear data structure that satisfies last-in, first-out (LIFO) policy for its
allocation
and de allocation.
• This makes only last element of the stack accessible at any time.
• Implementing stack data structure requires use of Stack Base (SB) that points to first
entry
of stack, and a Top of Stack (TOS) pointer to point to last entry allocated to stack.
• The stack allocation is based on the principles of control stack in which entries are
Activation
Records (ARs) of some procedure.
• ARs are pushed and popped on each call (activations) and return (the end of
procedure),
respectively.
• On each activation request, the memory is allocated on the TOS and pointer is
incremented
by the size of allocation.
• On execution of return from the procedure, AR is deallocated and TOS is
decremented by
the same size.
• Figure shows the AR on stack during procedure call.
Heap Allocation
• Heaps are a kind of non-linear data structure that permits allocation and deallocation
of
entities in any (random) order as needed.
• Heap data structure returns a pointer to allocated and deallocated area in heap for an
allocation request.
• Hence, an allocated entity must maintain a pointer to the memory area allocated to it.
• Space allocation strategy using heaps is optimized for performance by maintaining
list of
free areas and implementing policies such as first-fit and best-fit for new object
allocation
Assemblers
Assembly language is the most basic programming language available to any processor.
Assembly language allows a programmer to work with operations that you can implement
directly on the CPU for executing programs. An assembly language source code file consists
of collection of statements that helps the assembler to run the programs. These
statements in assembly language include:
The basic building block of an assembly language program includes characters, identifiers,
labels, constants and assembly language counter.
1. Instructions
2. Integer expressions
3. Reserved words and identifiers
4. Directives
1. Instructions
Instructions are assembled to machine code by assembler and executed at run-time by the
CPU.
Consider the following example of instruction used in assembly language:
A Labels
There is not much difference between an identifier and a label. A label is written as an
identifier
immediately followed by a colon (:). A label represents the current value of the current
location
counter that holds the current location of the instruction. You can use a label in assembler
instruction as an operand. A label consists of a digit between 0 and 9 and must be followed by
a
colon. Local numeric labels allow compilers and programmers to use label names temporarily.
The labels that have been created by you can be reused number of times throughout the
program
for representing the current value of the current location counter. The following example
shows a
label x that contains a value 20.
x: .word 20!int x = 20;
B Operands
You can use various operands in assembly language programming that help initialise data and
store data. Operands may be of following types:
• constant (immediate value)
• constant expression
• memory (data label)
• registers
1 Constants:
There are four different types of constants such as numeric, character, string and
floating -point, which helps store data. A numeric constant also starts with a digit and can be a
decimal, hexadecimal or octal. The rules for a numeric constant are as follows:
• Decimal constants contain digits between 0 to 9 only.
• Hexadecimal constants start with 0x (or 0X), followed decimal digits by between one to
eight or hexadecimal digits (0 to 9, a to f and A to F).
• Octal constants start with 0 and are followed by one to eleven octal digits (0 to 7). Values
that do not fit in 32 bits are big nums.
2 Constant Expression: A single character constant consists of a single quotation mark („)
followed by an ASCII character. It can be an alphanumeric character, i.e., from A to Z or
from a
to z or from 0 to 9. The characters may also include other printable ASCII characters that
include
#, $, :, ., +, -, *, / and |. Additionally, the non-printable characters, which are ASCII characters
such as space, tab, return and new line. Following example uses special character constants
for
creating a program using assembly language programming
3 Memory: The basic unit of memory is a byte that comprises of eight bits of
information. Table lists some other units of memory
4 Registers: The registers are used mainly for storing various arithmetic and logical operations.
There are 16 general-purpose registers in a processor and each consists of 32 bits of memory.
In
addition, there are 4 floating-point registers, each consisting of 64 bits of memory. The
general
purpose registers are sometimes used as base registers also. For example, the instruction
Add A1,16 interprets an add instruction. This instruction depicts a number that is to be
added to
the contents of register A1.
C. Comments
A comment explains the program‟s purpose and are of two types: single-line and multiple-
line
comments. Single-line comments begin with semicolon (;) and multi-line comments begin
with
COMMENT directive, and a programmer-chosen character ends with the same programmer
chosen character
2 Integer Expressions
An integer expression contains various constants and operators such as + and *. Table 1.2 lists
various operators of integer expression and their precedence order that allow you to evaluate
an
Expression
Reserved words can be instruction mnemonics, directives, type attributes, operators and
predefined symbols that are predefined in the language. The reserved words used in assembly
language include following general categories of words:
4 Directives
Directives are the commands that are recognised and acted upon by the assembler. Directives are
used for various purposes such as to declare code, data areas, select memory model and declare
procedures. Directives are not case-sensitive and different assemblers can have different
directives. Each group of variable declaration should be preceded by a data directive. Each group
of assembly language instructions should be preceded by a text directive. The model directive
tells the assembler to generate code that will run in protected mode and in 32-bit mode. The end
directive marks the end of the program. The data and code directives mark the beginning of the
data and code segments. The data is a read and write segment, and code is a read-only segment.
Example: Following example defines directives to allocate space for three variables, x, y and z.
You should initialise these variables with decimal 20, hexadecimal 3fcc and decimal 41,
respectively
You need to follow a syntax for creating a program in assembly language. Each assembly
language statement has the following general form:
Consider the following assembly program to add and subtract two 32-bit integers val
ASSEMBLY SCHEME
Once you have come across the elements of assembly language, you will know how to
communicate with the computer that uses assembly language. The scheme followed in the
assembly language is considered as a largely machine-dependent scheme used by the
programmers. The assembly language is preferred by programmers as it is easy to read and uses
symbolic addresses.
The assembly scheme can be structured with the help of the following components:
• BALR: This is an instruction which informs an assembler to load a register with the next
address. This instruction branch the address in the second field. It is important to note
that in case the second operand is 0 in the BALR instruction, the execution starts
proceeding with the next instruction. The major difference between the BALR and the
USING components is that the BALR instruction loads a base register, whereas the
USING pseudo-op does not load a register in the memory, rather it informs the assembler
about the contents of the base register. If the register does not contain the address that is
specified by the USING component, it results in a program error.
• START: This pseudo-op informs an assembler about the beginning of a program. It also
allows a user to give a name to the program.
• END: This pseudo-op informs an assembler that the last address of a program has been
reached.
• BR X: This is the instruction that branches to the location that has its address in general
register X.
The assembly scheme can also contain addresses in the operand fields of an instruction. Figure
shows the implementation of assembly scheme:
In the example, ASSEMBLY is the name of the program. The next instruction BALR 10,0 in the
program sets register 10 to the address of the next instruction, as the second operation in the
instruction is 0.
Next comes USING BEGIN+2,10, the pseudo- op in this instruction is indicating to the
assembler that register 10 as a base register and also that the content of the register 10 is the
address of the next instruction. SR 4,4 clears register 4 and sets index to 0.
The next instruction L 3, NINE loads the number 9 into register 3. The instruction L 2,DATA (8)
loads data (index) into register 2. Next comes A 2, FORTYFIVE, which adds 45 to register 2.
Similarly, the next two instructions store the updated value of data (index) and 8 is added to
register 8.
The next instruction BCT 3,LOOP will decrement register 3 by 1. In case, if the result generated
by this instruction comes to non-zero, then it will branch back to the LOOP instruction.
The instruction BR 14 would branch back to a caller that has called the LOOP instruction. The
next three instructions contain DC pseudo-op, which indicates that these are the constants in a
program. The next instruction is also a DC pseudo-op, which indicates the words that are to be
processed in a program. The last instruction is the END instruction, which informs the assembler
that the last card of the program has been reached.
An assembly scheme provides advantages such as readability and easy understanding of the
instructions. The primary disadvantage of an assembly scheme is that it uses an assembler to
translate a source program into object code.
An assembler is a program that is responsible for generating machine code instructions from a
source code program. Source code programs are written in assembly languages and an assembler
converts the written assembly language source program to a format that can be executed on a
processor. Each machine code instruction is replaced by a mnemonic, an abbreviation that
represents the actual information. An assembler is responsible for translating the programs
written in the mnemonic assembler language into executable form. These mnemonics are
frequently used in assembly programming as they reduce the chances of making an error.
Following are the features of an assembler:
Working of an Assembler
The essential task of an assembler is to convert symbolic text into binary values that are loaded
into successive bytes of memory. There are three major components to this task.
1. Allocating and initialising storage,
2. Conversion of mnemonics to binary instructions, and
3. Resolving addresses using single or two-pass assembler.
An assembler scans through a file maintaining a set of “location pointers” for next available byte
of each segment (.data, .text, .kdata and .ktext).
.align 2
array: .space 40
fun: .word 0xdeadbeef
The example consists of x, msg, array and fun labels and when an assembler encounters label
declarations while scanning a source file, it creates a SYMBOL table entry to keep track of its
location. Table 1.3 lists the SYMBOL table for different labels in the program:
Encoding Mnemonics
Mnemonics generate binary encoded instructions; unresolved addresses are set to 0 in the
location-offset pointer and set to unknown in the symbol table. An assembler also computes the
offsets to branch targets such as loading a register. Table 1.4 lists the updated symbols after
encoding the mnemonics:
Resolving Addresses
You can also resolve addresses that are stored in memory. Resolving of address can be done
using single-pass and two-pass assembler.
Single Pass
The first type of assembler developed was a single-pass assembler. The single-pass assembler is
not used in many systems and is primitive. The source code is processed only once in a single
pass assembler. Once the source code is processed, the labels and the operational codes those are
encountered while processing receives an address and opcode and therefore, are stored in
SYSTAB table and OPTAB table, respectively.
In the single pass, data and instructions are encoded and assigned offsets within their segment,
while the symbol table is constructed and unresolved address references are set to 0 in the
symbol table. Figure shows the old single pass or one pass approach to resolve the address.
Modern assemblers keep more information in their symbol table, which allows them to resolve
addresses in a single pass. Modern single-pass assemblers maintains two types of addresses as:
• Known addresses (backward references) are immediately resolved.
• Unknown addresses (forward references) are “back-filled” once they are resolved
Figure shows the symbol table after single-pass through the assembler in the modern way:
Two-Pass Assembler
In a two-pass assembler, the source code is passed twice through an assembler. The first pass in
a two-pass assembler is specifically for the purpose of assigning an address to all labels. Once all
the labels get stored in a table with the appropriate addresses, the second pass is processed to
translate the source code into machine code. After the second pass, the assembler generates the
object program, assembly listing and output information for the linker to link the program. This
is the most popular type of assembler currently in use. The flow chart shown below is for a two
pass assembler.
You need to follow a procedure that allows you to design a two-pass assembler. The general
design procedure of an assembler involves the following six steps:
The operation of a two-pass assembler can be explained with the help of the following functions
that are performed by an assembler while translating a program:
• It replaces symbolic addresses with numeric addresses.
• It replaces symbolic operation codes with machine operation codes.
• It reserves storage for instructions and data.
• It translates constants into machine representation.
A two-pass assembler includes two passes, Pass1 and Pass 2. Therefore, it is named as two-pass
assembler. In Pass 1 of the two-pass assembler, the assembler reads the assembly source
program. Each instruction in the program is processed and is then translated to the generator.
These translations are accumulated in appropriate tables, which help fetch a stored value. In
addition to this, an intermediate form of each statement is generated. Once these translations
have been made, Pass 2 starts its operation.
Pass 2 examines each statement that has been saved in the file containing the intermediate
program. Each statement in this file searches for the translations and then assembles these
translations into the machine language program.
The primary aim of the first pass of a two-pass assembler is to draw up a symbol table. Once
Pass 1 has been completed, all necessary information on each user-defined identifier should have
been recorded in this table.
A Pass 2 over the program then allows full assembly to take place quite easily. Refer to the
symbol table whenever it is necessary to determine an address for a named label, or the value of
a named constant. The first pass can also perform some error checking. Figure shows the steps
followed by a two-pass assembler:
The table stores the translations that are made to the program. The table also includes operation
table and symbol table. The operation table is denoted as OPTAB. The OPTAB table contains:
An assembler designer is responsible for creating OPTAB table. Once the table is created, an
assembler can use the tables contained within OPTAB, known as sub-tables. These sub-tables
include mnemonics and their translations. The introduction of mnemonics for each machine
instruction is subsequently gets translated into the machine language for the convenience of the
programmers.
3. Macro-operation definitions.
4. Macro-operation call, which includes calls such as PUSH and LOAD.
Another type of tables used in the two-pass assembler is the symbol table.
This table stores the symbols that are user-defined in the assembly program. These symbols may
be identifiers, constants or labels. Another specification for symbol tables is that these tables are
dynamic and you cannot predict the length of the symbol table. The implementation of
SYMTAB includes arrays, linked lists and a hash table.
Figure shows the implementation of a two-pass assembler in which you translate a program.
You can now notice from the code shown that JACK is the name of the program. You start from
the START instruction and come to know that it is a pseudo -op instruction, which is instructing
the assembler.
The next instruction in the code is the USING pseudo-op, which informs the assembler that
register 10 is the base register and at the execution time, USING pseudo-op contains the address
of the first instruction of the program.
Next in the code is the load instruction: L 1,EIGHT and to execute this instruction, the assembler
needs to have the address of EIGHT. Since, no index register is being used, therefore, you can
place a 0 in the relative address for the index register.
The next instruction is an ADD instruction. The offset for NINE is still not known. Similar is the
case with the next STORE instruction. You will notice that whenever an instruction is executed,
A location counter is maintained by the processor, which indicates that the relative address of an
instruction is being processed. Here, the counter is incremented by 4 in each of the instruction as
the length of a load instruction is 4. The DC instruction is a pseudo-op that asks for the definition
of data. For example, for NINE, a „9‟ is the output and an „8‟ for EIGHT.
In the instruction, DC F‟9‟, the word „9‟ is stored at the relative location 12, as the location
counter is having the value 12 currently. Similarly, the next instruction with label EIGHT that
holds the relative address with value 16 and label TEMP is associated with the counter value 20.
This completes the description of the column 2 of the above-mentioned code.
The macro processor replaces each macro call with the following lines:
A 1,DATA
A 2,DATA
A 3,DATA
The process of such a replacement is known as expanding the macro. The macro definition itself does
not
appear in the expanded source code. This is because the macro processor saves the definition of the
macro. In addition, the occurrence of the
macro name in the source program refers to a macro call. When the macro is called in the program, the
sequence of instructions corresponding to that macro name gets replaced in the expanded source.
Nested macro calls refer to the macro calls within the macros. A macro is available within other macro
definitions
also. In the scenario where a macro call occurs, which contains another macro call, the macro processor
generates
the nested macro definition as text and places it on the input stack. The definition of the macro is then
scanned and
the macro processor compiles it. This is important to note that the macro call is nested and not the
macro definition.
If you nest the macro definition, the macro processor compiles the same macro repeatedly, whenever
the section of
the outer macro is executed. The following example can make you understand the nested macro calls:
MACRO
SUB 1 &PAR
L 1, & PAR
A1, = F‟2‟
ST 1, &PAR
MEND
MACRO
SUBST &PAR1, &PAR2, &PAR3
SUB1 &PAR1
SUB1 &PAR2
SUB1 &PAR3
2MEND
You can easily notice from this example that the definition of the macro „SUBST‟ contains three
separate calls
to a previously defined macro „SUB1‟. The definition of the macro SUB1 has shortened the length of
the definition of
the macro „SUBST‟. Although this technique makes the program easier to understand, at the same time,
it is
considered as an inefficient technique. This technique uses several macros that result in macro
expansions on
multiple levels. The following code describes how to implement a nested macro call:
This is clear from the example that a macro call, SUBST, in the source is expanded in the expanded
source (Level
1) with the help of SUB1, which is further expanded in the expanded source (Level 2).
Macro calls within macros can involve several levels. This means a macro can include within itself any
number of
macros, which can further include macros. There is no limit while using macros in a program. In the
example
discussed, the macro SUB1 might be called within the definition of another macro. The conditional
macro facilities such as AIF and AGO makes it possible for a macro to call itself. The concept of
conditional macro expansion will be discussed later in this unit. The use of nested macro calls is
beneficial until it causes an infinite loop.
Apart from the benefits provided by the macro calls, there are certain shortcomings with this technique.
Therefore, it is always recommended to define macros separately in a separate file, which makes them
easier to maintain.
We have already come across the need and importance of macros. In this section, we are
concentrating on some of the features of macros. We have already discussed one of the primary
features of macro, which are nested macros. The features of the macro facility are as follows:
A 1,DATA1
A 2, DATA1
A 3, DATA1
:
:
:
A 1,DATA2
A 2,DATA2
A 3,DATA2
:
:
:
A 1,DATA3
A 2,DATA3
A 3,DATA3
DATA1 DC F‟5‟
DATA2 DC F‟10‟
DATA3 DC F‟15‟
In this example, the instruction sequences are very much similar but these sequences are not
identical. It is important to note that the first sequence performs an operation on an operand
DATA1. On the other hand, in the second sequence the operation is being performed on operand
DATA2. The third sequence performs operations on DATA3. They can be considered to perform
the same operation with a variable parameter or argument. This parameter is known as a macro
instruction argument or dummy argument. The program, previously discussed, can be rewritten
as follows:
Notice that in this program, a dummy argument is specified on the macro name line and is
distinguished by inserting an ampersand (&) symbol at the beginning of the name. There is no
limitation on supplying arguments in a macro call. The important thing to understand about the
macro instruction argument is that each argument must correspond to a definition or dummy
argument on the macro name line of the macro definition. The supplied arguments are
substituted for the respective dummy arguments in the macro definition whenever a macro call is
processed.
Conditional Macro Expansion
We have already mentioned the conditional macro expansion in the nested macro calls. In this
section, we will discuss about the important macro processor pseudo- ops such as AIF and AGO.
These macro expansions permit conditional reordering of the sequence of macro expansion.
They are responsible for the selection of the instructions that appear in the expansions of a macro
call. These selections are based on the conditions specified in a program. Branches and tests in
the macro instructions permit the use of macros that can be used for assembling the instructions.
The facility for selective assembly of these macros is considered as the most powerful
programming tool for the system software. The use of the conditional macro expansion can be
explained with the help of an example.
We have already mentioned the conditional macro expansion in the nested macro calls. In this
section, we will discuss about the important macro processor pseudo- ops such as AIF and AGO.
These macro expansions permit conditional reordering of the sequence of macro expansion.
They are responsible for the selection of the instructions that appear in the expansions of a macro
call. These selections are based on the conditions specified in a program. Branches and tests in
the macro instructions permit the use of macros that can be used for assembling the instructions.
The facility for selective assembly of these macros is considered as the most powerful
programming tool for the system software. The use of the conditional macro expansion can be
explained with the help of an example. Consider the following set of instructions:
LOOP 1 A 1,DATA1
A 2,DATA2
A 3,DATA3
LOOP 2 A 1,DATA3
A 2,DATA2
DATA1 DC F‟5‟
DATA2 DC F‟10‟
DATA3 DC F‟15‟
In this example, the operands, labels and number of instructions generated are different in each
sequence. Rewriting the set of instructions in a program might look like:
The labels starting with a period (.) such as .FINI are macro labels. These macro labels do not
appear in the output of the macro processor. The statement AIF (& COUNT EQ 1).FINI directs
the macro processor to skip to the statement labeled .FINI, if the parameter corresponding to
&COUNT is one. Otherwise, the macro processor continues with the statement that follows the
AIF pseudo-op.
AIF pseudo- op performs an arithmetic test and since it is a conditional branch pseudo-op, it
branches only if the tested condition is true. Another pseudo-op used in this program is AGO,
which is an unconditional branch pseudo-op and works as a GOTO statement. This is the label in
the macro instruction definition that specifies the sequential processing of instructions from the
location where it appears in the instruction. These statements are indications or directives to the
macro processor that do not appear in the macro expansions.
We can conclude that AIF and AGO pseudo-ops control the sequence of the statements in macro
instructions in the same way as the conditional and unconditional branch instructions direct the
order of program flow in a machine language program.
In this code, first the macro INSTRUCT has been defined and then within INSTRUCT, a new
macro &ADD is being defined. Macro definitions within macros are also known as “macro
definitions within macro definitions”.
definitions that are provided within macros. The macro calls are not considered here because the
macro pre-processor does not perform any macro expansion.
The macro preprocessor generally works in two modes: passive and active. The passive mode
looks for the macro definitions in the input and copies macro definitions found in the input to the
output. By default, the macro pre-processor works in the passive mode. The macro pre-processor
switches over to the active mode whenever it finds a macro definition in the input. In this mode,
the macro pre-processor is responsible for storing the macro definitions in the internal data
structures. When the macro definition is completed and the macros get translated, then the macro
Pre -processor switches back to the passive mode.
2. Saving the definitions: The pre-processor must save the macro instructions
definitions
that can be later required for expanding macro calls.
3. Recognising macro calls: The pre-processor must recognise macro calls along with
the
macro definitions. The macro calls appear as operation mnemonics in a program.
4. Replacing macro definitions with macro calls: The pre-processor needs to expand
macro calls and substitute arguments when any macro call is encountered. The pre
processor must substitute macro definition arguments within a macro call.
It is quite essential for a pre-processor designer to decide on certain issues such as whether or not
dummy arguments appear in the macro definition. It is important to note that dummy arguments
can appear anywhere in a macro definition.
The implementation of macro pre- processor can be performed by specifying the databases used
macro pre-processor. You can implement macro-preprocessor using:
• Implementation of two-pass algorithm
• Implementation of single-pass algorithm
First Pass
The first pass processes the definition of the macro by checking each operation code of the
macro. In first pass, each operation code is saved in a table called Macro Definition Table
(MDT). Another table is also maintained in first pass called Macro Name Table (MNT). First pass uses
various other databases such as Macro Name Table Counter (MNTC) and Macro Name Table Counter
(MDTC). The various databases used by first pass are:
1. The input macro source deck.
2. The output macro source deck copy that can be used by pass 2.
3. The Macro Definition Table (MDT), which can be used to store the body of the macro
definitions. MDT contains text lines and every line of each macro definition, except the
MACRO line gets stored in this table. For example, consider the code described in macro
expansion section where macro INC used the macro definition of INC in MDT. Table 2.1
shows the MDT entry for INC macro:
4. The Macro Name Table (MNT), which can be used to store the names of defined macros.
Each MNT entry consists of a character string such as the macro name and a pointer such
as index to the entry in MDT that corresponds to the beginning of the macro definition.
5. MNT entry for INCR macro:The Macro Definition Table Counter (MDTC) that indicates the next
available entry in the MDT.
The Macro Name Table Counter (MNTC) that indicates the next available entry in the MNT.
The Argument List Array (ALA) that can be used to substitute index markers for dummy
arguments prior to store a macro definition. ALA is used during both the passes of the macro
pre-processor. During Pass 1, dummy arguments in the macro definition are replaced with
positional indicators when the macro definition is stored. These positional indicators are used to
refer to the memory address in the macro expansion. It is done in order to simplify the later
argument replacement during macro expansion. The ith dummy argument on the macro name
card is represented in the body of the macro by the index marker symbol #. The # symbol is a
symbol reserved for the use of macro pre-processor
Second Pass
Second pass of two-pass algorithm examine each operation mnemonic such that it replaces
macro name with the macro definition. The various data-bases used by second pass are:
1. The copy of the input macro source deck.
2. The output expanded source deck that can be used as an input to the assembler.
3. The MDT that was created by pass 1.
4. The MNT that was created by pass 1.
5. The MDTP for indicating the next line of text that is to be used during macro expansion.
6. The ALA that is used to substitute macro calls arguments for the index markers in the
Two-Pass Algorithm
In two-pass macro-preprocessor, you have two algorithms to implement, first pass and second
pass. Both the algorithms examines line by line over the input data available. Two algorithms to
implement two-pass macro-preprocessor are:
• Pass 1 Macro Definition
• Pass 2 Macro Calls and Expansion
4. When macro name is not found then create expanded data file.
5. If end pseudo opcode is encountered then feed the expanded source file to assembler for
processing.
6. Else read next source of data input.
Figure shows the above steps followed to implement second pass algorithm in
diagrammatical structure:
C. Increase the value of MDTP by value one.
D. Read next line from MDT.
E. Substitute the values from the arguments list of the macro for dummy arguments.
F. If mend pseudo opcode is found then next source of input data is read.
G. Else expands data input.
4. When macro name is not found then create expanded data file.
5. If end pseudo opcode is encountered then feed the expanded source file to assembler for
processing.
7. Else read next source of data input.
Figure shows the above steps followed to implement second pass algorithm in
diagrammatical structure:
expansion of macro call, MDI indicator has value ON and retains value OFF otherwise. If
MDI indicator is on, then input data lines are read from MDT until mend pseudo opcode
is not encountered. When MDI is off, then data input is read from data source instead of
MDI.
• MDLC indicator: MDLC ensures you that macro definition is stored in MDT. MDLC is
a counter that keeps track of the numbers of macro1 and mend pseudo opcodes found.
Single-pass algorithm combines both the algorithms defined above to implement two-pass macro
pre
-processor. Following are the steps that are followed during single-pass algorithm:
1. Initialize MDTC and MNTC to value one and MDLC to zero.
2. Set MDI to value OFF.
3. Performs read operation.
4. Examine MNT to get the match with operation code.
5. If macro name is found then
A. MDI is set to ON.
B. Prepare argument list array containing a table of dummy arguments.
C. Performs read operation.
6. Else it examines that macro pseudo opcode is encountered. If macro pesudo opcode is
found then
A. Enter the name of the macro and current value of MDTC in MNT at entry number
MNTC.
B. Increment the MNTC to value one.
C. Prepare argument list array containing a table of dummy arguments..
D. Enter the macro card into MDT.
E. Increment the MDTC to value one.
F. Increment the MDLC to value one.
G. Performs read operation.
H. Substitute the index notations for the arguments list of the macro for dummy
arguments.
I. Enter data input line into MDT.
J. Increment the MDTC to value one.
K. If macro pseudo opcode is found then increments the MDLC to value one and
performs read operation.
L. Else it checks for mend pseudo opcode if not found then performs read operation.
M. If mend pseudo opcode is found then decrement the MDLC to value one.
N. If MDLC is equal to zero then it goes to step 2. Otherwise, it performs read
operation.
7. In case macro pseudo opcode is not found, then write it into expanded source card file.
INTRODUCTION
Earlier programmers used loaders that could take program routines stored in tapes and combine
and relocate them in one program. Later, these loaders evolved into linkage editor used for
linking the program routines but the program memory remained expensive and computers were
slow. A little progress in linking technology helped computers become faster and disks larger,
thus program linking became easier. For the easier use of memory space and efficiency in speed,
you need to use linkers and loaders.
Figure 1 shows a schematic flow of program execution, in which a translator, linker and loader
performs functions such as translating and linking a program for the final result of the program.
Object Modules
The object module of a program contains information, which is necessary to relocate and link the
program with other programs. The object module of a program, P, consists of various
components:
1. Header contains translated origin, size and execution start address of the program P.
Translated origin is the address of the origin assumed by the translator.
2. Program contains the machine language program corresponding to P.
3. Relocation table(RELOCTAB) maintains a list of entries that contains the translated
address for the corresponding entry.
4. Linking table(LINKTAB) contains information about the external references and public
definitions of the program P.
Binary Programs
A binary program is a machine language program, which contains a set of program units P
where Pi ε P. Pi has a memory address that is relocated at its link origin. Link origin is the
address of the origin assigned by the linker while producing a binary program. You need to
invoke a linker command for creating a binary program from a set of object modules. The
syntax for the linker command is:
linker <link origin>, <object module names>,[<execution start address>]
LOADERS
As already discussed that assemblers and compilers are used to convert the source program
developed by the user to the corresponding object program. A loader is a program that performs
the functions of a linker program and then immediately schedules the resulting executable
program for some kind of action. In other words, a loader accepts the object program, prepares
these programs for execution by the computer and then initiates the execution. It is not necessary
for the loader to save a program as an executable file. The functions performed by a loader are as
follows:
There are various other loading schemes such as compile and go loader, absolute loader,
relocating loader and direct linking loader.
This scheme is easy to implement, as the assembler simply places the code into core and the
loader, which consists of one instruction, transfers control to the starting instruction of the newly
assembled program. However, this scheme has several disadvantages. First, in this scheme, a
portion of memory is wasted. This is mainly because the core occupied by the assembler is not
available to the object program. Second, it is essential to assemble the user’s program deck every
time it is executed. Finally, it is quite difficult to handle multiple segments, if the source
programs are in different languages. This disadvantage makes it difficult to produce orderly
modular programs.
Absolute Loader
An absolute loader is the simplest type of loader scheme that fits the general model of loaders.
The assembler produces the output in the same way as in the “compile and go loader”. The
assembler outputs the machine language translation of the source program. The difference lies in
the form of data, i.e., the data in the absolute loader is punched on cards or you can say that it
uses object deck as an intermediate data. The loader in turn simply accepts the machine language
text and places it at the location prescribed by the assembler. When the text is being placed into
the core, it can be noticed that much core is still available to the user. This is because, within this
scheme, the assembler is not in the memory at the load time.
Although absolute loaders are simple to implement, they do have several disadvantages. First, it
is desirable for the programmer to specify the address in core where the program is to be loaded.
Furthermore, a programmer needs to remember the address of each subroutine, if there are
multiple subroutines in the program. Additionally, each absolute address is to be used by the
programmer explicitly in the other subroutines such that subroutine linkage can be maintained.
This is quite necessary for a programmer that no two subroutines are assigned to same or
overlapping locations. Programmer performs the functions of allocation and linker, the assembler
performs the relocation part and the loader in the absolute loader scheme performs loading.
Consider an example to see the implementation of an absolute loader.
In this section, the implementation of the absolute loader scheme is discussed. Figure 4.4 shows
the absolute loader scheme.
In the figure, the MAIN program is assigned to locations 1000-2470 and the SQRT subroutine is
assigned locations 4000-4770. This means the length of MAIN has increased to more than 3000
bytes, as it can be noticed from figure. If the modifications are required to be made in MAIN
subroutine, then the end of MAIN subroutine, i.e., 1000+3000=4000, gets overlapped with the
start of SQRT, i.e., with 4000. Therefore, it is necessary to assign a new location to SQRT. This
can be made possible by changing the START pseudo-op card and reassembling it. It is then
quite obvious to modify all other subroutines that refer to address of SQRT.
Relocating Loaders
Relocating loaders was introduced in order to avoid possible reassembling of all subroutines
when a single subroutine is changed. It also allows you to perform the tasks of allocation and
linking for the programmer. The example of relocating loaders includes the Binary Symbolic
Subroutine (BSS) loader. Although the BSS loader allows only one common data segment, it
allows several procedure segments. The assembler in this type of loader assembles each
procedure segment independently and passes the text and information to relocation and
intersegment references.
In this scheme, the assembler produces an output in the form of text for each source program. A
transfer vector that contains addresses, which includes names of the subroutines referenced by
the source program, prefixes the output text. The assembler would also provide the loader with
additional information such as the length of the entire program and also the length of the transfer
vector portion. Once this information is provided, the text and the transfer vector get loaded into
the core. Followed by this, the loader would load each subroutine, which is being identified in
the transfer vector. A transfer instruction would then be placed to the corresponding subroutine
for each entry in the transfer vector.
The output of the relocating assembler is the object program and information about all the
programs to which it references. Additionally, it also provides relocation information for the
locations that need to be changed if it is to be loaded in the core. This location may be arbitrary
in the core, let us say the locations, which are dependent on the core allocation. The BSS loader
scheme is mostly used in computers with a fixed-length direct-address instruction format.
Consider an example in which the 360 RX instruction format is as follows:
In this format, A2 is the 16-bit absolute address of the operand, which is the direct address
instruction format. It is desirable to relocate the address portion of every instruction. As a result,
the computers with a direct-address instruction format have much severe problems than the
computes having 360-type base registers. The 360-type base registers solve the problem using
relocation bits.
The relocation bits are included in the object desk and the assembler associates a
bit with each instruction or address field. The corresponding address field to each instruction
must be relocated if the associated bit is equal to one; otherwise this field is not relocated.
Direct-Linking Loaders
A direct-linking loader is a general relocating loader and is the most popular loading scheme
presently used. This scheme has an advantage that it allows the programmer
to use multiple procedure and multiple data segments. In addition, the programmer is free to
reference data or instructions that are contained in other segments. The direct-linking loaders
provide flexible intersegment referencing and accessing ability. An assembler provides the
following information to the loader along with each procedure or data segment. This information
includes:
• Length of segment.
• List of all the symbols and their relative location in the segment that are referred by other
segments.
• Information regarding the address constant which includes location in segment and
description about the revising their values.
• Machine code translation of the source program and the relative addresses assigned.
Let us understand the implementation of the direct-linking loader with the help of an example.
In the following example, there is a source program, which is being translated by an assembler in
order to produce the object code. The source program and the object code, which is being
translated, both are depicted code and their var.
You must have noticed that the card number 15 in the example contains a Define Constant (DC)
pseudo operation that provide instructions to the assembler. This pseudo-op helps in creating a
constant with the value that contains the address of TABLE, which places this constant value at
the location labelled as POINTER.
This is the point where the assembler does not know the final absolute address of TABLE. This
is because the assembler does not have any idea where the program is to be loaded. However, the
assembler has the knowledge that address of TABLE is 28th byte from the beginning of the
program. Therefore, the assembler places 28 in POINTER. It also informs to the loader that if
this program gets loaded at some other location, other than absolute location 0, then the content
of location POINTER is incorrect.
Another usage of DC pseudo-op in this program is on card number 17. This pseudo-op instructs
the assembler to create a constant with the value of the address of the subroutine ADD. It would
also cause this constant to be placed in the location, which is labelled as AADD.
The assembler cannot generate this constant, as the assembler does not have any idea of the
location where the procedure ADD is loaded. It is desirable for the assembler to provide
information to the loader such that the final absolute address of ADD can be loaded at the
designated location, i.e., AADD, when the program gets loaded
LINKAGE EDITOR
Linkage editor allows you to combine two or more objects defined in a program and supply
information needed to allow references between them. A linkage editor is also known as linker.
To allow linking in a program, you need to perform:
• Program relocation
• Program linking
Program relocation
Program relocation is the process of modifying the addresses containing instructions of a
program. You need to use program relocation to allocate a new memory address to the
instruction. Instructions are fetched from the memory address and are followed sequentially to
execute a program. The relocation of a program is performed by a linker and for performing
relocation you need to calculate the relocation_factor that helps specify the translation time
address in every instruction. Let the translated and linked origins of a program P be t_ origin and
l_origin, respectively. The relocation factor of a program P can be defined as:
relocation_factor = l_origin – t_origin
The algorithm that you use for program relocation is:
1. program_linked_origin:=<link origin> from linker command;
2. For each object module
3. t_origin :=translated origin of the object module; OM_size
Program Linking
Linking in a program is a process of binding an external reference to a correct link address. You
need to perform linking to resolve external reference that helps in the execution of a program.
All the external references in a program are maintained in a table called name table (NTAB),
which contains the symbolic name for external references or an object module. The information
specified in NTAB is derived from LINKTAB entries having type=PD. The algorithm that you
use for program linking is:
DESIGN OF A LINKER
You need to perform linking and relocation to explain the design of a linker. For example, the
design of a program LINKER helps explain linking and relocation. The output of a LINKER
program is in binary form with a .COM extension. You need to use the following syntax to
create a LINKER program:
stored in <executable file>, the <list of binary files> contains a list of external object module
name that are not specified in <object module name>. You need to resolve all the external object
module names for terminating the execution of LINKER program. The LINKER program is
explained using an example:
Alpha, beta and min are the names of the object modules that you need to link, and calculate is
the name given to the executable file, which is generated by LINKER. The load origin of the
program that you want to execute is 1000. The external names, which are not defined in object
modules alpha, beta and min needs to be searched in pas.lib library
For proper execution of the LINKER program you need to use a two-pass strategy:
• First pass
• Second pass
First Pass
In the first pass, the object module collects information about the segments and definitions that
are defined in the program. The algorithm that you use for the first pass is:
Algorithm (first pass of LINKER)
Second Pass
The second pass performs relocation and linking of the program. Second pass helps execute
program in work_area. You need to fetch the data from LEDATA records using segment index
and data offset and send them to the correct parts of work_area. The segment index helps obtain
the load origin of the segment from NTAB while data offset gives the load address of the first
byte of the data in the LEDATA record. A segment index can also contain a numeric value rather
than the name of the segment, LNAMES records help obtain the segment name. The obtained
segment name, the segment name is searched in NTAB for its load origin. For the relocation and
linking you need to FIXUPP records, while processing a FIXUPP the target data can be searched
in EXTDEF or SEGDEF to obtain the target name. The obtained target name can be searched in
NTAB for its load origin. LINKER constructs a table called the segment table (SEGTAB), which
contains the name of all the segments defined in the object module. A SEGTAB entry has the
following syntax:
The algorithm that you use for the second pass is:
1. list_of_object_modules: = object modules named in LINKER command;
2. Repeat step 3 until list_of_object_modules is empty.
3. Select an object module and process its object records.
A. If an LNAMES record
Enter the names in NAMELIST.
B. If a SEGDEF record
i:= name index; segment_name :=NAMELIST[i] ;
Enter (segment_name, load address from NTAB) in SEGTAB.
C. If an EXTTAB record
ƒ external_name: =name from EXTTAB record;
ƒ If external_name is not found in NTAB, then
-Locate an object module in the library which contains external_name as a
segment or public definition.
-Add name of object module to list_of_object_modules.
-Perform first pass of LINKER, for the new object module.
ƒ Enter (external_name, load address from NTAB) in
EXTTAB
D. If an LEDATA record
<load origin> + d;
ƒ Move the data from LEDATA into the memory area starting at the address
address_in_work_area.
E. If a FIXUPP record, for each FIXUPP specification
ƒ f:=offset from locat field;
ƒ fix_up_address:= address in work_area + f;
ƒ perform required fix up using a load address from SEGTAB or EXTTAB
and the value of the code in locat and fix dat.
F. If a MODEND record
If start address is specified, compute the corresponding load address (analogous to
the computation while processing an LEDATA record) and record it in the
executable file being generated.
DYNAMIC LINKING
Sophisticated operating systems, such as Windows allow you to link executable object modules
to be linked to a program while a program is running. This is known as dynamic linking. The
operating system contains a linker that determines functions, which are not specified in a
progam. A linker searches through the specified libraries for the missing function and helps
extract the object modules containing the missing functions from the libraries. The libraries are
constructed in a way to be able to work with dynamic linkers. Such libraries are known as
dynamic link libraries (DLLs). Technically, dynamic linking is not like static linking, which is
done at build time. DLLs contain functions or routines, which are loaded and executed when
needed by a program. The advantages of DLLs are:
• Code sharing: Programs in dynamic linking can share an identical code instead of
creating an individual copy of a same library. Sharing allows executable functions and
routines to be shared by many application programs. For example, the object linking and
embedding (OLE) functions of OLE2.DLL can be invoked to allow the execution of
functions or routines in any program.
• Automatic updating: Whenever you install a new version of dynamic link library, the
older version is automatically overridden. When you run a program the updated version
of the dynamic link library is automatically picked.
• Securing: Splitting the program you create into several linkage units makes it harder for
crackers to read an executable file.
We have seen that a lexical analyzer can identify tokens with the help of regular expressions and
pattern rules. But a lexical analyzer cannot check the syntax of a given sentence due to the
limitations of the regular expressions. Regular expressions cannot check balancing tokens, such as
parenthesis. Therefore, this phase uses context-free grammar (CFG), which is recognized by
push-down automata.
It implies that every Regular Grammar is also context-free, but there exists some problems, which
are beyond the scope of Regular Grammar. CFG is a helpful tool in describing the syntax of
programming languages.
Context-Free Grammar
In this section, we will first see the definition of context-free grammar and introduce terminologies
used in parsing technology.
A context-free grammar has four components:
A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of strings.
The non-terminals define sets of strings that help define the language generated by the
grammar.
A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from
which strings are formed.
A set of productions (P). The productions of a grammar specify the manner in which the
terminals and non-terminals can be combined to form strings. Each production consists of a
non-terminal called the left side of the production, an arrow, and a sequence of tokens
and/or on- terminals, called the right side of the production.
One of the non-terminals is designated as the start symbol (S); from where the production
begins.
The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the
start symbol) by the right side of a production, for that non-terminal.
Example
We take the problem of palindrome language, which cannot be described by means of Regular
Expression. That is, L = { w | w = wR } is not a regular language. But it can be described by means
of CFG, as illustrated below:
G = ( V, Σ, P, S )
Where:
V = { Q, Z, N }
Σ = { 0, 1 }
P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 }
S={Q}
This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111,
etc.
Context-free languages are closed under −
⚫ Union
⚫ Concatenation
⚫ Kleene Star operation
Union
Let L1 and L2 be two context free languages. Then L1 ∪ L2 is also context free.
Example
Let L1 = { anbn , n > 0}. Corresponding grammar G1 will have P: S1 → aAb|ab
Let L2 = { cmdm , m ≥ 0}. Corresponding grammar G2 will have P: S2 → cBb| ε
Union of L1 and L2, L = L1 ∪ L2 = { anbn } ∪ { cmdm }
The corresponding grammar G will have the additional production S → S1 | S2
Concatenation
If L1 and L2 are context free languages, then L1L2 is also context free.
Example
Union of the languages L1 and L2, L = L1L2 = { anbncmdm }
The corresponding grammar G will have the additional production S → S1 S2
Kleene Star
Complement − If L1 is a context free language, then L1’ may not be context free.
Finite automata - is a state machine that takes a string of symbols as input and changes its state
accordingly. Finite automata is a recognizer for regular expressions. When a regular expression
string is fed into finite automata, it changes its state for each literal. If the input string is
successfully processed and the automata reaches its final state, it is accepted, i.e., the string just fed
was said to be a valid token of the language in hand.
The mathematical model of finite automata consists of:
Finite set of states (Q)
Finite set of input symbols (Σ)
One Start state (q0)
Set of final states (qf)
Transition function (δ)
The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ), Q × Σ
➔Q
Finite Automata Construction
Intermediate states : All intermediate states have at least two arrows; one pointing to and
another pointing out from them.
Final state : If the input string is successfully parsed, the automata is expected to be in this
state. Final state is represented by double circles. It may have any odd number of arrows
pointing to it and even number of arrows pointing out from it. The number of odd arrows are
one greater than even, i.e. odd = even+1.
Transition : The transition from one state to another state happens when a desired symbol in
the input is found. Upon transition, automata can either move to the next state or stay in the
same state. Movement from one state to another is shown as a directed arrow, where the
arrows points to the destination state. If automata stays on the same state, an arrow pointing
from a state to itself is drawn.
Example : We assume FA accepts any three digit binary value ending in digit 1. FA = {Q(q0, qf),
Σ(0,1), q0, qf, δ}
We will reduce the regular expression into smallest regular expressions and converting these to
NFA and finally to DFA.
Some basic RA expressions are the following −
Case 1 − For a regular expression ‘a’, we can construct the following FA −
Method
Step 1 Construct an NFA with Null moves from the given regular expression.
Step 2 Remove Null transition from the NFA and convert it into its equivalent DFA.
Problem
Convert the following RA into its equivalent DFA − 1 (0 + 1)* 0
Solution
We will concatenate three expressions "1", "(0 + 1)*" and "0
Now we will remove the ε transitions. After we remove the ε transitions from the NDFA, we get the
following −
A Finite Automaton with null moves (FA-ε) does transit not only after giving input from the
alphabet set but also without any input symbol. This transition without input is called a null move.
An NFA-ε is represented formally by a 5-tuple (Q, ∑, δ, q0, F), consisting of
Q − a finite set of states
∑ − a finite set of input symbols
δ − a transition function δ : Q × (∑ ∪ {ε}) → 2Q
q0 − an initial state q0 ∈ Q
F − a set of final state/states of Q (F⊆Q).
If in an NDFA, there is ϵ-move between vertex X to vertex Y, we can remove it using the following
steps −
Find all the outgoing edges from Y.
Copy all these edges starting from X without changing the edge labels.
If X is an initial state, make Y also an initial state.
If Y is a final state, make X also a final state.
Problem
S
lo
u
tin
Step 1 −
Step 2 −
Now we will Copy all these edges from q1 without changing the edges from qf and get the
following FA −
Step 3 −
Here q1 is an initial state, so we make qf also an initial state.
So the FA becomes −
Step 4 −
Here qf is a final state, so we make q1 also a final state.
So the FA becomes −
If (Q, ∑, δ, q0, F) be a DFA that accepts a language L, then the complement of the DFA can be
obtained by swapping its accepting states with its non-accepting states and vice versa.
We will take an example and elaborate this below −
Problem
Check whether the grammar G with production rules −
X → X+X | X*X |X| a
is ambiguous or not.
Solution
Let’s find out the derivation tree for the string "a+a*a". It has two leftmost derivations.
Since there are two parse trees for a single string "a+a*a", the grammar G is ambiguous.
Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams. The
parser analyzes the source code (token stream) against the production rules to detect any errors in
the code. The output of this phase is a parse tree.
this way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and
generating a parse tree as the output of the phase.
Parsers are expected to parse the whole code even if some errors exist in the program. Parsers use
error recovering strategies, which we will learn later in this chapter.
Derivation
A derivation is basically a sequence of production rules, in order to get the input string. During
parsing, we take two decisions for some sentential form of input:
Deciding the non-terminal which is to be replaced.
Deciding the production rule, by which, the non-terminal will be replaced.
To decide which non-terminal to be replaced with production rule, we can have two options.
Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is called left-most
derivation. The sentential form derived by the left-most derivation is called the left-sentential form.
Right-most Derivation
If we scan and replace the input with production rules, from right to left, it is known as right-most
derivation. The sentential form derived from the right-most derivation is called the right-sentential
form.
Example
Production rules:
E→E+E
E→E*E
E → id
Input string: id + id * id
The left-most derivation is:
E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
Notice that the left-most side non-terminal is always processed first.
The right-most derivation is:
E→E+E
E→E+E*E
E → E + E * id
E → E + id * id
E → id + id * id
Parse Tree
A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are derived
from the start symbol. The start symbol of the derivation becomes the root of the parse tree. Let us
see this by an example from the last topic.
We take the left-most derivation of a + b * c
The left-most derivation is:
E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
Step 1:
E→E*E
Step 2:
E→E+E*E
Step 3:
E → id + E * E
Step 4:
E → id + id * E
Step 5:
E → id + id * id
n a parse tree:
All leaf nodes are terminals.
All interior nodes are non-terminals.
(2) S => Aα | β
A => Sd
(1) is an example of immediate left recursion, where A is any non-terminal symbol and α represents
a string of non-terminals.
(2) is an example of indirect-left recursion.
A top-down parser will first parse the A, which in-turn will yield a string consisting of A itself and
the parser may go into a loop forever.
Removal of Left Recursion
One way to remove left recursion is to use the following technique:
The production
A => Aα | β
is converted into following productions
A => βA'
A'=> αA' | ε
This does not impact the strings derived from the grammar, but it removes immediate left recursion.
Second method is to use the following algorithm, which should eliminate all direct and indirect left
recursions.
START
Arrange non-terminals in some order like A1, A2, A3,…, An
for each i from 1 to n
{
for each j from 1 to i-1
{
replace each production of form Ai Aj
with Ai Ɓ1
㤵2
㤵| Ɓ 㤵|…
㤵
where Aj ⟹Ɓ1 | Ɓ2|…| Ɓn are current Aj productions
}
}
eliminate immediate left-recursion
END
Example
Example
The above productions can be written as
A => αA'
A'=> β | 㤵
Now the parser has only one production per prefix which makes it easier to take decisions.
First and Follow Sets
An important part of parser table construction is to create first and follow sets. These sets can
provide the actual position of any terminal in the derivation. This is done to create the parsing table
where the decision of replacing T[A, t] = α with some production rule.
First Set
This set is created to know what terminal symbol is derived in the first position by a non-terminal.
For example,
α→tβ
That is α derives t (terminal) in the very first position. So, t ∈ FIRST(α).
if α is a non-terminal and α → 1 2
Follow Set
Likewise, we calculate what terminal symbol immediately follows a non-terminal α in production
rules. We do not consider what the non-terminal can generate but instead, we see what would be the
next terminal symbol that follows the productions of a non-terminal.
Algorithm for calculating Follow set:
if α is a start symbol, then FOLLOW() = $
if α is a non-terminal and has a production α → AB, then FIRST(B) is in FOLLOW(A)
except ℇ .
if α is a non-terminal and has a production α → AB, where B ℇ , then FOLLOW(A) is in
FOLLOW(α).
Follow set can be seen as: FOLLOW(α) = { t | S *αt*}
Top-down Parsing
When the parser starts constructing the parse tree from the start symbol and then tries to transform
the start symbol to the input, it is called top-down parsing.
Recursive descent parsing : It is a common form of top-down parsing. It is called recursive
as it uses recursive procedures to process the input. Recursive descent parsing suffers from
backtracking.
Backtracking : It means, if one derivation of a production fails, the syntax analyzer restarts
the process using different rules of same production. This technique may process the input
string more than once to determine the right production.
Bottom-up Parsing
As the name suggests, bottom-up parsing starts with the input symbols and tries to construct the
parse tree up to the start symbol.
Example:
Input string : a + b * c
Production rules:
S→E
E→E+T
E→E*T
E→T
T → id
Let us start bottom-up parsing
a+b*c
Read the input and check if any production matches with the input:
a+b*c
T+b*c
E+b*c
E+T*c
E*c
E*T
E
S
We have learnt in the last chapter that the top-down parsing technique parses the input, and starts
constructing a parse tree from the root node gradually moving down to the leaf nodes. The types of
top-down parsing are depicted below:
For an input string: read, a top-down parser, will behave like this:
It will start with S from the production rules and will match its yield to the left-most letter of the
input, i.e. ‘r’. The very production of S (S → rXd) matches with it. So the top-down parser
advances to the next input letter (i.e. ‘e’). The parser tries to expand non-terminal ‘X’ and checks its
production from the left (X → oa). It does not match with the next input symbol. So the top-down
parser backtracks to obtain the next production rule of X, (X → ea).
Now the parser matches all the input letters in an ordered manner. The string is accepted
Predictive Parser
Predictive parser is a recursive descent parser, which has the capability to predict which production
is to be used to replace the input string. The predictive parser does not suffer from backtracking.
To accomplish its tasks, the predictive parser uses a look-ahead pointer, which points to the next
input symbols. To make the parser back-tracking free, the predictive parser puts some constraints on
the grammar and accepts only a class of grammar known as LL(k) grammar.
Predictive parsing uses a stack and a parsing table to parse the input and generate a parse tree. Both
the stack and the input contains an end symbol $ to denote that the stack is empty and the input is
consumed. The parser refers to the parsing table to take any decision on the input and stack element
combination.
In recursive descent parsing, the parser may have more than one production to choose from for a
single instance of input, whereas in predictive parser, each step has at most one production to
choose. There might be instances where there is no production matching the input string, making
the parsing procedure to fail
LL Parser
An LL Parser accepts LL grammar. LL grammar is a subset of context-free grammar but with some
restrictions to get the simplified version, in order to achieve easy implementation. LL grammar can
be implemented by means of both algorithms namely, recursive-descent or table-driven.
LL parser is denoted as LL(k). The first L in LL(k) is parsing the input from left to right, the second
L in LL(k) stands for left-most derivation and k itself represents the number of look aheads.
Generally k = 1, so LL(k) may also be written as LL(1).
LL Parsing Algorithm
We may stick to deterministic LL(1) for parser explanation, as the size of table grows exponentially
with the value of k. Secondly, if a given grammar is not LL(1), then usually, it is not LL(k), for any
given k.
Given below is an algorithm for LL(1) Parsing:
Input:
string ω
parsing table M for grammar G
Output:
If ω is in L(G) then left-most derivation of ω,
error otherwise.
Shift-Reduce Parsing
Shift-reduce parsing uses two unique steps for bottom-up parsing. These steps are known as
shift-step and reduce-step.
Shift step: The shift step refers to the advancement of the input pointer to the next input
symbol, which is called the shifted symbol. This symbol is pushed onto the stack. The
shifted symbol is treated as a single node of the parse tree.
Reduce step : When the parser finds a complete grammar rule (RHS) and replaces it to
(LHS), it is known as reduce-step. This occurs when the top of the stack contains a handle.
To reduce, a POP function is performed on the stack which pops off the handle and replaces
it with LHS non-terminal symbol.
LR Parser
The LR parser is a non-recursive, shift-reduce, bottom-up parser. It uses a wide class of context-free
grammar which makes it the most efficient syntax analysis technique. LR parsers are also known as
LR(k) parsers, where L stands for left-to-right scanning of the input stream; R stands for the
construction of right-most derivation in reverse, and k denotes the number of lookahead symbols to
make decisions.
There are three widely used algorithms available for constructing an LR parser:
SLR(1) – Simple LR Parser:
Works on smallest class of grammar
Few number of states, hence very small table
Simple and fast construction
LR(1) – LR Parser:
Works on complete set of LR(1) Grammar
Generates large table and large number of states
Slow construction
LALR(1) – Look-Ahead LR Parser:
Works on intermediate size of grammar
Number of states are same as in SLR(1)
LR Parsing Algorithm
Here we describe a skeleton algorithm of an LR parser:
token = next_token()
repeat forever
s = top of stack
if action[s, token] = “shift si” then
PUSH token
PUSH si
token = next_token()
else if action[s, token] = “reduce A::= β“ then
POP 2 * |β| symbols
s = top of stack
PUSH A
PUSH goto[s,A]
else if action[s, token] = “accept” then
return
else
error()
If watched closely, we find most of the leaf nodes are single child to their parent nodes. This
information can be eliminated before feeding it to the next phase. By hiding extra information, we
can obtain a tree as shown below:
ASTs are important data structures in a compiler with least unnecessary information. ASTs are
more compact than a parse tree and can be easily used by a compiler.
COMPILER
Introduction
The semantic gap characterizes the difference between two descriptions of an object by
different linguistic representations, for instance languages or symbols.
The semantic gap can be defined as "the difference in meaning between constructs formed
within different representation systems”
Memory Allocation
Memory allocation is the process of reserving a partial or complete portion of computer
memory for the execution of programs and processes. Memory allocation is achieved through a process
known as memory management.
1. Static allocation is done at compile time when you know the size of the array.
2. The memory size allocated to “data” is static. But it is possible to change content of a static
structure without increasing the memory space allocated to it.
3. Global variables are declared “ahead of time,” such as fixed array.
4. Lifetime of Static allocation is the entire runtime of program.
5. It has efficient execution time
Dynamic allocation of memory space is done by using these functions when value is returned by
functions and assigned to pointer variables.
System designers often have a choice of when the system should make decisions about how to
treat a feature; such a decision is called a binding.
Usually, this boils down to making the decision at compile-time based solely on the program
text (a static binding) or at run-time based on values computed by the program (a dynamic binding).
Data structures can grow and shrink to fit changing data requirements.
Early binding times (before run time) are associated with greater efficiency
Compilers try to fix decisions that can be taken at compile time to avoid to generate
code that makes a decision at run time
Syntax and static semantics checking is performed only once at compile time and does
not impose any run-time overheads
• Late binding times (at run time) are associated with greater flexibility.
• Interpreters allow programs to be extended at run time.
• Languages such as Smalltalk-80 with polymorphic types allow variable names to refer to
objects of multiple types at run time.
• Method binding in object-oriented languages must be late to support dynamic binding
• allocate (create) additional storage.
• de-allocate (free/delete) dynamic space.
Binding is the act of associating properties with names. Binding time is the moment in the program's
life cycle when this association occurs.
The actual values stored in the variables is perhaps the most important of these properties. In
dynamically typed languages we will only know the types of variables during the execution of the
program. Languages that provide some form of late binding will only lets us know the target of a
function call at runtime, for instance.
Source codes generally have a number of instructions, which are always executed in sequence and are
considered as the basic blocks of the code. These basic blocks do not have any jump statements among
them, i.e., when the first instruction is executed, all the instructions in the same basic block will be
executed in their sequence of appearance without losing the flow control of the program.
A program can have various constructs as basic blocks, like IF-THEN-ELSE, SWITCH-CASE
conditional statements and loops such as DO-WHILE, FOR, and REPEAT-UNTIL, etc.
Basic block identification
We may use the following algorithm to find the basic blocks in a program:
• Search header statements of all the basic blocks from where a basic block starts:
• First statement of a program.
• Statements that are target of any branch (conditional/unconditional).
Activation Record
Control stack or runtime stack is used to keep track of the live procedure activations i.e the procedures
whose execution have not been completed. A procedure name is pushed on to the stack when it is
called (activation begins) and it is popped when it returns (activation ends). Information needed by a
single execution of a procedure is managed using an activation record or frame. When a procedure is
called, an activation record is pushed into the stack and as soon as the control returns to the caller
function the activation record is popped.
Directed Acyclic Graph (DAG) is a tool that depicts the structure of basic blocks, helps to see the flow
of values flowing among the basic blocks, and offers optimization too. DAG provides easy
transformation on basic blocks. DAG can be understood here:
• Leaf nodes represent identifiers, names or constants.
• Interior nodes represent operators.
• Interior nodes also represent the results of expressions or the identifiers/name where the values
are to be stored or assigned.
Example:
t0 = a + b
t1 = t0 + c
d = t0 + t1
[t0 = a + b]
[t1 = t0 + c]
[d = t0 + t1]
This optimization technique works locally on the source code to transform it into an optimized code.
By locally, we mean a small portion of the code block at hand. These methods can be applied on
intermediate codes as well as on target codes. A bunch of statements is analyzed and are checked for
the following possible optimization:
At compilation level, the compiler searches for instructions redundant in nature. Multiple loading and
storing of instructions may carry the same meaning even if some of them are removed. For example:
• MOV x, R0
• MOV R0, R1
We can delete the first instruction and re-write the sentence as:
MOV x, R1
Unreachable code
Unreachable code is a part of the program code that is never accessed because of programming
constructs. Programmers may have accidently written a piece of code that can never be reached.
Example:
void add_ten(int x)
{
return x + 10;
A source code can directly be translated into its target machine code, then why at all we need to
translate the source code into an intermediate code which is then translated to its target code? Let us see
the reasons why we need an intermediate code.
• If a compiler translates the source language to its target machine language without having the
option for generating intermediate code, then for each new machine, a full native compiler is
required.
• Intermediate code eliminates the need of a new full compiler for every unique machine by
keeping the analysis portion same for all the compilers.
• The second part of compiler, synthesis, is changed according to the target machine.
• It becomes easier to apply the source code modifications to improve code performance by
applying code optimization techniques on the intermediate code.
Intermediate codes can be represented in a variety of ways and they have their own benefits.
• High Level IR - High-level intermediate code representation is very close to the source
language itself. They can be easily generated from the source code and we can easily apply code
modifications to enhance performance. But for target machine optimization, it is less preferred.
• Low Level IR - This one is close to the target machine, which makes it suitable for register and
memory allocation, instruction set selection, etc. It is good for machine-dependent
optimizations.
Intermediate code can be either language specific (e.g., Byte Code for Java) or language independent
(three-address code).
Intermediate code generator receives input from its predecessor phase, semantic analyzer, in the form
of an annotated syntax tree. That syntax tree then can be converted into a linear representation, e.g.,
postfix notation. Intermediate code tends to be machine independent code. Therefore, code generator
assumes to have unlimited number of memory storage (register) to generate code.
For example:
a = b + c * d;
The intermediate code generator will try to divide this expression into sub-expressions and then
generate the corresponding code.
r1 = c * d;
r2 = b + r1;
a = r2
r being used as registers in the target program.
A three-address code has at most three address locations to calculate the expression. A three-address
code can be represented in two forms : quadruples and triples.
Quadruples
Each instruction in quadruples presentation is divided into four fields: operator, arg1, arg2, and result.
The above example is represented below in quadruples format:
Op arg1 arg2 result
* c d r1
+ b r1 r2
+ r2 r1 r3
= r3 a
Triples
Each instruction in triples presentation has three fields : op, arg1, and arg2.The results of respective
sub-expressions are denoted by the position of expression. Triples represent similarity with DAG and
syntax tree. They are equivalent to DAG while representing expressions.
Op arg1 arg2
* c d
+ b (0)
+ (1) (0)
= (2)
Triples face the problem of code immovability while optimization, as the results are positional and
changing the order or position of an expression may cause problems.
Indirect Triples
This representation is an enhancement over triples representation. It uses pointers instead of position to
store results. This enables the optimizers to freely re-position the sub-expression to produce an
optimized code.
A variable or procedure has to be declared before it can be used. Declaration involves allocation of
space in memory and entry of type and name in the symbol table. A program may be coded and
designed keeping the target machine structure in mind, but it may not always be possible to accurately
convert a source code to its target language.
Taking the whole program as a collection of procedures and sub-procedures, it becomes possible to
declare all the names local to the procedure. Memory allocation is done in a consecutive manner and
names are allocated to memory in the sequence they are declared in the program. We use offset
variable and set it to zero {offset = 0} that denote the base address.
The source programming language and the target machine architecture may vary in the way names are
stored, so relative addressing is used. While the first name is allocated memory starting from the
memory location 0 {offset=0}, the next name declared later, should be allocated memory next to the
first one.
Example:
We take the example of C programming language where an integer variable is assigned 2 bytes of
memory and a float variable is assigned 4 bytes of memory.
int a;
float b;
Allocation process:
{offset = 0}
int a;
id.type = int
id.width = 2
float b;
id.type = float
id.width = 4
ffset = offset + id.width
{offset = 6}
To enter this detail in a symbol table, a procedure enter can be used. This method may have the
following structure:
enter(name, type, offset)
This procedure should create an entry in the symbol table, for variable name, having its type set to type
and relative address offset in its data area.
Peephole optimizations
Usually performed late in the compilation process after machine code has been generated. This form of
optimization examines a few adjacent instructions (like "looking through a peephole" at the code) to see whether
they can be replaced by a single instruction or a shorter sequence of instructions. For instance, a multiplication
of a value by 2 might be more efficiently executed by left-shifting the value or by adding the value to itself (this
example is also an instance of strength reduction).
Local optimizations
These only consider information local to a basic block. Since basic blocks have no control flow, these
optimizations need very little analysis (saving time and reducing storage requirements), but this also means that
no information is preserved across jumps.
Global optimizations
These are also called "intraprocedural methods" and act on whole functions This gives them more information to
work with but often makes expensive computations necessary. Worst case assumptions have to be made when
function calls occur or global variables are accessed (because little information about them is available).
Loop optimizations
These act on the statements which make up a loop, such as a for loop (e.g., loop-invariant code motion). Loop
optimizations can have a significant impact because many programs spend a large percentage of their time inside
loops.
Prescient store optimizations
Allow store operations to occur earlier than would otherwise be permitted in the context of threads and locks.
The process needs some way of knowing ahead of time what value will be stored by the assignment that it
should have followed. The purpose of this relaxation is to allow compiler optimization to perform certain kinds
of code rearrangement that preserve the semantics of properly synchronized programs.
Most high-level languages share common programming constructs and abstractions: decision (if, switch, case),
looping (for, while, repeat.. until, do.. while), and encapsulation (structures, objects). Thus similar optimization
techniques can be used across languages. However, certain language features make some kinds of optimizations
difficult. For instance, the existence of pointers in C and C++ makes it difficult to optimize array accesses (see
alias analysis). However, languages such as PL/1 (that also supports pointers) nevertheless have available
sophisticated optimizing compilers to achieve better performance in various other ways. Conversely, some
language features make certain optimizations easier. For example, in some languages functions are not permitted
to have side effects. Therefore, if a program makes several calls to the same function with the same arguments,
the compiler can immediately infer that the function's result need be computed only once. In languages where
functions are allowed to have side effects, another strategy is possible. The optimizer can determine which
function has no side effects, and restrict such optimizations to side effect free functions. This optimization is
only possible when the optimizer has access to the called function.
Many optimizations that operate on abstract programming concepts (loops, objects, structures) are independent
of the machine targeted by the compiler, but many of the most effective optimizations are those that best exploit
special features of the target platform. E.g.: Instructions which do several things at once, such as decrement
register and branch if not zero.
The following is an instance of a local machine dependent optimization. To set a register to 0, the obvious
way is to use the constant '0' in an instruction that sets a register value to a constant. A less obvious way is to
XOR a register with itself. It is up to the compiler to know which instruction variant to use. On many RISC
machines, both instructions would be equally appropriate, since they would both be the same length and take the
same time. On many other microprocessors such as the Intel x86 family, it turns out that the XOR variant is
shorter and probably faster, as there will be no need to decode an immediate operand, nor use the internal
"immediate operand register". (A potential problem with this is that XOR may introduce a data dependency on
the previous value of the register, causing a pipeline stall. However, processors often have XOR of a register
with itself as a special case that does not cause stalls.)
Loop optimizations
Roughly, if a variable in a loop is a simple linear function of the index variable, such as j := 4*i + 1, it can be
updated appropriately each time the loop variable is changed. This is a strength reduction, and also may allow
the index variable's definitions to become dead code. This information is also useful for bounds-checking
elimination and dependence analysis, among other things.
Loop fission attempts to break a loop into multiple loops over the same index range but each taking only a part
of the loop's body. This can improve locality of reference, both of the data being accessed in the loop and the
Another technique which attempts to reduce loop overhead. When two adjacent loops would iterate the same
number of times (whether or not that number is known at compile time), their bodies can be combined as long as
they make no reference to each other's data.
Loop inversion
This technique changes a standard while loop into a do/while (also known as repeat/until) loop wrapped in an if
conditional, reducing the number of jumps by two, for cases when the loop is executed. Doing so duplicates the
condition check (increasing the size of the code) but is more efficient because jumps usually cause a pipeline
stall. Additionally, if the initial condition is known at compile-time and is known to be side-effect-free, the if
guard can be skipped.
Loop interchange
These optimizations exchange inner loops with outer loops. When the loop variables index into an array, such a
transformation can improve locality of reference, depending on the array's layout.
If a quantity is computed inside a loop during every iteration, and its value is the same for each iteration, it can
vastly improve efficiency to hoist it outside the loop and compute its value just once before the loop begins. This
is particularly important with the address-calculation expressions generated by loops over arrays. For correct
implementation, this technique must be used with loop inversion, because not all code is safe to be hoisted
outside the loop.
Some pervasive algorithms such as matrix multiplication have very poor cache behavior and excessive memory
accesses. Loop nest optimization increases the number of cache hits by performing the operation over small
blocks and by using a loop interchange.
Loop reversal
Loop reversal reverses the order in which values are assigned to the index variable. This is a subtle optimization
which can help eliminate dependencies and thus enable other optimizations. Furthermore, on some architectures,
loop reversal contributes to smaller code, as when the loop index is being decremented, the condition that needs
to be met in order for the running program to exit the loop is a comparison with zero. This is often a special,
parameter-less instruction, unlike a comparison with a number, which needs the number to compare to.
Therefore, the amount of bytes needed to store the parameter is saved by using the loop reversal. Additionally, if
the comparison number exceeds the size of word of the platform, in standard loop order, multiple instructions
would need to be executed in order to evaluate the comparison, which is not the case with loop reversal.
Loop unrolling
Unrolling duplicates the body of the loop multiple times, in order to decrease the number of times the loop
condition is tested and the number of jumps, which hurt performance by impairing the instruction pipeline. A
"fewer jumps" optimization. Completely unrolling a loop eliminates all overhead, but requires that the number of
iterations be known at compile time.
Loop splitting
Loop splitting attempts to simplify a loop or eliminate dependencies by breaking it into multiple loops which
have the same bodies but iterate over different contiguous portions of the index range. A useful special case is
loop peeling, which can simplify a loop with a problematic first iteration by performing that iteration separately
before entering the loop.
Loop unswitching
Unswitching moves a conditional from inside a loop to outside the loop by duplicating the loop's body inside
each of the if and else clauses of the conditional.
Software pipelining
The loop is restructured in such a way that work done in an iteration is split into several parts and done over
several iterations. In a tight loop this technique hides the latency between loading and using values.
Automatic parallelization
A loop is converted into multi-threaded or vectorized (or even both) code in order to utilize multiple processors
simultaneously in a shared-memory multiprocessor (SMP) machine, including multi-core machines.
Data-flow optimizations
Data-flow optimizations, based on data-flow analysis, primarily depend on how certain properties of data are
propagated by control edges in the control flow graph. Some of these include:
Common subexpression elimination
In the expression (a + b) - (a + b)/4, "common subexpression" refers to the duplicated (a + b). Compilers
implementing this technique realize that (a + b) will not change, and so only calculate its value once.
replacing expressions consisting of constants (e.g., 3 + 5) with their final value (8) at compile time, rather than
doing the calculation in run-time. Used in most modern languages.
in the presence of pointers, it is difficult to make any optimizations at all, since potentially any variable can have
been changed when a memory location is assigned to. By specifying which pointers can alias which variables,
unrelated pointers can be ignored.
removal of assignments to variables that are not subsequently read, either because the lifetime of the variable
ends or because of a subsequent assignment that will overwrite the first value.
Most programs run as a loop in the system. It becomes necessary to optimize the loops in order to save
CPU cycles and memory. Loops can be optimized by the following techniques:
• Invariant code : A fragment of code that resides in the loop and computes the same value at
each iteration is called a loop-invariant code. This code can be moved out of the loop by saving
it to be computed only once, rather than with each iteration.
• Induction analysis : A variable is called an induction variable if its value is altered within the
loop by a loop-invariant value.
• Strength reduction : There are expressions that consume more CPU cycles, time, and memory.
These expressions should be replaced with cheaper expressions without compromising the
output of expression. For example, multiplication (x * 2) is expensive in terms of CPU cycles
than (x << 1) and yields the same result.
Dead-code Elimination
Dead code is one or more than one code statements, which are:
• Either never executed or unreachable,
• Or if executed, their output is never used.
Thus, dead code plays no role in any program operation and therefore it can simply be eliminated.
The above control flow graph depicts a chunk of program where variable ‘a’ is used to assign the
output of expression ‘x * y’. Let us assume that the value assigned to ‘a’ is never used inside the
loop.Immediately after the control leaves the loop, ‘a’ is assigned the value of variable ‘z’, which
would be used later in the program. We conclude here that the assignment code of ‘a’ is never used
anywhere, therefore it is eligible to be eliminated.
Likewise, the picture above depicts that the conditional statement is always false, implying that the
code, written in true case, will never be executed, hence it can be removed.
Partial Redundancy
Redundant expressions are computed more than once in parallel path, without any change in
operands.whereas partial-redundant expressions are computed more than once in a path, without any
change in operands. For example,
Loop-invariant code is partially redundant and can be eliminated by using a code-motion technique.
Another example of a partially redundant code can be:
If (condition)
{
a = y OP z;
}
else
{
...
}
c = y OP z;
We assume that the values of operands (y and z) are not changed from assignment of variable a to
variable c. Here, if the condition statement is true, then y OP z is computed twice, otherwise once.
Code motion can be used to eliminate this redundancy, as shown below:
If (condition)
{
...
tmp = y OP z;
a = tmp;
...
}
else
{
...
tmp = y OP z;
}
c = tmp
Strength reduction
There are operations that consume more time and space. Their ‘strength’ can be reduced by replacing
them with other operations that consume less time and space, but produce the same result.
For example, x * 2 can be replaced by x << 1, which involves only one left shift. Though the output of
a * a and a2 is same, a2 is much more efficient to implement.
Code Generator
A code generator is expected to have an understanding of the target machine’s runtime environment
and its instruction set. The code generator should take the following things into consideration to
generate the code:
• Target language : The code generator has to be aware of the nature of the target language for
which the code is to be transformed. That language may facilitate some machine-specific
instructions to help the compiler generate the code in a more convenient way. The target
machine can have either CISC or RISC processor architecture.
• IR Type : Intermediate representation has various forms. It can be in Abstract Syntax Tree
(AST) structure, Reverse Polish Notation, or 3-address code.
• Selection of instruction : The code generator takes Intermediate Representation as input and
converts (maps) it into target machine’s instruction set. One representation can have many ways
(instructions) to convert it, so it becomes the responsibility of the code generator to choose the
appropriate instructions wisely.
• Register allocation : A program has a number of values to be maintained during the execution.
The target machine’s architecture may not allow all of the values to be kept in the CPU memory
or registers. Code generator decides what values to keep in the registers. Also, it decides the
registers to be used to keep these values.
• Ordering of instructions : At last, the code generator decides the order in which the instruction
will be executed. It creates schedules for instructions to execute them.
Code Generation
Basic blocks comprise of a sequence of three-address instructions. Code generator takes these sequence
of instructions as input.
Note : If the value of a name is found at more than one place (register, cache, or memory), the
register’s value will be preferred over the cache and main memory. Likewise cache’s value will be
preferred over the main memory. Main memory is barely given any preference.
getReg : Code generator uses getReg function to determine the status of available registers and the
location of name values. getReg works as follows:
Overview of Interpretation
An interpreter is system software that translates a given High-Level Language (HLL) program into a low-
level one, but it differs from compilers. Interpretation is a real-time activity where an interpreter takes
theprogram, onestatementatatime, andtranslateseachlinebeforeexecuting it.
Benefits of Interpretation
The distinguished benefits of interpretation are as follows:
• Executes the source code directly. It translates the source code into some efficient
Intermediate Code (IC) and immediately executes it. The process of execution can be
• The Java virtual machine essentially interprets the bytecode formof aprogram. The Java compiler
andtheJavavirtualmachinethusimplementtheimpureinterpretationscheme.
• Use of an interpretive scheme allows certain elements of a program to be specified during
interpretation. This feature is exploited to provide a capability for including program modules
called Java class files during interpretation of a Java program.
• The class loader is invoked whenever a new class file is to be dynamically included in
program. Theclassloaderlocatesthedesiredclassfileandpassesittothe Javabytecode verifier.
• The Java bytecode verifier checks whether
o The program forges pointers, thereby potentially accessing invalid data or performing
branches to invalid locations.
o Theprogram violatesaccess restrictions, e.g., byaccessing private data.
o The program has type-mismatches whereby it may access data in an invalid manner.
o The program may have stack overflows or underflows during execution.
• The Java language environment provides the two compilation schemes shown in the lower half of
Figure The Java Just-In-Time compiler compiles parts of the Java bytecode that are consuming a
significant fraction of the execution time into the machine language of the computer to improve
their execution efficiency. It is implemented using the schemeof dynamic compilation.
• After the just-in-time compiler has compiled some part of the program, some parts of the Java
source program has been converted into the machine language while the remainder of the program
still exists in the bytecode form. Hence the Java virtual machine uses a mixed- mode execution
approach.
• Theothercompilationoptionusesthe Javanativecodecompilershown inthe lower part of
Figure .It simply compiles the complete Java program into the machine language of a computer.
This scheme provides fast execution of the Java program; however, it cannot provide any of the
benefits of interpretation or just-in- time compilation.
class file, and a bytecode verifier, which ensures that execution of the bytecode would not cause
any breaches of security.
• The Java virtual machine is a stack machine. By contrast, a stack machine performs
computations by using the values existing in the top few entries on a stack and leaving their results
on the stack. This arrangement requires that a program should load the values on which it wishes
to operate on the stack before performing operations on them and should take their results from
the stack.
• The stack machine has the following three kinds of operations:
o Pushoperation: Thisoperationhasoneoperand, whichistheaddressofamemory location.
The operation creates a new entry at the top of the stack and copies the value that is
contained in the specified memory location into this entry.
o Pop operation: This operation also has the address of a memory location as its operand.
It performs the converse of the push operation—it copies the value contained in the
entry that is at the top of the stack into the specified memory location and also deletes
that entry from the stack.
o n-ary operation: This operation operates on the values existing in the top n entries of the
stack, deletes the top n entries from the stack, and leaves the result, if any, in the top
entry of the stack. Thus, a unary operation operates only on the value contained in the
top entry of the stack, a binary operation operates on values contained in the top two
entries of the stack, etc.
• A stack machine can evaluate expressions very efficiently because partial results need not be stored
in memory—they can be simply left on the stack.
Types of Errors
Syntax Error
• Syntax errors occur due to the fact that the syntax of the programming language is not
followed.
• The errors in token formation, missing operators, unbalanced parenthesis, etc., constitute syntax
errors.
• Theseare generallyprogrammer inducedduetomistakesandnegligencewhilewritinga program.
• Syntaxerrorsaredetectedearlyduringthecompilationprocessandrestrictthecompilerto
proceed for code generation.
• Let us see the syntax errors with Java language in the following examples.
Example 1: Missing punctuation-"semicolon" int
age = 50 // note here semicolon is missing
Example 2: Errors inexpressionsyntax
x = ( 30 - 15; // note the missing closing parenthesis ")"
Semantic Error
Logical Error
• Logical errors occur due to the fact that the software specification is not followed while
writing the program. Although the program is successfully compiled and executed error free,
the desired results are not obtained.
• Let us look into some logical errors with Java language.
Example: Errorsincomputation
publicstatic int mul(inta, int b) {
return a + b ;
}
// this method returns the incorrect value with respect to the specification that requires to multiply
two integers
Example:Non-terminatingloops
String str = br. readLine();
while (str != null) {
System.out.pri ntln(str);
} // the loop in the code did not terminate
• Logical errors may cause undesirable effect and program behaviors. Sometimes, these errors
remain undetected unless the results are analyzed carefully.
Debugging Procedures
• Whenever there is a gap between an expected output and an actual output of a program, the
program needs to be debugged.
• Anerrorinaprogramiscalledbug, anddebuggingmeansfindingandremovingtheerrors present
in theprogram.
• Debugging involves executing the program in a controlled fashion.
• During debugging, the execution of a program can be monitored at every step.
• Inthedebugmode,activitiessuchasstartingtheexecutionandstoppingtheexecutionare in the
hands of the debugger.
• The debugger provides the facility toexecute a program up to the specified instruction by
inserting a breakpoint.
• Itgivesachancetoexaminethevaluesassignedtothevariablespresentintheprogramat any
instant and, if required, offers an opportunity to update the program.
• Types of debugging procedures:
o Debug Monitors:
Adebugmonitorisaprogramthatmonitorstheexecutionofaprogramandreports thestate
ofaprogramduringitsexecution.Itmayinterfereintheexecutionprocess, depending upon
the actions carried out by a debugger (person). In order to initiate theprocessofdebugging,
a programmer must compile the program with the debug option first. This option, along
with other information, generates a table that stores the information about the variables
used in a program and their addresses.
o Assertions:
Assertions are mechanisms used by a debugger to catch the errors at a stage before the
execution of a program. Sometimes, while programming, some assumptions are made about
the data involved in computation. If these assumptions went wrong during the
execution of the program, it may lead to erroneous results. For this, a programmer can
make use of an assert statement. Assertions are the statements used in programs, which are
always associated with Boolean conditions. If an assert() statement is evaluated to be true,
nothing happens. But if it is realized that the statement is false, the execution program
halts.
Classification of Debuggers
Static Debugging
• Static debugging focuses on semantic analysis.
• In a certain program, supposethere are twovariables: var1 and var2. The type of var1 is an integer,
and the type of var2 is a float. Now, the program assigns the value of var2 to var1; then, there is a
possibility that it may not get correctly assigned to the variable due to truncation. This type of
analysis falls under static debugging.
• Static debugging detects errors before the actual execution.
• Static code analysis may include detection of the following situations:
o Dereferencing of variable before assigning a value to it
o Truncation of value due to wrong assignment
o Redeclaration of variables
o Presence of unreachablecode
Dynamic/Interactive Debugger
Dynamic analysis is carried out during program execution. An interactive debugging system provides
programmerswithfacilitiesthataidintestinganddebuggingprogramsinteractively.
whether some pre-condition or post-condition has been met in the program during execution.
• Tracing: Tracing monitors step by step the execution of all executable statements present in a
program. The other name for this process is "step into". Another possible variation is "step over"
debugging that can be executed at the level of procedure or function. This can be implemented
byaddinga breakpoint at the lastexecutable statementin aprogram.
• Traceback: This gives a user the chance to traceback over the functions, and the traceback utility
uses stack data structure. Traceback utility should show the path by which the current statement in
the program was reached.
• Program-display capabilities: While debugging is in progress, the program being debugged must be
made visible on the screen along with the line numbers.
• Multilingual capability: The debugging system must also consider the language in which the
debugging is done. Generally, different programming languages involve different user
environments and applicationssystems.
• Optimization: Sometimes, to make a program efficient, programmers may use an optimized code.
Debugging of such statements can be tricky. However, to simplify the debugging process, a
debuggermayuseanoptimizingcompilerthatdealswiththefollowingissues:
o Removing invariant expressions from a loop
o Merging similar loops
o Eliminating unnecessarystatements
o Removing branch instructions
Page 123