System software (BCA)
System software (BCA)
An assembler is a crucial system software that converts assembly language programs into
machine code or object code, which the central processing unit (CPU) can directly execute. It
serves as a translator, converting human-readable mnemonics and symbols into the binary
instructions that make up machine language. This process is essential for low-level programming
and system development, where precise control over hardware is necessary.
Assemblers are similar to compilers in that they produce executable code. However, assemblers
are more simplistic since they only convert low-level code (assembly language) to machine code.
Since each assembly language is designed for a specific processor, assembling a program is
performed using a simple one-to-one mapping from assembly code to machine code. Compilers,
on the other hand, must convert generic high-level source code into machine code for a specific
processor.
Most programs are written in high-level programming languages and are compiled directly to
machine code using a compiler. However, in some cases, assembly code may be used to
customize functions and ensure they perform in a specific way. Therefore, IDEs often include
assemblers so they can build programs from both high and low-level languages.
The design of an assembler depends upon the machine architecture as the language used is
mnemonic language.
Key Characteristics of an Assembler
1. Translation of Assembly Language to Machine Code:
o The primary function of an assembler is to convert assembly language, which uses symbolic
instructions (like MOV, ADD, SUB), into the corresponding binary code understood by a CPU.
o Every assembly instruction is mapped directly to a machine instruction, making this translation
straightforward but vital for efficient system execution.
2. Symbolic Addressing:
o Assemblers use symbols to represent memory addresses, making it easier for programmers to
reference data without needing to know the exact address. The assembler manages these
symbols in a symbol table and resolves them to actual memory addresses during translation.
3. Platform-Specific:
o Assemblers are highly architecture-specific. They generate machine code for a specific
processor type or instruction set architecture (ISA) such as x86, ARM, or MIPS.
o Each processor has its own set of instructions, registers, and addressing modes, so the
assembler must be tailored to that particular architecture.
4. Two Types of Assemblers:
o Single-Pass Assembler: Processes the source code once and generates machine code
immediately. Efficient but struggles with forward references.
o Two-Pass Assembler: Processes the code in two passes—first building a symbol table, then
generating the machine code. Easier to manage forward references but requires more time and
memory.
5. Macro Processing:
o Assemblers often support macros, which allow the definition of reusable code blocks. A macro
can be called multiple times, simplifying repetitive code patterns and reducing the overall
complexity of the source code.
6. Error Detection and Reporting:
o Assemblers perform error checking, ensuring that the syntax of the assembly language is correct
and that symbolic references are properly resolved. Errors like undefined labels, incorrect
instructions, and invalid operand types are flagged during assembly.
Functions of an Assembler
1. Lexical Analysis:The assembler scans the source code, breaking it down into tokens
(mnemonics, labels, constants, directives, etc.). This process identifies the basic
elements of the assembly code that need to be translated into machine code.
2. Syntax Analysis:After lexical analysis, the assembler checks the syntax of each
instruction to ensure that it follows the rules of the assembly language. For example, it
verifies that the correct number and types of operands are used for each instruction.
3. Symbol Table Management:The assembler creates and maintains a symbol table,
which maps symbolic names (labels, variables) to memory addresses. During the
assembly process, symbolic addresses are replaced with actual memory addresses from
this table.
4. Instruction Encoding:The assembler translates each mnemonic instruction into its
corresponding binary machine code instruction. For example, MOV A, B might be
translated into a binary sequence like 10101011, depending on the processor’s
instruction set.
5. Handling Directives:Assembly programs often include assembler directives (like
START, END, ORG, DB) that are instructions for the assembler itself, not for the
machine. These directives control aspects like where code is loaded in memory or how
data is initialized.
6. Forward Reference Handling:Forward references occur when a label or variable is
used before it is defined. Assemblers handle these either by backpatching (in single-pass
assemblers) or by resolving them in a second pass (in two-pass assemblers).
7. Literal Handling:The assembler also handles literals (constants used in the program)
and stores them in a literal table. These literals are replaced with actual values when
generating machine code.
8. Error Handling: Assemblers detect and report various errors, such as:
Syntax Errors: Mistakes in instruction formats, invalid operands, etc.
Semantic Errors: Undefined symbols, incorrect label definitions.
o The assembler outputs error messages to help the programmer debug the code.
Assembler Directives
Assembler directives are special instructions that guide the assembler during
the assembly process. These directives are not translated into machine code
and do not generate any executable instructions. Instead, they provide
information to the assembler about how to interpret and manage the assembly
code, such as memory allocation, symbol definition, and data initialization.
Assembler directives help control the organization of the program in memory,
initialization of data, label assignment, and macro definitions. They are often
prefixed by a dot (.), but this can vary based on the assembler and architecture.
Design of Assembler
It generates instructions by evaluating the mnemonics (symbols) in operation field and find the
value of symbol and literals to produce machine code. Now, if assembler do all this work in one
scan then it is called single pass assembler,
Single-Pass Assembler : A Single-Pass Assembler processes the source program in one pass.
It is faster and more efficient but has limitations in handling forward references.Scenario for one-
pass assemblers Generate their object code in memory for immediate execution – load-and-go
assembler External storage for the intermediate file between two passes is slow or is inconvenient
to use Main problem - Forward references Data items Labels on instructions Solution Require
that all areas be defined before they are referenced. It is possible, although inconvenient, to do so
for data items. Forward jump to instruction items cannot be easily eliminated. Insert (label,
address _ to _ be _ modified) to SYMTAB Usually, address _ to _ be _ modified is stored in a
linked-list
Forward Reference in One-pass Assembler Omits the operand address if the symbol has not yet
been defined Enters this undefined symbol into SYMTAB and indicates that it is undefined Adds
the address of this operand address to a list of forward references associated with the SYMTAB
entry When the definition for the symbol is encountered, scans the reference list and inserts the
address. At the end of the program, reports the error if there are still SYMTAB entries indicated
undefined symbols. For Load-and-Go assembler Search SYMTAB for the symbol named in the
END statement and jumps to this location to begin execution if there is no error
Two-Pass Assembler
A Two-Pass Assembler makes two passes over the source program. This type
of assembler is commonly used because it resolves forward references
(instructions that refer to labels that are defined later in the program). Here's
how it works in depth:
Pass 1: Symbol Table Construction
Objective: The first pass's primary goal is to gather information about all the
symbols (labels) in the source code.
Key Activities:
1. Location Counter (LC) Initialization: The LC is initialized to the starting address
of the program (typically zero or a user-defined address). It keeps track of the
memory location of each instruction.
2. Scanning the Source Code: The assembler reads the source code line by line.
For each label encountered, it stores the label in the Symbol Table along with
its corresponding address (from the LC).
3. Address Assignment: Each instruction or label is assigned an address based
on the current value of the LC.
4. LC Increment: After each instruction, the LC is incremented by the instruction's
size, ensuring that the next instruction gets the correct address.
5. Error Handling: If a symbol (label) is used in an instruction but not defined in
the source code, it raises an undefined symbol error.
Output: At the end of the first pass, a Symbol Table and intermediate file are
created, containing the instruction addresses and symbol definitions.
Pass 2: Machine Code Generation
Objective: The second pass's purpose is to generate the actual machine code
using the Symbol Table created in Pass 1.
Key Activities:
1. Reading the Source Code and Intermediate File: The assembler reads the
intermediate file and uses the Symbol Table to resolve addresses for all
symbols.
2. Opcode Conversion: Each assembly instruction is converted into its
corresponding machine code (binary or hexadecimal) using the instruction's
mnemonic and the operand addresses.
3. Forward Reference Resolution: All forward references are resolved using the
Symbol Table. Since the symbol definitions were gathered in Pass 1, the
assembler can now assign the correct memory locations to instructions
referencing labels.
4. Object Code Generation: The assembler generates the object code for each
instruction and outputs it in a machine-readable form, typically in an object file
or for immediate execution.
5. Error Handling: It checks for errors such as undefined symbols and incorrect
syntax. If any error is detected, it halts the assembly process and reports the
issues.
Output: The result is a fully assembled object code.
Advantages of Two-Pass Assembler:
1. Forward Reference Handling: Since it makes two passes, it handles forward
references smoothly.
2. Error Detection: More thorough error checking is possible since all symbols
and addresses are known after the first pass.
3. Modularity: It separates symbol resolution from machine code generation,
making the process more organized.
Disadvantages of Two-Pass Assembler:
1. Inefficient: The assembler has to scan the source code twice, which can be
slower for large programs.
2. Memory Usage: It requires additional memory to store the intermediate file and
Symbol Table across passes.
Working of Assembler
Assembler divides tasks into two passes:
Pass-1
Define symbols and literals and remember them in the symbol table and literal table
respectively.
Keep track of the location counter.
Process pseudo-operations.
Defines a program that assigns the memory addresses to the variables and
translates the source code into machine code.
Pass-2
Generate object code by converting symbolic op-code into respective numeric op-
code.
Generate data for literals and look for values of symbols.
Defines a program that reads the source code two times.
It reads the source code and translates the code into object code.
Firstly, We will take a small assembly language program to understand the working in their
respective passes. Assembly language statement format:
[Label] [Opcode] [operand]
Assembly Program:
Label Op-code operand LC value(Location counter)
JOHN START 200
MOVER R1, ='3' 200
MOVEM R1, X 201
L1 MOVER R2, ='2' 202
LTORG 203
X DS 1 204
END 205
=’3′ –––
X –––
X –––
L1 202
Literal Address
=’3′ –––
=’2′ –––
Step-5: LTORG 203
Assign an address to the first literal specified by LC value, i.e., 203
Literal Address
=’3′ 203
=’2′ –––
Step-6: X DS 1 204
It is a data declaration statement i.e. X is assigned a data space of 1. But X is a symbol that was
referred to earlier in step 3 and defined in step 6. This condition is called a Forward Reference
Problem where the variable is referred prior to its declaration and can be solved by back-patching.
So now the assembler will assign X the address specified by the LC value of the current step.
Symbol Address
X 204
L1 202
Step-7: END 205
The program finishes execution and the remaining literal will get the address specified by the LC
value of the END instruction. Here is the complete symbol and literal table made by pass-1 of the
assembler.
Symbol Address
X 204
L1 202
Literal Address
=’3′ 203
=’2′ 205
Now tables generated by pass 1 along with their LC value will go to pass 2 of the assembler for
further processing of pseudo-opcodes and machine op-codes.
Working of Pass-2
Pass-2 of the assembler generates machine code by converting symbolic machine-opcodes into
their respective bit configuration(machine understandable form). It stores all machine-opcodes in
the MOT table (op-code table) with symbolic code, their length, and their bit configuration. It will
also process pseudo-ops and will store them in the POT table(pseudo-op table). Various
Databases required by pass-2:
1. MOT table(machine opcode table)
2. POT table(pseudo opcode table)
3. Base table(storing value of base register)
4. LC ( location counter)
2. Types of Macros
Macros can be classified into various types based on their complexity, functionality, and how they
are used. These types offer flexibility in how code is defined and reused.
a. Simple Macros
Description: The most basic type of macro, which simply replaces the macro name with a fixed
block of code. There are no parameters or conditions in simple macros.
Example:
ADD_VALUES MACRO
MOV AX, BX
ADD AX, CX
ENDM
Here, ADD_VALUES is a simple macro that moves the content of BX into AX and adds the value
of CX to it.
b. Parameterized Macros
Description: Macros that accept arguments, allowing the same block of code to be reused with
different values or variables. The arguments act as placeholders that are replaced with actual
values during macro invocation.
Example:
ADD_VALUES MACRO A, B
MOV AX, A
ADD AX, B
ENDM
When invoked with specific parameters like ADD_VALUES 5, 10, the macro expands to use
those values, making it more versatile than a simple macro.
c. Conditional Macros
Description: Macros that incorporate conditional logic, typically using assembly language
directives like IF, ELSE, and ENDIF. This allows the macro to generate different code based on
certain conditions.
Example:
MAXIMUM MACRO A, B
IF A > B
MOV AX, A
ELSE
MOV AX, B
ENDIF
ENDM
Depending on whether A is greater than B, the macro will move the correct value into the AX
register.
d. Nested Macros
Description: Macros that are defined within other macros. This allows for complex hierarchies
and modularization within macro definitions.
Example:
OUTER_MACRO MACRO X
NESTED_MACRO MACRO Y
MOV AX, X
ADD AX, Y
ENDM
NESTED_MACRO 5
ENDM
OUTER_MACRO defines NESTED_MACRO, which can then be invoked with parameters inside
OUTER_MACRO.
e. Recursive Macros
Description: Macros that invoke themselves, either directly or indirectly. This kind of macro is
powerful but should be used carefully to avoid infinite recursion.
Example:
FACTORIAL MACRO N
IF N <= 1
MOV AX, 1
ELSE
MOV AX, N
FACTORIAL N-1
MUL AX, N
ENDIF
ENDM
A one-pass macro processor that alternate between macro definition and macro expansion is
able to handle “macro in macro”.However, because of the one-pass structure, the definition of a
macro must appear in the source program before any statements that invoke that macro.This
restriction is reasonable (does not create any real inconvenience).
Three main data structures involved in an one-pass macro processor:
• DEFTAB: Stores the macro definition including macro prototype and macro body.Comment
lines are omitted.References to the macro instruction parameters are converted to a positional
notation for efficiency in substituting arguments.
• NAMTAB: Store macro names, which serves an index to DEFTAB contain pointers to the
beginning and end of the definition
• ARGTAB : Used during the expansion of macro invocations.When a macro invocation
statement is encountered, the arguments are stored in this table according to their position in
the argument list
DATA STRUCTURE
ALGORITHEM
2. Functions of a Loader
The loader performs several vital functions that are crucial for
the proper execution of programs. Below are the main tasks
that a loader typically handles:
1. Memory Allocation
The loader allocates memory for the program’s text segment
(where instructions are stored), data segment (for global/static
variables), and stack (for function calls and local variables).
Example:
If a program’s text segment needs 100 KB of memory and its
data segment requires 50 KB, the loader allocates 150 KB of
memory and ensures that each segment is placed in separate
memory locations.
2. Relocation
Relocation involves adjusting memory addresses based on
where the program is actually loaded. During the linking stage,
memory addresses are typically relative (e.g., starting from 0).
When the program is loaded, the loader updates these
addresses to reflect their true positions in physical memory.
3. Symbol Resolution
In the case of dynamic loading, where a program depends on
external libraries, the loader resolves external symbols (e.g.,
function names, variable names) by finding the appropriate
libraries and linking them to the main program.
4. Dynamic Loading of Libraries
In modern systems, programs often rely on dynamically loaded
libraries (e.g., .dll files in Windows or .so files in Linux). The
loader is responsible for loading these libraries into memory at
runtime.
5. Transfer of Control
After all initializations and setups (memory allocation, symbol
resolution, etc.), the loader passes control to the program’s
entry point, such as the main() function in C/C++ programs.
Example of Loader Functionality:
When a user executes a program, the loader may find that it
depends on a dynamic library like libssl.so for SSL encryption.
The loader searches for this library, loads it into memory, links
it to the main program, and updates all references to libssl.so
functions.
3. Types of Loaders
Different types of loaders are used depending on the system
design and the specific needs of the application. The following
are common types:
1. Absolute Loader
In an absolute loader, the executable file contains absolute
memory addresses. The loader simply transfers the program to
a specific location in memory.
Advantages:
5. Design of Loaders
1. Absolute Loader
An absolute loader directly loads programs without any
address modifications. It’s simple but lacks flexibility. Absolute
loaders are often used in embedded systems where programs
are always loaded into the same memory locations.
2. Dynamic Loading and DLL (Dynamic Link Library)
A dynamic loader loads shared libraries at runtime. This is
common in modern operating systems. For example, Windows
uses .dll files to dynamically load and link libraries to running
applications.
Advantages of DLLs:
Memory efficiency: Shared libraries reduce memory
consumption.
Code reusability: Libraries can be reused across different
programs.
Ease of maintenance: Updating the library updates all
programs that depend on it.
Disadvantages of DLLs:
Runtime overhead: Linking happens at runtime, adding
overhead.
Compatibility issues: Different programs may require
incompatible versions of the same library, leading to "DLL hell."
Example of Dynamic Loading:
In Windows, a program may use a DLL for graphics rendering.
When the program is executed, the loader dynamically loads
the necessary DLL into memory, allowing the program to use
the library's functions.
UNIT v: basIcs Of cOmpIler:
a sImple cOmpIler
A compiler is a computer program which helps you transform source code written in a high-level
language into low-level machine language. It translates the code written in one programming
language to some other language without changing the meaning of the code. The compiler also
makes the end code efficient which is optimized for execution time and memory space.
The compiling process includes basic translation mechanisms and error detection. Compiler
process goes through lexical, syntax, and semantic analysis at the front end, and code generation
and optimization at a back-end.
Features of Compilers
Correctness
Speed of compilation
Preserve the correct the meaning of the code
The speed of the target code
Recognize legal and illegal program constructs
Good error reporting/handling
Code debugging help
Types of Compiler
Single Pass Compilers
Two Pass Compilers
Multipass Compilers
Single Pass Compiler
In single pass Compiler source code directly transforms into machine code. For example, Pascal
language.
Two Pass Compiler
The multipass compiler processes the source code or syntax tree of a program several times. It
divided a large program into multiple small programs and process them. It develops multiple
intermediate codes. All of these multipass take the output of the previous phase as an input. So it
requires less memory. It is also known as 'Wide Compiler'.
> During analysis, the operations implied by the source program are determined and recorded in a
hierarchical structure called a tree.
>Often, a special kind of tree called a syntax tree is used.
>In syntax tree each node represents an operation and the children of the node represent the
arguments of the operation.
>For example, a syntax tree of an assignment statement is shown below.
The analysis-synthesis model is a foundational concept in compiler design, breaking the
compilation process into two major phases: Analysis and Synthesis.
Analysis Phase
The analysis phase is responsible for examining the source code and breaking it down into its
components. It consists of the following steps:
1. Lexical Analysis:
o This is the first step, where the compiler reads the source code and converts
it into tokens (basic syntactic units).
o The lexical analyzer (lexer) identifies keywords, operators, identifiers, and
literals using regular expressions and finite automata.
o Example: For the statement int a = 5;, the tokens generated might be int, a,
=, 5, and ;.
2. Syntax Analysis:
o In this step, the compiler parses the token sequence to ensure it adheres to
the grammatical structure of the language.
o The syntax analyzer (parser) constructs a syntax tree (or parse tree) that
represents the hierarchical structure of the program.
o Example: The parser might build a tree that shows the assignment operation,
with a as the variable being assigned and 5 as the value.
3. Semantic Analysis:
o This step verifies the logical correctness of the program, checking for
semantic errors such as type mismatches or undeclared variables.
o The semantic analyzer checks that operations are valid for the types involved
(e.g., you cannot add a string to an integer).
o Example: If b is declared as a string and the program attempts to add it to an
integer, a semantic error will be flagged.
Synthesis Phase
The synthesis phase takes the analyzed representation and constructs the final machine code. It
consists of the following steps:
1. Intermediate Code Generation:
o After the analysis phase, the compiler produces an intermediate
representation (IR) of the code that is easier to manipulate and optimize than
the original source code.
o Example: The statement int a = 5; might be represented in IR as LOAD 5
INTO a.
2. Optimization:
o The compiler optimizes the intermediate code to improve performance and
reduce resource usage. Optimization can occur at different levels: local
optimizations (within a single function) and global optimizations (across
multiple functions).
o Example: If a variable is assigned a value that is not used later in the
program, the compiler can eliminate that assignment to save space and time.
3. Code Generation:
o The final step of the synthesis phase involves generating the target machine
code from the optimized intermediate representation.
o The code generator produces assembly or machine language instructions
specific to the target architecture.
o Example: The IR instruction LOAD 5 INTO a might be translated into a
machine instruction like MOV R1, 5 followed by MOV a, R1.
All these phases convert the source code by dividing into tokens, creating parse trees, and
optimizing the source code by different phases.
Lexical Analysis
The first phase of scanner works as a text scanner. This phase scans the source code as a stream
of characters and converts it into meaningful lexemes. Lexical analyzer represents these lexemes
in the form of tokens as:
<token-name, attribute-value>
Syntax Analysis
The next phase is called the syntax analysis or parsing. It takes the token produced by lexical
analysis as input and generates a parse tree (or syntax tree). In this phase, token arrangements are
checked against the source code grammar, i.e. the parser checks if the expression made by the
tokens is syntactically correct.
Semantic Analysis
Semantic analysis checks whether the parse tree constructed follows the rules of language. For
example, assignment of values is between compatible data types, and adding string to an integer.
Also, the semantic analyzer keeps track of identifiers, their types and expressions; whether
identifiers are declared before use or not etc. The semantic analyzer produces an annotated syntax
tree as an output.
Intermediate Code Generation
After semantic analysis the compiler generates an intermediate code of the source code for the
target machine. It represents a program for some abstract machine. It is in between the high-level
language and the machine language. This intermediate code should be generated in such a way
that it makes it easier to be translated into the target machine code.
Code Optimization
The next phase does code optimization of the intermediate code. Optimization can be assumed as
something that removes unnecessary code lines, and arranges the sequence of statements in order
to speed up the program execution without wasting resources (CPU, memory).
Code Generation
In this phase, the code generator takes the optimized representation of the intermediate code and
maps it to the target machine language. The code generator translates the intermediate code into a
sequence of (generally) re-locatable machine code. Sequence of instructions of machine code
performs the task as the intermediate code would do.
Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All the identifier's names
along with their types are stored here. The symbol table makes it easier for the compiler to quickly
search the identifier record and retrieve it. The symbol table is also used for scope management.
Error Handling: Throughout the compilation process, the compiler must handle errors gracefully,
providing useful feedback to the programmer regarding syntax or semantic issues.