0% found this document useful (0 votes)
13 views

DSA_Module1_Notes_stu (1) (1)

The document provides an overview of data structures, including their classifications into primitive and non-primitive types, and outlines various operations performed on them. It details linear data structures like arrays, linked lists, stacks, and queues, as well as non-linear structures such as trees and graphs, highlighting their advantages and disadvantages. Additionally, it discusses the importance of selecting appropriate data structures based on problem requirements and operational constraints.
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

DSA_Module1_Notes_stu (1) (1)

The document provides an overview of data structures, including their classifications into primitive and non-primitive types, and outlines various operations performed on them. It details linear data structures like arrays, linked lists, stacks, and queues, as well as non-linear structures such as trees and graphs, highlighting their advantages and disadvantages. Additionally, it discusses the importance of selecting appropriate data structures based on problem requirements and operational constraints.
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 22

DATA STRUCTURES AND APPLICATIONS

Subject Code: 18CS32


Module -1
Introduction: Data Structures, Classifications (Primitive & Non Primitive), Data structure
Operations, Review of Arrays, Structures, Self-Referential Structures, and Unions.
Pointers and Dynamic Memory Allocation Functions. Representation of Linear Arrays in
Memory, Dynamically allocated arrays, Array Operations: Traversing, inserting, deleting,
searching, and sorting. Multidimensional Arrays, Polynomials and Sparse Matrices.
Strings: Basic Terminology, Storing, Operations and Pattern Matching algorithms.
Programming Examples.

Data:
The term data mea ns a value or set of values. It specifies either the value of a variable or a
constant (e.g., ma rks of students, name of an employee, address of a custome r, value of pi, etc.).
Thus, data are simply collection of facts and figures. A data item refers to a single unit of values.
While a data item that does not have subordinate data items is categorized as an
elementary item, the one that is composed of one or more subordinate data ite ms is called a
group item. For example, a student‘s name may be divided into three sub-ite ms—first na me,
middle name, and last na me —but the USN of a student would norma lly be treated as a single
ite m.

In the above example ( USN, Age, Gender, First, Middle, Last, Street, Area) are ele me ntary data
ite ms, whereas (Name, Address ) are group data ite ms.
A record is a collection of related data items. For example, the na me, address, course,
and marks obtained are individual data ite ms of a student record. But all these data ite ms can be
grouped together to form a record.
A file is a collection of related records. For example, if there are 60 students in a class,
then there are 60 records of the students. All these related records are stored in a file.
Moreover, each record in a file may consist of multiple data ite ms but the value of a
certain data item uniquely identifies the record in the file. Such a data item K is called a primary
key, and the values K1, K2 ... in such field are called keys or key values. For example, in a
student‘s record that contains roll number, name, address, course, and ma rks obtained, the field
roll number is a primary key. Rest of the fields (name, address, course, and marks) cannot serve as
primary keys, since two or more students may have the sa me name, or may have the sa me
address (as they might be staying at the sa me place), or may be enrolled in the same course, or
have obtained same ma rks.
This organization and hierarchy of data is taken further to form more comple x types of
data structures.
An entity is something that has certain attributes or properties which may be assigned
values. The values the mse lves may be either numeric or non-numeric.

1
Example:

Entities with similar attributes (e.g. all the employees in an organization) form an entity
set. Each attribute of an entity set has a range of values, the set of all possible values that could
be assigned to the partic ular attribute.
The term “information” is sometimes used for data with given attributes, or, in other words
information is meaningful or processed data.

Data Structure:
A data structure is a particular way of storing and organizing data in a computer‘s memory so
that it can be used efficiently. Data may be organized in many different ways. Hence, d ata
structure can also be defined as a mathematical or logical model of a particular organization of data.
The choice of a particular data model depends on the two considerations first; it must be
rich enough in structure to mirror the actual relationships of the data in the real world. On the other
hand, the structure should be simple enough that one can effectively process t he data whenever
necessary.
A data structure is basically a group of data elements that are put together
under one name, and which defines a particular way of storing and organizing data in a
computer so that it can be used efficiently.
Data structures are used in almost every program or software system. Some common
examples of data structures are arrays, linked lists, queues, stacks, binary trees, and hash tables.
Data structures are widely applied in the following areas:
- Compiler design
- Operating system
- Statistic al analysis package
- DBMS
- Numerical analysis
- Simulation
- Artificial intelligence
- Graphics
The application of an appropriate data structure provides the most efficient solution for any
problem. When selecting a data structure to solve a problem, the following steps must be
performed.
1. Analysis of the problem to determine the basic operations that must be supported. For
example, basic operation may include inserting/deleting/searching a data ite m from the
data structure.
2. Quantify the resource constraints for each operation.
3. Select the data structure that best meets these requirements.
Data Structure Example Applications
1. How does Google quickly find web pages that contain a search term?
2. What‘s the fastest way to broadcast a message to a network of computers ?
3. How can a subsequence of DNA be quickly found within the genome?
4. How does your operating system track which me mory (disk or RAM) is free?

CLASSIFICAT ION OF DAT A STRUCTURES


Data structures are generally categorized into two classes: primitive and non-primitive data
structures.
Primitive and Non-primitive Data Structures

2
Primit ive data structures are the fundamental data types which are supported by a
programming language. Some basic data types are integer, real, character, and boolean. The
terms ‗data type‘, ‗basic data type‘, and ‗primitive data type‘ are often used interchangeably.
Non- primit ive data structures are those data structures which are created using primitive
data structures. Examples of such data structures include linked lists, stacks, trees, and graphs.
Non-primit ive data structures can further be classified into two categories:
 Linear Data Structure
 Non- linear Data Structure
1. Linear Data Structures
If the ele me nts of a data structure are stored in a linear or sequential order, then it is a linear
data structure. The common examples of linear data structure are
 Arrays
 Queues
 Stacks
 Linked lists
Linear data structures can be represented in me mory in two different ways.
i) One way is to have to a linear relationship between elements by means of sequential
memory locations. These linear structures are called arrays.
ii) The other way is to have a linear relationship between ele me nts by mea ns of links.
These linear structures are called linked lists.
Arrays:
The simplest type of data structure is a linear (or one dimensional) array. An array is a
collection of simila r data elements. The ele ments of the array are stored in consecutive me mory
locations and are referenced by an index (also known as the subscript).

Linked List:
A linked list is a very flexible, dynamic data structure in which elements (called nodes) form
a sequential list; the linear order is given by mea ns of pointers. In contrast to static arrays, a
programmer need not worry about how many elements will be stored in the linked list. This feature
enables the programmers to write robust programs which require less ma intenance.
In a linked list, each node is allocated space as it is added to the list. Every node in the list
points to the next node in the list. Therefore, in a linked list, every node contains the following two
types of data:
i) The value of the node or any other data that corresponds to that node
ii) A pointer or link to the next node in the list
The last node in the list contains a NULL pointer to indicate that it is the end or tail of the
list. Since the me mory for a node is dynamically allocated when it is added to the list, the total
number of nodes that may be added to a list is limited only by the amount of me mory available.
Figure 1.2 shows a linked list of seven nodes.

Advantage: Easier to insert or delete data elements


Disadvantage: i) Slow search operation
ii) Requires more memory space

Stack:
A stack is a linear data structure in which insertion and deletion of elements are done at only one
end, which is known as the top of the stack. Stack is called a last-in, first-out (LIFO) structure
because the last ele ment which is added to the stack is the first eleme nt which is deleted from the
stack. Stacks can be imple mented using either arrays or linked lists. In the computer‘s memory,
stacks can be implemented using arrays or linked lists.
Top is used to store the address of the topmost element of the stack. It is this position from where
the ele me nt will be added or deleted.

3
Queue:
A queue is a first-in, first-out (FIFO) data structure in which the element that is inserted first is the
first one to be taken out. The eleme nts in a queue are added at one end called the rear and
removed from the other end called the front. Like stacks, queues can be implemented by using
either arrays or linked lists. Every queue has front and rear variables that point to the position from
where deletions and insertions can be done, respectively.
2. Non-linear Data Structures
If the ele ments of a data structure are not stored in a sequential order, then it is a non- linear data
structure. The relationship of adjacency is not maintained between elements of a non- linear data
structure. Examples include trees and graphs.
Tree:
A tree is a non-linear data structure which consists of a collection of nodes arranged in
a hierarchical order. One of the nodes is designated as the root node, and the remaining nodes
can be partitioned into disjoint sets such that each set is a sub-tree of the root. The simplest form
of a tree is a binary tree. A binary tree consists of a root node and left and right sub-trees, where
both sub-trees are also binary trees. Each node contains a data ele me nt, a left pointer which
points to the left sub-tree, and a right pointer which points to the right sub-tree.
Advantage: Provides quick search, insert, and delete operations
Disadvantage: Complicated deletion algorithm
Graph:
A graph is a non-linear data structure which is a collection of vertices (also called nodes) and
edges that connect these vertices. A graph is often viewed as a generalization of the tree
structure, where instead of a purely parent-to-c hild relationship between tree nodes, any kind of
comple x relationships between the nodes can exist.
In a tree structure, nodes can have any number of children but only one parent, a graph on
the other hand relaxes all such kinds of restrictions. Fig. 1.8 shows a graph with five nodes.

Advantage: Best mode ls real-world situations


Disadvantage: Some algorithms are slow and very comple x

Every data structure has its own strengths and weaknesses. Also every data structure suits
specific problem depending upon the operations performed and the data organization. Fig. 1.9
shows a detailed data structure classification.

Fig. 1. 9 Classification of Data Structure

4
Data structures are the building blocks of a program and so the selection of a particular
data structure addresses the following two things:
i) Data structure should be rich enough in structure for reflecting real relationship
existing between data.
ii) A structure must be simple such that we can process data effectively whenever
needed.

DAT A STRUCTURE OPERATIONS:


The data appearing in our data structures are processed by mea ns of certain operations. In fact,
the particular data structure that one chooses for a given situation depends largely in the
frequency with which specific operations are performed.
The different operations that can be performed on the various data structures are as follows.
i) Create: Produces a new, empty data structure of appropriate size.
ii) Insert: Adds new data ite ms to the given list of data ite ms. For example, to add the details
of a new student who has recently joined the course.
iii) Delete: Removes (deletes) a particular data item from the given collection of data items.
For example, to delete the na me of a student who has left the course.
iv) Search: Finds the location of one or more data ite ms that satisfy the given constraint or finds
the location of the desired data item with a given key value. Such a data item may or may not
be present in the given collection of data items. For example, to find the names of all the
students who secured 100 marks in mathe matic s.
v) Sort: Arranges the data ite ms in some particular order like ascending order or descending order
depending on the type of application. For example, arranging the names of students in a
class in an alphabetical order, or calculating the top three winners by arranging the
participants‘ scores in descending order and then extracting the top three.
vi) Merge: Combines two lists of sorted data items to form a single list of sorted data items.
vii) Traversal: Means accessing each data item exactly once so that it can be processed.
Traversing is also called as visiting. For example, to print the names of all the students in a
class.
Many a time, two or more operations are applied simultaneously in a given situation. For
example, if we want to delete the details of a student whose na me is X, then we first have to

5
search the list of students to find whether the record of X exists or not and if it exists then at
which location, so that the details can be deleted from that particular location.

ARRAYS
Array Definition:
An array is a list of a finite number n of homogeneous data ele me nts (i.e. data elements of the
sa me type) such that:
i) The ele ments of the array are referenced respectively by an index set consisting of n
consecutive numbe rs.
ii) The ele ments of the array are stored respectively in successive me mory locations.
We assume that the index set consists of the integers 1, 2, ... , n, if not explicitly stated.
The number n of ele me nts is called the length or size of the array. In general the length of the
array can be obtained from the index set by using the formula:
Length = UB – LB +1 where UB is the upper bound i.e. the largest of the array,
LB is the lower bound i.e. the sma llest of the array.

Suppose we have an array A with subscripts 1, 2, 3, …, n, then the ele ments of A are denoted by

i) the subscript notation A 1 , A 2, A 3 . .. . A n or


ii) the parenthesis notation A (1), A (2), A (3).................A (n) or
iii) the bracket notation A [1], A [2], A [3].................A [n]
The numbe r K in A[K] is called a subscript or index and A[K] is called a subscripted variable.
Representation of Arrays:
Declaring an array means specifying the following:
 Data type — the kind of values it can store, for example, int, char, float, double.
 Name — to identify the array.
 Size —the ma ximu m number of values that the array can hold.
In C, arrays are declared using the following syntax:
data-type name[size];
For example,
int marks[10];
The above statement declares an array marks that contains 10 elements.
In C, the array index starts from zero. This means that the array marks will contain 10 ele
me nts in all. The first element will be stored in marks[0], second ele me nt in marks[1], so on and so
forth. Therefore, the last ele ment, that is the 10th ele ment, will be stored in ma rks[9].
In the me mory, the array will be stored in consecutive me mory locations as shown in Fig.
1.1. The computer need not keep track of the address of every element of the array, but needs to
keep track only of the address of the first ele ment, called the base address of the array.

The array name denotes


i) the na me of the array
ii) the address of the array
iii) the address of the first byte of the array.
The subscript or index represents the offset from the beginning of the array to the ele me nt
being referenced. Thus, with just the array name and the index, the computer can calc ulate the
address of any element in the array, since an array stores all its data ele ments in consecutive

6
memory locations. The address of the data elements can simply be calculated using the base
address.
The compiler calculates the address of any ele ment of A by the following formula:
Address( A[K]) = Base(A) + w * (K – LB)
Where Base(A) denotes the base address of the array A,
K is any subscript,
w is the numbe r of words per me mory cell for the array A, and
LB is the lower bound of the array.
The time to calculate Address(A[K]) is essentially the sa me for any value of K. Further,
given any subscript K, one can locate and access the content of A[K] without scanning any other
ele me nt of A.
Ex. 1.1 Given an array int ma rks[]={99, 67, 78,56,88,90,34, 85}, calculate the address of
marks[4] if the base address = 1000.
Solution: The size of an integer value in C requires 2 bytes and LB of a C array is 0.
Therefore, the address of ma rks[4] = 1000 + 2 * (4 – 0) = 1000 + 8 = 1008
Fig. 1.2 shows the storage representation of array marks in memory.

Fig. 1.2 Storage representation of array marks in memory


Array Operations:
Some of the operations that can be performed on arrays are:
i) Traversing an array
ii) Inserting an element in an array
iii) Deleting an element from an array
iv) Merging two arrays
v) Searching an element in an array
vi) Sorting an array in ascending or descending order

i) Traversing an array
Traversing an array means accessing each and every element of the array for a specif ic purpose.
Traversing the data eleme nts of an array A can include printing every element, counting the
number of ele me nts with a given property, or performing any process on these ele ments. Since,
array is a linear data structure i.e. all its ele ments form a sequence, traversing its ele me nts is
very simple and straightforward.
Algorithm: To traverse an array:
Here A is an array with lower bound LB and upper bound UB. The algorithm traverses A applying
an operation PROCESS to each eleme nt of A.
Step 1: [Initialize counter] SET I = LB Step
2: Repeat Steps 3 to 4 while I <= UB
Step 3: [Visit ele me nt] Apply P ROCESS to A[I]
Step 4: [Incre me nt counter] SET I = I + 1
[End of step 2 loop]
Step 5: EXIT
The operation PROCESS in the traversal algorithm may use certain variables which must be
initialized before PROCESS is applied to any of the ele ments in the array. Accordingly, the
algorithm may need to be preceded by such an initialization step.

7
ii) Inserting
Inserting refers to the operation of adding another ele me nt to the array A.
Inserting an element at the ―end‖ of an array can be easily done provided the
memory space allocated for the array is large enough to accommodate the additional element. All
we have to do is to add 1 to the number of elements and assign the value of the element. But, if
we need to insert an element in the middle of an array, on the average, half of the elements must
be moved down to new locations to accommodate the new ele ment and keep the order of the
other ele me nts.
Algorithm: To insert an element into an array: INSERT (A, N, POS, ELEM)
Here, A is the array in which the ele me nt has to be inserted,
N is the number of elements in the array,
POS is the position at which the ele ment has to be inserted; it is a positive integer such that POS
<= N,
ELEM is the value that has to be inserted.
The algorithm inserts an ele me nt ELEM into the POSth position in the array A.
Step 1: [Initialize counter] SET I = N
Step 2: Repeat Steps 3 and 4 while I >= POS
Step 3: [Move ith eleme nt down] SET A[I + 1] = A[I]
Step 4: [Decrease counter] SET I = I – 1
[End of Step 2 Loop]
Step 5: [Update N] SET N = N + 1
Step 6: [Insert element] SET A[POS] = ELEM
Step 7: EXIT

iii) Deleting
Deleting refers to the operation of removing one of the ele ments from the array A.
Similar to insertion, we face no problem if an element at the ―end‖ of an array is to be
deleted. But, if an element somewhere in the middle of the array is to be deleted, each
subsequent ele me nt must be moved one location upward to ―fill up‖ the array.
Algorithm: To delete an element from an array: DELETE(A, N, POS, ELEM)
Here, A is the array from which the ele ment has to be deleted,
N is the number of elements in the array,

8
POS is the position of the ele ment that is to be deleted; it is a positive integer such that POS
<= N,
ELEM is the value that has been deleted.
The algorithm deletes the POSth ele ment from the array A.
Step 1: SET ELEM = A[POS]
Step 2: [Initialize counter] SET I = POS
Step 3: Repeat Steps 4 and 5 while I <= N – 1
Step 4: [Move I+1st ele ment upward] SET A[I] = A[I + 1]
Step 5: [Incre me nt counter] SET I = I + 1
[End of Step 3 Loop]
Step 6: [Update N] SET N = N - 1
Step 7: EXIT
Arrays have the following limitations:
i) Arrays are of fixed size.
ii) Data elements are stored in contiguous memory locations which may not be always
available.
iii) Insertion and deletion of ele me nts can be problematic because of shifting of ele me
nts from their positions.
However, these limitations can be solved by using linked lists.

Strings:
Introduction:
Today, computers are frequently used for processing non-nume rical data, called character data.
One of the primary applications is word processing. This involves some type of pattern
matching.
In computer terminology, the term ―string‖ is a sequence of characters. Also, instead of
using the term ‗word processing‘, terms such as ―String Processing‖, ―string
manipulation”, “text editing‖, are used.

Basic Terminology:
Each programming language contains a character set that is used to communicate with the
computer. Usually, this set consists of:
Alphabet: ABC DEFGHIJ KLM NOPQRST UVWXYZ
Digits: 012345 6789
Special Characters: + - / * ( ) , . $ = ‗ (blank space)
String – a finite sequence S of zero or more characters.
Length of a string – the number of characters in a string.
Empty string or null string – the string with zero characters.
Specific strings will be denoted by enclosing their characters in single quotation marks.
The quotation marks also serve as string delimiters. Hence
‗THE END‘ ‗TO BE OR NOT TO BE‘ ‗‘ ‗ ‗
are strings with lengths 7, 18, 0 and 2, respectively.
The blank space is a character and contributes to the length of a string.
CONCATENATION: Let S1 and S2 be two strings. The string of the characters of S 1 followed by the
characters of S2 is called the concatenation of S1 and S2; it will be denoted by S1 // S2. For example:
i. ‗THE‘ // ‗END‘ = ‗THEEND‘
ii. ‗THE‘ // ‗ ‗ // ‗END‘ = ‗T HE END‘.
iii. Suppose S1 = ‗MARK‘ and S2 = ‗TWAIN‘, then:

9
S1 // S2 = ‗MARKTWAIN‘
S1 // ‗ ‗ // S2 = ‗MARK TWAIN‘
The length of the string S1 // S2 is equal to the sum of the lengths of the strings S1 and S2.

A string Y is called a substring of a string S if there exist strings X and Z such that S
= X // Y // Z
If X is an empty string, then Y is called an initial substring of S, and if Z is an empty string, then
Y is called an terminal substring of S. For example,
‗BE OR NOT‘ is a substring of ‗TO BE OR NOT TO BE‘
‗THE‘ is an initial substring of ‗THE END‘
If Y is a substring of S, then the length of Y cannot exceed the length of S.
STORING STRINGS:
Generally, strings are stored in three types of structures:
1) Fixed-length structures
2) Variable- length structures with fixed maximu ms
3) Linked structures.
Record-oriented, Fixed-Length Storage:
In fixed-length storage, each line of print is viewed as a record, where all records are of the sa me
length. Since data are mostly input on termina ls that are 80 column, it is assume d that a record
has a length of 80.
The ma in advantages of storing strings in this way are:
i) The ease of accessing data from any given record.
ii) The ease of updating data in any given record (as long as the length of the new data does
not exceed the record length).
The ma in disadvantages are:
i) Time is wasted reading an entire record if most of the storage consists of inessential blank
spaces.
ii) Certain records may require more space than available.
iii) When the correction consists of more or fewer characters than the original text, changing a
misspelled word requires the entire record to be changed.

10
Suppose a new record is to be inserted in the above example. This would require that all
succeeding records be moved to new me mory locations. However, this disadvantage can be easily
remedied as shown in Fig. 3.3. Here, we use an array POINT which gives the address of each
successive record, so that the records need not be stored in consecutive locations in memory.
Now, to insert a new record, we only need to update the array POINT.

Variable-Length Storage with Fixed Maximum:


Although strings may be stored in f ixed-length me mory locations, we can know the actual length
of each string. Then, one need not read the entire record when the string occupies only the
beginning part of the me mory location. Also, certain string operations depend on having such variable-
length strings.
The storage of variable-length strings in memory cells with fixed lengths can be done in
two general ways:
1. Use a marker (such as two dollar signs, $$) called a sentinel, to signal the end of the array.
Fig. 3.4(a)
2. List the length of the string as an additional item in the pointer array. Fig. 3.4(b)

11
Note: One might store strings one after another by using some separation marker, such as the
two dollar signs ($$) as in Fig. 3.5(a) or by using a pointer array giving the location of the strings
as in Fig. 3. 5(b). These ways of storing strings obviously save space and are sometimes used in
secondary me mory when records are relatively permanent and require little change. However,
such methods of storage are usually inefficient when the strings and their lengths are frequently
being changed.

Linked Storage:
In word processing, one must be able to correct and modify the printed matter i.e. delete, change
and insert words, phrases, sentences and even paragraphs in the text. Fixed length memory cells,
however, cannot easily handle these operations. Hence, for most word processing applications,
strings are stored by mea ns of linked lists.
Strings may be stored in linked lists as shown in Fig. 3. 7. Each me mory cell is assigned
one character or a fixed number of characters, and a link contained in the cell gives the address
of the cell containing the next character or group of characters in the string. Fig. 3.7(a) shows how
the string ‗To be or not to be, that is the question.‘ would appear in me mory with one
character per node and Fig. 3.7(b) shows the sa me with four characters per node.

12
STRING OPERATIONS
A string may be viewed as a sequence or array of characters. But, there is a fundamental
difference in use between strings and other types of arrays. Specifically, substrings (i.e. groups of
consecutive ele me nts in a string) may be units the mse lves. Also, the basic units of access in a
string are usually substrings and not individual characters.
For example, the string ‗TO BE OR NOT TO BE‘ is an 18- character sequence. But,
the substrings TO, BE, OR, ... have their own meaning. On the other hand, in an 18-ele me nt array of
18 integers,
4, 8, 6, 15, 9, 5, 4, 13, 8, 5, 11, 9, 9, 13, 7, 10, 6, 11
The basic unit of access is usually an individual element. Groups of consecutive eleme nts normally
do not have any special meaning.
The different operations that can be performed on strings are:
- Concatenation
- Length
- Substring
- Indexing

LENGTH:
The numbe r of characters in a string is called its length. It is denoted by:
LENGTH(string)
Example:
i) Suppose S = ‗COMPUTER‘, LENGTH(S) = 8
ii) LENGTH(‗MARK TWAIN‘) = 10

In C language, string length is determined using the strlen() function, that is a part of
string. h header file. Hence, it must be included at the time of pre-processing. Example:
X = strlen(―Sunrise‖); // strlen() function returns an integer value 7 and
assigns it to the variable X

CONCATENATION: Let S1 and S2 be two strings. The string of the characters of S 1 followed by the
characters of S2 is called the concatenation of S1 and S2; it will be denoted by S1 // S2. For example:
i. ‗THE‘ // ‗END‘ = ‗THEEND‘
ii. ‗THE‘ // ‗ ‗ // ‗END‘ = ‗T HE END‘.
iii. Suppose S1 = ‗MARK‘ and S2 = ‗TWAIN‘, then:

13
S1 // S2 = ‗MARKTWAIN‘
S1 // ‗ ‗ // S2 = ‗MARK TWAIN‘
The length of the string S1 // S2 is equal to the sum of the lengths of the strings S1 and S2.
Concatenation is denoted in some programming languages as follows:
PL/1: S1 || S2
FORT RAN77: S1 // S2
BASIC: S1 + S2
SNOBOL: S1 - S2
In C language, concatenation is performed using the strcat() function that is a part of the
string. h header file. Hence, it must be included at the time of pre-processing.

SUBSTRING:
Accessing a substring from a given string requires three pieces of information:
1) The na me of the string or the string itself.
2) The position of the first character of the substring in the given string.
3) The length of the substring or the position of the last character of the substring.
SUBST RING(string, initial, length)
SUBST RING(S, K, L) – denotes the operation SUBST RING i.e. it returns the substring of a
string S beginning in a position K and having a length L.
Example:
i) SUBST RING(‗TO BE OR NOT TO BE‘, 4, 7) = ‗BE OR N‘
ii) SUBST RING(‗THE END‘, 4, 4) = ‗ END‘

14
// Function SUBSTRING
char *SUBST R(char *STR, int i, int j)
{
int k, m = 0;
char STRRES[80];
for ( k=i-1; k<=i+j-1- 1; k++)
{
STRRES[m] = STR[k];
m = m+1;
}
STRRES[m] = ‗\0‘;
return(STRRES);
}

INDEXING:
Indexing, also called pattern matching, refers to finding the position where a string pattern P
first appears in a given string text T. This operation INDEX is written as:
INDEX(text, pattern)
If the pattern P does not appear in the text T, then INDEX is assigned the value 0. The arguments
―text‖ and ―pattern‖ can be either string constants or string variables.

15
Note: Since 0 is a valid index location in C, -1 is used to denote instances where pattern does not
match the text.

PATTERN MATCHING ALGORITHMS


Pattern matc hing is the problem of deciding whether or not a given string pattern P appears in a
string text T. We assume that the length of P does not exceed the length of T.
Note: In pattern matc hing algorithms,
1) Characters are sometimes denoted by lowercase letters (a, b, c, ...) and exponents may be
used to denote repetition. Example:
i) a2b3ab2 may be used for aabbbabb
ii) (cd)3 may be used for cdcdcd
2) The empty string may be denoted by λ, the Greek letter la mbda.
3) The concatenation of strings X and Y may be denoted by X.Y or simply XY.
First Pattern Matching Algorithm
In the f irst pattern matching algorithm we compa re a given pattern P with each of the substrings
of T, mov ing from left to right, until we get a matc h. Let
WK = SUBSTRING(T, K, LENGTH(P))

16
That is, WK denotes the substring of T having the same length as P and beginning with the Kth
character of T.
First we compa re P character by character, with the first substring W 1 . If all the characters are the
sa me, then P = W1 and so P appears in T and INDEX(T, P) = 1. However, if we find that some
character of P is not the same as the corresponding character of W 1, then P ≠ W1 and we can
immediately move on to the next substring W 2 .
That is, we next compa re P with W2. If P ≠ W2, then we compa re P with W3, and so on.
The process stops when
i) We find a match of P with some substring WK. And so P appears in T and INDEX(T, P) = K,
or
ii) We exhaust all the WK‘s with no match and hence P does not appear in T. The ma ximum
value MAX of the subscript K is equal to LENGT H(T) – LENGTH(P) +1.
Example: Assume that T is a 20- character string and P is a 4- character string and that T and P
are stored in memory as arrays with one character per element. That is,
T = T[1]T[2]T[3]...T[19]T[20] and
P = P[1]P[2]P[3]P[4]
P is compa red with each of the following 4-character substrings of T:
W1 = T[1]T[2]T[3]T[4], W2 = T[2]T[3]T[4]T[5], ..., W17 = T[17]T[18]T[19]T[20]
There are a MAX = 20-4+1 = 17 such substrings of T.
The following Algorithm assumes T as an s-character string and P as an r-character string.
Algorithm: First Pattern Matching Algorithm:
Here T and P are strings with lengths R and S, respectively, and are stored as arrays with one
character per element. This algorithm finds the INDEX of P in T .
Step 1: [Initialize] SET K := 1 and MAX := S-R+1
Step 2: Repeat Steps 3 to 5 while K <= MAX
Step 3: Repeat for L = 1 to R [Tests each character of P]
If P[L] ≠ T[K+L-1], then Go to Step 5
[End of inner loop]
Step 4: [Success] SET INDEX = K and EXIT
Step 5: SET K := K+1
[End of Step 2 loop]
Step 6: [Failure] SET INDEX = 0
Step 7: EXIT
This algorithm consists of two loops, one inside the other. The outer loop runs through
each successive R-character substring
WK = T[K]T[K+1]…T[K+R-1]
of T. The inner loop compa res P with W K, character by character. If any character does not matc h, then
control transfers to step 5, which incre me nts K and then leads to the next substring of T. If all
the R characters of P do matc h those of some W K, then P appears in T and K is the INDEX of P in T.
On the other hand, if the outer loop completes all of its cycles, then P does not appear in T and so
INDEX = 0.

17
The comple xity of this pattern matching algorithm is measured by the number C of
compa risons between characters in the pattern P and characters of the text T.
Let Nk denote the number of comparisons that take place in the inner loop when P is compa red
with Wk.
Then,
C = N1 + N2 + … + NL
where L is the position in T where P first appears or
L = MAX if P does not appear in T.
The following examples compute C for some specific P and T where
LENGTH(P) =4 and
LENGTH(T) = 20.
So, MAX = 20 – 4 + 1 = 17.
Example 3. 11
a) Suppose P = aaba and T = cdcd...cd = (cd)10.

18
P does not occur in T. Hence, for each of the 17 cycles, the first character of P does not match
WK. So, NK = 1. Therefore,
C = 1 + 1 + 1 + ... + 1 = 17
b) Suppose P = aaba and T = ababaaba...
Comparing P with W1 = abab, N1 = 2, since the first letter matc hes.
Comparing P with W2 = baba, N2 = 1, since the first letters do not matc h.
Similarly, N3 = 2 and N4 = 1.
W5 = P; i.e. P is a substring of T and N5 = 4.
Hence, C = 2 + 1 + 2 + 1 + 4 = 10
c) Suppose P = aaab and T = aa...a = a20.
Here P does not appear in T.
Also, every WK = aaaa. Hence NK = 4, since the first three letters of P matc h.
Hence, C = 4 + 4 + ... + 4 = 17 . 4 = 68
In general, when P is an r-character string and T is an s-character string, the data size for the
algorithm is n =r+s
The worst case occurs when every character of P except the last matc hes every substring WK, as
in Eg. 3.10(c). In this case, C(n) = r (s – r + 1).
For fixed n, s =n–r
Therefore, C(n) = r (n – r –r +1) = r (n – 2 r + 1)
The ma ximum value of C(n) occurs when r = (n + 1) / 4.
Substituting this value for r,
C(n) = (n + 1) / 4 * (n – 2 * (n + 1) / 4 + 1) = (n + 1)2 / 8 = O(n2)
The comple xity of the average case in any actual situation depends on certain probabilities which
are usually unknown. When the characters of P and T are randomly selected from some finite
alphabet, the comple xity of the average case is still not easy to analyze, but the comple xity of the
average case is still a factor of the worst case.
Accordingly, the complexity of this pattern matching algorithm is equal to O(n 2). In other words,
the time required to execute this algorithm is proportional to n2.

Second Pattern Matching Algorithm:


This algorithm uses a table which is derived from the pattern but is independent of the text T.
For example: Suppose
P = aaba and
T = T1T2T3 … , where Ti denotes the ith character of T.
Suppose the first two characters of T matc h those of P i.e. suppose T = aa….
Then T has one of the following forms:
i) T = aab…
ii) T = aaa…
iii) T = aax where a is any character different from a or b.
Suppose we read T3 .
i) Suppose T3 = b. Then we next read T4 to see if T4 = a, which will give a match of P with
W1.
ii) Suppose T3 = a. Then we know that P ≠ W1; but we also know that W2 = aa…, i.e. the first
two characters of the substring W2 matc h those of P. Hence we next read T4 to see if T4 =
b.
iii) Suppose T3 = x. Then we know that P ≠ W 1; but we also know that P ≠W 2 and P ≠W3, since
x does not appear in P. Hence we read T 4, to see if T4 = a i.e. to see if the first character of
W4 matc hes the first character of P.
There are two important points in this procedure.
i) When we read T 3, we need to compa re only T3 with those characters which appear in P. If
none of these matc h, then we are in the last case of a character x which does not appear
in P.
ii) After reading and checking T 3 , we next read T 4 ; we do not have to go back again in the
text T.
Fig. 3.8(a) contains the table that is used in the second pattern matc hing algorithm for
the pattern P = aaba. Let Qi denote the initial substring of P of length I.
Q0 = λ, Q1 = a, Q2 = a2, Q3 = a2b, Q4 = a2ba = P
Q0 = λ is the empty string.

19
The rows of the table are labelled by these initial substrings of P, excluding P itself. The columns
of the table are labelled a, b and x, where X represents any character that doesn‘t appear in the
pattern P.
Let f be the function determined by the table, i.e. f(Q i, t) denote the entry in the table in row Q i and
column t (where t is any character). The entry f(Q i, t) is defined to be the largest Q that appears as
a terminal substring in the string Qit, the concatenation of Qi and t.

This table can also be pictured by the labelled directed graph as in Fig. 3.8(b). Let T =
T1T2T3…Tn denote the n-character string text, which is searched for the pattern P. Beginning with
the initial state Q0 and using the text T, we will obtain a sequence of states S1, S2, S3, … We let S1
= Q0 and we read the first character T 1. From the table or the graph, the pair (S 1 , T1) yields a
second state S2, i.e. f(S1, T1) = S2. We read the next character T2. The pair (S2, T2) yields a state
S3, and so on. There are two possibilities:
i) Some state SK = P, the desired pattern. In this case P appears in T and its index is K-
LENGTH(P).
ii) No state S1, S2, S3, … , SN+1 is equal to P. In this case, P does not appear in T.
Algorithm : Pattern Matching
The pattern matc hing table f(Qi, T) of a pattern P is in me mory, and the input is an n- character string
T = T 1T2T3…Tn. This algorithm finds the INDEX of P in T.
Step 1: [Initialialize] Set K := 1 ans S1 = Q0.
Step 2: Repeat steps 3 to 5 while SK ≠ P and K <= N.
Step 3: Read Tk.
Step 4: Set SK+! := F(SK, TK). [Finds next state]
Step 5: Set K:= K+1. [Updates counter]
[End of Step 2 Loop]
Step 6: [Successful?]
If SK = P, then
INDEX = K – LENGTH(P)
Else:
INDEX = 0
[End of IF structure]
Step7: Exit

20
21
The running time of the above algorithm is proportional to the numbe r of times Step 2 loop is
executed. The worst case occurs when all of the text T is read, i.e. when the loop is executed n =
LENGTH(T) times. Hence, the comple xity of this pattern matc hing algorithm is equal to O(n).

22

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy