Dictionaries: Sets
Dictionaries: Sets
DICTIONARIES
A dictionary is a container of elements from a totally ordered universe that supports the
basic operations of inserting/deleting elements and searching for a given element.
In this chapter, first, we introduce the abstract data type Set which includes dictionaries,
priority queues, etc. as subclasses.
Sets:
The set is the most fundamental data model of mathematics.
A set is a collection of well defined elements. The members of a set are all different.
There are special operations that are commonly performed on sets, such as Union,
intersection, difference.
1. The union of two sets S and T, denoted S ∪ T, is the set containing the
elements that are in S or T, or both.
2. The intersection of sets S and T, written S ∩ T, is the set containing the
elements that are in both S and T.
3. The difference of sets S and T, denoted S − T, is the set containing those
elements that are in S but not in T.
For example:
Let S be the set {1, 2, 3} and T the set {3, 4, 5}. Then
S ∪ T = {1, 2, 3, 4, 5}, S ∩ T = {3}, and S − T = {1, 2}
Set implementation:
Possible data structures include
Bit Vector
Array
Linked List
o Unsorted
o Sorted
Dictionaries:
A dictionary is a dynamic set ADT with the operations:
Search(S, k) – an access operation that returns a pointer x to an element where x.key
=k
Insert(S, x) – a manipulation operation that adds the element pointed to by x to S
KHIT Page 3
ADVANCED DATA STRUCTURES UNIT - I CSE
Implementation:
1. Fixed Length arrays
2. Linked lists: sorted, unsorted, skip-lists
3. Hash Tables: open, closed
4. Trees
Binary Search Trees (BSTs)
Balanced BSTs
o AVL Trees
o Red-Black Trees
Splay Trees
Multiway Search Trees
o 2-3 Trees
o B Trees
Tries
Let n be the number of elements is a dictionary D. The following is a summary of the
performance of some basic implementation methods:
Worst case complexity of
Among these, the sorted list has the best average case performance.
In this chapter, we discuss two data structures for dictionaries, namely Hash Tables and Skip
Lists.
KHIT Page 4
ADVANCED DATA STRUCTURES UNIT - I CSE
HASHING
Division Method:
One common method of determining a hash key of the division method of hashing
For example:
The division method is generally a reasonable strategy, unless the key happens to have
some undesirable properties.
Note: if the table size is 10 and all of the keys end in zero.
In the above example 42 mod 8 => 2, it’s already filled position in the hash table. This is
known as collision. i.e. two or more record keys map to the same array index.
In This case, the choice of hash function and table size needs to be carefully considered. The
best table sizes are prime numbers.
KHIT Page 5
ADVANCED DATA STRUCTURES UNIT - I CSE
Multiplication method:
The simplest situation when the keys are floating –point numbers known to be in affixed
range.
For example:
If the keys are numbers that are greater than 0 and less than 1, we can just multiply by m
(table size) and round off to the nearest integer to get an address between 0 an m-1.
Algorithm:
Mathematically
Example:
m = 8 (implies m = 23, p = 3)
w=5
k = 21
So that h(21) = 4.
KHIT Page 6
ADVANCED DATA STRUCTURES UNIT - I CSE
Example:
m = 8 (implies m = 23, p = 3)
w=5
k = 21
s = 13
k·s = 21 · 13
= 273
= 8 · 25 + 17
= r1 . r0
r1 = 8 · 25
r0 = 17 = 100012
Written in w = 5 bits, r0 = 100012 The p = 3 most significant bits of r0 is 1002 or 410,
so h(21) = 4.
Exercise Example:
m = 4 (implies m = 22, p = 2)
w=3
k = 12
s=5 0 < s < 2w = 23 = 8
k·s = 12 · 5
=?
= ? · 23 + ?
= r1 . r0
r1 = ? · 23
r0 = ? = ?2
Written in w = 3 bits, r0 = ?2
The p = 2 most significant bits of r0 is ?2 or ?10, so h(12) = ?.
KHIT Page 7
ADVANCED DATA STRUCTURES UNIT - I CSE
Universal method:
Hashing is a fun idea that has lots of unexpected uses. Here, we look at a novel type of hash
function that makes it easy to create a family of universal hash functions. The method is
based on a random binary matrix and is very simple to implement.
The idea is very simple. Suppose you have an input data item that you have input data with
m – bits and you want a hash function that produces n – bits then first generate a random
binary matrix (M) of order nxm.
For example, Suppose you have a key 11, the binary form is 1011 and it is a four bit input
value(m) and want to generate output a three bit hash value(n).
(0100)
M = (1011)
(1101)
and if the data value was 1011 the hash value would be computed as:
( 0 1 0 0 ) (1) (0)
h(x) = Mx = ( 1 0 1 1 ) (0) = (1)
( 1 1 0 1 ) (1) (0)
(1)
There are a number of other ways to look at the way the arithmetic is done that suggest
different ways of implementing the algorithm.
The first is to notice that what you are doing is anding each row with the data column
vector. That is taking the second row as an example: ( 1 0 1 1 )And (1 0 1 1) = (1 0 1 1)
and then you add up the bits in the result: 1+0+1+1=1
There is no.of other ways to look at the way the arithmetic is done that suggest different
ways of implementing the algorithm.
Hashing gives an alternative approach that is often the fastest and most convenient way of
solving these problems like AI – search programs, cryptography, networks, complexity
theory.
KHIT Page 8
ADVANCED DATA STRUCTURES UNIT - I CSE
In general, a hashing function can map several keys into the same address. That leads to a
collision. The colliding records must be stored and accessed as determined by a collision –
resolution techniques.
There are two broad classes of such techniques:
The difference between the two has to do with whether collision are stored outside the
table (open hashing), or whether collision result in storing and of the records at another slot
in the table (closed hashing).
The particular hashing method that one uses depends on many factors. One important
factor is the ratio of the no.of keys in the table to the no.of hash addresses. It is called load
factor, and is given by: Load factor (α) = n/m,
where n is no.of keys in the table and m is no.of hash address (table size)
Open Hashing:
The simplest form of open hashing defines each slot in the hash table to be the head of a
linked list. All records that hash to a particular slot are placed on that slot’s linked list.
The below figure illustrates a hash table where each slot stores one record and a link pointer
to the rest of the list.
KHIT Page 9
ADVANCED DATA STRUCTURES UNIT - I CSE
Any key that hash to the same index are simply added to the linked list; there is no need to
search for empty cells in the array. This method is called separating chaining.
It resolves collisions in the prime area – that is that contains all of the home addresses.
i.e. when a data item cannot be placed at the index calculated by the hash function, another
location in the array is sought.
There are different methods of open addressing, which vary in the method used to find the
next vacant cell.
They are (i) Linear probing
(ii) Quadratic probing
(iii) Pseudo random probing
(iv) Double hashing
(v) Key – offset
If the physical end of the table is reached during the linear search will wrap around to the
beginning of the table and continue from there.
If an empty slot is not found before reaching the point of collision; the table is full
KHIT Page 10
ADVANCED DATA STRUCTURES UNIT - I CSE
A problem with the linear probe method is that it is possible for blocks of data to form when
collisions are resolved. This is known as primary clustering.
This means that any key that hashes into the cluster will require several attempts to resolve
the collision.
Linear probes have two advantages: First, They are quite simple to implement. Second, data
tend to remain near their home address.
Exercise Example:
Insert the nodes 89, 18, 49, 58, and 69 into a hash table that holds 10 items using the
division method.
In this probe, rather than always moving one spot, move ‘ i 2 ‘ spots from the point of
collision, where ‘i’ is the no. of attempts to resolve the collision.
In linear probe, if the primary hash index is x, subsequent probe go to x+1, x+2, x+3 and so
on. In Quadratic probing, probes go to x+1, x+4, x+9, and so on, the distance from the initial
probe is the square of the step number: x+12, x+22, x+32, x+42 and so on.
i.e., at first it picks the adjacent cell, if that is occupied, it tries 4 cells away, if that is
occupied it tries 9 cells away, and so on. It eliminates the primary clustering problem with
linear probe.
Consider the above exercise problem, keys 89, 18, 49, 58, 69 with table size 10.
Here each key that hashes to same location will require a longer probe. This phenomenon is
called secondary clustering. It is not a serious problem, but quadratic probing is not often
used because there is a slightly better solution.
KHIT Page 11
ADVANCED DATA STRUCTURES UNIT - I CSE
This method uses pseudo – random number to resolve the collision i.e. this probe function
would select the next position on the probe sequence at random from among the unvisited
slots that is the probe sequence should be a random permutation of hash table positions.
Unfortunately, we can’t actually select the next position in the probe sequence at random,
because we would not be able to duplicate this same probe sequence when searching for
the key.
In this probing, the ith slot in the probe sequence is (h (key) + r1) mod M) where r1 is the ith
value in the random permutation of the numbers from 1 to M-1.
All insertions and searches use the same sequence of random numbers.
36 % 8 = 4
18 % 8 = 2
72 % 8 = 0 now insert 60
43 % 8 = 3 60 % 8 = 4; is a collision
6 % 8 = 6
Pseudo random numbers are a relatively simple solution, but they have one significant
limitation all keys follow only one collision resolution path through the list. Because this
collision resolution can create significant secondary clustering
KHIT Page 12
ADVANCED DATA STRUCTURES UNIT - I CSE
Double Hashing:
Double hashing uses the idea of applying a second hash function to the key when a collision
occurs. The result of the second hash function will be the number of positions from the
point of collision to insert.
Table size = 10
Hash1 (key) = key % 10
Hash2 (key) = 7 – (key % 7)
Because 7 is a prime number than the size of the table
Hash (89) = 89 % 10 = 9
Hash (18) = 18 % 10 = 8
NOTE:
KHIT Page 13
ADVANCED DATA STRUCTURES UNIT - I CSE
Key – offset:
It is double hashing method that produces different collision paths for different keys. Where
as the pseudo random – number generator produces a new address as a function of the
previous address; key offset calculates the new address as a function of the old address and
key.
One of the simplest versions simply adds the quotient of key divided by the list size to the
address to determine the next collision resolution address, as shown below
For example:
The key is 166702 and list size is 307, using modulo - division hashing method generates an
address of 1. It’s a collision because there was a key 070918.
Using key offset to calculate the next address, we get 237 as shown below
If 237 were also a collision, we would repeat the process to locate the next address, as
shown below
KHIT Page 14
ADVANCED DATA STRUCTURES UNIT - I CSE
Skip Lists:
Skip list is a type of data structure that can be used as an alternative to balanced (binary)
trees or B-Trees. As compared to a binary tree, skip lists allow quick search, insertions and
deletions of elements with simple algorithms. This is achieved by using probabilistic
balancing rather than strictly enforce balancing as done in B-trees.
Skip lists are also significantly faster than equivalent algorithms for B-Trees.
A skip list is basically a linked list with additional pointers that allow intermediate nodes to
be skipped, hence the name skip list.
In a simple linked list that consists of ‘n’ elements, to perform a search ‘n’ comparisons are
required in the worst case.
For example:
If a second pointer pointing two nodes ahead is added to every node, the number of
comparisons goes down to n/2+1 in the worst case.
Consider a stored list where every other node has an additional pointer, to the node two a
head of it in the list.
KHIT Page 15
ADVANCED DATA STRUCTURES UNIT - I CSE
In the list of above figure, every second node has a pointer two ahead of it; every fourth
node has a pointer four ahead if it. Here we need to examine no more than +2
nodes.
In below figure, (every (2i)th node has a pointer (2i) node ahead (i = 1, 2,...); then the number
of nodes to be examined can be reduced to log2n while only doubling the number of
pointers.
Here, Every (2i)th node has a pointer to a node (2i)nodes ahead (i = 1, 2,...)
A node that has k forward pointers is called a level k node. If every (2i)th node has a
pointer (2i) nodes ahead, then
# of level 1 nodes 50 %
# of level 2 nodes 25 %
# of level 3 nodes 12.5 %
Such a data structure can be used for fast searching but insertions and deletions will
be extremely cumbersome, since levels of nodes will have to change.
What would happen if the levels of nodes were randomly chosen but in the same
proportions (below figure)?
o level of a node is chosen randomly when the node is inserted
o A node's ith pointer, instead of pointing to a node that is 2 i - 1 nodes ahead,
points to the next node of level i or higher.
o In this case, insertions and deletions will not change the level of any node.
o Some arrangements of levels would give poor execution times but it can be
shown that such arrangements are rare.
Such a linked representation is called a skip list.
Each element is represented by a node the level of which is chosen randomly when
the node is inserted, without regard for the number of elements in the data
structure.
A level i node has i forward pointers, indexed 1 through i. There is no need to store
the level of a node in the node.
Maxlevel is the maximum number of levels in a node.
o Level of a list = Maxlevel
o Level of empty list = 1
o Level of header = Maxlevel
KHIT Page 16
ADVANCED DATA STRUCTURES UNIT - I CSE
It is a skip list
Initialization:
An element NIL is allocated and given a key greater than any legal key. All levels of all lists
are terminated with NIL. A new list is initialized so that the level of list = maxlevel and all
forward pointers of the list's header point to NIL
Search:
We search for an element by traversing forward pointers that do not overshoot the node
containing the element being searched for. When no more progress can be made at the
current level of forward pointers, the search moves down to the next level. When we can
make no more progress at level 1, we must be immediately in front of the node that
contains the desired element (if it is in the list).
KHIT Page 17
ADVANCED DATA STRUCTURES UNIT - I CSE
9 elements at level 1
3 elements at level 2
3 elements at level 3
1 element at level 6
L(n) = log2n
L(n) = log n
where p is the fraction of the nodes with level i pointers which also have level (i + 1)
pointers.
However, starting at the highest level does not alter the efficiency in a significant
way.
Another important question to ask is:
What should be MaxLevel? A good choice is
Complexity of search, delete, insert is dominated by the time required to search for
the appropriate element. This in turn is proportional to the length of the search
path. This is determined by the pattern in which elements with different levels
appear as we traverse the list.
Insert and delete involve additional cost proportional to the level of the node being
inserted or deleted.
KHIT Page 18
ADVANCED DATA STRUCTURES UNIT - II CSE
BALANCED TREES
Introduction:
Tree:
A Tree consists of a finite set of elements, called nodes, and set of directed lines called
branches, that connect the nodes.
The no. of branches associated with a node is the degree of the node.
i.e. in - degree and out - degree.
Binary tree:
A binary tree is a tree in which no node can have more than two sub trees designated as the
left sub tree and the right sub tree.
Balance factor:
The balance factor of a binary tree is the difference in height between its left and right sub
trees. i.e. Balance factor = HL - HR
In a balanced binary tree, the height of its sub trees differs by no more than one (its balance
factor is -1, 0, +1) and also its sub trees are also balanced.
KHIT Page 19
ADVANCED DATA STRUCTURES UNIT - II CSE
In the design of the linear list structure, we had two choices: an array or a linked list
The array structure provides a very efficient search algorithm, but its insertion and
deletion algorithm are very inefficient.
The linked list structure provides efficient insertion and deletion, but its search
algorithm is very inefficient.
What we need is a structure that provides an efficient search, at the same time efficient
insertion and deletion algorithms.
The binary search tree and the AVL tree provide that structure.
A binary search tree (BST) is a binary tree with the following properties:
All items in the left sub tree are less than the root.
All items in the right sub tree are greater than or equal to the root.
Each sub tree is itself a binary search tree.
While the binary search tree is simple and easy to understand, it has one major problem:
It is not balance.
To over come this problem, AVL trees are designed, which are balanced.
AVL TREES
In 1962, two Russian mathematicians, G.M Adelson – velskii and E.M Landis, aerated the
balanced binary tree structure that is named after them – the AVL tree.
An AVL tree is a search tree in which the heights of the sub trees differ by no more than 1. It
is thus a balanced binary tree.
An AVL tree is a binary tree that either is empty or consists of two AVL sub tree, T L and TR,
whose heights differ by no more than 1.
| HL - HR | < = 1
Where HL is the height of the left sub tree, HR is the height of the right sub tree
The bar symbols indicate absolute value.
KHIT Page 20
ADVANCED DATA STRUCTURES UNIT - II CSE
The Balance factor for any node in an AVL tree must be +1, 0, -1.
Balancing Trees:
When ever we insert a node into a tree or delete a node from a tree, the resulting tree may
be unbalanced then we must rebalance it.
AVL trees are balanced by rotating nodes either to the left or to the right.
Now, we are going to discuss the basic balancing algorithms. They are
1. Left of left:
A sub tree of a tree that is left high has also become left high
2. Right of right:
A sub tree of a tree that is right high has also become right high
3. Right of left:
A sub tree of a tree that is left high has become right high
4. Left of right:
A sub tree of a tree that is right high has become left high
KHIT Page 21
ADVANCED DATA STRUCTURES UNIT - II CSE
Left of left:
When the out – of – balance condition has been created by a left high sub tree of a left high
tree, we must balance the tree by rotating the out – of – balance node to the right.
Complex case:
NOTE:
KHIT Page 22
ADVANCED DATA STRUCTURES UNIT - II CSE
Right of Right:
This case is the mirror of previous case. It contains a simple left rotation.
NOTE:
KHIT Page 23
ADVANCED DATA STRUCTURES UNIT - II CSE
Right of Left:
The above two types required single rotation to balance the trees. Now we study about two
out – of – balance conditions in which we need to rotate two nodes, one to the left and one
to the right to balance the trees.
Here, an out – of – balance tree in which the root is left high and left sub tree is right high –
a right of left tree.
To balance this tree, we first rotate the left sub tree to the left, then we rotate the root to
the right, making the left node the new root.
KHIT Page 24
ADVANCED DATA STRUCTURES UNIT - II CSE
Left of Right:
Complex case:
NOTE:
In both cases, i.e. Right of left and Left of right, we have double rotations.
KHIT Page 25
ADVANCED DATA STRUCTURES UNIT - II CSE
What is the maximum height of an AVL tree having exactly n nodes? To answer this
question, we will pose the following question:
What is the minimum number of nodes (sparsest possible AVL tree) an AVL tree of height h
can have ?
Let Fh be an AVL tree of height h, having the minimum number of nodes. Fh can be
visualized as in Figure.
Let Fl and Fr be AVL trees which are the left subtree and right subtree, respectively,
of Fh. Then Fl or Fr must have height h-2.
Suppose Fl has height h-1 so that Fr has height h-2. Note that Fr has to be an AVL tree
having the minimum number of nodes among all AVL trees with height of h-1.
Similarly, Fr will have the minimum number of nodes among all AVL trees of height
h--2. Thus we have
| Fh| = | Fh - 1| + | Fh - 2| + 1
where | Fr| denotes the number of nodes in Fr. Such trees are called Fibonacci trees.
See Figure. Some Fibonacci trees are shown in Figure 4.20. Note that | F0| = 1 and |
F1| = 2.
Thus the numbers | Fh| + 1 are Fibonacci numbers. Using the approximate formula for
Fibonacci numbers, we get
| Fh| + 1
h 1.44log| Fn|
KHIT Page 26
ADVANCED DATA STRUCTURES UNIT - II CSE
KHIT Page 27
ADVANCED DATA STRUCTURES UNIT - II CSE
2 – 3 TREES
The basic idea behind maintaining a search tree is to make the insertion, deletion and
searching operations efficient.
In AVL Trees the searching operation is efficient. However, insertion & deletion involves
rotation that makes the operation complicated.
To build a 2 – 3 tree there are certain rules that need to be followed. These rules are as
follows:
All the non – leaf nodes in a 2 – 3 tree must always have two or three non – empty
child nodes that are again 2 – 3 trees.
The level of all the leaf nodes must always be the same.
One single node can contain (left and right) then that node contains single data. The
data occurring on left sub tree of that node is less than the data of the node and the
data occurring on right sub tree of that node is greater than the data of the node.
If any node has three children (left, middle, right) then that node contains two data
values, let say i and j where i < j, the data of all the nodes on the middle sub tree are
greater than i but less than j and the data of all nodes on the right sub tree are
greater than j.
Example of 2 – 3 Trees:
KHIT Page 28
ADVANCED DATA STRUCTURES UNIT - II CSE
Insertions in 2 – 3 Trees:
Let us now try to understand the process of insertion of a value in 2 – 3 trees. To insert a
value in a 2 – 3 trees we first need to search the position where the value can be inserted,
and then the value and node in which the value is to be inserted are adjusted.
Algorithm:
Example:
Insert 5 5
Insert 21
Insert 8
Insert 63
KHIT Page 29
ADVANCED DATA STRUCTURES UNIT - II CSE
Insert 69
Insert 32
Insert 7, 9, 25
KHIT Page 30
ADVANCED DATA STRUCTURES UNIT - II CSE
Deletions in 2 – 3 Trees:
In case of insertion the node where the data is to be inserted is split if it already contains
maximum no. of values. But in case of deletion, two nodes are merged if the node of the
value to be deleted contains minimum number of values (i.e. only one value)
Example 1:
Consider a 2 – 3 trees
Delete 47
Delete 63
KHIT Page 31
ADVANCED DATA STRUCTURES UNIT - II CSE
Example 2:
Delete 47
KHIT Page 32
ADVANCED DATA STRUCTURES UNIT - III CSE
PRIRORITY QUEUES
A priority queue is an important data type in computer science. Major operations supported
by priority queues are Inserting and Delete min.
Insert, which does the obvious thing; and Delete min, which finds, returns, and removes the
minimum element in the priority queue.
Simple Implementation:
Performing insertions at front in O(1) and traversing the list which requires O(N) time
To delete the minimum, we could insist that the list be kept always sorted: this
makes insertions expensive O(N) and delete-mins cheap O(1)
Another way of implementing priority queues would be use a binary search tree.
The basic data structure we will use will not require pointers and will support both
operations in O(log N) worst case time. The implementations we will use is known as
a binary – heap
Binary – Heaps:
Heaps (occasionally called as partially ordered trees) are a very popular data structure for
implementing priority queues.
Binary heaps are refer to merely as heaps, like binary search trees, heaps have two
properties, namely, a structure property and a heap order property.
KHIT Page 33
ADVANCED DATA STRUCTURES UNIT - III CSE
Structure property:
A heap is a binary tree that is completely filled, with the possible exception of the bottom
level, which is filled from left to right, such tree is known as a complete binary tree as shown
below
A binary heap is a complete binary tree with elements from a partially ordered set, such that
the element at every node is less than (or equal to) the element at its left child and the
element at its right child.
It is easy to show that a complete binary tree height ‘ h ‘ has between 2h and 2h+1-1 nodes.
This implies that the height of a complete binary tree is log N , which is clearly O(log N).
One important observation is that because a complete binary tree is so regular, it can be
represented in an array and no pointers are necessary.
Since a heap is a complete binary tree, the elements can be conveniently stored in an array.
If an element is at position i in the array, then the left child will be in position 2i, the right
child will be in the position (2i+1), and the parent is in position i/2
The only problem with this implementation is that an estimate of the maximum heap size is
required in advance, but typically this is not a problem.
Because of the heap property, the minimum element will always be present at the root of
the heap. Thus the find min operation will have worst case O(1) running time.
KHIT Page 34
ADVANCED DATA STRUCTURES UNIT - III CSE
It is the property that allows operations to be performed quickly. Since we want to be able
to find the minimum quickly, it makes sense that the smallest element should be at the root.
If we consider that any sub tree should also be a heap, then any node should be smaller
than all of its descendants.
Applying this logic, we arrive at the heap order property. In a heap, for every node X, the
key in the parent of X is smaller than ( or equal to) the key X, with exception of the root.
(Which has no parent)?
It is easy to perform the two required operations. All the work involves ensuring that the
heap order property is maintained.
Insert:
To insert an element say x, into the heap with n elements we first create a hole in position
(n+1) and see if the heap property is violated by putting x into the hole. If the heap property
is violated then we have found the current position for x. Otherwise we push - up or
percolate – up x until the heap property is restored.
To do this we slide the element that is in the holes parent node into the hole thus bubbling
the hole up toward the root. We continue this process until x can be placed in the whole.
KHIT Page 35
ADVANCED DATA STRUCTURES UNIT - III CSE
We create a hole in the next available heap location. Inserting 14 in the hole would violate
the heap order property so 31 is slid down into the hole. This strategy is continued until the
correct location for 14 is found.
This general strategy is known as a percolate up. i.e. the new element is percolated up the
heap until the correct location is found.
NOTE: Worst case complexity of insert is O(h) where h is the height of the heap. Thus
insertions are O(log n) where n is the no .of elements in the heap.
Delete min:
Where the minimum is deleted a hole is created at the root level. Since the heap now has
one less element and the heap is a complete binary tree, the element in the least position is
to be relocated.
This we first do by placing the last element in the hole created at the root. This will leave the
heap property possibly violated at the root level.
We now push – down or percolate – down the hole at the root until the violation of heap
property is stopped. While pushing down the hole it is important to slide it down to the less
of its two children (pushing up the latter). This is done so as not to create another violation
of heap property.
KHIT Page 36
ADVANCED DATA STRUCTURES UNIT - III CSE
This general strategy is known as a percolate down. We use same technique as in the insert
routine to avoid the use of swaps in this routine.
NOTE: The worst case running time of delete min is O(log n) where n is the no. of elements
in the heap.
KHIT Page 37
ADVANCED DATA STRUCTURES UNIT - III CSE
Creating Heap:
The build heap operation takes as input n elements. The problem here is to create a heap of
these elements i.e. places them into an empty heap.
Obvious approach is to insert the n element one at a time into an initially empty
heap. Since each insert will take O(1) average and O(log n) worst case time, the total
running time of this algorithm would be O(n) average but O(n log n) worst case
KHIT Page 38
ADVANCED DATA STRUCTURES UNIT - III CSE
Left: after percolate – down (6) Right: after percolate – down (5)
Left: after percolate – down (4) Right: after percolate – down (3)
Left: after percolate – down (2) Right: after percolate – down (1)
Each dashed line corresponds to two comparisons: one to find the smallest child and one to
compare the smaller child with the node.
Notice that there are only 10 dashed lines in the entire algorithm (there could been an 11 th
where?) corresponding to 20 comparisons.
KHIT Page 39
ADVANCED DATA STRUCTURES UNIT - III CSE
To bounding the running time of build heap, we must bound the no. of dashed lines. This
can be done by computing the sum of the heights of all the nodes in the heap, which is the
maximum no. of dashed lines. What we would like to show is that this is O(n).
THEOREM:
For the perfect binary tree of height h containing 2h+1-1 nodes, the sum of the heights of
nodes is 2h+1-1-(h+1)
Proof:
It is easy to see that this tree consists of 1 node at height h, 2 nodes at height h-1, 22 nodes
at height h-2, and in general 2i nodes at height h-i
The sum of the height of all the nodes is then S = ∑ 2i (h-i) where i = o to h
S = h + 2 (h – 1) + 4 (h – 2) + 8 (h – 3) + 16 (h – 4) + - - - + 2h – 1 (1)
S = - h + 2 + 4 + 8 + 16 + - - - + 2h – 1 + 2h
It is easy to see that the above is an upper bound on the sum of heights of nodes of a
complete binary tree. Since a complete binary tree of height h has between 2 h and 2h+1
nodes, the above sum is there fore O(n)
Where n is the no. of nodes in the heap
Since the worst case complexity of the heap building algorithm is of the order of the sum of
height of the nodes of the heap built, we then have the worst case complexity of heap
building as O(n)
KHIT Page 40
ADVANCED DATA STRUCTURES UNIT - III CSE
Binomial Queues:
We know that previous concepts support merging, insertion, and delete min all effectively in
O(log n) time per operation but insertion take constant average time.
Binomial Queues support all three operations in O(log n) worst case time per operation, but
insertions take constant time on average.
It differs from all the priority queue implementations that a binomial queue is not a heap –
ordered tree but rather a collection of heap – ordered trees known as a forest.
Each of the heap – ordered trees is of a constrained from known as a binomial tree. There is
at most one binomial tree of every height.
B0 B1 B2 B3
The above diagram shows binomial trees B0 B1 B2 and B3 from the diagram we see that a
binomial tree Bk consists of a root with children B0 B1 B2 - - - Bk-1
NOTE: Binomial tree of height k have exactly 2k nodes and the no. of nodes at depth d is the
binomial coefficient kCd
NOTE: If we impose heap order on the binomial tree and allow at most one binomial tree of
any height we can uniquely represent a priority queue of any size by a collection of binomial
trees (forest).
KHIT Page 41
ADVANCED DATA STRUCTURES UNIT - III CSE
H1:
Find – min:
This is implemented by scanning the roots of the entire tree. Since there are at most log n
different trees, the minimum can be found in O(log n) time.
Alternatively, one can keep track of the current minimum and perform find – min in O(1)
time. If we remember to update the minimum if it changes during other operations.
Merge:
Merging two binomial queues is a conceptually easy operation, which we will describe by
example.
Consider the two binomial queues H1 and H2 with six and seven elements respectively as
shown below.
KHIT Page 42
ADVANCED DATA STRUCTURES UNIT - III CSE
KHIT Page 43
ADVANCED DATA STRUCTURES UNIT - III CSE
Since H1 has no binomial tree of height 0, and H2 does, we can just use the binomial tree of
height o in H2 as part of H3.
Since both H1 and H2 have binomial tree of height 1, we merge them by making the larger
root a sub tree of the smaller, creating a binomial tree of height 2.
Thus, H3 will not have a binomial tree of height 1 as shown in the above diagrams.
There are now three binomial trees of height 2, namely, the original trees in both H1 and H2
plus the tree formed by adding of height 1 tree in both H1 and H2.
We keep one binomial tree of height 2 in H3 and merge the other two, creating a binomial
tree of height 3.
Since H1 and H2 have no trees of height 3, this tree becomes part of H3 and we are finished.
The resulting binomial queue is as shown in above figure.
Since merging two binomial trees takes constant time with almost any reasonable
implementation, and there are O(log n) binomial tree, the merge takes O(log n) time in the
worst case.
To make this operation efficient, we need to keep the trees in the binomial queue sorted by
height, which is certainly a simple thing to do.
Insertion:
Insertion is a special case of merging, since we merely create a one – node tree and
perform a merge.
More precisely, if the priority queue into which the element is being inserted has the
property that the smallest non existent binomial tree is B i the running time is
proportional to i+1.
KHIT Page 44
ADVANCED DATA STRUCTURES UNIT - III CSE
For example:
In The previous example, H3 is missing a binomial tree of height 1, so the insertion will
terminate in two steps. Since each tree in a binomial queue is present with probability ½.
If we define the random variable X as representing the no. of steps in an insert operation,
then
Thus we expect an insertion to terminate in two steps on the average. Further more
performing n inserts on an initially empty binomial queue will take O(n) worst case time.
Consider an example, the binomial queue that are formed by inserting 1 through 7 in order.
After 1 is inserted:
After 2 is inserted:
After 3 is inserted:
After 4 is inserted:
KHIT Page 45
ADVANCED DATA STRUCTURES UNIT - III CSE
After 5 is inserted:
After 6 is inserted:
After 7 is inserted:
If we insert 8 then
Inserting 4 shows off a bad case, we merge 4 with B0 obtaining a new tree of height 1. We
merge this tree with B1 obtaining a tree of height 2 which is the new priority queue.
The next insertion after 7 is another bad case and would require three merges.
KHIT Page 46
ADVANCED DATA STRUCTURES UNIT - III CSE
Delete min:
A delete min can be performed by first finding the binomial tree with the smallest
root.
Let this tree be Bk and let the original priority queue be H
Remove the binomial tree Bk from the forest of trees in H forming the new binomial
queue H′
Now remove the root of Bk creating binomial trees B0 B1 - - - Bk – 1 which collectively
from priority queue H″.
Finish the operation by merging H′ & H″
The binomial queue that results from merging H′ & H″ is as shown below
NOTE: The entire delete min operation takes O(log n) worst case time
KHIT Page 47
ADVANCED DATA STRUCTURES UNIT - III CSE
Result 1:
The above analysis will not help when we try to analyze a sequence of operations that
include more than just insertions.
Amortized Analysis
o If there is no B0 tree, then the insertion costs one time unit. The result of
insertion is that there is now a B0 tree and the forest has one more tree.
o If there is a B0 tree but not B1 tree, then insertion costs 2 time units. The new
forest will have a B1 tree but not a B0 tree. Thus number of trees in the forest is
unchanged.
o An insertion that costs 3 time units will create a B2 tree but destroy a B0 and
B1, yielding one less tree in the forest.
o In general, an insertion that costs c units results in a net increase of 2 - c trees.
Since
a Bc - 1 tree is created
all Bi trees, 0 i c - 1 are removed.
KHIT Page 48
ADVANCED DATA STRUCTURES UNIT - III CSE
Thus expensive insertions remove trees and cheap insertions create trees.
Let ti =
ci =
We have
c0 =0
ti + (ci - ci - 1) = 2
Result 2:
The amortized running times of Insert, Delete-min, and Merge are 0(1), 0(log n), and
0(log n) respectively.
Potential function = # of trees in the queue
To prove this result we choose:
Insertion
ti =
ci =
ai = ti + (ci - ci - 1)
ai =2 i
= 2n - (cn - c0)
As long as (cn - c0) is positive, we are done.
In any case (cn - c0) is bounded by log n if we start with an empty tree.
Merge:
Assume that the two queues to be merged have n1 and n2nodes with T1 and T2 trees. Let n =
n1+ n2. Actual time to perform merge is given by:
ti = 0(logn1 + logn2)
= 0(max(logn1, logn2)
= 0(log n)
(ci - ci - 1) is at most (log n) since there can be at most (log n) trees after merge.
Deletemin:
The analysis here follows the same argument as for merge.
KHIT Page 49
ADVANCED DATA STRUCTURES UNIT - III CSE
It can be shown in a Fibonacci heap that any node of rank r 1 has at least Fr + 1
descendant.
KHIT Page 50
ADVANCED DATA STRUCTURES UNIT - IV CSE
GRAPHS
In this chapter, we turn our attention to a data structure – Graphs - that differs from all of
the other in one major concept: each node may have multiple predecessors as well as
multiple successors.
Graphs are very useful structures. They can be used to solve complex routing problems,
such as designing and routing airlines among the airports they serve. Similarly, they can be
used to route messages over a computer network from one node to another.
Basic Concepts:
A graph is a collection of nodes, called vertices and a collection of segments called lines
connecting pair of vertices. In other words a graph consists of two sets, a set of vertices and
set of lines.
A directed graph or digraph is a graph in which each line has a direction (arrow
head) to its successor. The line in a directed graph is known as arc. The flow along
the arc between two vertices can follow only the in directed direction.
A path is a sequence of vertices in which each vertex is adjacent to the next one.
KHIT Page 51
ADVANCED DATA STRUCTURES UNIT - IV CSE
Two vertices in a graph are said to be adjacent vertices (or neighbors) if there is a path of
length 1 connecting them.
A cycle is a path, it start with vertex and ends with same vertex.
Example:
A-B-C-A is a cycle
A loop is a special case of cycle in which a single arc begins and ends with the same vertex.
In a loop the end points of the line are the same.
Two vertices are said to be connected if there is a path between them. A graph is said to be
connected if, ignoring direction, there is a path from any vertex to any other vertex.
A directed graph is strongly connected if there is a path from each vertex to every other
vertex in the digraph.
A directed graph is weakly connected if at least two vertices are connected (A connected
undirected graph would always be strongly connected, so the concept is not normally used
with undirected graphs)
KHIT Page 52
ADVANCED DATA STRUCTURES UNIT - IV CSE
The out - degree of a vertex in a digraph is the no. of arcs leaving the vertex
NOTE: A tree is a graph in which each vertex has only one predecessor; how ever a graph is
not a tree.
Operations on Graphs:
There are six primitive graph operations that provide the basic modules needed to maintain
a graph. They are
1. Insert a vertex
2. Delete a vertex
3. Add an edge
4. Delete an edge
5. Find a vertex
6. Traverse a graph
Vertex insertion:
When a vertex is inserted it is disjoint; it is not connected to any other vertices in the list
The below diagram shows a graph before and after a new vertex is added
KHIT Page 53
ADVANCED DATA STRUCTURES UNIT - IV CSE
Algorithm:
Vertex deletion:
Delete vertex removes a vertex from the graph when a vertex is deleted; all connecting
edges are also removed
KHIT Page 54
ADVANCED DATA STRUCTURES UNIT - IV CSE
Algorithm:
Return +1 if successful
-1 if degree not zero
-2 if key is not found
if (empty graph)
return -2;
end if
search for vertex to be deleted
if (not found)
return -2;
end if
if ( vertex indegree > 0 or indegree > 0)
return -1;
end if
delete vertex
decrement graph count
return 1;
end delete vertex
Edge addition:
Add edge connects a vertex to destination vertex. If a vertex requires multiple edges, add
an edge must be called once for each adjacent vertex. To add an edge, two vertices must be
specified. If the graph is a digraph, one of the vertices must be specified as the source and
one as the destination.
KHIT Page 55
ADVANCED DATA STRUCTURES UNIT - IV CSE
Algorithm:
Return +1 if successful
-2 if fromkey not found
-3 if tokey not found
Allocate memory for new arc
Search and set fromvertex
if (fromvertex not found)
return -2;
end if
search and set tovertex
if (tovertex not found)
return -3;
end if
increment fromvertex outdegree
increment tovertex indegree
set arc destination to tovertex
if ( fromvertex arc list empty)
set fromvertex firstArc to new arc
set new arc nextArc to null
return 1;
end if
find insertion point in arc list
if (insert at beginning of arc list)
set fromvertex firstArc to new arc
else
insert in arc list
end if
return 1;
end insertArc
KHIT Page 56
ADVANCED DATA STRUCTURES UNIT - IV CSE
Edge deletion:
Below diagram shows that deleted the edge {B, E} from the graph
Algorithm:
Return +1 if successful
-2 if fromkey not found
-3 if tokey not found
if (empty graph)
return -2;
end if
search and set fromvertex to tovertex with key equal to fromkey
if (fromvertex arc not found)
return -2;
end if
if (fromvertex arc list null)
return -3;
end if
search and find arc with key equal to tokey
if (tokey not found)
return -3;
enf if
set tovertex to arc destination
delete arc
decrement fromvertex outdegree
decrement tovertex indegree
return 1;
end deleteArc
KHIT Page 57
ADVANCED DATA STRUCTURES UNIT - IV CSE
Find vertex:
Find vertex traverse a graph, looking for a specified vertex. If the vertex is found its data are
returned and if it is not found then an error is indicated.
The below figure shows find vertex traverses the graph, looking for vertex C
Algorithm:
KHIT Page 58
ADVANCED DATA STRUCTURES UNIT - IV CSE
To represent a graph, we need to store two sets. The first set represents the vertices of the
graph and the second set represents the edges or arcs. The two most common structures
used to store these sets are arrays and linked lists. Although the arrays offer some simplicity
this is a major limitation.
Adjacency Matrix:
The adjacency matrix uses a vector (one – dimensional array) for the vertices and a matrix
(two – dimensional array) to store the edges. If two vertices are adjacent – that is if there is
no edge between them, intersect is set to 0.
If the graph is directed, the intersection in the adjacency matrix indicates the direction
In the below diagram, there is an arc from sources vertex B to destination vertex C. In the
adjacency matrix, this arc is seen as a 1 in the intersection from B (on the left) to C (on the
top). Because there is no arc from C to B, however, the intersection from C to B is 0.
KHIT Page 59
ADVANCED DATA STRUCTURES UNIT - IV CSE
NOTE: In adjacency matrix representation, we use a vector to store the vertices and a matrix
to store the edges.
In addition to the limitation that the size of graph must be know before the program starts,
there is another serious limitation in the adjacency matrix: only one edge can be stored
between any two vertices. Although this limitation does not prevent many graphs from
using the matrix format, some network structures require multiple lines between vertices.
Adjacency list:
The adjacency list uses a two – dimensional ragged array to store the edges. An adjacency
list is shown below.
The vertex list is a singly linked list of vertices in the list. Depending on the application, it
could also be implemented using doubly linked lists or circularly linked lists. The pointer at
the left of the list links the vertex entries. The pointer at the right in the vertex is a head
pointer to a linked list of edges from the vertex. Thus, in the non – directed graph on the left
in above figure there is a path from vertex B to vertices A, C, and E. To find these edges in
the adjacency list, we start at B’s vertex list entry and traverse the linked list to A, then to C,
and finally to E.
NOTE: In the adjacency list, we use a linked list to store the vertices and a two – dimensional
linked list to store the arcs.
KHIT Page 60
ADVANCED DATA STRUCTURES UNIT - IV CSE
Traverse graph:
There is always at least one application that requires that all vertices in a given graph be
visited; as we traverse the graph, we set the visited flag to on to indicate that the data have
been processed
That is traversal of a graph means visiting each of its nodes exactly once. This is
accomplished by visiting the nodes in a systematic manner
There are two standard graph traversals: depth first and breadth first. Both use visited flag
In the depth – first traversal, we process all of a vertex’s descendants before we move to an
adjacent vertex. This concept is most easily seen when the graph is a tree
In the below figure we show the tree pre – order traversal processing sequence, one of the
standard depth – first traversals
In a similar manner, the depth – first traversal of a graph starts by processing the first
vertex; we select any vertex adjacent to the first vertex and process it. This continues until
we found no adjacent entries
This is similar to reaching a leaf in a tree. We require a stack to complete the traversal
i.e. last – in – first – out (LIFO) order
Let’s trace a depth – first traversal through the graph in below figure the numbering in the
box next to a vertex indicates the processing order
KHIT Page 61
ADVANCED DATA STRUCTURES UNIT - IV CSE
2. We then loop, pop the stack and after processing the vertex, push all of the adjacent
vertices into the stack
NOTE: In the depth – first traversal, all of a node’s descendents are processed before
moving to an adjacent node
Push nodes adjacent to H into stack already G is in waiting state, then push nodes E
and P
Push adjacent nodes, H is already visited, so push Y and M into the stack
In the breadth – first traversal of a graph, we process all adjacent vertices of a vertex before
going to the next level. We first saw the breadth – first traversal of a tree as shown in below
This traversal starts at level 0 and then processes all the vertices in level 1 before going on
to process the vertices in level 2.
The breadth – first traversal of a graph follows the same concept, begin by picking a starting
vertex A after processing it, process all of its adjacent vertices and continue this process
until get no adjacent vertices
KHIT Page 63
ADVANCED DATA STRUCTURES UNIT - IV CSE
The breadth – first traversal uses a queue rather than a stack. As we process each vertex, we
place all of its adjacent vertices in the queue. Then to select the next vertex to be processed,
we delete a vertex from the queue and process it.
2. We then loop, dequeuing the queue and processing the vertex from the front of
the queue. After processing the vertex, we place all of its adjacent vertices into
the queue. Thus in the above diagram we dequeue vertex X, process it, and then
place vertices G and H in the queue.
NOTE: In the breadth – first traversal, all adjacent vertices are processed before processing
the descendents of a vertex.
KHIT Page 64
ADVANCED DATA STRUCTURES UNIT - IV CSE
Algorithms:
Breadth-First Search:
KHIT Page 65
ADVANCED DATA STRUCTURES UNIT - IV CSE
Exercise problems:
KHIT Page 66
ADVANCED DATA STRUCTURES UNIT - V CSE
GRAPH ALGORITHMS
The trees are the special case of graphs. A tree may be defined as a connected graph
without any cycles.
A spanning tree of a graph is a sub - graph which is basically a tree and it contains all the
vertices of graph containing no cycles.
A network is a graph whose lines are weighted. It is also known as a weighted graph. The
weight is an attribute of an edge. In an adjacency matrix, the weight is stored as the
intersection value. In an adjacency list, it is stored as the value in the adjacency linked list.
A minimum – cost spanning tree is a spanning tree in which the total weight of lines is
guaranteed to be the minimum of all possible trees in the graph.
KHIT Page 66
ADVANCED DATA STRUCTURES UNIT - V CSE
Example:
Before going into the types of algorithms which will apply to get the minimum spanning tree
of a graph
We can start with any vertex. Because the vertex list is usually key sequenced,
The above example shows a graph and one of its minimum – cost spanning trees. Since the
identification of a minimum – cost spanning tree involves the selection of a subset of edges.
2. Spanning trees are have wide applications in many areas such as network design
KHIT Page 66
ADVANCED DATA STRUCTURES UNIT - V CSE
There are two popular techniques to construct a minimum – cost spanning tree from a
weighted graph. One such method is known as Prim’s algorithm and another one is
Kruskal’s algorithm
Prim’s Algorithm:
The prim’s algorithm is implemented using the adjacency matrix of a graph. This matrix is
denoted by adjMatrix [i, j] where i and j operate from 0 to n-1 for n – node weighted
undirected graph.
We can represent a weighted graph by an adjacency matrix to store the set of edges, an
entry (i, j) in an adjacency matrix contains information on the edge that goes from the
vertex i to the vertex j. each matrix entry contains the weight of the corresponding edge.
Now the sequence of below figures show the working of prim’s algorithm:
KHIT Page 66
ADVANCED DATA STRUCTURES UNIT - V CSE
Kruskal’s Algorithm:
Like Prim’s algorithm, Kruskal’s algorithm also constructs the minimum spanning tree of a
graph by adding edges to the spanning tree one – by – one. At all points during its
execution, the set of edges selected by prim’s algorithm forms exactly one tree, on the
other hand, the set of edges selected by Kruskal’s algorithm forms a forest of trees.
KHIT Page 66
ADVANCED DATA STRUCTURES UNIT - V CSE
The edges of the graph are arranged in increasing order of weights as shown below. Initially
the spanning tree, T is empty. We select the edge with smallest weight and include it in T.
If selected edge creates a cycle, then it will be removed from T. We repeat these two steps
until the tree T contains n-1 edges (where n is the number of vertices in the graph)
If the tree contains less than n-1 edges and the edge list is empty, then no spanning tree is
possible for the graph
KHIT Page 66
ADVANCED DATA STRUCTURES UNIT - V CSE
KHIT Page 66
ADVANCED DATA STRUCTURES UNIT - V CSE
A minimum spanning tree gives no indication about the shortest path between two vertices.
Rather only the overall cost or weight is minimized. In real life, we are required to find the
shortest path between two cities.
For example: one would be interested in finding most economical route between any two
cities in a given railways network.
We are given a directed graph G in which every edge has a weight and our problem is to find
a path from one vertex v to another vertex w such that the sum of the weights on the path
is as minimal as possible. We shall call such path a shortest path.
The shortest path from vertex A to vertex E is ADCE and has a total cost of 12
compared to the cost of 20 for the edge directly from A to E and the cost of 14 for the path
ABCE
It turns out that is just easy to solve the more general problem of starting at one vertex,
called the source, and finding the shortest path to every other vertex, instead of to just one
destination vertex. For simplicity, we take the source to be vertex A and our problem then
consists of finding the shortest path from vertex A to every other vertex in the graph.
Dijkstra’s Algorithm:
The solution we will show for the shortest path problem is called Dijkstra’s Algorithm, after
Edsger Dijkstra, who first described it in 1959. This algorithm is based on the adjacency
matrix representation of a graph. Somewhat surprisingly, it finds not only the shortest path
from one specified vertex to another, but the shortest paths from the specified vertex to all
the other vertices.
KHIT Page 66
ADVANCED DATA STRUCTURES UNIT - V CSE
The algorithm works by maintaining a set S of vertices whose shortest distance from the
source is already known.
Where d[i] contains the length of the current shortest path from source vertex to vertex i
Let’s apply Dijkstra’s algorithm to the given digraph for getting stages of algorithm
KHIT Page 66
ADVANCED DATA STRUCTURES UNIT - V CSE
KHIT Page 66
ADVANCED DATA STRUCTURES UNIT - V CSE
Exercise problems:
1. Construct a minimum – cost spanning tree of a given graph by using prim’s algorithm
and kruskal’s algorithm
2. Find the minimum – cost spanning tree for the below graphs using prim’s and
kruskal’s algorithms
KHIT Page 66
ADVANCED DATA STRUCTURES UNIT - V CSE
3. Find the shortest path from A to all other vertices using Dijkstra’s algorithm for the
graph as shown below
4. Find the shortest path by using Dijkstra’s algorithm on the following edge weighted
directed graph with the vertex P as the source
KHIT Page 66
ADVANCED DATA STRUCTURES UNIT - VII CSE
PATTERN MATCHING
Computers are well recognized to perform numerical computations – but they are equally
capable of processing textual data. A document generally contains textual data, or simply
text. Computers are used to edit documents, to search documents, to transmit documents
over the Internet, to display documents on monitors, and to print documents on printers.
The main concern in the text processing is centered on either manipulation or movement of
characters or searching for pattern or words.
Before proceeding to the search algorithms for text processing, we recall the operations on
strings. Representing a string as an array of characters is simple and efficient.
Pattern searching problem and its related variations are commonly come across in
computing.
For example: We may wish to determine whether or not the substring “DATA” occurs in the
text: ADVANCED DATA STRUCTURES
Two substrings can be said to match when they are equal character-by-character from the
first to the last character. It follows that if the number of character-for-character matches is
equal to the length of both the substring sought and the text substring
A mismatch between two substrings must therefore mean that the two substrings are not
the same character-for-character or in other words the number of character-for-character
matches is less than the length of the two substrings.
This section uses the following notation: For a text string T of length n and a string P of
length m, we want to find whether P is a substring of T, The concept of a match is that there
is a substring of T starting at some index I that matches P, character-by-character, so that
A D V A N C E D D A T A S T R U C T U R E S
D A T A
P
KHIT Page 81
ADVANCED DATA STRUCTURES UNIT - VII CSE
A brute – force algorithm solves a problem in the most simple, direct or obvious way
A simple method to string matching is starting the comparison of P (string pattern) and T
(text string) from the first character of T and the first character of P. If there is a mismatch,
the comparison starts from the second character of T and so on.
The running time of brute – force pattern matching algorithm is not efficient in the worst
case. The running time of the algorithm is O (nm). The algorithm has a quadratic running
time O (n2), when m = n/2
This algorithm scans the character of the search pattern from right to left. If a match is not
found then a shift is made by some number of characters. This algorithm is also called as
“looking glass heuristic”
Algorithm:
j m-1;
retirn -1; //no match
}
Program Logic:
Analysis:
The computation of the last function takes O (m+|∑|) time and actual search takes O(mn)
time. Therefore the worst case running time of Boyer-Moore algorithm is O(nm + |∑|).
Implies that the worst-case running time is quadratic, in case of n = m, the same as the naïve
algorithm.
Boyer-Moore algorithm is extremely fast on large alphabet (relative to the length of the
pattern).
Example:
Consider a text T = X Y X Z X X Y X T Z X Y X Z X Y Y X X Y
We will first build last table using last(c) function where ‘c’ represents character from T
KHIT Page 83
ADVANCED DATA STRUCTURES UNIT - VII CSE
The remaining characters in text string is T which is not present in pattern i.e. last (T) = -1
STEP 1:
X Y X Z X X Y X T Z X Y X Z X Y Y X X Y
Mismatch
X Y X Z X Y
0 1 2 3 4 5 j
STEP 2:
X Y X Z X X Y X T Z X Y X Z X Y Y X X Y
Mismatch
X Y X Z X Y
0 1 2 3 4 5 j
l = last (X) = 4, j = 3
As j < l +1 3 < 4 + 1 pattern shift by ( l – j) positions i.e. by 1 position
KHIT Page 84
ADVANCED DATA STRUCTURES UNIT - VII CSE
STEP 3:
X Y X Z X X Y X T Z X Y X Z X Y Y X X Y
Mismatch
X Y X Z X Y
0 1 2 3 4 5 j
STEP 4:
X Y X Z X X Y X T Z X Y X Z X Y Y X X Y
Mismatch
X Y X Z X Y
0 1 2 3 4 5 j
STEP 5:
X Y X Z X X Y X T Z X Y X Z X Y Y X X Y
Mismatch
X Y X Z X Y
0 1 2 3 4 5 j
STEP 5:
X Y X Z X X Y X T Z X Y X Z X Y Y X X Y
Matched
X Y X Z X Y
0 1 2 3 4 5 j
Now the match for the given pattern is found in the given string
KHIT Page 85
ADVANCED DATA STRUCTURES UNIT - VII CSE
The basic method of the B – M algorithm ultimately reduces the efficiency of pattern
matching algorithm. Hence the K – M – P algorithm came up which avoids the repeated
comparison of characters
The basic idea behind this algorithm is to build a prefix array. Some times this array is also
called ∏ array. This prefix array is built using the prefix and suffix information of pattern
The KMP algorithm achieves the efficiency of O (m + n) which is optimal in worst case.
Where n is the length of text and m is length of pattern
Let us first understand how to compute the prefix array for given pattern
Algorithm 1:
KHIT Page 86
ADVANCED DATA STRUCTURES UNIT - VII CSE
Example:
STEP 1:
0 1 2 3 4 5
a b a d a b
STEP 2:
0 1 2 3 4 5
a b a d a b
0 0
STEP 3:
KHIT Page 87
ADVANCED DATA STRUCTURES UNIT - VII CSE
0 1 2 3 4 5
a b a d a b
0 0 1
STEP 4:
0 1 2 3 4 5
a b a d a b
0 0 1 0
KHIT Page 88
ADVANCED DATA STRUCTURES UNIT - VII CSE
STEP 5:
suffix: a , d a , a d a , b a d a
0 1 2 3 4 5
a b a d a b
0 0 1 0 1
STEP 6:
suffix: b , a b , d a b , a d a b , b a d a b
0 1 2 3 4 5
a b a d a b
0 0 1 0 1 2
NOTE:
If there is more than one matching prefix – suffix then make entry with largest length of
matching into the prefix table
E.g.: string a b a b a
Prefix: a , a b , a b a , a b a b
Suffix: a , b a , a b a , b a b a
Make an entry into the prefix table with largest length of matching is 3.
KHIT Page 89
ADVANCED DATA STRUCTURES UNIT - VII CSE
Tries:
A trie (pronounced ‘try’ means retrieval) is a tree based data structure for sorting strings in
order to support fast pattern matching. The main application for tries is in information
retrieval. The trie uses the digits in the keys to organize and search the dictionary.
a b
l n r a
l o d t e d t
o n
t e
The above trie shows words like allot, alone, and, are, bat, bad. The idea is that all strings
sharing common prefix should come from a common node. The tries are used in spell
checking programs.
A trie is a data structure that supports pattern matching queries in time proportional to the
pattern size.
Advantages:
In tries the keys are searched using common prefixes. Hence it is faster. The lookup
of keys depends upon the height in case of binary search tree.
Tries takes less space when they contain a large number of short strings. As nodes
are shared between the keys
KHIT Page 90
ADVANCED DATA STRUCTURES UNIT - VII CSE
Digital search tree is a binary tree in which each node contains one element
Every element is attached as a node using the binary representation
The bits are read from left to right
All the keys in the left sub-tree of a node at level i have bit 0 at ith position, similarly
the right sub-tree of a node at a level i have bit 1 at ith position
Example:
Consider following stream of keys with binary representation to construct a digital search
tree
A T F R C H I N
A
00001
F T
00101 10011
R
10010
C 00001 H 00001
I N
00001 00001
NOTE:
KHIT Page 91
ADVANCED DATA STRUCTURES UNIT - VII CSE
Binary trie:
Example:
Consider the elements 0000, 0010, 0001, 1001, 1000, and 1100. The binary trie can be built
as the numbers in square represents bit number
Level 1
Level 2
Level 3
Level 4
The binary trie may contain branch nodes whose degree is one. For creating compressed
binary trie, eliminate degree one nodes. Compressed binary trie above is as follows
KHIT Page 92
ADVANCED DATA STRUCTURES UNIT - VII CSE
Patricia:
The Patricia stands for Practical Algorithm To Retrieve Information Coded In Alphanumeric.
Building a Patricia is quite simple.
In Patricia every node will have a bit index. This number is written at every node. Based on
this number the trie will be searched.
Let us understand the procedure of building Patricia with the help of an example
Index 4 3 2 1 0
A 0 0 0 0 1
S 1 0 0 1 1
E 0 0 1 0 1
R 1 0 0 1 0
C 0 0 0 1 1
H 0 1 0 0 0
I 0 1 0 0 1
STEP 1:
As this is very first node we will simply create it as root node. For obtaining its bit index, we
will search the index of leftmost 1
43210
00001
Left most index is at index 0
th
Hence bit index of A is 0. The 0 index of A has value 1. Hence we will have right link up to
self node
STEP 2: Insert S: 1 0 0 1 1
We will start searching for S in existing trie. A bit index is 0. The 0th index of S denotes 1.
That means S should be attached as a right child of A. but wait! before attaching the node S
we must find out the bit index of S. as S should be attached to A, the A is closest node of S.
hence compare S and A.
KHIT Page 93
ADVANCED DATA STRUCTURES UNIT - VII CSE
STEP 3: Insert E: 0 0 1 0 1
For inserting E in an existing trie we go on searching from root. The bit index at root is 4. The
4th index bit of E is 0, so move onto left branch. With A, the bit index is 0. The 0 th index of E
is 1.
Hence E can be attached as right child of A. but before attaching E to A, we must find bit
index of E. the closest node of E is A. hence compare E and A.
As bit index of E is 2, we can not attach E as a child of A 9since bit index of A is 0). Hence we
traverse upwards. But as bit index of S is 4, we must attach E as child of S the bit index of E is
2 and the 2nd index bit is 1. Hence right link up to self node
KHIT Page 94
ADVANCED DATA STRUCTURES UNIT - VII CSE
STEP 4: Insert R: 1 0 0 1 0
We start from root, the bit index is 4. The 4th index of R is 1. That is attach R as right child of
S, the S is nearest neighbor of R. hence compare S and R
STEP 5: Insert C: 0 0 0 1 1
The search will be S – E – A. the 0th index of A is 1. Hence C can be attached as right child of
A. the A is nearest of C.
But bit index of A is 0 and bit index of C is 1. Hence C can not be the child of A. we traverse
up, the 1st index bit of C is 1. Then right link up of C is up to self node.
KHIT Page 95
ADVANCED DATA STRUCTURES UNIT - VII CSE
STEP 6: Insert H: 0 1 0 0 0
We can not attach H as child of A. traverse up, towards S. the 3rd index bit of H is 1, so right
link up to self node
KHIT Page 96
ADVANCED DATA STRUCTURES UNIT - VII CSE
STEP 7: Insert I: 0 1 0 0 1
KHIT Page 97