0% found this document useful (0 votes)
4 views

Unit 3 Tries

Uploaded by

vani.cs4014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit 3 Tries

Uploaded by

vani.cs4014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

12.

5 Tries

Objective: You will learn how preprocessing the text using data structures like
tries can optimize pattern matching, especially in scenarios where multiple queries
are performed on a fixed text.

1. Preprocessing Text for Pattern Matching

In traditional pattern matching algorithms like Knuth-Morris-Pratt (KMP), the


preprocessing step is focused on the pattern, not the text. This preprocessing helps
in speeding up the search process when the pattern is matched against the text.

However, in cases where we have a fixed text and multiple patterns to search for, a
different strategy is beneficial. Instead of preprocessing the pattern for each query,
you preprocess the text to make each query faster. This is particularly useful in
scenarios like:

 Web search engines: Where a fixed set of web pages (the text) are queried
multiple times with different search terms (patterns).
 Genomic databases: Where a fixed DNA sequence (the text) is queried
multiple times with different DNA patterns.

2. Trie Data Structure

A trie (pronounced “try”) is a specialized tree-based data structure that is


particularly effective for string storage and retrieval. It is also known as a prefix
tree. The key features and operations of tries are:

 Structure:
o Each node in a trie represents a character of a string.
o The path from the root to a node represents a prefix of some strings
stored in the trie.
o A single path from the root to a node represents a complete string in
the trie.
 Insertion:
o To insert a string into the trie, you start at the root and create nodes for
each character in the string, if they do not already exist.
 Search:
o To search for a string or prefix, you start at the root and follow the
path dictated by the characters of the string. If you reach the end of
the string and are still in the trie, the string or prefix exists in the trie.
 Prefix Matching:
o Tries support efficient prefix matching. For a given prefix, you can
quickly find all strings in the trie that start with that prefix.

3. Applications and Use Cases

Information Retrieval:

 Web Search Engines:


o When a search engine indexes web pages, it constructs a trie for the
keywords and phrases present in those pages. Subsequent search
queries can then be matched against this preprocessed trie quickly.
 Genomic Databases:
o In bioinformatics, large DNA sequences are often queried for specific
patterns. A trie can be built to store all possible patterns or substrings
of a DNA sequence, allowing rapid queries to find whether a specific
DNA pattern exists or is a prefix of any stored sequences.

Example:

Consider a text set S with the following strings:

1. "banana"
2. "bandana"
3. "band"

To build a trie for these strings:

 Insert "banana":
o Root → 'b' → 'a' → 'n' → 'a' → 'n' → 'a'
 Insert "bandana":
o Root → 'b' → 'a' → 'd'
 Insert "band":
o Root → 'b' → 'a' → 'n' → 'd'
Now, if we query the trie for the prefix "ban", we’ll find all strings in SSS that start
with "ban", which includes "banana", "band", and "bad".

12.5.1 Standard Tries

What is a Standard Trie?

A trie (pronounced “try”) is a tree-based data structure used to efficiently store and
retrieve strings, particularly when there is a need for quick prefix-based searches.
The standard trie is a specialized form where certain properties and constraints are
applied.

Key Properties of Standard Tries

1. Structure:
o Ordered Tree: A standard trie is an ordered tree with each node
representing a character from an alphabet Σ
o Nodes and Labels: Each node, except the root, is labeled with a
character from Σ. The root does not carry a label.
2. Canonical Ordering:
o The children of each internal node are ordered according to a
canonical ordering of the alphabet Σ. This ordering ensures a
consistent and predictable structure within the trie.
3. External Nodes:
o External Nodes: The trie has exactly s external (leaf) nodes, each
corresponding to one of the s strings in the set S.
o String Representation: The path from the root to an external node
represents the string associated with that node. The labels along this
path concatenate to form the string from S associated with that
external node.
4. Uniqueness:
o No Prefix Overlap: It is assumed that no string in S is a prefix of
another string. This ensures that each string in S has a unique path
from the root to an external node in the trie.
o Special Character: To satisfy this assumption in practice, a special
character not in the original alphabet Σ is appended to each string.
This special character marks the end of each string and avoids any
prefix overlap.
5. Children and Depth:
o Children Count: An internal node can have between 1 and d
children, where d is the size of the alphabet Σ.
o Path Representation: A path from the root to an internal node at
depth iii represents an i-character prefix of a string in S. For each
possible character that can follow this prefix, there is a corresponding
child of the internal node labeled with that character.
6. Multi-way and Binary Tries:
o General Case: If there are d characters in the string, the trie is a
multi-way tree, where internal nodes have between 1 and d children.
o Binary Trie: For an alphabet with only two characters, the trie
effectively becomes a binary tree. Internal nodes may have only one
child, leading to an improper binary tree structure.

Example of a Standard Trie

Consider a set S with the following strings:

1. "car"
2. "cat"
3. "cab"

Here's how we can construct the trie:

1. Insert "car":
o Root → 'c' → 'a' → 'r'
2. Insert "cat":
o Root → 'c' → 'a' → 't'
3. Insert "cab":
o Root → 'c' → 'a' → 'b'

The trie for this set of strings would look like this:

 The path "c" → "a" → "r" corresponds to "car".


 The path "c" → "a" → "t" corresponds to "cat".
 The path "c" → "a" → "b" corresponds to "cab".

Structural Properties

1. Prefix Storage:
o The trie stores common prefixes efficiently. For example, all three
strings share the prefix "ca".
2. Space Complexity:
o The space used by a trie depends on the number of strings and their
shared prefixes. Although a trie can be space-intensive, it is often
more space-efficient than storing all possible substrings separately.
3. Time Complexity:
o Insertion and Search: Both operations take O(m) time, where m is
the length of the string being inserted or searched for, assuming a
balanced trie structure.

Proposition 12.8 provides key properties of a standard trie (prefix tree) that
stores a collection S of s strings from an alphabet of size d.

1. Every Internal Node Has at Most d Children

 Explanation: In a standard trie, each internal node represents a character in


one of the strings. Since the alphabet has size d, each node can have at most
d children. This is because each child corresponds to one of the d possible
characters in the alphabet.
o Example: If the alphabet is {a, b, c}, then each internal node in the
trie can have at most three children, corresponding to each of these
characters.

2. The Trie Has s External Nodes

 Explanation: External nodes (or leaf nodes) in the trie are those where a
string ends. For s strings, there are s external nodes because each string will
end at a unique node in the trie.
o Example: If the strings are "cat", "dog", and "bat", then the trie will
have exactly three external nodes, one for each string, marking the
end of each string.

3. The Height of the Trie Is Equal to the Length of the Longest String in S

 Explanation: The height of a trie is determined by the length of the longest


string stored in it. This is because the trie must be deep enough to
accommodate the longest string, with each level of the trie corresponding to
a character position in the string.
o Example: If the longest string in the collection is "elephant" (which
has 8 characters), the height of the trie will be 8.

4. The Number of Nodes in the Trie Is O(n)

 Explanation: The total number of nodes in the trie is proportional to the


total length of all strings, n, where n is the sum of the lengths of all strings.
This is because each character in each string contributes to the creation of a
node. However, this count is linear in n because each character results in a
constant amount of work (creating a node if necessary).
o Example: If the collection of strings has a total length of 100
characters, then the number of nodes in the trie is on the order of 100.
Note that this assumes efficient node sharing for common prefixes.

Summary

 Internal Node Children: Each internal node can have up to d children,


corresponding to the number of characters in the alphabet.
 External Nodes: The number of external nodes (leaves) equals the number
of strings s.
 Height: The height of the trie corresponds to the length of the longest string
in the collection.
 Number of Nodes: The total number of nodes in the trie is proportional to
the total length n of all strings combined, making it O(n)

Using a Trie as a Dictionary

A trie is an efficient data structure for storing and searching a set of strings, and it
can be used as a dictionary where each string is a key.

Search Operation in a Trie

1. Searching for a String:


o To search for a string X in the trie:
 Trace the Path: Start at the root node and follow the path
indicated by each character in X. Move from node to node
according to the characters of X.
 Check the End: If you can trace the entire path of X and end at
an external (leaf) node, then X is present in the trie.
 Path Issues: If you cannot trace the path completely (because a
required character node is missing) or if the path ends at an
internal node (not a leaf), then X is not present.
o Example:
 Suppose we have a trie with strings "bull", "bat", and "bet".
 To search for "bull", follow the path from the root: 'b' → 'u' →
'l' → 'l', ending at an external node. Hence, "bull" is found.
 To search for "bet", follow the path from the root: 'b' → 'e' →
't'. If the path does not exist or ends at an internal node (not an
end node), then "bet" is not found.
 For "be", the path exists but ends at an internal node, so "be" is
not a complete string in the trie.

Time Complexity for Search

 Time Complexity: Searching for a string of length m involves:


o Visiting up to m+1 nodes (one node per character plus the root node).
o Each node can have up to d children, where d is the size of the
alphabet. Thus, at each node, the operation involves checking one of d
possible children.
o Overall Time Complexity: The time spent at each node is O(d), so
the total time for searching a string of length m is O(dm). For fixed-
size alphabets (constant d), this simplifies to O(m).

Using a Trie for Pattern Matching

 Word Matching: Involves checking if a pattern exactly matches one of the


words in the dictionary.
o With a trie, this can be done in O(dm) time, where m is the length of
the pattern and d is the size of the alphabet. The time complexity is
independent of the text size and depends only on the length of the
pattern and the alphabet size.
 Prefix Matching: A variant where you check if a pattern matches the
beginning of any word in the dictionary. This can also be efficiently handled
using a trie.
 Limitations: The trie cannot efficiently handle patterns that span multiple
words or are proper suffixes of words.

Constructing a Trie

 Insertion Process:
o Incremental Insertion: Insert strings one at a time into the trie.
o Path Tracing: For each string X:
 Trace the path of X in the trie.
 Stop when you reach an internal node before fully tracing X.
 Create a new chain of nodes to store the remaining characters of
X from the point where you stopped.
o Time Complexity:
 Insertion of One String: Inserting a string of length m requires
O(dm) time.
 Constructing Trie for All Strings: With s strings and total
length n, the total construction time is O(dn).
12.5.2 Compressed Tries

A compressed trie is a variation of a standard trie data structure designed to reduce


its size by simplifying certain parts of the tree.

Standard Trie

A standard trie (or prefix tree) is a tree-like data structure used to store strings,
where each node represents a single character of the string. Internal nodes can have
one or more children, and each path from the root to a leaf node represents a
distinct string.
Compression in Tries

In a standard trie, it's possible to have chains of nodes where each node (except the
last one) has only one child. This can lead to a lot of redundant nodes and edges. A
compressed trie addresses this redundancy by combining these chains into single
edges, effectively compressing the trie.

Redundant Nodes

A node in the trie is considered redundant if:

 It has exactly one child.


 It is not the root node.

For example, in a trie where nodes represent individual characters and you have a
sequence of nodes where each node only leads to one other node, those nodes are
redundant.

Redundant Chains

A redundant chain is a sequence of nodes:

 (v0, v1), (v1, v2), ..., (vk-1, vk), where vi is redundant for i = 1, ..., k-1.
 v0 (the starting node) and vk (the ending node) are not redundant.

In other words, the chain starts and ends with non-redundant nodes and has a series
of redundant nodes in between.

Transformation to a Compressed Trie

To transform a standard trie T into a compressed trie:

1. Identify Redundant Chains: Locate chains of redundant nodes.


2. Replace Chains: For each identified chain (v0, v1, ..., vk), replace the
sequence of edges (v0, v1), (v1, v2), ..., (vk-1, vk) with a single edge (v0,
vk).
3. Relabel the Edge: The new edge (v0, vk) should be labeled with the
concatenation of the labels of the nodes from v1 to vk.
Example

Suppose you have a standard trie where:

 v0 leads to v1
 v1 leads to v2
 v2 leads to v3

If v1 and v2 are redundant (i.e., they each have only one child and are not the root),
you can compress this chain into a single edge from v0 to v3. The label on this new
edge would be the concatenation of the labels from v1 to v3.

Benefits

The compressed trie reduces the number of nodes and edges by eliminating
redundancy, which can save memory and make operations like search and insert
more efficient.

Proposition 12.9: A compressed trie storing a collection S of s strings from an


alphabet of size d has the following properties:
• Every internal node of T has at least two children and most d children
• T has s external nodes
• The number of nodes of T is O(s)

Example: represent compressed trie


Strings are stored in array S
S[0]=bear
S[1]= bell
S[2]= bid
S[3]= bull
S[4]= stock
S[5]= stop

Node is represented as (i, j, k)


Where i is the index of S in which the word is present
j is the index at which that prefix starts
k is the index at which that prefix ends
This additional compression scheme allows us to reduce the total space for the trie
itself from O(n) for the standard trie to O(s) for the compressed trie, where n is the
total length of the strings in S and s is the number of strings in S

12.5.3 Suffix tries


A suffix trie (also known as a suffix tree or position tree) is a specialized form of
trie used to represent all suffixes of a given string X. It is particularly useful in
string processing and various applications like substring searches, pattern
matching, and more.

A suffix trie for a string X is a trie constructed for all suffixes of X. Each suffix is
a substring that starts at some position iii and extends to the end of the string. For
example, if X="minimize", its suffixes include "minimize", "inimize", "nimize",
"imize", "mize", "ize", "ze", and "e".

To build a suffix trie for X, you would include all suffixes of X as strings in the
trie. Specifically, for a string X of length n, you build the trie for the set of strings
X[i..n−1] for i=0,1,…,n−1. This means you add each suffix starting from every
possible position in X.
Now we will make compressed trie then we represent each node using numbers as
(j, k)
We can construct the suffix trie for a string of length n with an incremental
algorithm like the one given in Section 12.5.1. This construction takes O(dn2 ) time
because the total length of the suffixes is quadratic in n. However, the (compact)
suffix trie for a string of length n can be constructed in O(n) time with a
specialized algorithm, different from the one for general tries.

12.5.4 Search Engines


The World Wide Web contains a huge collection of text documents (Web pages).
Information about these pages are gathered by a program called a Web crawler,
which then stores this information in a special dictionary database.
A Web search engine allows users to retrieve relevant information from this
database, thereby identifying relevant pages on the Web containing given
keywords.

Inverted Files/ Inverted Index:


 Purpose: The core data structure used by search engines to efficiently locate
documents (web pages) that contain specific words or keywords.
 Structure: The inverted index (or inverted file) stores key-value pairs:
o Key: A word (or index term).
o Value: A collection of web pages (or occurrence list) where the word
appears.

The keys (words) in this dictionary are called index terms and should be a set of
vocabulary entries and proper nouns as large as possible. The elements in this
dictionary are called occurrence lists and should cover as many Web pages as
possible.

We can efficiently implement an inverted index with a data structure consisting of:
1. An array storing the occurrence lists of the terms (in no particular order)
2. A compressed trie for the set of index terms, where each external node stores the
index of the occurrence list of the associated term

The reason for storing the occurrence lists outside the trie is to keep the size of the
trie data structure sufficiently small to fit in internal memory. Instead, because of
their large total size, the occurrence lists have to be stored on disk.

When multiple keywords are given and the desired output are the pages containing
all the given keywords, we retrieve the occurrence list of each keyword using the
trie and return their intersection.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy