Unit 3 Tries
Unit 3 Tries
5 Tries
Objective: You will learn how preprocessing the text using data structures like
tries can optimize pattern matching, especially in scenarios where multiple queries
are performed on a fixed text.
However, in cases where we have a fixed text and multiple patterns to search for, a
different strategy is beneficial. Instead of preprocessing the pattern for each query,
you preprocess the text to make each query faster. This is particularly useful in
scenarios like:
Web search engines: Where a fixed set of web pages (the text) are queried
multiple times with different search terms (patterns).
Genomic databases: Where a fixed DNA sequence (the text) is queried
multiple times with different DNA patterns.
Structure:
o Each node in a trie represents a character of a string.
o The path from the root to a node represents a prefix of some strings
stored in the trie.
o A single path from the root to a node represents a complete string in
the trie.
Insertion:
o To insert a string into the trie, you start at the root and create nodes for
each character in the string, if they do not already exist.
Search:
o To search for a string or prefix, you start at the root and follow the
path dictated by the characters of the string. If you reach the end of
the string and are still in the trie, the string or prefix exists in the trie.
Prefix Matching:
o Tries support efficient prefix matching. For a given prefix, you can
quickly find all strings in the trie that start with that prefix.
Information Retrieval:
Example:
1. "banana"
2. "bandana"
3. "band"
Insert "banana":
o Root → 'b' → 'a' → 'n' → 'a' → 'n' → 'a'
Insert "bandana":
o Root → 'b' → 'a' → 'd'
Insert "band":
o Root → 'b' → 'a' → 'n' → 'd'
Now, if we query the trie for the prefix "ban", we’ll find all strings in SSS that start
with "ban", which includes "banana", "band", and "bad".
A trie (pronounced “try”) is a tree-based data structure used to efficiently store and
retrieve strings, particularly when there is a need for quick prefix-based searches.
The standard trie is a specialized form where certain properties and constraints are
applied.
1. Structure:
o Ordered Tree: A standard trie is an ordered tree with each node
representing a character from an alphabet Σ
o Nodes and Labels: Each node, except the root, is labeled with a
character from Σ. The root does not carry a label.
2. Canonical Ordering:
o The children of each internal node are ordered according to a
canonical ordering of the alphabet Σ. This ordering ensures a
consistent and predictable structure within the trie.
3. External Nodes:
o External Nodes: The trie has exactly s external (leaf) nodes, each
corresponding to one of the s strings in the set S.
o String Representation: The path from the root to an external node
represents the string associated with that node. The labels along this
path concatenate to form the string from S associated with that
external node.
4. Uniqueness:
o No Prefix Overlap: It is assumed that no string in S is a prefix of
another string. This ensures that each string in S has a unique path
from the root to an external node in the trie.
o Special Character: To satisfy this assumption in practice, a special
character not in the original alphabet Σ is appended to each string.
This special character marks the end of each string and avoids any
prefix overlap.
5. Children and Depth:
o Children Count: An internal node can have between 1 and d
children, where d is the size of the alphabet Σ.
o Path Representation: A path from the root to an internal node at
depth iii represents an i-character prefix of a string in S. For each
possible character that can follow this prefix, there is a corresponding
child of the internal node labeled with that character.
6. Multi-way and Binary Tries:
o General Case: If there are d characters in the string, the trie is a
multi-way tree, where internal nodes have between 1 and d children.
o Binary Trie: For an alphabet with only two characters, the trie
effectively becomes a binary tree. Internal nodes may have only one
child, leading to an improper binary tree structure.
1. "car"
2. "cat"
3. "cab"
1. Insert "car":
o Root → 'c' → 'a' → 'r'
2. Insert "cat":
o Root → 'c' → 'a' → 't'
3. Insert "cab":
o Root → 'c' → 'a' → 'b'
The trie for this set of strings would look like this:
Structural Properties
1. Prefix Storage:
o The trie stores common prefixes efficiently. For example, all three
strings share the prefix "ca".
2. Space Complexity:
o The space used by a trie depends on the number of strings and their
shared prefixes. Although a trie can be space-intensive, it is often
more space-efficient than storing all possible substrings separately.
3. Time Complexity:
o Insertion and Search: Both operations take O(m) time, where m is
the length of the string being inserted or searched for, assuming a
balanced trie structure.
Proposition 12.8 provides key properties of a standard trie (prefix tree) that
stores a collection S of s strings from an alphabet of size d.
Explanation: External nodes (or leaf nodes) in the trie are those where a
string ends. For s strings, there are s external nodes because each string will
end at a unique node in the trie.
o Example: If the strings are "cat", "dog", and "bat", then the trie will
have exactly three external nodes, one for each string, marking the
end of each string.
3. The Height of the Trie Is Equal to the Length of the Longest String in S
Summary
A trie is an efficient data structure for storing and searching a set of strings, and it
can be used as a dictionary where each string is a key.
Constructing a Trie
Insertion Process:
o Incremental Insertion: Insert strings one at a time into the trie.
o Path Tracing: For each string X:
Trace the path of X in the trie.
Stop when you reach an internal node before fully tracing X.
Create a new chain of nodes to store the remaining characters of
X from the point where you stopped.
o Time Complexity:
Insertion of One String: Inserting a string of length m requires
O(dm) time.
Constructing Trie for All Strings: With s strings and total
length n, the total construction time is O(dn).
12.5.2 Compressed Tries
Standard Trie
A standard trie (or prefix tree) is a tree-like data structure used to store strings,
where each node represents a single character of the string. Internal nodes can have
one or more children, and each path from the root to a leaf node represents a
distinct string.
Compression in Tries
In a standard trie, it's possible to have chains of nodes where each node (except the
last one) has only one child. This can lead to a lot of redundant nodes and edges. A
compressed trie addresses this redundancy by combining these chains into single
edges, effectively compressing the trie.
Redundant Nodes
For example, in a trie where nodes represent individual characters and you have a
sequence of nodes where each node only leads to one other node, those nodes are
redundant.
Redundant Chains
(v0, v1), (v1, v2), ..., (vk-1, vk), where vi is redundant for i = 1, ..., k-1.
v0 (the starting node) and vk (the ending node) are not redundant.
In other words, the chain starts and ends with non-redundant nodes and has a series
of redundant nodes in between.
v0 leads to v1
v1 leads to v2
v2 leads to v3
If v1 and v2 are redundant (i.e., they each have only one child and are not the root),
you can compress this chain into a single edge from v0 to v3. The label on this new
edge would be the concatenation of the labels from v1 to v3.
Benefits
The compressed trie reduces the number of nodes and edges by eliminating
redundancy, which can save memory and make operations like search and insert
more efficient.
A suffix trie for a string X is a trie constructed for all suffixes of X. Each suffix is
a substring that starts at some position iii and extends to the end of the string. For
example, if X="minimize", its suffixes include "minimize", "inimize", "nimize",
"imize", "mize", "ize", "ze", and "e".
To build a suffix trie for X, you would include all suffixes of X as strings in the
trie. Specifically, for a string X of length n, you build the trie for the set of strings
X[i..n−1] for i=0,1,…,n−1. This means you add each suffix starting from every
possible position in X.
Now we will make compressed trie then we represent each node using numbers as
(j, k)
We can construct the suffix trie for a string of length n with an incremental
algorithm like the one given in Section 12.5.1. This construction takes O(dn2 ) time
because the total length of the suffixes is quadratic in n. However, the (compact)
suffix trie for a string of length n can be constructed in O(n) time with a
specialized algorithm, different from the one for general tries.
The keys (words) in this dictionary are called index terms and should be a set of
vocabulary entries and proper nouns as large as possible. The elements in this
dictionary are called occurrence lists and should cover as many Web pages as
possible.
We can efficiently implement an inverted index with a data structure consisting of:
1. An array storing the occurrence lists of the terms (in no particular order)
2. A compressed trie for the set of index terms, where each external node stores the
index of the occurrence list of the associated term
The reason for storing the occurrence lists outside the trie is to keep the size of the
trie data structure sufficiently small to fit in internal memory. Instead, because of
their large total size, the occurrence lists have to be stored on disk.
When multiple keywords are given and the desired output are the pages containing
all the given keywords, we retrieve the occurrence list of each keyword using the
trie and return their intersection.