LIPIcs MFCS 2019 71
LIPIcs MFCS 2019 71
LIPIcs MFCS 2019 71
Philip Bille
Technical University of Denmark, DTU Compute, Denmark
phbi@dtu.dk
Inge Li Gørtz
Technical University of Denmark, DTU Compute, Denmark
inge@dtu.dk
Abstract
Given a regular expression R and a string Q, the regular expression parsing problem is to determine
if Q matches R and if so, determine how it matches, e.g., by a mapping of the characters of Q to the
characters in R. Regular expression parsing makes finding matches of a regular expression even more
useful by allowing us to directly extract subpatterns of the match, e.g., for extracting IP-addresses
from internet traffic analysis or extracting subparts of genomes from genetic data bases. We present
a new general techniques for efficiently converting a large class of algorithms that determine if a
string Q matches regular expression R into algorithms that can construct a corresponding mapping.
As a consequence, we obtain the first efficient linear space solutions for regular expression parsing.
2012 ACM Subject Classification Theory of computation → Design and analysis of algorithms
Keywords and phrases regular expressions, finite automata, regular expression parsing, algorithms
Funding Supported by the Danish Research Council (DFF – 4005-00267, DFF – 1323-00178)
1 Introduction
A regular expression specifies a set of strings formed by characters combined with concate-
nation, union (|), and Kleene star (*) operators. For instance, (a|(ba))*) describes the
set of strings of as and bs, where every b is followed by an a. Regular expressions are a
fundamental concept in formal language theory and a basic tool in computer science for
specifying search patterns. Regular expression search appears in diverse areas such as internet
traffic analysis [14, 27, 17], data mining [11], data bases [19, 21], computational biology [23],
and human computer interaction [16].
Given a regular expression R and a string Q, the regular expression parsing problem [15,
8, 10, 24, 25, 18] is to determine if Q matches a string in the set of strings described by R
and if so, determine how it matches by computing the corresponding sequence of positions of
characters in R, i.e., the mapping of each character in Q to a character in R corresponding to
the match. For instance, if R = (a|(ba))*) and Q = aaba, then Q matches R and 1, 1, 2, 3
is a corresponding parse specifying that Q[1] and Q[2] match the first a in R, Q[3] match the
b in R, and Q[4] match the last a in R1 . Regular expression parsing makes finding matches
of a regular expression even more useful by allowing us to directly extract subpatterns of the
match, e.g., for extracting IP-addresses from internet traffic analysis or extracting subparts
of genomes from genetic data bases.
1
Another typical definition of parsing is to compute a parse tree (or a variant thereof) of the derivation
of Q on R. Our definition simplifies our presentation and it is straightforward to derive a parse tree
from our parses.
© Philip Bille and Inge Li Gørtz;
licensed under Creative Commons License CC-BY
44th International Symposium on Mathematical Foundations of Computer Science (MFCS 2019).
Editors: Peter Rossmanith, Pinar Heggernes, and Joost-Pieter Katoen; Article No. 71; pp. 71:1–71:14
Leibniz International Proceedings in Informatics
Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany
71:2 From Regular Expression Matching to Parsing
To state the existing bounds, let n and m be the length of the string and the regular
expression, respectively. As a starting point consider the simpler regular expression matching
problem, that is, to determine if Q matches a string in the set of strings described by R
(without necessarily providing a mapping from characters in Q to characters in R). A
classic textbook algorithm to matching, due to Thompson [26], constructs and simulates a
non-deterministic finite automaton (NFA) in O(nm) time and O(m) space. An immediate
approach to solve the parsing problem is to combine Thompson’s algorithm with backtracking.
To do so, we store all state-sets produced during the NFA simulation and then process these
in reverse order to recover an accepting path in the NFA matching Q. From the path we
then immediately obtain the corresponding parse of Q since each transition labeled by a
character uniquely corresponds to a character in R. This algorithm uses O(nm) time and
space. Hence, we achieve the same time bound as matching but increase the space by an
Ω(n) factor. We can improve the time by polylogarithmic factors using faster algorithms
for matching [22, 3, 4, 6, 7], but by a recent conditional lower bound [2] we cannot hope to
achieve Ω((nm)1−ε ) time assuming the strong exponential time hypothesis. Other direct
approaches to regular expression parsing [15, 8, 10, 24, 25, 18] similarly achieve Θ(nm) time
and space (ignoring polylogarithmic factors), leaving a substantial gap between linear space
for matching and Θ(nm) space for parsing. The goal of this paper is to address this gap.
1.1 Results
We present a new technique to efficiently extend the classic state-set transition algorithms
for matching to also solve parsing in the same time complexity while only using linear space.
Specifically, we obtain the following main result based on Thompson’s algorithm:
I Theorem 1. Given a regular expression of length m and a string of length n, we can solve
the regular expression parsing problem in O(nm) time and O(n + m) space.
This is the first bound to significantly improve upon the combination of Θ(nm) time and
space. The result holds on a comparison-based, pointer machine model of computation. Our
techniques are sufficiently general to also handle the more recent faster state-set transition
algorithms [22, 4, 3] and we also obtain a similar space improvement for these.
1.2 Techniques
Our overall approach is similar in spirit to the classic divide and conquer algorithm by
Hirschberg [13] for computing a longest common subsequence of two strings in linear space.
Let A be the Thompson NFA (TNFA) for R built according to Thompson’s rules [26] (see
also Figure 1) with m states, and let Q be the string of length n.
We first decompose A using standard techniques into a pair of nested subTNFAs called
the inner subTNFA and the outer subTNFA. Each have roughly at most 2/3 of the states of
A and overlap in at most 2 boundary states. We then show how to carefully simulate A to
decompose Q into substrings corresponding to subparts of an accepting path in each of the
subTNFAs. The key challenge here is to efficiently handle cyclic dependencies between the
subTNFAs. From this we construct a sequence of subproblems for each of the substrings
corresponding to the inner subTNFAs and a single subproblem for the outer subTNFA. We
recursively solve these to construct a complete accepting path in A. This strategy leads
to an O(nm) time and O(n log m + m) space solution. We show how to tune and organize
the recursion to avoid storing intermediate substrings leading to the linear space solution
in Theorem 1. Finally, we show how to extend our solution to obtain linear space parsing
solutions for other state-set transition algorithms.
P. Bille and I. Li Gørtz 71:3
2 Preliminaries
Strings. A string Q of length n = |Q| is a sequence Q[1] . . . Q[n] of n characters drawn
from an alphabet Σ. The string Q[i] . . . Q[j] denoted Q[i, j] is called a substring of Q. The
substrings Q[1, i] and Q[j, n] are the ith prefix and the jth suffix of Q, respectively. The
string is the unique empty string of length zero.
Regular Expressions. First we briefly review the classical concepts used in the paper. For
more details see, e.g., Aho et al. [1]. We consider the set of non-empty regular expressions over
an alphabet Σ, defined recursively as follows. If α ∈ Σ ∪ {} then α is a regular expression,
and if S and T are regular expressions then so is the concatenation, (S) · (T ), the union,
(S)|(T ), and the star, (S)∗ . The language L(R) generated by R is defined as follows. If
α ∈ Σ ∪ {}, then L(α) is the set containing the single string α. If S and T are regular
expressions, then L(S · T ) = L(S) · L(T ), that is, any string formed by the concatenation of
a string in L(S) with a string in L(T ), L(S)|L(T ) = L(S) ∪ L(T ), and L(S ∗ ) = i≥0 L(S)i ,
S
where L(S)0 = {} and L(S)i = L(S)i−1 · L(S), for i > 0. The parse tree T P (R) of R (not
to be confused with the parse of Q wrt. to R) is the rooted, binary tree representing the
hierarchical structure of R. The leaves of T P (R) are labeled by a character from Σ or and
internal nodes are labeled by either ·, |, or ∗.
Thompson NFA. Given a regular expression R, we can construct an NFA accepting precisely
the strings in L(R) by several classic methods [20, 12, 26]. In particular, Thompson [26]
gave the simple well-known construction shown in Figure 1. We will call an NFA constructed
2
Sometimes NFAs are allowed a set of accepting states, but this is not necessary for our purposes.
M FC S 2 0 1 9
71:4 From Regular Expression Matching to Parsing
α
(a) (b) N (S) N (T )
N (S) ϵ
ϵ ϵ
(c)
(d) N (S)
ϵ ϵ ϵ ϵ
N (T )
ϵ
Figure 1 Thompson’s recursive NFA construction. The regular expression α ∈ Σ∪{} corresponds
to NFA (a). If S and T are regular expressions then N (ST ), N (S|T ), and N (S ∗ ) correspond to
NFAs (b), (c), and (d), respectively. In each of these figures, the leftmost state θ and rightmost state
φ are the start and the accept nodes, respectively. For the top recursive calls, these are the start
and accept states of the overall automaton. In the recursions indicated, e.g., for N (ST ) in (b), we
take the start state of the subautomaton N (S) and identify with the state immediately to the left of
N (S) in (b). Similarly the accept state of N (S) is identified with the state immediately to the right
of N (S) in (b).
with these rules a Thompson NFA (TNFA). A TNFA N (R) for R has at most 2m states,
at most 4m transitions, and can be computed in O(m) time. Note that each character in
R corresponds to a unique character transition in N (R) and hence a parse of a string Q
for N (R) directly corresponds to a parse of Q for R. The parse tree of a TNFA N (R) is
the parse tree of R. With a breadth-first search of A we can compute a state-set transition
for a single character in O(m) time. By our above discussion, it follows that we can solve
regular expression matching in O(nm) time and O(m) space, and regular expression parsing
in O(nm) time and O(nm) space.
TNFA Decomposition. We need the following decomposition result for TFNAs (see Fig-
ure 2). Similar decompositions are used in [22, 3]. Given a TNFA A with m > 2 states, we
decompose A into an inner subTNFA AI and an outer subTNFA AO . The inner subTNFA
consists of a pair of boundary states θAI and φAI and all states and transitions that are
reachable from θAI without going through φAI . Furthermore, if there is a path of -transitions
from φAI to θAI in AO , we add an -transition from φAI to θAI in AI (following the rules
from Thompson’s construction). The outer subTNFA is obtained by removing all states
and transitions of AI except θAI and φAI . Between θAI and φAI we add a special transition
labeled βAI 6∈ Σ and if AI accepts the empty string we also add an -transition (corresponding
to the regular expression (βAI | )). The decomposition has the following properties. Similar
results are proved in [22, 3] (see also full version [5] for a the proof).
I Lemma 2. Let A be any TNFA with m > 2 states. In O(m) time we can decompose A
into inner and outer subTNFAs AO and AI such that
(i) AO and AI have at most 32 m + 8 states each, and
(ii) any path from AO to AI crosses θAI and any path from AI to AO crosses φAI .
3 String Decompositions
Let A be a TNFA decomposed into subTNFAs AO and AI and Q be a string accepted by A.
We show how to efficiently decompose Q into substrings corresponding to subpaths matched
in each subTNFA. The algorithm will be a key component in our recursive algorithm in the
next section.
P. Bille and I. Li Gørtz 71:5
ϵ
c
<latexit sha1_base64="(null)">(null)</latexit>
<latexit
<latexit sha1_base64="(null)">(null)</latexit>
<latexit
d b
<latexit sha1_base64="(null)">(null)</latexit>
<latexit
A
ϵ ϵ ϵ ϵ
<latexit sha1_base64="(null)">(null)</latexit>
<latexit
a
<latexit sha1_base64="(null)">(null)</latexit>
<latexit
a
<latexit sha1_base64="(null)">(null)</latexit>
<latexit
ϵ ϵ ϵ ϵ
ϵ ϵ ϵ ϵ ϵ b
a ϵ <latexit sha1_base64="(null)">(null)</latexit>
<latexit
ϵ ϵ
<latexit sha1_base64="(null)">(null)</latexit>
<latexit
<latexit sha1_base64="(null)">(null)</latexit>
✓ AO ✓ AI
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
AI <latexit sha1_base64="(null)">(null)</latexit>
AO
ϵ ϵ ϵ ϵ
ϵ ϵ
b
<latexit sha1_base64="(null)">(null)</latexit>
<latexit
c
<latexit sha1_base64="(null)">(null)</latexit>
<latexit
<latexit sha1_base64="(null)">(null)</latexit>
<latexit
d
AO ϵ ϵ AI ϵ ϵ
a a
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit
<latexit sha1_base64="(null)">(null)</latexit>
<latexit
ϵ ϵ ϵ ϵ
AI ϵ ϵ ϵ b ϵ ϵ
a ϵ
<latexit sha1_base64="(null)">(null)</latexit>
ϵ
<latexit sha1_base64="(null)">(null)</latexit>
<latexit
ϵ ✓ AI
<latexit sha1_base64="(null)">(null)</latexit>
<latexit
<latexit sha1_base64="(null)">(null)</latexit>
✓ AO ✓ AI
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
AI <latexit sha1_base64="(null)">(null)</latexit>
AO
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
AI
ϵ ϵ ϵ
ϵ ϵ
ϵ ϵ
ϵ ϵ
An immediate idea would be to process Q from left to right using state-set transitions
and “collapse” the state set to a boundary state b of AI whenever the state set contains b and
there is a path from b to φA matching the rest of Q. Since AO and AI only interact at the
boundary states, this effectively corresponds to alternating the simulation of A between AO
and AI . However, because of potential cyclic dependencies from paths of -transition from
φAI to θAI in AO and θAI to φAI in AI we cannot immediately determine which subTNFA
we should proceed in and hence we cannot correctly compute the string decomposition. For
instance, consider the string Q = aaacdaabaacdacdaabab from Figure 3. After processing
the first two characters (aa) both θAI and φAI are in the state set, and there is a path from
both these states to φA matching the rest of Q. The same is true after processing the first
six characters (aaacda). In the first case the substring consisting of the next three characters
(acd) only matches a path in AI , whereas in the second case the substring consisting of
the next two characters (ab) only matches a path in AO . A technical contribution in our
algorithm in the next section is to efficiently overcome these issues by a two-step approach
that first decomposes the string into substrings and labels the substrings greedily to find a
correct string decomposition.
M FC S 2 0 1 9
71:6 From Regular Expression Matching to Parsing
Q:
<latexit sha1_base64="(null)">(null)</latexit>
a a a c d a a b a a c d a c d a a b a b
<latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit
Prefix(✓AI )/Prefix(
<latexit sha1_base64="(null)">(null)</latexit>
AI )
Suffix(✓AI )/Suffix(
<latexit sha1_base64="(null)">(null)</latexit>
AI )
Match(✓AI )/Match(
<latexit sha1_base64="(null)">(null)</latexit>
AI )
O I <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit
I <latexit sha1_base64="(null)">(null)</latexit>
<latexit
I O
<latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit
I <latexit sha1_base64="(null)">(null)</latexit>
<latexit
I <latexit sha1_base64="(null)">(null)</latexit>
<latexit
I <latexit sha1_base64="(null)">(null)</latexit>
<latexit
I O
<latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit
I O
<latexit sha1_base64="(null)">(null)</latexit>
<latexit
<latexit sha1_base64="(null)">(null)</latexit>
Partition and labeling a a a c d a a b a a c d a c d a a b a b
<latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit <latexit sha1_base64="(null)">(null)</latexit>
<latexit
<latexit sha1_base64="(null)">(null)</latexit>
Thus, Match(θAI ) and Match(φAI ) are the positions in Q that correspond to a valid pair
for the boundary states θAI and φAI , respectively. To compute these, we first compute the
prefix match sets, Prefix(s), and suffix match sets, Suffix(s), for s ∈ {θAI , φAI }. A position i
is in Prefix(s) if there is a path from θA to s accepting the prefix Q[1, i], and in Suffix(s) if
there is a path from s to φA accepting the suffix Q[i + 1, n]. To compute the prefix match
sets we perform state-set transitions on Q and A and whenever the current state-set contains
either θAI or φAI we add the corresponding position to Prefix(s). We compute the suffix
match sets in the same way, but now we perform the state-set transitions on Q from right to
left and A with the direction of all transitions reversed. Each step of the state-set transition
takes O(m) time and hence we use O(nm) time in total.
P. Bille and I. Li Gørtz 71:7
Finally, we compute the match sets Match(s), for s ∈ {θAI , φAI }, by taking the intersec-
tion of Prefix(s) and Suffix(s). In total we use O(mn) time and O(n + m) space to compute
and store the match sets.
such that 0 ≤ i1 < · · · < ik ≤ n and Xj ⊆ {θAI , φAI } and with the property that the states
of all pairs intersect a single accepting path P in A and at all places where P is equal to
either θAI or φAI correspond to a valid pair in V .
To compute the sequence we run a slightly modified state-set transition algorithm: For
i = 0, 1, . . . , n we set Si = δA (Si−1 , Q[i]) (for i = 0 set S0 := δA (θA , )) and compute the set
Thus X is the set of boundary states in Si that corresponds to a valid pair computed in
Step 1. If X 6= ∅ we add (i, X) to the sequence V and set Si := X.
We argue that this produces a sequence V of valid pairs with the required properties.
First note that by definition of X we inductively produce state-set S0 , . . . , Sn such that Si
contains the set of states reachable from θA that match Q[1, i] and the paths used to reach
Si intersect the states of the valid pairs produced up to this point. Furthermore, we include
all positions in V where Si contains θAI or φAI . It follows that V satisfies the properties.
Each of modified state-set transition uses O(m) time and hence we use O(nm) time in
total. The sequence V uses O(n) space. In addition to this we store the match sets and a
constant number of active state-sets using O(n + m) space.
Labeling. We label the substrings as follows. First label q0 and qk+1 with outer. For the
rest of the substrings, if Xi = {θAI } and Xi+1 = {φAI } then label qi with inner, and if
Xi = {φAI } and Xi+1 = {θAI } then label qi with outer. If either Xi or Xi+1 contain more
than one boundary state then we use state-set transitions in AI and AO to determine if AI
accepts qi or if there is a path in AO from φAI to θAI that matches qi . If a substring is
accepted by AI then it can be an inner substring and if there is a path in AO from φAI to
θAI that matches qi then it can be an outer substring. If a substring can only be either an
inner or an outer substring then it is labeled with inner or outer, respectively. Let qi be a
substring that can be both an inner or an outer substring. We divide this into two cases. If
there is an -path from φAI to θAI then label all such qi with inner. Otherwise label all such
qi with outer. See also Algorithm 1.
M FC S 2 0 1 9
71:8 From Regular Expression Matching to Parsing
Algorithm 1 Labeling.
Input: A sequence V of valid pairs (i1 , X1 ), . . . (ik , Vk ) and the corresponding
partition q0 , . . . , qk+1 of Q.
Output: A labeling of the partition
1 The (possible empty) substrings q0 and qk+1 are labeled outer.
2 for i = 1 to k do
3 if Xi = {θAI } and Xi+1 = {φAI } then /* Case 1 */
4 label qi inner.
5 else if Xi = {φAI } and Xi+1 = {θAI } then /* Case 2 */
6 label qi outer.
7 else if Xi or Xi+1 contains more than one boundary node then /* Case 3 */
8 Use standard state-set transitions in AI and AO to determine if AI accepts qi
or if there is a path in AO from φAI to θAI that matches qi .
9 if qi is only accepted by AI then /* Subcase 3a */
10 label qi inner
11 else if qi is only accepted by AO then /* Subcase 3b */
12 label qi with outer
13 else /* qi is accepted by both AI and AO */
14 There are two cases.
15 if there is an -path from φAI to θAI then /* Subcase 3c */
16 label qi inner.
17 else label qi outer. /* Subcase 3d */
18 end
19 end
20 end
For correctness first note that q0 and qk+1 are always (possibly empty) outer substrings.
The cases where both |Xi | = |Xi+1 | (case 1 and 2) are correct by the correctness of the
sequence of valid pairs V . Due to cyclic dependencies we may have that Xi and Xi+1 contain
more than one boundary state. This can happen if there is an -path from θAI to φAI and/or
there is an -path from φAI to θAI . If a substring only is accepted by one of AI (case 3a)
or AO (case 3b) then it follows from the correctness of V that the labeling is correct. It
remains to argue that the labeling in the case where qi is accepted by both AI and AO is
correct. To see why the labeling in this case is consistent with a string decomposition of the
accepting path consider case 3c. Here, it is safe to label qi with inner, since if we are in φAI
after having read qi−1 we can just follow the -path from φAI to θAI and then start reading
qi from here.The argument for case 3d is similar.
Except for the state-set transitions in case 3 all cases takes constant time. The total time
of all the state-set transitions is O(nm). The space of V and the partition together with the
labeling uses O(n) space.
String decomposition. Now every substring has a label that is either inner or outer. We
then merge adjacent substrings that have the same label. This produces an alternating
sequence of inner and outer substrings, which is the final string decomposition. Such an
alternating subsequence must always exist since each pair in V intersects an accepting path.
In summary, we have the following result.
P. Bille and I. Li Gørtz 71:9
I Lemma 3. Given string Q of length n, and TNFA A with m states decomposed into AO
and AI , we can compute a string decomposition wrt. AI in O(nm) time and O(n + m) space.
Step 2: Recurse. We build a single substring corresponding to all the subpaths in AO and
` substrings for AI (one for each subpath in AI ) and recursively compute the compressed
paths. To do so, construct q = q 1 · βAI · q 2 · βAI · · · βAI · q `+1 . Recall, that βAI is the label
of the special transition we added between θAI and φAI in AO . Then, recursively compute
the compressed paths
p = Path(AO , q)
pi = Path(AI , qi ) 1≤i≤`
P = p0 · p1 · p1 · p2 · p2 · · · p` · p`+1
Inductively, it directly follows that the returned compressed path is a compressed accepting
path for Q in A.
4.1 Analysis
We now show that the total time T (n, m) of the algorithm is O(nm). If n < γn or m < γm ,
we run the backtracking algorithm using O(nm) = O(n + m) time and space. If n ≥ γn
and m ≥ γm , we implement the recursive step of the algorithm using O(nm) time. Let
P`+1
ni be the length of the inner string qi in the string decomposition and let n0 = i=1 |q̄i |.
P`+1
Thus, n = i=1 ni and |q̄| = n0 + `. In step 2, the recursive call toP compute p takes
`
O(T (n0 +`, 23 m+8)) time and the recursive calls to compute p1 , . . . , p` take i=1 T (ni , 23 m+8)
M FC S 2 0 1 9
71:10 From Regular Expression Matching to Parsing
time. The remaining steps of the algorithm all take O(nm) time. Hence, we have the following
recurrence for T (n, m).
(P`
2 2
i=1 T (ni , 3 m + 8) + T (n0 + `, 3 m + 8) + O(mn) m ≥ γm and n ≥ γn
T (n, m) =
O(m + n) m < γm or n < γn
It follows that T (n, m) = O(nm) for γn = 2 and γm = 25 (see full version [5] for a detailed
proof).
Next, we consider the space complexity. First, note that the total space for storing R and
Q is O(n + m). To analyse the space during the recursive calls of the algorithm, consider
the recursion tree Trec for Path(A, Q). For a node v in Trec , we define Qv of length nv to be
the string and Av with mv states to be the TNFA at v. Consider a path v1 , . . . , vj of nodes
in Trec from the root to leaf vj corresponding to a sequence of nested recursive calls of the
algorithm. If we naively store the subTNFAs, the string decompositions, and the compressed
paths, we use O(nvi + mvi ) space at each node vi , 1 ≤ i ≤ j. By Lemma 2(i) the sum of the
sizes of the subTNFAs on a path forms a geometrically decreasing sequence and hence the
Pj
total space for the subTNFAs is i=1 mvi = O(m). However, since each string (and hence
compressed path) at each node vi , 1 ≤ i ≤ j, may have length Ω(n) we may need Ω(n log m)
space in total for these. We show how to improve this to O(n + m) space in the next section.
To analyse the space of the modified algorithm, consider a path a path v1 , . . . , vj of nodes
in Trec from the root to a leaf vj . We now have that only nodes vi , 1 ≤ i < j will explicitly
store a string if vi+1 is a light child of vi . By Lemma 4 the sum of these lengths form a
geometrically decreasing sequence and hence the total space is now O(n). In summary, we
have shown the following result.
I Theorem 5. Given a TNFA with m states and a string of length n, we can compute a
compressed accepting path for Q in A in O(nm) time and O(n + m) space.
M FC S 2 0 1 9
71:12 From Regular Expression Matching to Parsing
With the universal table, we process each micro TNFA in constant time, leading to
an algorithm using O(|M S|/t + n + m) = O(nm/t + n + m) time and O(2t + m) space.
Setting t = ε log n produces the stated result. Note that each state-set uses O(dm/te) space.
To handle general alphabets, we store dictionaries for each micro TNFA with bit masks
corresponding to characters appearing in the TNFA and combine these with an additional
masking step in state-set transition. The leads a general solution with the same time and
space bounds as above.3
Similar to Section 4.1 it follows that T (n, m) = O(nm/t + n + m), for 25 ≤ t < w. The space
is linear as before. Plugging in t = log n and including the preprocessing time and space for
the universal tables we obtain the following logarithmic improvement of Theorem 1.
I Theorem 6. Given a regular expression of length m, a string of length n, we can solve the
regular expression parsing problem in O(nm/ log n + n + m) time and O(n + m) space.
3
Note that the time bound in the original paper has an additional m log m term [4]. Using atomic
heaps [9] to represent dictionaries for micro TNFAs this term is straightforward to improve to O(m).
See also Bille and Thorup [6, Appendix A].
P. Bille and I. Li Gørtz 71:13
References
1 Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: principles, techniques, and
tools. Addison-Wesley Longman Publishing Co., Inc., 1986.
2 Arturs Backurs and Piotr Indyk. Which Regular Expression Patterns are Hard to Match? In
Proc. 57th FOCS, pages 457–466, 2016.
3 Philip Bille. New Algorithms for Regular Expression Matching. In Proc. of the 33rd ICALP,
pages 643–654, 2006.
4 Philip Bille and Martin Farach-Colton. Fast and compact regular expression matching. Theor.
Comput. Sci., 409(3):486–496, 2008.
5 Philip Bille and Inge Li Gørtz. From Regular Expression Matching to Parsing. Arxiv preprint
arXiv:1804.02906, 2019.
6 Philip Bille and Mikkel Thorup. Faster Regular Expression Matching. In Proc. 36th ICALP,
pages 171–182, 2009. Full version with appendix available at http://www2.compute.dtu.dk/~
phbi/files/publications/2009fremC.pdf.
7 Philip Bille and Mikkel Thorup. Regular Expression Matching with Multi-Strings and Intervals.
In Proc. 21st SODA, pages 1297–1308, 2010.
8 Danny Dubé and Marc Feeley. Efficiently building a parse tree from a regular expression. Acta
Informatica, 37(2):121–144, 2000.
9 Michael L. Fredman and Dan E. Willard. Trans-dichotomous algorithms for minimum spanning
trees and shortest paths. J. Comput. System Sci., 48(3):533–551, 1994.
10 Alain Frisch and Luca Cardelli. Greedy regular expression matching. In Proc. 31st ICALP,
volume 3142, pages 618–629, 2004.
11 Minos N Garofalakis, Rajeev Rastogi, and Kyuseok Shim. SPIRIT: Sequential pattern mining
with regular expression constraints. In Proc. 25th VLDB, pages 223–234, 1999.
12 Victor M. Glushkov. The Abstract Theory of Automata. Russian Math. Surveys, 16(5):1–53,
1961.
13 D. S. Hirschberg. A linear space algorithm for computing maximal common subsequences.
Commun. ACM, 18(6):341–343, 1975.
14 Theodore Johnson, S. Muthukrishnan, and Irina Rozenbaum. Monitoring Regular Expressions
on Out-of-Order Streams. In Proc. 23nd ICDE, pages 1315–1319, 2007.
15 Steven M Kearns. Extending regular expressions with context operators and parse extraction.
Software: Practice and Experience, 21(8):787–804, 1991.
16 Kenrick Kin, Björn Hartmann, Tony DeRose, and Maneesh Agrawala. Proton: multitouch
gestures as regular expressions. In Proc. SIGCHI, pages 2885–2894, 2012.
17 Sailesh Kumar, Sarang Dharmapurikar, Fang Yu, Patrick Crowley, and Jonathan Turner.
Algorithms to accelerate multiple regular expressions matching for deep packet inspection. In
Proc. SIGCOMM, pages 339–350, 2006.
18 Ville Laurikari. NFAs with tagged transitions, their conversion to deterministic automata and
application to regular expressions. In Proc. 7th SPIRE, pages 181–187, 2000.
19 Quanzhong Li and Bongki Moon. Indexing and Querying XML Data for Regular Path
Expressions. In Proc. 27th VLDB, pages 361–370, 2001.
20 R. McNaughton and H. Yamada. Regular Expressions and State Graphs for Automata. IRE
Trans. on Electronic Computers, 9(1):39–47, 1960.
21 Makoto Murata. Extended path expressions of XML. In Proc. 20th PODS, pages 126–137,
2001.
22 E. W. Myers. A Four-Russian Algorithm for Regular Expression Pattern Matching. J. ACM,
39(2):430–448, 1992.
23 Gonzalo Navarro and Mathieu Raffinot. Fast and Simple Character Classes and Bounded Gaps
Pattern Matching, with Applications to Protein Searching. J. Comp. Biology, 10(6):903–923,
2003.
24 Lasse Nielsen and Fritz Henglein. Bit-coded Regular Expression Parsing. In Proc. 5th LATA,
pages 402–413, 2011.
M FC S 2 0 1 9
71:14 From Regular Expression Matching to Parsing
25 Martin Sulzmann and Kenny Zhuo Ming Lu. Regular expression sub-matching using partial
derivatives. In Proc. 14th PPDP, pages 79–90, 2012.
26 K. Thompson. Regular Expression Search Algorithm. Commun. ACM, 11:419–422, 1968.
27 Fang Yu, Zhifeng Chen, Yanlei Diao, T. V. Lakshman, and Randy H. Katz. Fast and memory-
efficient regular expression matching for deep packet inspection. In Proc. ANCS, pages 93–102,
2006.