ADMA2013 MaxSP Maximal Sequential Patterns
ADMA2013 MaxSP Maximal Sequential Patterns
ADMA2013 MaxSP Maximal Sequential Patterns
Maintenance
philippe.fournier-viger@umoncton.ca,
silvemoonfox@hotmail.com,tsengsm@mail.ncku.edu.tw
Abstract. Sequential pattern mining is an important data mining task with wide
applications. However, it may present too many sequential patterns to users,
which degrades the performance of the mining task in terms of execution time
and memory requirement, and makes it difficult for users to comprehend the re-
sults. The problem becomes worse when dealing with dense or long sequences.
As a solution, several studies were performed on mining maximal sequential
patterns. However, previous algorithms are not memory efficient since they
need to maintain a large amount of intermediate candidates in main memory
during the mining process. To address these problems, we present a both time
and memory efficient algorithm to efficiently mine maximal sequential patterns,
named MaxSP (Maximal Sequential Pattern miner), which computes all maxi-
mal sequential patterns without storing intermediate candidates in main
memory. Experimental results on real datasets show that MaxSP serves as an
efficient solution for mining maximal sequential patterns.
1 Introduction
Mining useful patterns in sequential data is a challenging task in data mining. Many
studies have been proposed for mining interesting patterns in sequence databases [1,
2, 3] Sequential pattern mining is probably the most popular research topic among
them. A sub-sequence is called sequential pattern or frequent sequence if it frequently
appears in a sequence database, and its frequency is no less than a user-specified min-
imum support threshold minsup. Sequential pattern mining plays an important role in
data mining and is essential to a wide range of applications such as the analysis of
web click-streams, program executions, medical data, biological data and e-learning
data [1, 2, 3]. Several algorithms have been proposed for sequential pattern mining
such as SPAM [4], SPADE [5] and PrefixSpan [6]. However, a drawback of these
algorithms is that they may present too many sequential patterns to users. A very large
number of sequential patterns makes it difficult for users to analyze results to gain
insightful knowledge. It may also cause the algorithms to become inefficient in terms
of time and memory because the more sequential patterns the algorithms produce, the
more resources they consume. The problem becomes worse when the database con-
tains long sequential patterns. For example, consider a sequence database containing a
sequential pattern having 20 distinct items. A sequential pattern mining algorithm will
present the sequential pattern as well as its 220-1 subsequences to users. This will most
likely make the algorithm fail to terminate in reasonable time and run out of storage
space. For example, the well-known sequential pattern mining algorithm PrefixSpan
would have to perform 220 database projection operations to produce the results.
To reduce the computational cost of the mining task and present fewer but repre-
sentative patterns to users, many studies focused on developing concise representa-
tions of sequential patterns. One of the representations that has been proposed is
closed sequential patterns [7, 8, 9]. A closed sequential pattern is a sequential pattern
that is not strictly included in another pattern having the same frequency. Several
approaches have been proposed for mining closed sequential patterns in sequence
databases such as BIDE [7], CloSpan [8] and ClaSP [9]. Although these algorithms
mines a compact set of sequential patterns, the set of closed patterns is still too large
for dense databases or database containing long sequences.
To address this problem, it was proposed to mine maximal sequential patterns [10,
11, 12, 13, 14]. A maximal sequential pattern is a closed pattern that is not strictly
included in another closed pattern. The set of maximal sequential patterns is thus
generally a very small subset of the set of (closed) sequential patterns. It is widely
recognized that mining maximal patterns can be faster than mining all (closed) pat-
terns. Besides, the set of maximal sequential patterns is representative because all
sequential patterns can be derived from it. Furthermore, the exact frequency of the
sequential patterns can be obtained with a single database pass. This method thus
provides an alternative solution to find all sequential patterns when the traditional
algorithms cannot successfully mine (closed) sequential patterns in the databases.
Maximal sequential pattern mining is important and has been adopted for many ap-
plications such as discovering frequent longest common subsequences in a text, anal-
ysis of DNA sequences, data compression and web log mining [10].
Although maximal sequential pattern mining is desirable and useful in many appli-
cations, it is still a challenging data mining task that has not been deeply explored.
Only few algorithms have been proposed for efficiently mining maximal sequential
patterns. MSPX [12] is an approximate algorithm and therefore it provides an incom-
plete set of maximal patterns to the user, and thus may omit important information.
DIMASP algorithm [10] is designed for the special case where sequences are strings
(no more than an item can appear at the same time) and where no pair of contiguous
items appears more than once in each sequence. AprioriAdjust algorithm [13] is an
apriori-like algorithm, which may suffer from the drawbacks of the candidate genera-
tion-and-test paradigm. In other word, it may generate a large number of candidate
patterns that do not appear in the input database and require to scan the original data-
base many times. MSPX [12] and MFSPAN [14] algorithms need to maintain a large
amount of intermediate candidates in main memory during the mining process. Alt-
hough the above algorithms are pioneers for maximal sequential pattern mining, they
are not memory efficient since they need to maintain a large amount of intermediate
candidates in main memory during the mining process [10, 11, 12, 13, 14].
To address the above issues, we propose a both time and memory efficient algo-
rithm, named MaxSP (Maximal Sequential Pattern miner), to efficiently mine maxi-
mal sequential patterns in sequence databases. The proposed algorithm is developed
for the general case of a sequence database rather than strings and it can capture the
complete set of maximal sequential patterns with only two database scans. Moreover,
it discovers all maximal sequential patterns neither producing redundant candidates
nor storing intermediate candidates in main memory. Whenever a maximal pattern is
discovered by MaxSP, it can be outputted immediately. We performed an experi-
mental study with five real-life datasets to evaluate the performance of MaxSP. We
compared its performance with the BIDE algorithm [8], one of the current best algo-
rithms for mining closed sequential patterns without storing intermediate candidates
in memory. Results show that MaxSP outperforms BIDE in terms of execution and
memory consumption and that the set of maximal patterns is much smaller than the
set of closed patterns.
The rest of the paper is organized as follows. Section 2 formally defines the prob-
lem of maximal sequential pattern mining and its relationship to sequential pattern
mining. Section 3 describes the MaxSP algorithm. Section 4 presents the experimental
study. Finally, Section 5 presents the conclusion and future work.
2 Problem Definition
The problem of sequential pattern mining was proposed by Agrawal and Srikant [1].
A sequence database SDB is a set of sequences S = {s1, s2…ss} and a set of items I =
{i1, i2, … im} occurring in these sequences. An item is a symbolic value. An itemset I
= {i1, i2, …, im} is an unordered set of distinct items. For example, the itemset {a, b,
c} represents the sets of items a, b and c. A sequence is an ordered list of itemsets s =
〈I1, I2, … In 〉 such that Ik ⊆ I (1 ≤ k ≤ n). For example, consider the sequence database
depicted in Figure 1. It contains four sequences having respectively the sequences ids
(SIDs) 1, 2, 3 and 4. In this example, each single letter represents an item. Items be-
tween curly brackets represent an itemset. For instance, the sequence 〈{a,
b},{c},{f},{g},{e}〉 indicates that items a and b occurred at the same time, were fol-
lowed successively by c, f, g and lastly e. A sequence sa = 〈A1, A2, …, An〉 is contained
in another sequence sb = 〈B1, B2,…, Bm〉 iff there exists integers 1 ≤ i1 < i2 < … < in ≤
m such that A1 ⊆ Bi1 , A2 ⊆ Bi2 , …, An ⊆ Bin (denoted as sa sb). The support of a
subsequence sa in a sequence database SDB is defined as the number of sequences s
S such that sa s and is denoted by sup(sa).
3
and find representative patterns, it was proposed to mine closed and maximal sequen-
tial patterns.
Property 1. It can be easily seen that maximal patterns are a subset of the set of
closed patterns and that closed patterns are a subset of the set of frequent sequential
patterns. Rationale. This follows directly from the above definitions. Example. Con-
sider the database of Figure 1 and minsup = 2. There are 29 sequential patterns
(shown in Figure 2), such that 15 are closed (identified by the letter ‘C’) and only 10
are maximal (identified by the letter ‘M’).
Fig. 3. Algorithm to recover all frequent sequential patterns from maximal patterns
Note that performing a database projection does not require to make a physical
copy of the database. For memory efficiency, a projected database is rather represent-
ed by a set of pointers on the original database (this optimization is called pseudo-
projection) [5]. Also, note that the pseudo-code presented in Figure 4 is simplified.
The actual PrefixSpan algorithm needs to consider that an item can be appended to
5
the current prefix P by i-extension or s-extension when counting the support of single
items. An i-extension is to append an item to the last itemset of prefix P. An s-
extension is to append an item as a new itemset after the last itemset of prefix P. The
interested reader is referred to [5] for more details. The PrefixSpan algorithm is cor-
rect and complete. It enumerates all frequent sequential patterns thanks to the anti-
monotonicity property, which states that the support of a proper supersequence X of a
sequential pattern S can only be lower or equal to the support of S [2]. PrefixSpan is
said to discover sequential patterns without candidate generation because only fre-
quent items are concatenated with the current prefix P to generate patterns at each
recursive call of the algorithm.
PrefixSpan (a sequence database SDB, a threshold minsup, a prefix P initially set to 〈〉)
1. Scan SDB once to count the support of each item.
2. FOR each item i with a support ≥ minsup
3. P’ := Concatenate(P, i).
4. Output the pattern P’.
5. SDBi := DatabaseProjection(SDB, i ).
6. PrefixSpan(SDBi, minsup, P’).
The main challenge to design a maximal sequential pattern mining algorithm based on
PrefixSpan is how to determine if a given frequent sequential pattern generated by the
PrefixSpan is maximal. A naïve approach would be to keep all frequent sequential
patterns found until now into memory. Then, every time that a new frequent sequen-
tial pattern would be found, the algorithm would compare the pattern with previously
found patterns to determine if (1) the new pattern is included in a previously found
pattern or (2) if some previously found pattern(s) are included in the new pattern. The
first case would indicate that the new pattern is not maximal. The second case would
indicate that some previously found pattern(s) are not maximal. This approach is used
for example in CloSpan [8] for closed sequential pattern mining. The drawback of this
approach is that it can consume a large amount of memory if the number of patterns is
large, and it is becomes very time consuming if a very large number of patterns is
found, because a very large number of comparisons would have to be performed [7].
In this paper, we present a new checking mechanism that can determine if a pattern is
maximal without having to compare a new pattern with previously found patterns.
The mechanism is inspired by the mechanism used in the BIDE algorithm for check-
ing if a pattern is closed [7]. In this subsection, we first introduce important defini-
tions and then we present our solution. Note that in the following, we use sequences
where itemsets contain single items (strings) for the sake of simplicity. Nevertheless,
the definitions that we present can be easily extended for the general case of a se-
quence of itemsets containing multiple items (our implementation handle the general
case of itemsets).
Definition 4. Let be a prefix P and a sequence S containing P. The first instance of
the prefix P in S is the subsequence of S starting from the first item in S until the end
of the first instance of P in S. For example, the first instance of 〈{a},{b}〉 in
〈{a},{a},{b},{e}〉 is 〈{a},{a}, {b}〉.
Definition 5. Let be a prefix P and a sequence S containing P. The last instance of the
prefix P in S is the subsequence of S starting from the first item in S until the end of
the first instance of P in S. For example, the last instance of 〈{a},{b}〉 in
〈{a},{b},{b},{e}〉 is 〈{a},{b}, {b}〉.
7
We now demonstrate that the maximal-forward-extension check and maximal-
back-extension-check is sufficient for determining if a pattern is maximal.
3.3 Optimizations
We performed three optimizations to improve the performance of MaxSP. First, we
use pseudo-projections instead of projections to avoid the cost of making physical
database copies (as suggested in PrefixSpan [5]). A second optimization is to remove
infrequent items from the database immediately after the first database scan because
they will not appear in any maximal sequential pattern. A third optimization concerns
the process of searching for the maximal-backward-extensions of a prefix by scanning
maximum periods. During the scan, item supports are accumulated. The scan can be
stopped as soon as there is an item known to appear in minsup maximum periods,
because it means that the prefix is not maximal. For large databases containing many
sequences, we found that this optimization increase performance by about a factor of
two. Note that, this optimization has similarity to the ScanSkip optimization proposed
in the BIDE algorithm, that stop scanning sequences as soon as a pattern is deter-
mined to be non closed [7].
MaxSP (a sequence database SDB, a threshold minsup, a prefix P initially set to 〈〉)
1. largestSupport := 0.
2. Scan SDB once to count the support of each item.
3. FOR each item i with a support ≥ minsup
4. P’ := Concatenate(P, i).
5. SDBi := DatabaseProjection(SDB, i ).
6. IF the pattern P’ has no maximal-backward-extension in SDBi THEN
7. maximumSupport := MaxSP (SDBi, minsup, P’).
8. IF maximumSupport < minsup THEN OUTPUT the pattern P’.
9. IF support(P’) > largestSupport THEN largestSupport := support(P’)
10. RETURN largestSupport.
4 Experimental Evaluation
9
Table 1. Datasets’ Characteristics
Pattern count
Maximal
Patterns
20K 10K
patterns
10K Closed 5K
K patterns K
37 39 41 43 45 47 136 141 146 151 156 161
minsup minsup
Sign
Fifa
500K Leviathan 10
100
Pattrn count
300K
200K 50 5
100K
K
41 46 51 2550 2750 2950 3150
95 115
minsup minsup minsup
As it can be seen from the results, the number of maximal patterns is always con-
siderably smaller than the number of closed patterns, and the gap increases quickly as
minsup decreases. For example, for the Sign dataset, only 25 % of the closed sequen-
tial patterns are maximal for minsup = 47. Another example is Snake, where only 28
% of the closed patterns are maximal for minsup = 136. This confirms that mining
maximal sequential patterns is more efficient in terms of storage space.
With respect to execution time, we can see that MaxSP is always faster than BIDE.
The difference is large for sparse datasets such as Sign and FIFA. For example, for
Sign, MaxSP was up to five times faster than BIDE. There are two reasons why
MaxSP is faster. The first reason is how the maximal-backward-extension checking is
performed. For each pattern found, MaxSP looks for items that could extend it with a
support no less than minsup, while BIDE looks for items with a support equal to the
support of the prefix. As soon as MaxSP or BIDE find an item meeting their respec-
tive conditions, they stop searching for backward extensions (the third optimization in
MaxSP and the ScanSkip optimization in BIDE). Because the condition verified by
BIDE is more specific and thus harder to meet, BIDE needs to analyze more sequenc-
es on average for backward extension checking, and this makes BIDE slower. The
second reason is that BIDE needs to perform more write operations to disk for storing
patterns because the set of closed sequential patterns is larger.
For the memory usage (cf. Figure 8), similar conclusions can be drawn. MaxSP
generally uses less memory than BIDE. This is due to the fact that less sequences
need to be scanned (as explained previously) and less patterns need to be created by
MaxSP. Note that we did not show the memory usage for FIFA and BMS due to
space limitation but results are similar as those of Leviathan, Snake and Sign.
BMS Snake
150K 200
Runtime (s)
Runtime (s)
100K BIDE 150
50K MaxSP 100
K 50
37 39 41 43 45 47 141 146 151
minsup minsup
Sign
2000 Leviathan FIFA
2000
Runtime (s)
Runtime (s)
1500
Runtime (s)
1000
1000 1000
500
500
41 44 47 50 95 105 115
2550 2750 2950 3150
minsup minsup minsup
5 Conclusion
Maximal sequential pattern mining is an important data mining task that is essential
for a wide range of applications. The set of maximal sequential patterns is a compact
representation of sequential patterns. Several algorithms have been proposed for max-
imal sequential pattern mining. However, they are not memory efficient since they
may produce too many redundant candidates and need to maintain a large amount of
intermediate candidates in main memory during the mining process. To address these
problems, we proposed a memory efficient algorithm to mine maximal sequential
patterns named MaxSP (Maximal Sequential Pattern miner). It incorporates a novel
checking mechanism consisting of verifying maximal-backward-extensions and max-
imal-forward-extensions, which allows discovering all maximal sequential patterns
without storing intermediate candidates in memory nor producing redundant candi-
dates. An experimental study on five real datasets shows that MaxSP is more memory
efficient and up to five time faster than BIDE, a state-of-art algorithm for closed se-
quential pattern mining, and that the number of maximal patterns is generally much
11
smaller than the number of closed sequential patterns. The source code of MaxSP and
BIDE be downloaded from http://goo.gl/hDtdt as part of the SPMF open-source data
mining software.
For future work, we plan to develop new algorithms for mining concise representa-
tions of sequential patterns and sequential rules [15, 16].
References
1. Han, J. and Kamber, M.: Data Mining: Concepts and Techniques, 2nd ed., San Francisco,
Morgan Kaufmann (2006)
2. Agrawal, R. and Srikant, R.: Mining Sequential Patterns, Proc. Int. Conf. on Data Engi-
neering, pp. 3-14 (1995)
3. Mabroukeh, N. R. and Ezeife, C. I.: A taxonomy of sequential pattern mining algorithms,
ACM Computing Surveys, vol. 43, no. 1, pp. 1-41 (2010)
4. Ayres, J., Flannick, J., Gehrke, J. and Yiu, T.: Sequential PAttern mining using a bitmap
representation, Proc. KDD 2002, Edmonton, Alberta, pp. 429-435 (2002)
5. Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.: Mining Se-
quential Patterns by Pattern-Growth: The PrefixSpan Approach, IEEE Trans. Knowledge
and Data Engineering, vol. 16, no. 10, pp. 1-17 (2001)
6. Zaki, M. J.: SPADE: An efficient algorithm for mining frequent sequences, Machine learn-
ing, vol. 42. no. 1-2, pp. 31-60 (2001)
7. Wang, J., Han, J., Li, C.: Frequent Closed Sequence Mining without Candidate Mainte-
nance, IEEE Trans. on Know. and Data Engineering, vol. 19, no. 8, pp.1042-1056 (2007)
8. Yan, X., Han, J. and Afshar, R.: CloSpan: Mining closed sequential patterns in large data-
sets, Proc. of the third SIAM International Conference on Data Mining, May 1-3, San
Francisco, California, ISBN 0-89871-545-8. (2003)
9. Gomariz, A., Campos, M., Marin, R., Goethals, B.: ClaSP: An Efficient Algorithm for
Mining Frequent Closed Sequences, Proc. PAKDD 2013, LNAI 7818, pp. 50-61 (2013)
10. García-Hernández, R. A., Martínez-Trinidad, J. F., Carrasco-Ochoa, J. A.: A new algo-
rithm for fast discovery of maximal sequential patterns in a document collection. Comp.
Linguistics and Intelligent Text Processing, Springer LNCS 3878, pp. 514-523 (2006)
11. Lin, N. P., Hao, W.-H., Chen, H.-J., Chueh, H.-E., Chang, C.-I.: Fast Mining Maximal Se-
quential Patterns. Proc. of the 7th International Conference on Simulation, Modeling and
Optimization, September 15-17, Beijing, China, pp.405-408 (2007)
12. Luo, C., Chung, S.: Efficient mining of maximal sequential patterns using multiple sam-
ples." Proc. 5th SIAM int’l conf. on data mining, Newport Beach, California. (2005)
13. Lu, S., Li, C.: AprioriAdjust: An Efficient Algorithm for Discovering the Maximum Se-
quential Patterns, Proc. 2nd Int’l Workshop Knowl. Grid and Grid Intell. (2004)
14. Guan, E.-Z., Chang, X.-Y., Wang, Z., Zhou, C.-G.: Mining Maximal Sequential Patterns,
Proc of the second Int’l Conf. Neural Networks and Brain, pp.525-528 (2005)
15. Fournier-Viger, P., Nkambou, R., Tseng, V. S.: RuleGrowth: Mining Sequential Rules
Common to Several Sequences by Pattern-Growth. Proc. of the 26th Symposium on Ap-
plied Computing. Tainan, Taiwan, pp. 954-959, ACM Press (2011).
16. Fournier-Viger, P., Faghihi, U., Nkambou, R., Mephu Nguifo, E.: CMRules: Mining Se-
quential Rules Common to Several Sequences. Knowledge-based Systems, Elsevier, vol.
25, no. 1, pp. 63-76 (2012)