Mannila 1997
Mannila 1997
Mannila 1997
Heikki Mannila*
1 Introduction
Knowledge discovery in databases (KDD), often called data mining, aims at the
discovery of useful information from large collections of data. The discovered
knowledge can be rules describing properties of the data, frequently occurring
patterns, clusterings of the objects in the database, etc. Current technology makes
it fairly easy to collect data, but data analysis tends to be slow and expensive.
There is a suspicion that there might be nuggets of useful information hiding in
9the masses of unanalyzed or underanalyzed data, and therefore semiautomatic
methods for locating interesting information from data would be useful. Data
mining has in the 1990's emerged as visible research and development area; see
[11] for a recent overview of the area.
This tutorial describes some methods of data mining and also lists a variety
of open problems, both in the theory of data mining and in the systems side of it.
We start in Section 2 by briefly discussing the KDD process, basic data mining
techniques, and listing some prominent applications.
Section 3 moves to a generic data mining algorithm, and discusses some of
the architectural issues in data mining systems. Section 4 considers the specific
problems of finding association rules, episodes, or keys from large data sets.
In Section 5 we consider the possibilities of specifying KDD tasks in a high
level language and compiling these specifications to efficient discovery algorithms.
* Part of this work was done while the author was visiting the Max Planck Institut f/ir
Informatik in Saarbr/icken, Germany. Work supported by the Academy of Finland
and by the Alexander yon Humboldt Stiftung.
42
Section 6 studies the use of sampling in data mining, and Section 7 considers the
possibilities of representing large data sets by smaller condensed representations,
Section 8 gives a list of open problems in data mining.
Before starting on the KDD process, I digress briefly to other some topics not
treated in this paper. An important issue in data mining is its relationship with
machine learning and statistics. I refer to [11, 31] for some discussions on this.
Visualization of data is an important technique for obtaining useful information
from large masses of data. The area is large; see [25] for an overview. Visualization
can also be useful for making the discovered patterns easier to understand.
Clustering is obviously a central technique in analyzing large data collections.
The literature on the area is huge, and too wide to even scratch here.
Similarity searches are often needed in data mining applications: how does
one find objects that are roughly similar to the a query point. Again, the literature
is vast, and we provide only two recent pointers: [4, 49].
The goal of knowledge discovery is to obtain useful knowledge from large col-
lections of data. Such a task is inherently interactive and iterative: one cannot
expect to obtain useful knowledge simply by pushing a lot of data to a black
box. The user of a KDD system has to have a solid understanding of the domain
in order to select the right subsets of data, suitable classes of patterns, and good
criteria for interestingness of the patterns. Thus KDD systems should be seen as
interactive tools, not as automatic analysis systems.
Discovering knowledge from data should therefore be seen as a process con-
taining several steps:
In industry, the success of KDD is partly related to the rise of the concepts
of data warehousing and on-line analytical processing (OLAP). These strategies
for the storage and processing of the accumulated data in an organization have
become popular in recent years. KDD and data mining can be viewed as ways
of realizing some of the goals of data warehousing and OLAP.
A fairly large class of data mining tasks can be described as the search for
interesting and frequently occurring patterns from the data. That is, we are given
a class 79 of patterns or sentences that describe properties of the data, and we
can specify whether a pattern p E 79 occurs frequently enough and is otherwise
interesting. That is, the generic data mining task is to find the set
4 Examples
In this section we discuss three data mining problems where instantiations of the
above algorithm can be used.
4. while C r 0 do
5. Y~ :-- the sets X E C that are frequent;
6. add ~" to jr;
7. C := sets Y of size i + 1 such that
8. each subset W of Y of size i is frequent;
9. i := i+1;
10. od;
The algorithm has to read the database at most K + 1 times, where K is the
size of the largest frequent set. In the applications, K is small, typicMly at most
10, so the number of passes through the data is reasonable.
A modification of the above method is obtained by computing for each fre-
quent set X the subrelation r x C r consisting of those rows t E r such that
t[A] = 1 for all A E X. Then it is easy to see that for example r{A,B,C } - -
r{A,B } N r{B,C }. Thus the relation r x for a set X of size k can be obtained
from the relations rx, and rx,,, where X ' = X \ {A) and X " = X \ {B) for
some A, B E X with A r B. This method has the advantage that rows that do
not contribute to any frequent set will not be inspected more than once. For
comparisons of the two approaches, see [2, 18, 43].
The algorithms described above work quite nicely on large input relations.
Their running time is approximately O ( N F ) , where N = n p is the size of the
input and F is the sum of the sizes of the sets in the candidate collection C during
the operation of the algorithm [37]. This is nearly linear, and the algorithms seem
to scale nicely to tens of millions of examples. Typically the only case when they
fail is when the output is too large, i.e., there are too many frequent sets.
The methods for finding frequent sets are simple: they are based on one
nice but simple observation (subsets of frequent sets must be frequent), and use
straightforward implementation techniques.
A naive implementation of the algorithms on top of a relational database
system would be easy: we need to pose to the database management system
queries of the form "What is s ( { A 1 , . . . , Ak}, r)?", or in SQL
sense that there are no frequent sets that contain more than about 15 attributes.
Namely, the framework of finding all association rules generates typically at least
as many rules as there are frequent sets, and if there is a frequent set of size K,
there will be at least 2/~ frequent sets.
The information about the frequent sets can actually be used to approximate
fairly accurately the confidences and supports of a far wider set of rules, including
negation and disjunction [36].
The approach of computing PI(d, 7~) presented in the previous section provides
one possible framework. A data mining task could be given by specifying the class
and the interestingness predicate. That is, the user of a KDD system makes a
pattern query. Mimicking the behavior of an ordinary DBMS, the KDD system
compiles and optimizes the pattern query, and executes it by searching for the
patterns that satisfy the user's specifications and have sufficient frequency in the
data.
As an example, consider the simple case of mining for association rules in a
course enrollment database. The user might say that he/she is interested only
in rules that have the "Data Management" course on the left-hand side. This re-
striction can be utilized in the algorithm for finding frequent sets: only candidate
sets that contain "Data Management" need to be considered.
Developing methods for such pattern queries and their optimization is cur-
rently one of the most interesting research topics in data mining. So far, even the
simple techniques such as the above have not been sufficiently studied. It is not
clear how far one can get by using such methods, but the possible payoff is large.
Such queries have some similarities with the strategies adopted in OLAP. The
difference is mainly that OLAP is verification-driven, in the sense that the ques-
tions are fairly well fixed in advance. Data mining, on the other hand, is discovery-
driven: one does not want to specify in advance what exactly is searched for.
In addition to developing query processing strategies for data mining applica-
tions, changes in the underlying storage model can also have a strong effect on the
performance. A very interesting experiment in this direction is the work on the
Monet database server developed at CWI in the Netherlands by Martin Kersten
and others [5, 19]. The Monet system is based on the vertical partitioning of the
relations: a relation with k attributes is decomposed into k relations, each with
two attributes: the OID and one of the original attributes. The system is built on
the extensive use of main memory, has an extensible set of basic operations, and
supports shared-memory parallelism. Experiments with Monet on data mining
applications have produced quite good results [18, 19].
6 Sampling
Data mining is often difficult for at least two reasons: first, there are lots of data,
and second, the data is multidimensional. The hypothesis or pattern space is in
most cases exponential in the number of attributes, so the multidimensionality
can actually be the harder problem.
A simple way of alleviating the problems caused by the volume of data (i.e.,
the number of rows) is to use sampling. Even small samples can give quite good
approximation to the association rules [2, 48] or functional dependencies [28]
that hold in a relation. See [27] for a general analysis on the relationship between
the logical form of the discovered knowledge and the sample sizes needed for
discovering it.
The problem with using sampling is that the results can be wrong, with a
small probability. A possibility is to first use a sample and then verify (and, if
50
necessary, correct) the results against the whole data set. For instances of this
scheme, see [28, 48]; also the generic algorithm can be modified to correspond to
this approach. We give the sample-and-correct algorithm for finding functional
dependencies.
A l g o r i t h m 3 Finding the keys of a relation by sampling and correcting.
I n p u t . A relation r over schema R.
O u t p u t . The set of keys of r.
Method.
1. s := a sample of r;
2. /C := keys(s);
3. w h i l e there is a set X E/C such that X is not a key of r do
4. add some rows u, v E r with u[X] = v[Z] to s;
5. /~ := keys(s);
6. od;
7. output K:.
7 Condensed representations
8 Open problems
Data mining is an applied field, and research in the area seems to succeed best
when done in cooperation with the appliers. Hence I am a little hesitant to offer
a list of research questions or open problems; on the other hand, it seems to me
that such a list gives a reasonable sample of the research issues in the area.
Thus, this section contains a list of research topics in data mining that I
consider interesting or important. The problems are very varying, from archi-
tectural issues to specific algorithmic questions. For brevity, the descriptions are
quite succinct, and I also provide only a couple of references.
F r a m e w o r k a n d general t h e o r y
1. Develop a general theory of data mining. Possible starting points are [6, 7,
23, 26, 30, 37]. (One might call this the theory of inductive databases.)
2. What is the relationship between the logical form of sentences to be dis-
covered and the computational complexity of the discovery task? (The issue
of logical form vs. sample size is considered in [27].)
3. Prove or disprove the CPM principle (Section 3).
4. What can be said about the performance of Algorithm 3 and its analogues
for other problems?
5. How useful is the concept of condensed representation? [36]
S y s t e m a n d l a n g u a g e issues
12. Design an algorithm for the key finding problem that works in polynomial
time with respect to the size of the output and the number Of attributes, and
in subquadratic time in the number of rows. Solutions to the two following
problems would imply considerable progress for this problem.
52
13. Finding the keys of a relation contains as a subproblem the problem of finding
transversals of hypergraphs [33, 8, 36]. Given a hypergraph 7/, can the set
Tr(7/) of its transversals be computed in time polynomial in 17/I and ITr(7/)I?
14. When one reduces the problem of finding keys to transversals of hypergraphs,
one has to solve the following preliminary problem. Given a relation r over
schema R and two rows u, v E r, denote ag(u, v) = { g E R I u[A] = viAl},
and let disag(u, v) = R \ a g ( u , v). Denote ag(r) = {ag(u, v) I u, v E r, u • v),
and disag(r) = {disag(u, v) I u, v E r, u r v). Further, let maxag(r) be the
collection of maximal elements of ag(r) and mindisag(r) the collection of
minimal elements of disag(r). Given a relation r, compute mindisag(r) in
time O(q(Imindisag(r)l)f(Ir[)), where q is a polynomial and f = o(Ir12).
15. In the analysis of event sequences, patterns are typically subsequences. Given
a sequence s = a l - . . a , ~ , a pattern p = b l . . . b k and a window width W,
decide in o(nk) time whether p occurs as a subsequence in s within a window
of width W, i.e., whether there are indices 1 < il < i2 < . " < ik < n such
that ik - il < W and for all j = 1 , . . . , k we have aij = bj. [38, 35] (The
solution should work for very large alphabets.)
Acknowle dgements
Comments from Dimitrios Gunopulos and Hannu Toivonen are gratefully ac-
knowledged.
References
1. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of
items in large databases. In Proceedings of ACM SIGMOD Conference on Man-
agement of Data (SIGMOD'93), pages 207 - 216, May 1993.
2. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast dis-
covery of association rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages
307 - 328. AAAI Press, Menlo Park, CA, 1996.
3. R. Agrawal and K. Shim. Developing tightly-coupled data mining applications on
a relational database system. In Proc. of the 2nd Int'l Conference on Knowledge
Discovery in Databases and Data Mining, pages 287-290, 1996.
4. S. Berchtold, D. A. Keim, and H. P. Kriegel. The X-tree: An index structure for
high-dimensional data. In Proceedings of the 22nd International Conference on
Very Large Data Bases (VLDB'96), pages 28-29, Mumbay, India, 1996. Morgan
Kaufmann.
5. P. A. Boncz, W. Quak, and M. L. Kersten. Monet and its geographical exten-
sions: a novel approach to high-performance GIS processing. In P. M. G. Apers,
M. Bouzeghouh, and G. Gardarin, editors, Advances in Database Technology -
EDBT'96, pages 147-166, 1996.
6. L. De Raedt and M. Bruynooghe. A theory of clausal discovery. In Proceedings of
the Thirteenth International Joint Conference on Artificial Intelligence (IJCAI-
93), pages 1058 - 1053, Chamb~ry, France, 1993. Morgan Kaufmarm.
53
23. M. Jaeger, H. Mannila, and E. Weydert. Data mining as selective theory extraction
in probabilistic logic. In R. Ng, editor, SIGMOD'96 Data Mining Workshop, The
University of British Columbia, Department of Computer Science, TR 96-08, pages
41-46, 1996.
24. M. Kantola, H. Mannila, K.-J. Raihs and H. Siirtola. Discovering functional and
inclusion dependencies in relational databases. International Journal of Intelligent
Systems, 7(7):591 - 607, Sept. 1992.
25. D. Keim and H. Kriegel. Visualization techniques for mining large databases: A
comparison. IEEE Transactions on Knowledge and Data Engineering, 1996. to
appear.
26. J.-U. Kietz and S. Wrobel. Controlling the complexity of learning in logic through
syntactic and task-oriented models. In S. Muggleton, editor, Inductive Logic Pro-
gramming, pages 335 - 359. Academic Press, London, 1992.
27. J. Kivinen and H. Mannila. The power of sampling in knowledge discovery. In
Proceedings of the Thirteenth A CM SIGACT-SIGMOD-SIGART Symposium on
Principles of Database Systems (PODS'94), pages 77 - 85, Minneapolis, MN, May
1994.
28. J. Kivinen and H. Marmila. Approximate dependency inference from relations.
Theoretical Computer Science, 149(1):129- 149, 1995.
29. M. Klemettinen, H. Mannila, P. Ronkalnen, H. Toivonen, and A . I . Verkamo.
Finding interesting rules from large sets of discovered association rules. In Proceed-
ings of the Third International Conference on Information and Knowledge Man-
agement (CIKM'94), pages 401 - 407, Gaithersburg, M D , Nov. 1994. ACM.
30. W. Kloesgen. Efficient discovery of interesting statements in databases. Journal
of Intelligent Information Systems, 4(1):53 - 69, 1995.
31. H. Marmila. Data mining: machine learning, statistics, and databases. In Pro-
ceedings of the 8th International Conference on Scientific and Statistical Database
Management, Stockholm, pages 1-6, 1996.
32. H. Mannila and K.-J. R~iihs Design by example: An application of Armstrong
relations. Journal of Computer and System Sciences, 33(2):126 - 141, 1986.
33. H. Mannila and K.-J. Raihs Design of Relational Databases. Addison-Wesley
Publishing Company, Wokingham, UK, 1992.
34. H. Marmila and K.-J. R~iih~i. On the complexity of dependency inference. Discrete
Applied Mathematics, 40:237 - 243, 1992.
35. H. Mannila and H. Toivonen. Discovering generalized episodes using minimal oc-
currences. In Proceedings of the Second International Conference on Knowledge
Discovery and Data Mining (KDD'96), pages 146 - 151, Portland, Oregon, Aug.
1996. A A A I Press.
36. H. Mannila and H. Toivonen. Multiple uses of frequent sets and condensed rep-
resentations. In Proceedings of the Second International Conference on Knowledge
Discovery and Data Mining (KDD'96), pages 189 - 194, Portland, Oregon, Aug.
1996. A A A I Press.
37. H. Mannila and H. Toivonen. On an algorithm for finding all interesting sentences.
In Cybernetics and Systems, Volume I1, The Thirteenth European Meeting on Cy-
bernetics and Systems Research, pages 973 - 978, Vienna, Austria, Apr. 1996.
38. H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering frequent episodes in
sequences. In Proceedings of the First International Conference on Knowledge
Discovery and Data Mining (KDD'95), pages 210 - 215, Montreal, Canada, Aug.
1995.
55