1 - Introduction To DM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

1 - Introduction to

Data Mining
Tran Thi Oanh
Outline
➢Why Data Mining?
➢What Is Data Mining?
➢A Multi-Dimensional View of Data Mining
➢What Kind of Data Can Be Mined?
➢What Kinds of Patterns Can Be Mined?
➢What Technology Are Used?
➢What Kind of Applications Are Targeted?
➢Major Issues in Data Mining
➢A Brief History of Data Mining and Data Mining Society
➢Summary
2
Why Data Mining?

➢The Explosive Growth of Data: from terabytes to petabytes


o Data collection and data availability
• Automated data collection tools, database systems, Web,
computerized society
o Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
➢We are drowning in data, but starving for knowledge!
➢“Necessity is the mother of invention”—Data mining—Automated analysis
of massive data sets
3
Evolution of Sciences
➢ Before 1600, empirical science
➢ 1600-1950s, theoretical science
o Each discipline has grown a theoretical component. Theoretical models often motivate experiments and
generalize our understanding.
➢ 1950s-1990s, computational science
o Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical,
theoretical, and computational ecology, or physics, or linguistics.)
o Computational Science traditionally meant simulation. It grew out of our inability to find closed-form
solutions for complex mathematical models.
➢ 1990-now, data science
o The flood of data from new scientific instruments and simulations
o The ability to economically store and manage petabytes of data online
o The Internet and computing Grid that makes all these archives universally accessible
o Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly
with data volumes. Data mining is a major new challenge!
➢ Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-
54, Nov. 2002
4
Evolution of Database Technology
➢ 1960s:
o Data collection, database creation, IMS and network DBMS
➢ 1970s:
o Relational data model, relational DBMS implementation
➢ 1980s:
o RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
o Application-oriented DBMS (spatial, scientific, engineering, etc.)
➢ 1990s:
o Data mining, data warehousing, multimedia databases, and Web databases
➢ 2000s
o Stream data management and mining
o Data mining and its applications
o Web technology (XML, data integration) and global information systems

5
Outline
➢Why Data Mining?
➢What Is Data Mining?
➢A Multi-Dimensional View of Data Mining
➢What Kind of Data Can Be Mined?
➢What Kinds of Patterns Can Be Mined?
➢What Technology Are Used?
➢What Kind of Applications Are Targeted?
➢Major Issues in Data Mining
➢A Brief History of Data Mining and Data Mining Society
➢Summary
6
What Is Data Mining?

➢Data mining (knowledge discovery from data)


o Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
o Data mining: a misnomer?

➢Alternative names
o Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting,
business intelligence, etc.

➢Watch out: Is everything “data mining”?


o Simple search and query processing
o (Deductive) expert systems

7
Knowledge Discovery (KDD) Process
➢ This is a view from typical database systems
and data warehousing communities
Pattern Evaluation
➢ Data mining plays an essential role in the
knowledge discovery process
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases 8
Example: A Web Mining Framework

➢Web mining usually involves


o Data cleaning
o Data integration from multiple sources
o Warehousing the data
o Data cube construction
o Data selection for data mining
o Data mining
o Presentation of the mining results
o Patterns and knowledge to be used or stored into knowledge-base

9
Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
10
Example: Mining vs. Data Exploration

➢Business intelligence view


o Warehouse, data cube, reporting but not much mining
➢Business objects vs. data mining tools
➢Supply chain example: tools
➢Data presentation
➢Exploration

11
KDD Process: A Typical View from ML and Statistics

Input Data Data Pre- Data Post-


Processing Mining Processing

Data integration Pattern discovery Pattern evaluation


Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis
…………

➢ This is a view from typical machine learning and statistics communities

12
Example: Medical Data Mining
➢Health care & medical data mining – often adopted such a view in
statistics and machine learning
➢Preprocessing of the data (including feature extraction and
dimension reduction)
➢Classification or/and clustering processes
➢Post-processing for presentation

13
Outline
➢Why Data Mining?
➢What Is Data Mining?
➢A Multi-Dimensional View of Data Mining
➢What Kind of Data Can Be Mined?
➢What Kinds of Patterns Can Be Mined?
➢What Technology Are Used?
➢What Kind of Applications Are Targeted?
➢Major Issues in Data Mining
➢A Brief History of Data Mining and Data Mining Society
➢Summary
14
Multi-Dimensional View of Data Mining
➢ Data to be mined
o Database data (extended-relational, object-oriented, heterogeneous, legacy), data
warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text
and web, multi-media, graphs & social and information networks
➢ Knowledge to be mined (or: Data mining functions)
o Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
o Descriptive vs. predictive data mining
o Multiple/integrated functions and mining at multiple levels
➢ Techniques utilized
o Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern
recognition, visualization, high-performance, etc.
➢ Applications adapted
o Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market
analysis, text mining, Web mining, etc. 15
Outline
➢Why Data Mining?
➢What Is Data Mining?
➢A Multi-Dimensional View of Data Mining
➢What Kind of Data Can Be Mined?
➢What Kinds of Patterns Can Be Mined?
➢What Technology Are Used?
➢What Kind of Applications Are Targeted?
➢Major Issues in Data Mining
➢A Brief History of Data Mining and Data Mining Society
➢Summary
16
Data Mining: On What Kinds of Data?

➢Database-oriented data sets and applications


o Relational database, data warehouse, transactional database
➢Advanced data sets and advanced applications
o Data streams and sensor data
o Time-series data, temporal data, sequence data (incl. bio-sequences)
o Structure data, graphs, social networks and multi-linked data
o Object-relational databases
o Heterogeneous databases and legacy databases
o Spatial data and spatiotemporal data
o Multimedia database
o Text databases
o The World-Wide Web
17
Outline
➢Why Data Mining?
➢What Is Data Mining?
➢A Multi-Dimensional View of Data Mining
➢What Kind of Data Can Be Mined?
➢What Kinds of Patterns Can Be Mined?
➢What Technology Are Used?
➢What Kind of Applications Are Targeted?
➢Major Issues in Data Mining
➢A Brief History of Data Mining and Data Mining Society
➢Summary
18
Data Mining Function: (1) Generalization

➢Information integration and data warehouse construction


o Data cleaning, transformation, integration, and multidimensional
data model
➢Data cube technology
o Scalable methods for computing (i.e., materializing)
multidimensional aggregates
o OLAP (online analytical processing)
➢Multidimensional concept description: Characterization and
discrimination
o Generalize, summarize, and contrast data characteristics, e.g., dry vs.
wet region
19
Data Mining Function: (2) Association and Correlation
Analysis
➢Frequent patterns (or frequent itemsets)
o What items are frequently purchased together in your Walmart?
➢Association, correlation vs. causality
o A typical association rule
• Diaper → Beer [0.5%, 75%] (support, confidence)
o Are strongly associated items also strongly correlated?
➢How to mine such patterns and rules efficiently in large datasets?
➢How to use such patterns for classification, clustering, and other
applications?

20
Data Mining Function: (3) Classification

➢ Classification and label prediction


o Construct models (functions) based on some training examples
o Describe and distinguish classes or concepts for future prediction
• E.g., classify countries based on (climate), or classify cars based on (gas
mileage)
o Predict some unknown class labels
➢ Typical methods
o Decision trees, naïve Bayesian classification, support vector machines, neural
networks, rule-based classification, pattern-based classification, logistic regression,

➢ Typical applications:
o Credit card fraud detection, direct marketing, classifying stars, diseases, web-
pages, …
21
Data Mining Function: (4) Cluster Analysis

➢Unsupervised learning (i.e., Class label is unknown)


➢Group data to form new categories (i.e., clusters), e.g., cluster houses to
find distribution patterns
➢Principle: Maximizing intra-class similarity & minimizing interclass
similarity
➢Many methods and applications

22
Data Mining Function: (5) Outlier Analysis

➢Outlier analysis
o Outlier: A data object that does not comply with the general behavior
of the data
o Noise or exception? ― One person’s garbage could be another
person’s treasure
o Methods: by product of clustering or regression analysis, …
o Useful in fraud detection, rare events analysis

23
Time and Ordering: Sequential Pattern, Trend and Evolution
Analysis

➢Sequence, trend and evolution analysis


o Trend, time-series, and deviation analysis: e.g., regression and value
prediction
o Sequential pattern mining
• e.g., first buy digital camera, then buy large SD memory cards
o Periodicity analysis
o Motifs and biological sequence analysis
• Approximate and consecutive motifs
o Similarity-based analysis
➢Mining data streams
o Ordered, time-varying, potentially infinite, data streams

24
Structure and Network Analysis

➢Information network analysis


o Social networks: actors (objects, nodes) and relationships (edges)
• e.g., author networks in CS, terrorist networks
o Multiple heterogeneous networks
• A person could be multiple information networks: friends, family,
classmates, …
o Links carry a lot of semantic information: Link mining
➢Web mining
o Web is a big information network: from PageRank to Google
o Analysis of Web information networks
• Web community discovery, opinion mining, usage mining, …

25
Evaluation of Knowledge
➢Are all mined knowledge interesting?
o One can mine tremendous amount of “patterns” and knowledge
o Some may fit only certain dimension space (time, location, …)
o Some may not be representative, may be transient, …

➢Evaluation of mined knowledge → directly mine only interesting


knowledge?
o Descriptive vs. predictive
o Coverage
o Typicality vs. novelty
o Accuracy
o Timeliness
o…
26
Outline
➢Why Data Mining?
➢What Is Data Mining?
➢A Multi-Dimensional View of Data Mining
➢What Kind of Data Can Be Mined?
➢What Kinds of Patterns Can Be Mined?
➢What Technology Are Used?
➢What Kind of Applications Are Targeted?
➢Major Issues in Data Mining
➢A Brief History of Data Mining and Data Mining Society
➢Summary
27
Data Mining: Confluence of Multiple Disciplines

Machine Pattern Statistics


Learning Recognition

Applications Data Mining Visualization

Algorithm Database High-Performance


Technology Computing

28
Why Confluence of Multiple Disciplines?

➢Tremendous amount of data


o Algorithms must be highly scalable to handle such as tera-bytes of data
➢High-dimensionality of data
o Micro-array may have tens of thousands of dimensions
➢High complexity of data
o Data streams and sensor data
o Time-series data, temporal data, sequence data
o Structure data, graphs, social networks and multi-linked data
o Heterogeneous databases and legacy databases
o Spatial, spatiotemporal, multimedia, text and Web data
o Software programs, scientific simulations
➢New and sophisticated applications
29
Outline
➢Why Data Mining?
➢What Is Data Mining?
➢A Multi-Dimensional View of Data Mining
➢What Kind of Data Can Be Mined?
➢What Kinds of Patterns Can Be Mined?
➢What Technology Are Used?
➢What Kind of Applications Are Targeted?
➢Major Issues in Data Mining
➢A Brief History of Data Mining and Data Mining Society
➢Summary
30
Applications of Data Mining
➢Web page analysis: from web page classification, clustering to PageRank
& HITS algorithms
➢Collaborative analysis & recommender systems
➢Basket data analysis to targeted marketing
➢Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
➢Data mining and software engineering (e.g., IEEE Computer, Aug. 2009
issue)
➢From major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining
31
Outline
➢Why Data Mining?
➢What Is Data Mining?
➢A Multi-Dimensional View of Data Mining
➢What Kind of Data Can Be Mined?
➢What Kinds of Patterns Can Be Mined?
➢What Technology Are Used?
➢What Kind of Applications Are Targeted?
➢Major Issues in Data Mining
➢A Brief History of Data Mining and Data Mining Society
➢Summary
32
Major Issues in Data Mining (1)

➢Mining Methodology
o Mining various and new kinds of knowledge
o Mining knowledge in multi-dimensional space
o Data mining: An interdisciplinary effort
o Boosting the power of discovery in a networked environment
o Handling noise, uncertainty, and incompleteness of data
o Pattern evaluation and pattern- or constraint-guided mining
➢User Interaction
o Interactive mining
o Incorporation of background knowledge
o Presentation and visualization of data mining results
33
Major Issues in Data Mining (2)

➢Efficiency and Scalability


o Efficiency and scalability of data mining algorithms
o Parallel, distributed, stream, and incremental mining methods
➢Diversity of data types
o Handling complex types of data
o Mining dynamic, networked, and global data repositories
➢Data mining and society
o Social impacts of data mining
o Privacy-preserving data mining
o Invisible data mining
34
Outline
➢Why Data Mining?
➢What Is Data Mining?
➢A Multi-Dimensional View of Data Mining
➢What Kind of Data Can Be Mined?
➢What Kinds of Patterns Can Be Mined?
➢What Technology Are Used?
➢What Kind of Applications Are Targeted?
➢Major Issues in Data Mining
➢A Brief History of Data Mining and Data Mining Society
➢Summary
35
A Brief History of Data Mining Society

➢1989 IJCAI Workshop on Knowledge Discovery in Databases


o Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
1991)
➢1991-1994 Workshops on Knowledge Discovery in Databases
o Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-
Shapiro, P. Smyth, and R. Uthurusamy, 1996)
➢1995-1998 International Conferences on Knowledge Discovery in Databases
and Data Mining (KDD’95-98)
o Journal of Data Mining and Knowledge Discovery (1997)
➢ACM SIGKDD conferences since 1998 and SIGKDD Explorations
➢More conferences on data mining
o PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001),
etc.
➢ACM Transactions on KDD starting in 2007
36
Conferences and Journals on Data Mining

➢KDD Conferences ◼ Other related conferences


o ACM SIGKDD Int. Conf. on Knowledge ◼ DB conferences: ACM SIGMOD, VLDB,
Discovery in Databases and Data ICDE, EDBT, ICDT, …
Mining (KDD) ◼ Web and IR conferences: WWW,
o SIAM Data Mining Conf. (SDM) SIGIR, WSDM
o (IEEE) Int. Conf. on Data Mining ◼ ML conferences: ICML, NIPS
(ICDM) ◼ PR conferences: CVPR,
o European Conf. on Machine Learning ◼ Journals
and Principles and practices of
◼ Data Mining and Knowledge Discovery
Knowledge Discovery and Data Mining
(DAMI or DMKD)
(ECML-PKDD)
◼ IEEE Trans. On Knowledge and Data
o Pacific-Asia Conf. on Knowledge
Eng. (TKDE)
Discovery and Data Mining (PAKDD)
◼ KDD Explorations
o Int. Conf. on Web Search and Data
Mining (WSDM) ◼ ACM Trans. on KDD

37
Where to Find References? DBLP, CiteSeer, Google

➢ Data mining and KDD (SIGKDD: CDROM)


o Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
o Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
➢ Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)
o Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
o Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
➢ AI & Machine Learning
o Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
o Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI,
etc.
➢ Web and IR
o Conferences: SIGIR, WWW, CIKM, etc.
o Journals: WWW: Internet and Web Information Systems,
➢ Statistics
o Conferences: Joint Stat. Meeting, etc.
o Journals: Annals of statistics, etc.
➢ Visualization
o Conference proceedings: CHI, ACM-SIGGraph, etc.
o Journals: IEEE Trans. visualization and computer graphics, etc.
38
Summary

➢Data mining: Discovering interesting patterns and knowledge from


massive amount of data
➢A natural evolution of database technology, in great demand, with
wide applications
➢A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
➢Mining can be performed in a variety of data
➢Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
➢Data mining technologies and applications
➢Major issues in data mining
39
Thank you!
Q&A
Bias is

A. A class of learning algorithm that tries to find an optimum classification of a set of


examples using the probabilistic theory
B. Any mechanism employed by a learning system to constrain the search space of a
hypothesis
C. An approach to the design of learning algorithms that is inspired by the fact that when
people encounter new situations, they often explain them by reference to familiar
experiences, adapting the explanations to fit the new situation.
D. None of these
Algorithm is

A. It uses machine-learning techniques. Here program can learn from past experience and
adapt themselves to new situations
B. Computational procedure that takes some value as input and produces some value as
output
C. Science of making machines performs tasks that would require intelligence when
performed by humans
D. None of these
Bayesian classifiers is

A. A class of learning algorithm that tries to find an optimum classification of a set of


examples using the probabilistic theory.
B. Any mechanism employed by a learning system to constrain the search space of a
hypothesis
C. An approach to the design of learning algorithms that is inspired by the fact that when
people encounter new situations, they often explain them by reference to familiar
experiences, adapting the explanations to fit the new situation.
D. None of these
Classification is

A. A subdivision of a set of examples into a number of classes


B. A measure of the accuracy, of the classification of a concept that is given by a certain
theory
C. The task of assigning a classification to a set of examples
D. None of these
Binary attribute are

A. This takes only two values. In general, these values will be 0 and 1 and .they can be coded
as one bit
B. The natural environment of a certain species
C. Systems that can be used without knowledge of internal operations
D. None of these
Classification accuracy is

A. A subdivision of a set of examples into a number of classes


B. Measure of the accuracy, of the classification of a concept that is given by a certain theory
C. The task of assigning a classification to a set of examples
D. None of these
Cluster is

A. Group of similar objects that differ significantly from other objects


B. Operations on a database to transform or simplify data in order to prepare it for a
machine-learning algorithm
C. Symbolic representation of facts or ideas from which information can potentially be
extracted
D. None of these
Data mining is

A. The actual discovery phase of a knowledge discovery process


B. The stage of selecting the right data for a KDD process
C. A subject-oriented integrated time variant non-volatile collection of data in support of
management
D. None of these
Data selection is

A. The actual discovery phase of a knowledge discovery process


B. The stage of selecting the right data for a KDD process
C. A subject-oriented integrated time variant non-volatile collection of data in support of
management
D. None of these
Classification task referred to

A. A subdivision of a set of examples into a


number of classes
B. A measure of the accuracy, of the classification
of a concept that is given by a certain theory
C. The task of assigning a classification to a set of
examples
D. None of these
Discovery is

A. It is hidden within a database and can only be


recovered if one is given certain clues (an
example IS encrypted information).
B. The process of executing implicit previously
unknown and potentially useful information
from data
C. An extremely complex molecule that occurs in
human chromosomes and that carries genetic
information in the form of genes.
D. None of these
Euclidean distance measure is

A. A stage of the KDD process in which new data is


added to the existing selection.
B. The process of finding a solution for a problem
simply by enumerating all possible solutions
according to some pre-defined order and then
testing them
C. The distance between two points as calculated
using the Pythagoras theorem
D. None of these
KDD (Knowledge Discovery in
Databases) is referred to

A. Non-trivial extraction of implicit previously unknown and potentially useful information


from data
B. Set of columns in a database table that can be used to identify each record within this
table uniquely.
C. collection of interesting and useful patterns in a database
D. none of these
Learning is
A. The process of finding the right formal representation of a certain body of knowledge in
order to represent it in a knowledge-based system
B. It automatically maps an external signal space into a system's internal representational
space. They are useful in the performance of classification tasks.
C. A process where an individual learns how to carry out a certain task when making a
transition from a situation in which the task cannot be carried out to a situation in which
the same task under the same circumstances can be carried out.
D. None of these
Learning algorithm referrers to

A. An algorithm that can learn


B. A sub-discipline of computer science that deals with the design and implementation of
learning algorithms
C. A machine-learning approach that abstracts from the actual strategy of an individual
algorithm and can therefore be applied to any other form of machine learning.
D. None of these
Knowledge is referred to

A. Non-trivial extraction of implicit previously unknown and potentially useful information


from data
B. Set of columns in a database table that can be used to identify each record within this table
uniquely
C. collection of interesting and useful patterns in a database
D. none of these
Noise is

A. A component of a network
B. In the context of KDD and data mining, this refers to random errors in a database table.
C. One of the defining aspects of a data warehouse
D. None of these
Prediction is

A. The result of the application of a theory or a rule in a specific case


B. One of several possible enters within a database table that is chosen by the designer as
the primary means of accessing the data in the table.
C. Discipline in statistics that studies ways to find the most interesting projections of multi-
dimensional spaces.
D. None of these
MCQ

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy