Unit 3
Unit 3
Unit 3
Business Intelligence
Unit-3
Introduction to Data Mining
(DM)
Outline
Motivation: Why data mining?
What is data mining?
Data mining functionalities
Classification of Data mining systems
Data Mining: On what kind of data?
Data Mining Architecture
KDD Process
Data mining issues
Motivation : Why Data Mining?
Data Mining
Data Solution
“Data Mining”
Explosion Extraction of interesting
Knowledge from data in large
Problem databases
Netflix collects user ratings of movies (data) => What types of movies you will
like (knowledge) => Recommend new movies to you (action) => Users stay
with Netflix (goal)
Gene sequences of cancer patients (data) => Which genes lead to cancer?
(knowledge) => Appropriate treatment (action) => Save life (goal)
Road traffic (data) => Which road is likely to be congested? (knowledge) =>
Suggest better routes to drivers (action) => Save time and energy (goal)
KDD Process
The knowledge discovery process is an iterative sequence of the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to
extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present mined knowledge to users)
KDD Process (Cont..)
Appropriate
To identifyforthe
mining
truly by
interesting
Intelligent Pattern Evaluation
performing summary
patterns
methods are
or applied
aggregation Patterns
representing
in order
The analysis To remove
operations,
extractfor
toknowledge data Data Mining
Knowledge
task are noise andinstance.
patterns.
retrieved frominconsistent
the database. data.
Transformation
Transformed
Data
Preprocessing
Unit: 3 – Introduction to Data Mining (DM) 11 Darshan Institute of Engineering & Technology
Domains of Data Mining Systems
Data mining is an interdisciplinary field, joining of a set of
disciplines, including database systems, statistics, machine
learning, visualization and information science.
Database
Technology
Machine
Statistics
Learning
Data
Mining
Other
Visualization
Disciplines
Information
Science
Data Mining—On what kind of data?
Relational Databases:
• A database system, also called a database management system (DBMS),
consists of a collection of interrelated data, known as a database, and a set
of software programs to manage and access the data.
• E.g. : SQL Server, Oracle etc.
Data Warehouses:
• A data warehouse is a repository of information collected from multiple
sources.
• Data warehouses are constructed via a process of data cleaning, data
integration, data transformation, data loading, and periodic data refreshing.
• E.g. : Stock Market, D-Mart, Big Bazar etc.
Data Mining—On what kind of data? (Cont..)
Transactional Databases:
• Transactional database consists of a file where each record represents a
transaction.
• A transaction typically includes a unique transaction identity number (TID)
and a list of the items making up the transaction (such as items purchased
in a store).
• E.g. : Online shopping like Flipkart, Amazon etc.
Other Data
• Spatial data (Maps or Location)
• Engineering design data (Design of Buildings, Offices Structures)
• Hypertext and multimedia data (Including text, image, video, and audio
data), the World Wide Web (a huge, widely distributed information
repository made available on the Internet).
Data Mining Issues
Data mining issues can be classified into five categories:
1. Mining Methodology
2. User Interaction
3. Efficiency and Scalability
4. Diversity of Database Types
5. Data Mining and Society
1) Mining Methodology
Mining various and new kinds of knowledge
• Data mining covers a wide spectrum of data analysis and knowledge
discovery tasks, so these tasks may use the same database in different ways
and require the development of numerous data mining techniques.
Mining knowledge in multidimensional space
• When searching for knowledge in large data sets, we can explore the data
in multidimensional space.
• That is, we can search for interesting patterns among combinations of
dimensions (attributes) at varying levels of abstraction. Such mining is
known as (exploratory) multidimensional data mining.
1) Mining Methodology (Cont..)
Data mining—an interdisciplinary effort
• The power of data mining can be substantially enhanced by integrating new
methods from multiple disciplines.
• For example, to mine data with natural language text, it makes sense to
fuse data mining methods of information retrieval and natural language
processing.
Handling uncertainty, noise, or incompleteness of data
• Data often contain noise, errors, exceptions, uncertainty or incomplete.
• Errors and noise may confuse the data mining process, leading to the
derivation of erroneous patterns.
2) User Interaction
Interactive mining
• The data mining process should be highly interactive. Thus, it is important
to build flexible user interfaces and an exploratory mining environment,
facilitating the user’s interaction with the system.
Incorporation of background knowledge
• Background knowledge, constraints, rules, and other information
regarding the domain under study should be incorporated into the
knowledge discovery process.
Presentation and visualization of data mining results
• How any system can present data mining results, vividly(clear image in
mind) and flexibly ?, so that the discovered knowledge can be easily
understood and directly usable by humans.
3) Efficiency and Scalability
Efficiency and scalability of data mining algorithms
• Data mining algorithms must be efficient and scalable in order to
effectively extract information from huge amounts of data lies in many data
repositories or in dynamic data streams.
• In other words, the running time of a data mining algorithm must be
predictable, short, and acceptable by applications.
• Efficiency, scalability, performance, optimization, and the ability to execute
in real time are key criteria for new mining algorithms.
Parallel, distributed, and incremental mining algorithms
• The giant size of many data sets, the wide distribution of data, and the
computational complexity of some data mining methods are factors that
motivate the development of parallel and distributed data-intensive mining
algorithms.
4) Diversity of Database Types
Handling complex types of data
• Data mining is how to uncover knowledge from stream, time-series,
sequence, graph, social network, and multirelational data.
• In mining various types of attributes are available and also different types of
data in database or dataset.
Mining dynamic, networked, and global data repositories
• Data from multiple sources are connected by the Internet and various kinds
of networks like distributed and heterogeneous global information
systems.
• The discovery of knowledge from different sources of structured, semi-
structured, or unstructured challengeable.
• Web Mining, multisource data mining and information network mining
have become challenging and fast-evolving data mining fields.
5) Data Mining and Society
Social impacts of data mining
• With data mining penetrating our everyday lives, it is important to study the
impact of data mining on society, How can we used at a mining technology to
benefit our society? How can we guard against its misuse?
Privacy-preserving data mining
• Data mining will help in scientific discovery, business management, economy
recovery, and security protection (e.g., the real-time discovery of intruders and
cyber attacks).
• However, it poses the risk of disclosing an individual’s personal information.
Invisible data mining
• We cannot expect everyone in society to learn and master in data mining
techniques.
• For example, when purchasing items online, users may be unaware that the store is
likely collecting data on the buying patterns of its customers, which may be used to
recommend other items for purchase in the future.