0% found this document useful (0 votes)
5 views

DM NOTES

A data warehouse is a centralized repository that integrates data from multiple sources for analysis and decision-making, focusing on historical data and supporting Online Analytical Processing (OLAP). Data mining, on the other hand, involves extracting valuable insights and patterns from large datasets, often utilizing the data stored in a data warehouse. The knowledge discovery process (KDD) encompasses various steps including data cleaning, integration, selection, transformation, mining, evaluation, and presentation to convert raw data into useful information.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

DM NOTES

A data warehouse is a centralized repository that integrates data from multiple sources for analysis and decision-making, focusing on historical data and supporting Online Analytical Processing (OLAP). Data mining, on the other hand, involves extracting valuable insights and patterns from large datasets, often utilizing the data stored in a data warehouse. The knowledge discovery process (KDD) encompasses various steps including data cleaning, integration, selection, transformation, mining, evaluation, and presentation to convert raw data into useful information.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 193

DATA WAREHOUSE:

Data warehouse refers to the process of compiling and organizing data into
one common database.

The data from multiple sources are integrated into a common source
known as Data Warehouse.

Data Warehouse:

A Data Warehouse refers to a place where data can be stored for useful
mining. It is like a quick computer system with exceptionally huge data
storage capacity. Data from the various organization's systems are copied to
the Warehouse, where it can be fetched and conformed to delete errors.
Here, advanced requests can be made against the warehouse storage of
data.

Data warehouses and databases both are relative data systems, but both
are made to serve different purposes. A data warehouse is built to store a
huge amount of historical data typically using Online Analytical
Processing (OLAP). A database is made to store current transactions and
allow quick access to specific transactions commonly known as Online
Transaction Processing (OLTP).

Important Features of Data Warehouse

1. Subject Oriented

A data warehouse is subject-oriented. It provides useful data about a


subject instead of the company's ongoing operations. A data warehouse
usually focuses on modeling and analysis of data that helps the business
organization to make data-driven decisions.
2. Time-Variant:

The different data present in the data warehouse provides information for a
specific period.

3. Integrated

A data warehouse is built by joining data from heterogeneous sources, such


as social databases, documents, etc.

4. Non- Volatile

It means, once data entered into the warehouse cannot be change.

Advantages of Data Warehouse:


o More accurate data access
o Improved performance
o Cost-efficient
o Consistent and quality data

DATA MINING:

Data mining refers to the process of extracting useful data from the
databases. The data mining process depends on the data compiled in the
data warehousing phase to recognize meaningful patterns. A data
warehousing is created to support management systems.

Data mining is the process of extracting knowledge or insights from large


amounts of data using various statistical and computational techniques.
The primary goal of data mining is to discover hidden patterns and
relationships in the data that can be used to make informed decisions or
predictions.
or
Data mining is one of the most useful techniques that help individuals to
extract valuable information from huge sets of data.
or
Data mining refers to the analysis of data. It is the process of extracting
important pattern from large datasets.

Data mining is a crucial component of successful analytics initiatives in


organizations. The information it generates can be used in business
intelligence (BI) and advanced analytics applications that involve analysis of
historical data, as well as real-time analytics applications that examine
streaming data as it's created or collected.

Page | 2
Effective data mining aids in various aspects of planning business strategies
and managing operations. That includes customer
customer-facing
facing functions such as
marketing, advertising, sales and customer s
support,
upport, plus manufacturing,
supply chain management, finance and HR. Data mining supports fraud
detection, risk management, Cyber Security Planning and many other
critical business use cases. It also plays an important role in healthcare,
government, scientific research, mathematics, sports and more.

Data Analysis
Itt is the process of analysing and organizing raw data in order to determine
useful information’s andnd decision
decisions.
Data Mining and Knowledge Discovery KDD
Data mining is also called Knowledge Discovery in Database (KDD).
(KDD The
knowledge discovery process includes Data cleaning, Data integration, Data
selection, Data transformation, Data mining, Pattern e
evaluation,
valuation, and
Knowledge presentation..
KDD is the overall process of converting raw data into useful information.
 This process consists of a series of transformation steps, from data pre-
processing to post processing of data mining results.

Input data
Stored in a variety of formats (flat files, sprea
spreadsheets,
dsheets, or relational tables)
Pre-processing
It transforms the raw input data into an appropriate format for subsequent
analysis.
Steps involved in pre-processing
processing
It will combine data from multiple sources,
 Cleaning data to remove noise
 Duplicate observations
 Selecting records and features that are relevant to the data mining task.
Post processing
 This step that ensures that only valid and useful results are
incorporated into the decision support system.
 Statistical measures or testing methods applied to eliminate false data
mining results. Example for post processing:
 visualization, which allows analysts to explore the data and the data
mining results from a variety of viewpoints.(output) Information

Data Mining Architecture

The significant components of data mining systems are a data source, data
mining engine, data warehouse server, the pattern evaluation module,
graphical user interface, and knowledge base.

Page | 4
Data Source:

The actual source of data is the Database, data warehouse, World Wide Web
(WWW), text files, and other documents. You need a huge amount of
historical data for data mining to be successful. Organizations typically store
data in databases or data warehouses.

Different processes:

Before passing the data to the database or data warehouse server, the data
must be cleaned, integrated, and selected. As the information comes from
various sources and in different formats, it can't be used directly for the
data mining procedure because the data may not be complete and accurate.
So, the first data requires to be cleaned and unified (schema). More
information than needed will be collected from various data sources, and
only the data of interest will have to be selected and passed to the server.
These procedures are not as easy as we think. Several methods may be
performed on the data as part of selection, integration, and cleaning.

Database or Data Warehouse Server:

The database or data warehouse server consists of the original data that is
ready to be processed. Hence, the server is cause for retrieving the relevant
data that is based on data mining as per user request.

Data Mining Engine:

The data mining engine is a major component of any data mining system. It
contains several modules for operating data mining tasks, including
association, characterization, classification, clustering, prediction, time-
series analysis, etc.

In other words, we can say data mining is the root of our data mining
architecture. It comprises instruments and software used to obtain insights
and knowledge from data collected from various data sources and stored
within the data warehouse.

Pattern Evaluation Module:

The Pattern evaluation module is primarily responsible for the measure of


investigation of the pattern by using a threshold value. It collaborates with
the data mining engine to focus the search on exciting patterns.

The pattern evaluation module might be coordinated with the mining


module, depending on the implementation of the data mining techniques
used.
Graphical User Interface:

The graphical user interface (GUI) module communicates between the data
mining system and the user. This module helps the user to easily and
efficiently use the system without knowing the complexity of the process.
This module cooperates with the data mining system when the user specifies
a query or a task and displays the results.

Knowledge Base:

The knowledge base is helpful in the entire process of data mining. It might
be helpful to guide the search or evaluate the result patterns. The knowledge
base may even contain user views and data from user experiences that
might be helpful in the data mining process. The data mining engine may
receive inputs from the knowledge base to make the result more accurate
and reliable. The pattern assessment module regularly interacts with the
knowledge base to get inputs, and also update it.

Data Mining - Knowledge Discovery

The term KDD stands for Knowledge Discovery in Databases.

The main objective of the KDD process is to extract information from data in
the context of large databases. It does this by using Data Mining algorithms
to identify what is deemed knowledge. KDD is the organized procedure of
recognizing valid, useful, and understandable patterns from huge and
complex data sets. Data Mining is the root of the KDD procedure.

Here is the list of steps involved in the knowledge discovery process :


 Data Cleaning − In this step, the noise and inconsistent data is removed.
 Data Integration − In this step, multiple data sources are combined.
 Data Selection − In this step, data relevant to the analysis task are
retrieved from the database.
 Data Transformation − In this step, data is transformed or consolidated
into forms appropriate for mining by performing summary or aggregation
operations.
 Data Mining − In this step, intelligent methods are applied in order to
extract data patterns.
 Pattern Evaluation − In this step, data patterns are evaluated.
 Knowledge Presentation − In this step, knowledge is represented.

Page | 6
The following diagram shows the process of knowledge discovery:

Data Types of Data

There are mainly 3 data types of data where the data stores on which
the mining can be performed.
Different sources of data that are used in data mining process.
 Data Base Data
 Data warehouse Data
 Transactional Data

DATABASE DATA
A database is also called a database management system or DBMS. Every
DBMS stores data that are related to each other in a way or the other. It
also has a set of software programs that are used to manage data and
provide easy access to it.
 A Relational database is defined as the collection of data organized in
tables with rows and columns.

 Physical schema in Relational databases is a schema which defines the


structure of tables.
 Logical schema in Relational databases is a schema which defines the
relationship among tables.

 Standard API of relational database is SQL.

 A relational database is a type of structured data that organizes data


into one or more tables, with each table consisting of rows and
columns. The rows represent individual records, and the columns
represent fields or attributes within those records.

 Relational databases are widely used in many different industries,


such as finance, healthcare, retail and e-commerce. They are also used
to support transactional systems, data warehousing, and business
intelligence.

 Relational databases are typically managed by a database management


system (DBMS) such as MySQL, Oracle, SQL Server, and Postgre SQL.

 Some advantages of relational databases are Data Integrity, Data


Consistency, Data Security, Efficient Data Retrieval, Scalability and
Complexity.

Data warehouse data:

A data warehouse is a single data storage location that collects data from
multiple sources and then stores it in the form of a unified plan. When
data is stored in a data warehouse, it undergoes cleaning, integration,
loading, and refreshing. Data stored in a data warehouse is organized in
several parts. If you want information on data that was stored 6 or 12
months back, you will get it in the form of a summary.

Transactional Databases:

Transactional database stores record, that are captured as transactions.


These transactions include flight booking, customer purchase, click on a
website, and others. Every transaction record has a unique ID. It also lists
all those items that made it a transaction.

Application: Banking, Distributed systems, Object databases, etc.

Advanced Database and Information systems and advanced


Applications

Object and Object-Relational Databases

There are two major types: object-oriented databases and object-relational


databases.

Page | 8
Object-oriented
oriented databases (OODBs)

Object-oriented
oriented databases (OODBs) are designed to store and manipulate
objects. It is similar to object
object-oriented
oriented programming languages, i.e. Java
and Python. Objects can contain data, methods, and relationships to other
objects.
In OODB, the object itself is the storage rather than the representation of
the data. This allows for more efficient and natural handling of complex
data structures and relationships between objects.

Advantages of OODBs
Object-oriented
oriented databases (OODBs) have many advantages
They work well with object
object-oriented programming languages.
They are easy to model.
They are fast for object-oriented
oriented workloads
workloads.
They can handle complex data structures wellwell.

Object-relational
relational databases (ORDBs)

Object-relational
tional databases (ORDBs) are a hybrid between traditional
relational databases and OODBs. ORDBs are designed to handle both
structured and unstructured data, much like OODBs, but they also support
SQL queries and transactions, much like traditional relatio
relational
nal databases.

An Object relational model is a combination of an Object oriented database


model and a Relational database model. So, it supports objects, classes,
inheritance etc. just like Object Oriented models and has support for data
types, tabular structures
ctures etc. like Relational data model.

One of the major goals of Object relational data model is to close the gap
between relational databases and the object oriented practices frequently
used in many programming languages such as C++, C#, Java etc.
Multimedia Databases:
 Multimedia databases consists audio, video, images and text
media.
 They can be stored on Object-Oriented Databases.
 They are used to store complex information in pre-specified
formats.
 Application: Digital libraries, video-on demand, news-on
demand, musical database, etc.
Spatial Database
 Store geographical information.
 Stores data in the form of coordinates, topology, lines,
polygons, etc.
 Application: Maps, Global positioning, etc.
Time-series Databases
 Time series databases contain stock exchange data and user
logged activities.
 Handles array of numbers indexed by time, date, etc.
 It requires real-time analysis.
 Application: extreme, Graphite, Influx DB, etc.
WWW
 WWW refers to World wide web is a collection of documents and
resources like audio, video, text, etc which are identified by
Uniform Resource Locators (URLs) through web browsers,
linked by HTML pages, and accessible via the Internet network.
 It is the most heterogeneous repository as it collects data from
multiple resources.
 It is dynamic in nature as Volume of data is continuously
increasing and changing.
 Application: Online shopping, Job search, Research, studying,
etc.
Structured Data: This type of data is organized into a specific format,
such as a database table or spreadsheet. Examples include
transaction data, customer data, and inventory data.
Semi-Structured Data: This type of data has some structure, but not
as much as structured data. Examples include XML and JSON files,
and email messages.
Unstructured Data: This type of data does not have a specific format,
and can include text, images, audio, and video. Examples include
social media posts, customer reviews, and news articles.

External Data: This type of data is obtained from external sources such
as government agencies, industry reports, weather data, satellite images,
GPS data, etc.

Heterogeneous Database
It consists of data from multiple dissimilar sources.
These sources may include different types of data bases such as relational
database, flat files, etc..

Page | 10
Legacy Database
It is a group of Heterogeneous Databases that combine different kinds of
relational oriented database, hierarchical network, spread sheets,
multimedia etc..
The Heterogeneous Database in a legacy database can be connected by
intra or inter computer networks.

Time-Series Data: This type of data is collected over time, such as stock
prices, weather data, and website visitor logs.

Integrating a Data Mining System with a DB/DW System


Data Integration in Data Mining

 Data integration in data mining refers to the process of combining data


from multiple sources into a single, unified view. This can involve
cleaning and transforming the data, as well as resolving any
inconsistencies or conflicts that may exist between the different sources.
The goal of data integration is to make the data more useful and
meaningful for the purposes of analysis and decision making.
Techniques used in data integration include data warehousing, ETL
(extract, transform, load) processes, and data federation.
If a data mining system is not integrated with a database or a data
warehouse system, then there will be no system to communicate with. This
scheme is known as the non-coupling scheme. In this scheme, the main
focus is on data mining design and on developing efficient and effective
algorithms for mining the available data sets.
The list of Integration Schemes is as follows −
 No Coupling − In this scheme, the data mining system does not utilize
any of the database or data warehouse functions. It fetches the data
from a particular source and processes that data using some data
mining algorithms. The data mining result is stored in another file.

 Loose Coupling − In this scheme, the data mining system may use
some of the functions of database and data warehouse system. It fetches
the data from the data respiratory managed by these systems and
performs data mining on that data. It then stores the mining result
either in a file or in a designated place in a database or in a data
warehouse.

 Semi−tight Coupling − In this scheme, the data mining system is


linked with a database or a data warehouse system and in addition to
that, efficient implementations of a few data mining primitives can be
provided in the database.
 Tight coupling − In this coupling scheme, the data mining system is
smoothly integrated into the database or data warehouse system. The
data mining subsystem is treated as one functional component of an
information system.

Major Issues in Data Mining:


Data mining is not an easy task, as the algorithms used can get very
complex and data is not always available at one place. It needs to be
integrated from various heterogeneous data sources. These factors also
create some issues. Here in this tutorial, we will discuss the major issues
regarding −
 Mining Methodology and User Interaction
 Performance Issues
 Diverse Data Types Issues
The following diagram describes the major issues.

Mining Methodology and User Interaction Issues


It refers to the following kinds of issues −

Page | 12
 Mining different kinds of knowledge in databases − Different users
may be interested in different kinds of knowledge. Therefore it is
necessary for data mining to cover a broad range of knowledge
discovery task.
 Interactive mining of knowledge at multiple levels of abstraction −
The data mining process needs to be interactive because it allows
users to focus the search for patterns, providing and refining data
mining requests based on the returned results.
 Incorporation of background knowledge − To guide discovery
process and to express the discovered patterns, the background
knowledge can be used. Background knowledge may be used to
express the discovered patterns .
 Data mining query languages and ad hoc data mining − Data
Mining Query language that allows the user to describe ad hoc mining
tasks, should be integrated with a data warehouse query language and
optimized for efficient and flexible data mining.

 Presentation and visualization of data mining results − Once the


patterns are discovered it needs to be expressed in high level
languages, and visual representations. These representations should
be easily understandable.

 Handling noisy or incomplete data − The data cleaning methods are


required to handle the noise and incomplete objects .If the data
cleaning methods are not there then the accuracy of the discovered
patterns will be poor.

 Pattern evaluation − The patterns discovered should be interesting.

Performance Issues
There can be performance-related issues such as follows:
 Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.

 Parallel, distributed, and incremental mining algorithms − The


factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of
parallel and distributed data mining algorithms. These algorithms
divide the data into partitions which is further processed in a parallel
fashion. Then the result from the partitions is merged. The
incremental algorithms, update databases without mining the data
again from scratch.
Diverse Data Types Issues:
 Handling of relational and complex types of data − The database
may contain complex data objects, multimedia data objects, spatial
data, temporal data etc. It is not possible for one system to mine all
these kind of data.
Mining information from heterogeneous databases and global
information systems − The data is available at different data sources
on LAN or WAN. These data source may be structured, semi
structured or unstructured. Therefore mining the knowledge from
them adds challenges to data mining.
Classification of Data Mining Systems
Classification of the data mining system helps users to understand
the system and match their requirements with such systems.
1.Classification according to the kinds of databases mined
A database system can be classified as a ‘type of data’ or ‘use of data’
model or ‘application of data’.
A database system can be further segmented based on distinct
principles, such as data models, types of data, etc., which further assist
in classifying a data mining system.
For example, if we want to classify a database based on the data model,
we need to select relational, transactional, object-relational or data
warehouse mining systems.
2. Classification according to the kinds of knowledge mined
This is based on functionalities such as characterization, association,
discrimination and correlation, prediction etc.

3. Classification according to the kinds of techniques utilized


This technique involves the degree of user interaction or the technique
of data analysis involved. For example, machine learning, visualization,
pattern recognition, neural networks, database-oriented or data-
warehouse oriented techniques.
4. Classification according to the application adapted

This involves domain-specific application. For example, the data


mining systems can be tailored accordingly for telecommunications,
finance, stock markets, e-mails and so on.
Data Mining Task Primitives

A data mining task can be specified in the form of a data mining query,
which is input to the data mining system. A data mining query is defined in
terms of data mining task primitives. These primitives allow the user to
interactively communicate with the data mining system. The data mining
primitives specify the following.

Page | 14
1. Set of task-relevant data to be mined.
2. Kind of knowledge to be mined.
3. Background knowledge to be used in the discovery process.
4. Interestingness measures and thresholds for pattern evaluation.
5. Representation for visualizing the discovered patterns.

A data mining query language can be designed to incorporate these


primitives, allowing users to interact with data mining systems flexibly.

Having a data mining query language provides a foundation on which user-


friendly graphical interfaces can be built.

This facilitates a data mining system's communication with other


information systems and integrates with the overall information processing
environment.

1. The set of task-relevant data to be mined

This specifies the portions of the database or the set of data in which the
user is interested.

This includes the database attributes or data warehouse dimensions of


interest (the relevant attributes or dimensions).

In a relational database, the set of task-relevant data can be collected via a


relational query involving operations like selection, projection, join, and
aggregation.

2. The kind of knowledge to be mined


This specifies the data mining functions to be performed, such as
characterization, discrimination, association or correlation analysis,
classification, prediction, clustering, outlier analysis, or evolution analysis.

3. The background knowledge to be used in the discovery process

This knowledge about the domain to be mined is useful for guiding the
knowledge discovery process and evaluating the patterns found.

Concept hierarchies are a popular form of background knowledge, which


allows data to be mined at multiple levels of abstraction.

o Rolling Up - Generalization of data: Allow to view data at more


meaningful and explicit abstractions and makes it easier to
understand. It compresses the data, and it would require fewer
input/output operations.
o Drilling Down - Specialization of data: Concept values replaced by
lower-level concepts. Based on different user viewpoints, there may be
more than one concept hierarchy for a given attribute or dimension.

4. The interestingness measures and thresholds for pattern evaluation

Different kinds of knowledge may have different interesting measures. They


may be used to guide the mining process or, after discovery, to evaluate the
discovered patterns.

For example, interesting measures for association rules include support and
confidence. Rules whose support and confidence values are below user-
specified thresholds are considered uninteresting.

o Simplicity: A factor contributing to the interestingness of a pattern is


the pattern's overall simplicity for human comprehension. For
example, the more complex the structure of a rule is, the more
difficult it is to interpret, and hence, the less interesting it is likely to
be. Objective measures of pattern simplicity can be viewed as
functions of the pattern structure, defined in terms of the pattern size
in bits or the number of attributes or operators appearing in the
pattern.
o Certainty (Confidence): Each discovered pattern should have a
measure of certainty associated with it that assesses the validity or
"trustworthiness" of the pattern.
o Utility (Support): The potential usefulness of a pattern is a factor
defining its interestingness. It can be estimated by a utility function,
such as support.
Novelty: Novel patterns are those that contribute new information or
increased performance to the given pattern set. It can be based on
inlier or outlier analysis.

Page | 16
5. The expected representation for visualizing the discovered patterns

This refers to the form in which discovered patterns are to be displayed,


which may include rules, tables, cross tabs, charts, graphs, decision trees,
cubes,, or other visual representations.

Functionalities of Data Mining

Data mining functionalitie


functionalities
s are used to represent the type of patterns that
have to be discovered in data mining tasks. In general, data mining tasks
can be classified into two types including descriptive and predictive.
Descriptive mining tasks define the common features of the da
data in the
database.
The predictive mining tasks act inference on the current information to
develop predictions.
Data mining is extensively used in many areas or sectors. It is used to
predict and characterize data. But the ultimate objective in Data Mining
Functionalities is to observe the various trends in data mining. There are
several
everal data mining functionalities that the organized and scientific methods
offer, such as:
1. Class/Concept Descriptions

A class or concept implies there is a data set or set of features that define
the class or a concept. A class can be a category of items on a shop floor,
and a concept could be the abstract idea on which data may be categorized
like products to be put on clearance sale and non-sale products. There are
two concepts here, one that helps with grouping and the other that helps in
differentiating.

o Data Characterization: This refers to the summary of general


characteristics or features of the class.
o Data Discrimination: Discrimination is used to separate distinct data
sets based on the disparity in attribute values. It compares features of
a class with features of one or more contrasting classes. eg., bar
charts, curves and pie charts.

2. Mining Frequent Patterns One of the functions of data mining is


finding data patterns. Frequent patterns are things that are
discovered to be most common in data. Various types of frequency can
be found in the dataset.

o Frequent item set: This term refers to a group of items that are
commonly found together, such as milk and sugar.
o Frequent substructure: It refers to the various types of data
structures that can be combined with an item set or subsequences,
such as trees and graphs.
o Frequent Subsequence: A regular pattern series, such as buying a
phone followed by a cover

3. Association Analysis

In data mining, association rules are if-then statements that identify


relationships between data elements. Support and confidence criteria are
used to assess the relationships -- support measures how frequently the
related elements appear in a data set, while confidence reflects the
number of times an if-then statement is accurate.

It analyses the set of items that generally occur together in a transactional


dataset. It is also known as Market Basket Analysis for its wide use in retail
sales. Two parameters are used for determining the association rules are
Support and Confidence.

o It provides which identifies the common item set in the


database(support).
o Confidence is the conditional probability that an item occurs when
another item occurs in a transaction(confidence)

Correlation is a mathematical technique for determining whether and how


strongly two attributes is related to one another. It refers to the various

Page | 18
types of data structures, such as trees and graphs, that can be combined
with an item set or subsequence.

4. Classification

Classification is a data mining technique that categorizes items in a


collection based on some predefined properties. It uses methods like if-then,
decision trees or neural networks to predict a class or essentially classify a
collection of items. A training set containing items whose properties are
known is used to train the system to predict the category of items from an
unknown collection of items.

5. Prediction/Predictive

It defines predict some unavailable data values or spending trends. An


object can be anticipated based on the attribute values of the object and
attribute values of the classes. It can be a prediction of missing numerical
values or increase or decrease trends in time-related information. There are
primarily two types of predictions in data mining: numeric and class
predictions.

o Numeric predictions are made by creating a linear regression model


that is based on historical data. Prediction of numeric values helps
businesses ramp up for a future event that might impact the business
positively or negatively.
o Class predictions are used to fill in missing class information for
products using a training data set where the class for products is
known.

6. Cluster Analysis

In image processing, pattern recognition and bioinformatics, clustering is a


popular data mining functionality. It is similar to classification, but the
classes are not predefined. Data attributes represent the classes. Similar
data are grouped together, with the difference being that a class label is not
known. Clustering algorithms group data based on similar features and
dissimilarities.

7. Outlier Analysis

Outlier analysis is important to understand the quality of data. If there are


too many outliers, you cannot trust the data or draw patterns. An outlier
analysis determines if there is something out of turn in the data and
whether it indicates a situation that a business needs to consider and take
measures to mitigate. An outlier analysis of the data that cannot be grouped
into any classes by the algorithms is pulled up.
8. Evolution Analysis

Evolution Analysis pertains to the study of data sets that change over
time. Evolution analysis models are designed to capture evolutionary
trends in data helping to characterize, classify, cluster or discriminate
time-related data.

INTERESTINGNESS PATTERNS

A data mining system has the potential to generate thousands or even


millions of patterns, or rules. This raises some serious questions for data
mining: A pattern is interesting if

(1) it is easily understood by humans

(2) valid on new or test data with some degree of certainty

(3) potentially useful

(4) novel.

A pattern is also interesting if it validates a hypothesis that the user sought


to confirm. An interesting pattern represents knowledge.
Several objective measures of pattern interestingness exist. These are
based on the structure of discovered patterns and the statistics underlying
them. An objective measure for association rules of the form X Y is rule
support, representing the percentage of transactions from a transaction
database that the given rule satisfies.
This is taken to be the probability P(XUY),where XUY indicates that a
transaction contains both X and Y, that is, the union of itemsets X and Y.
Another objective measure for association rules is confidence, which
assesses the degree of certainty of the detected association. This is taken to
be the conditional probability P(Y | X), that is, the probability that a
transaction containing X also contains Y. More formally, support and
confidence are defined as

support(X Y) = P(XUY) confidence(X Y) = P(Y | X)


In general, each interestingness measure is associated with a threshold,
which may be controlled by the user. For example, rules that do not satisfy
a confidence threshold of, say, 50% can be considered uninteresting. Rules
below the threshold threshold likely reflect noise, exceptions, or minority
cases and are probably of less value.

Page | 20
UNIT-II

UNIT-II

ASSOCIATION RULES

Introduction:
• ARM is to find out association rules or Frequent patterns or subsequences or
correlation relationships among large set of data items that satisfy the predefined
minimum support and confidence from a given database.

• Frequent patterns are patterns (such as itemsets, subsequences, or substructures) that


appear in a data set frequently.
Frequent item sets also known as association rules, are a fundamental concept in
association rule mining, which is a technique used in data mining to discover relationships
between items in a dataset.

The goal of association rule mining is to identify relationships between items in a


dataset that occur frequently together.

A frequent item set is a set of items that occur together frequently in a dataset.

The frequency of an item set is measured by the support and count, which is the
number of transactions or records in the dataset that contain the item set.

Support : It is one of the measures of interestingness. This tells about the usefulness and
certainty of rules. 5% Support means total 5% of transactions in the database follow the rule.

Support(A -> B) = Support_count(A ∪ B)

Confidence: It is one of the measures of interestingness A confidence of 60% means that


60% of the customers who purchased a milk and bread also bought butter.

Confidence(A -> B) = Support_count(A ∪ B) / Support_count(A)

For example, a set of items, such as milk and bread, that appear frequently
together in a transaction data set is a frequent itemset.

A subsequence, such as buying first a PC, then a digital camera, and then a
memory card, if it occurs frequently in a shopping history database, is a (frequent)
sequential pattern.

A substructure can refer to different structural forms such as subgraphs, subtrees,


or sublattices, which may be combined with itemsets or subsequences.
DM Notes 1
UNIT-II

 Support_count(X): Number of transactions in which X appears. If X is A union B then it is


the number of transactions in which A and B both are present.

 Maximal Itemset: An itemset is maximal frequent if none of its supersets are frequent.

 Closed Itemset:

An (frequent) itemset is called closed if it has no (frequent) superset having the same support.

An association rule is an expression A ⇒ B, where A and B are itemsets, and A ∩ B = ∅.

The support of the rule is the joint probability of a transaction containing both A and B, given
as sup(A ⇒ B) = P(A ∧ B) = sup(A ∪ B).

The confidence of a rule is the conditional probability.

 K- Itemset: Itemset which contains K items is a K-itemset. So it can be said that an itemset is
frequent if the corresponding support count is greater than the minimum support count.

What Is Association Mining?

Association Mining searches for frequent items in the data set.

In frequent mining usually, interesting associations and correlations between item sets in
transactional and relational databases are found.

Frequent Mining shows which items appear together in a transaction or relationship.

 Association rule mining:


 Finding frequent patterns, associations, correlations, or causal structures
among sets of items or objects in transaction databases, relational databases,
and other information repositories.
 Applications:
 Basket data analysis, cross-marketing, catalog design, loss-leader analysis,

clustering, classification, etc.

DM Notes 2
Data Warehousing and OLAP

• Data warehouses generalize and consolidate data in multidimensional


space.
• The construction of data warehouses involves data cleaning, data
integration, and data transformation viewed as an important
preprocessing step for data mining.
• data warehouses provide online analytical processing (OLAP) tools
for the interactive analysis of multidimensional data of varied
granularities, which facilitates effective data mining.

What Is a Data Warehouse?

• Data warehousing provides architectures and tools for business


executives to systematically organize, understand, and use
their data to make strategic decisions.

• data repository maintained separately from an organization’s


operational databases.
Definition
• “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of
management’s decision making process” .
Subject-oriented:
• A data warehouse is organized around major subjects such as
customer, supplier, product, and sales.
• Rather than concentrating on the day-to-day operations and
transaction processing of an organization.
• a data warehouse focuses on the modeling and analysis of data for
decision makers.
• Integrated: A data warehouse is usually constructed by
integrating multiple heterogeneous sources, such as relational
databases, flat files, and online transaction records.
• Time-varient: Data are stored to provide information from an historic
perspective (e.g., the past 5–10 years).
• Nonvolatile: A data warehouse is a physically separate store of data
transformed from the application data found in the operational
environment.
• Due to this separation, a data warehouse does not require
transaction processing, and recovery.
• It usually requires only two operations in data accessing: initial
loading of data and access of data.

Differences between Operational Database Systems and Data Ware houses


• OLTP (on-line transaction processing)
– DBMS is to perform online transaction and query processing. –
Day-to-day operations:
• purchasing, banking, manufacturing, registration,
acounting, etc.
• OLAP (on-line analytical processing)
– organize and present data in various formats to fulfill diverse
needs of different users.
– Data analysis and decision making

Users and system orientation:

• An OLTP system is customer-oriented and is used


Transaction and query processing by clerks, clients, and
IT professionals.

• An OLAP system is market-oriented and is used for data


analysis by knowledge workers, including managers, executives,
and analysts.
• Data contents:
An OLTP system manages current data which is too detailed to be
easily used for query processing.

An OLAP system manages large amounts of historic data,


provides facilities for summarization and aggregation, and stores
and manages information at different levels of granularity.

• Database design:

An OLTP system usually adopts an entity-relationship (ER) data


model and an application-oriented database design.

An OLAP system typically adopts either a star or a snowflake


model.
• View:

An OLTP system focuses mainly on the current data within an


enterprise or department, without referring to historic data or data
in different organizations.
• Access patterns:

An OLTP system consist mainly of short , atomic


transactions. Such a system requires concurrency control and
recovery mechanisms.

OLTP Vs OLAP
DataWarehousing: A Multitiered Architecture

• Data warehouses often adopt a three-tier architecture A three-tier data


warehousing architecture
1. BOTTOM TIER

• This is a warehouse database server which is like a relational database


system.
• Back-end tools are used to feed data into the bottom tier from
operational databases
• The data are extracted using application program interfaces
known as
gateways.

• A gateway is supported by the DBMS and allows client programs to


generate SQL code to be executed at a server.
• Examples of gateways: (given by Microsoft)
– ODBC (Open Database Connection)
• This tier also contains a metadata repository
• The middle tier is an OLAP server that is typically implemented
using either

(1) a relational OLAP(ROLAP) model (i.e., an extended relational


DBMS that maps operations on multidimensional data to
standard relational operations);
Or

(2) a multidimensional OLAP (MOLAP) model (i.e., a special- purpose


server that directly implements multidimensional data).

• The top tier is a front-end client layer, which contains query and
reporting tools, analysis tools, and/or data mining tools.
Data Warehouse Modeling: A Multidimensional Data Model
• What is a data cube?”
• A data cube allows data to be modeled and viewed in multiple dimensions.
• It is defined by dimensions and facts.
• In general terms, dimensions are the entities with respect to
which an organization wants to keep records.
• Each dimension may have a table associated with it, called a
dimension table.
• A multidimensional data model is typically organized around a
central theme, such as sales.
• This theme is represented by a fact table. Facts are numeric
measures.

It is 2-D view of sales data for All Electronics where


dimensions (time and item); measure (dollars_sold (in thousands))

• Now, suppose that we would like to view the sales data with a third
dimension 3-D Data cube: 3-D View of Sales Data for All Electronics
According to time, item, and location
• Suppose that we would now like to view our sales data with an
additional fourth dimension, such as supplier.

• Given a set of dimensions, we can generate a cuboid for each of the


possible subsets of the given dimensions.

• The result would form a lattice of cuboids, each showing the data
at a different level of summarization.

• The cuboid that holds the lowest level of summarization is called the
"base cuboid“.

• The 0-D cuboid, which holds the highest level of summarization, is called
the "apex cuboid". Lattice of cuboids with 4D data cube Stars,
Snowflakes, and Fact Constellations: Schemas for
ultidimensional Data Models
• The most popular data model for a data warehouse is a
multidimensional model.

• Such a model can exist in the form of


1) star schema
2) snowflake schema
3) fact constellation schema.
OLAP Operations in the Multidimensional Data Model

• In the multidimensional model, data are organized into multiple


dimensions and each dimension contains multiple levels of
abstraction defined by concept hierarchies.

• This organization provides users with the flexibility to view data


from different perspectives.

•OLAP provides a user-friendly environment for interactive data analysis

•OLAP operations on multidimensional data are

1) Roll-up
2) Drill-down
3) Slice and dice
4) Pivot (rotate)
OLAP Operations on Multidimensional Data
Operations
1) Roll-up :
The roll-up operation (also called as drill-up) performs
aggregation on a data cube, either by climbing up a concept
hierarchy for a dimension.
Ex: roll-up from cities to country
2) Drill-down:
Drill-down is the reverse of roll-up.
Drill-down can be realized by either stepping down a concept
hierarchy for a dimension.
EX: drill-down from quarter to months

3) Slice and dice(project and select):


The slice operation performs a selection on one dimension of the
given cube, resulting in a subcube.
Ex: Slice of time=Q1”
Dice of (location = “Toronto” or “Vancouver”) and
(time = “Q1” or “Q2”) and
(item =“home entertainment” or “computer”).

4) Pivot (rotate):
Pivot (also called rotate) is a visualization operation that rotates the
data axes to provide an alternative presentation of the data.
Data Preprocessing

• Data Quality and Major Tasks in Data Preprocessing


• Data Cleaning
• Data Integration
• Data Transformation and Data Discretization
• Data Reduction

Data Mining 1
Data Quality: Why Preprocess the Data?
• Data have quality if they satisfy the requirements of the intended use.

• Measures for data quality: A multidimensional view


– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?

Data Mining 6
Major Tasks in Data Preprocessing
• Data cleaning can be applied to remove noise and correct inconsistencies in the data.
– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration merges data from multiple sources into a coherent data store, such
as a data warehouse.
– Integration of multiple databases, data cubes, or files
• Data reduction can reduce the data size by aggregating, eliminating redundant
features, or clustering.
– Dimensionality reduction, Numerosity reduction, Data compression
• Data transformations and Data Discretization, such as normalization, may be
applied.
– For example, normalization may improve the accuracy and efficiency of mining algorithms
involving distance measurements.
– Concept hierarchy generation

Data Mining 9
Major Tasks in Data Preprocessing
Data Cleaning
• Data cleaning routines work to “clean” the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.
– If users believe the data are dirty, they are unlikely to trust the results of any data mining
that has been applied to it.
– Dirty data can cause confusion for the mining procedure, resulting in unreliable output

Data Mining 10
Major Tasks in Data Preprocessing
Data Integration
• Data integration merges data from multiple sources into a coherent data store, such
as a data warehouse.

Data Mining 11
Major Tasks in Data Preprocessing
Data Reduction
• Data reduction obtains a reduced representation of the data set that is much smaller
in volume, yet produces the same (or almost the same) analytical results.
• Data reduction strategies include dimensionality reduction and numerosity
reduction.

Data Mining 12
Major Tasks in Data Preprocessing
Data transformations and Data Discretization
• The data are transformed or consolidated so that the resulting mining process may be
more efficient, and the patterns found may be easier to understand.

• Data discretization is a form of data transformation.


– Data discretization transforms numeric data by mapping values to interval or concept
labels.

• Data Transformation: Normalization

Data Mining 13
• Data Quality and Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Transformation and Data Discretization
• Data Reduction

Data Mining 14
Data Cleaning
• Data in the real world is dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
• e.g., Occupation = “ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary = “−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age = “42”, Birthday = “03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– intentional: (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?

Data Mining 15
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data.

• Missing data may be due to


– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data

• Missing data may need to be inferred.

Data Mining 16
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same class: smarter
– the most probable value: inference-based such as Bayesian formula or decision tree.
• a popular strategy.
• In comparison to the other methods, it uses the most information from the present data to predict
missing values.

Data Mining 17
Noisy Data and
How to Handle Noisy Data?
• Noise: random error or variance in a measured variable
• Outliers may represent noise.
• Given a numeric attribute such as, say, price, how can we “smooth” out the data to
remove the noise?

Data Smoothing Techniques:


• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal with possible outliers)

Data Mining 18
Binning Methods for Data Smoothing
• Binning methods smooth a sorted data by distributing them into bins (buckets).

Smoothing by bin means:


• Each value in a bin is replaced by the mean value of the bin.

Smoothing by bin medians:


• Each bin value is replaced by the bin median.

Smoothing by bin boundaries:


• The minimum and maximum values in a given bin are identified as the bin
boundaries.
• Each bin value is then replaced by the closest boundary value.

Data Mining 19
Binning Methods for Data Smoothing: Example
• Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
• Partition into (equal-frequency) bins:
– Bin 1: 4, 8, 15
– Bin 2: 21, 21, 24
– Bin 3: 25, 28, 34
• Smoothing by bin means:
– Bin 1: 9, 9, 9
– Bin 2: 22, 22, 22
– Bin 3: 29, 29, 29
• Smoothing by bin medians:
– Bin 1: 8, 8, 8
– Bin 2: 21, 21, 21
– Bin 3: 28, 28, 28
• Smoothing by bin boundaries:
– Bin 1: 4, 4, 15
– Bin 2: 21, 21, 24
– Bin 3: 25, 25, 34
Data Mining 20
Data Smoothing
• Many methods for data smoothing are also methods for data reduction involving
discretization.
– For example, the binning techniques reduce the number of distinct values per attribute.
• This acts as a form of data reduction for logic-based data mining methods, such as decision tree
induction, which repeatedly make value comparisons on sorted data.

• Concept hierarchies are a form of data discretization that can also be used for data
smoothing.
– A concept hierarchy for price, for example, may map real price values into inexpensive,
moderately priced, and expensive, thereby reducing the number of data values to be
handled by the mining process.

Data Mining 21
Data Cleaning as a Process
• Data discrepancy detection
– Use metadata (e.g., domain, range, dependency, distribution)
– Check uniqueness rule, consecutive rule and null rule
– For example, values that are more than two standard deviations away from the mean for a
given attribute may be flagged as potential outliers.
– Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and
make corrections
• Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g.,
correlation and clustering to find outliers)
• Data migration and integration
– Data migration tools: allow transformations to be specified
– ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations
through a graphical user interface
• Integration of the two processes
– Iterative and interactive (e.g., Potter’s Wheels is a data cleaning tool)

Data Mining 22
Data Integration
• Data integration:
– Combines data from multiple sources into a coherent source.
– Careful integration can help reduce and avoid redundancies and inconsistencies.
• Schema integration:
– Integrate metadata from different sources
– e.g., A.cust-id  B.cust-#
• Entity identification problem:
– Identify real world entities from multiple data sources,
– e.g., Bill Clinton = William Clinton
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different sources are different
– Possible reasons: different representations, different scales, e.g., metric vs. British units

Data Mining 24
Data Transformation
• In data transformation, the data are transformed or consolidated into forms
appropriate for mining.
• In data transformation, a function that maps the entire set of values of a given attribute
to a new set of replacement values such that each old value can be identified with one
of the new values.

Some of data transformation strategies:


• Normalization: The attribute data are scaled so as to fall within a small specified
range.
• Discretization: A numeric attribute is replaced by a categorical attribute.
• Other data transformation strategies
– Smoothing: Remove noise from data. Smoothing is also a data cleaning method.
– Attribute/feature construction: New attributes constructed from the given ones. Attribute
construction is also a data reduction medhod.
– Aggregation: Summarization, data cube construction. Aggregation is also a data
reduction method.
Data Mining 34
Normalization
• An attribute is normalized by scaling its values so that they fall within a small
specified range.
• A larger range of an attribute gives a greater effect (weight) to that attribute.
– This means that an attribute with a larger range can have greater weight at data minining
tasks than an attribute with a smaller range.
• Normalizing the data attempts to give all attributes an equal weight.
– Normalization is particularly useful for classification algorithms involving neural networks
or distance measurements such as nearest-neighbor classification and clustering.

Some Normalization Methods:


• Min-max normalization
• Z-score normalization
• Normalization by decimal scaling

Data Mining 35
Min-Max Normalization
• Min-max normalization performs a linear transformation on the original data.

• Suppose that minA and maxA are minimum and maximum values of an attribute A.
• Min-max normalization maps a value, vi of an attribute A to 𝒗′𝒊 in the range
[new_minA,new_maxA] by computing:

• Min-max normalization preserves the relationships among the original data values.
• We can standardize the range of all the numerical attributes to [0,1] by applying
min-max normalization with newmin=0 and newmax=1 to all the numeric attributes.

Data Mining 36
Discretization
Discretization: To transform a numeric (continuous) attribute into a categorical attribute.

• Some data mining algorithms require that data be in the form of categorical attributes.
• In discretization:
– The range of a continuous attribute is divided into intervals.
– Then, interval labels can be used to replace actual data values to obtain a categorical
attribute.

Simple Discretization Example: income attribute is discretized into a categorical attribute.


– Target categories (low, medium, high).
– Calculate average income: AVG.
• If income> 2* AVG, new_income_value = “high”.
• If income < 0.5* AVG, new_income_value = “low”.
• Otherwise, new_income_value = “medium”.

Data Mining 40
Discretization Methods
• A basic distinction between discretization methods for classification is whether class
information is used (supervised) or not (unsupervised).

• Some of discretization methods are as follows:

Unsupervised Discretization: If class information is not used, then relatively simple


approaches are common.
• Binning
• Clustering analysis

Supervised Discretization:
• Classification (e.g., decision tree analysis)
• Correlation (e.g., 2 ) analysis

Data Mining 41
Discretization by Binning
• Attribute values can be discretized by applying equal-width or equal-frequency
binning.
• Binning aproaches sorts the atribute values first, then partition them into the bins.
– equal width approach divides the range of the attribute into a user-specified number of
intervals each having the same width.
– equal frequency (equal depth) approach tries to put the same number of objects into
each interval.
• After bins are determined, all values are replaced by bin labels to discretize that
attribute.
– Instead of bin labels, values may be replaced by bin means (or medians).

• Binning does not use class information and is therefore an unsupervised


discretization technique.

Data Mining 42
Data Reduction
• Data reduction: Obtain a reduced representation of the data set that is much smaller
in volume but yet produces the same (or almost the same) analytical results
• Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.

• Data reduction strategies:


– Dimensionality reduction: e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
– Numerosity reduction:
• Data cube aggregation
• Sampling
• Clustering, …

Data Mining 52
Data Reduction: Dimensionality Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse.
– Density and distance between points, which is critical to clustering, outlier analysis,
becomes less meaningful.
– The possible combinations of subspaces will grow exponentially.
• Dimensionality reduction
– Avoid the curse of dimensionality.
– Help eliminate irrelevant features and reduce noise.
– Reduce time and space required in data mining.
– Allow easier visualization.
• Dimensionality reduction techniques
– Wavelet transforms
– Principal Component Analysis
– Supervised and nonlinear techniques (e.g., feature selection)

Data Mining 53
Dimensionality Reduction
Attribute Subset Selection
• Data sets for analysis may contain hundreds of attributes, many of which may be
irrelevant to the mining task or redundant.
• Redundant Attributes duplicate much or all of the information contained in one or
more other attributes.
– price of a product and the sales tax paid contain much of the same information.
• Irrelevant Attributes contain almost no useful information for the data mining task.
– students' IDs are irrelevant to predict students' grade.
• Attribute Subset Selection reduces the data set size by removing irrelevant or
redundant attribute.
– The goal of attribute subset selection is to find a minimum set of attributes such that the
resulting probability distribution of the data classes is as close as possible to the original
distribution obtained using all attributes.
– Attribute subset selection reduces the number of attributes appearing in the discovered
patterns, helping to make the patterns easier to understand.

Data Mining 54
Dimensionality Reduction
Attribute Subset Selection
Attribute Subset Selection Techniques:
• Brute-force approach:
– Try all possible feature subsets as input to data mining algorithm.

• Embedded approaches:
– Feature selection occurs naturally as part of the data mining algorithm.

• Filter approaches:
– Features are selected before data mining algorithm is run.

Data Mining 55
Numerosity Reduction
Sampling
Simple Random Sampling
– There is an equal probability of selecting any particular item
Sampling without replacement
• As each item is selected, it is removed from the population
Sampling with replacement
• Objects are not removed from the population as they are selected for the sample.

Stratified Sampling
– Split the data into several partitions; then draw random samples from each partition.
• In the simplest version, equal numbers of objects are drawn from each group even though the
groups are of different sizes.
• In an other variation, the number of objects drawn from each group is proportional to the size of
that group.

Data Mining 66
SBIT –CSE-DM UNIT –II

16
SBIT –CSE-DM UNIT –II

17
SBIT –CSE-DM UNIT –II

18
SBIT –CSE-DM UNIT –II

19
SBIT –CSE-DM UNIT –II

20
SBIT –CSE-DM UNIT –II

21
SBIT –CSE-DM UNIT –II

22
SBIT –CSE-DM UNIT –II

23
SBIT –CSE-DM UNIT –II

24
SBIT –CSE-DM UNIT –II

25
SBIT –CSE-DM UNIT –II

26
SBIT –CSE-DM UNIT –II

27
SBIT –CSE-DM UNIT –II

28
SBIT –CSE-DM UNIT –II

29
SBIT –CSE-DM UNIT –II

ECLAT

30
SBIT –CSE-DM UNIT –II

31
SBIT –CSE-DM UNIT –II

32
SBIT –CSE-DM UNIT –II

Data discretization refers to a method of converting a huge number of data


values into smaller ones so that the evaluation and management of data become
easy.
Discretization is one form of data transformation technique. It transforms numeric
values to interval labels of conceptual labels.
Binning, also known as discretization or bucketing, is a data preprocessing
technique used in data mining. It involves dividing a continuous variable into a set
of smaller intervals or bins and replacing the original values with the corresponding
bin labels.

Correlation Analysis

The word correlation is used in everyday life to denote some form of association.
Correlation analysis in market research is a statistical method that identifies the
strength of a relationship between two or more variables.
The correlation coefficient is measured on a scale that varies from + 1 through 0 to –
1. Complete correlation between two variables is expressed by either + 1 or -1.
When one variable increases as the other increases the correlation is positive; when
one decreases as the other increases it is negative. Complete absence of correlation
is represented by 0.

33
SBIT –CSE-DM UNIT –II

34
SBIT –CSE-DM UNIT –II

35
SBIT –CSE-DM UNIT –II

36
SBIT –CSE-DM UNIT –II

Constraint-Based Association Mining


A data mining procedure can uncover thousands of rules from a given set of
information, most of which end up being independent or tedious to the users. Users
have a best sense of which “direction” of mining can lead to interesting patterns
and the “form” of the patterns or rules they can like to discover.
Therefore, a good heuristic is to have the users defines such intuition or
expectations as constraints to constraint the search space. This strategy is called
constraint-based mining.
Constraint-based algorithms need constraints to decrease the search area in
the frequent itemset generation step (the association rule generating step is exact to
that of exhaustive algorithms).
The general constraint is the support minimum threshold.
The important of constraints is well-defined − they create only association
rules that are appealing to users. The method is quite trivial and the rules space is
decreased whereby remaining methods satisfy the constraints.
Constraint-based clustering discover clusters that satisfy user-defined
preferences or constraints. It depends on the characteristics of the constraints,
constraint-based clustering can adopt rather than different approaches.
The constraints can include the following which are as follows
Knowledge type constraints − These define the type of knowledge to be mined,
including association or correlation.
Data constraints − These define the set of task-relevant information such as
Dimension/level constraints − These defines the desired dimensions (or attributes)
of the information, or methods of the concept hierarchies, to be utilized in mining.

37
SBIT –CSE-DM UNIT –II

Interestingness constraints − These defines thresholds on numerical measures of


rule interestingness, including support, confidence, and correlation.
Rule constraints − These defines the form of rules to be mined. Such constraints
can be defined as metarules (rule templates), as the maximum or minimum
number of predicates that can appear in the rule antecedent or consequent, or as
relationships between attributes, attribute values, and/or aggregates.
The following constraints can be described using a high-level declarative data
mining query language and user interface. This form of constraint-based mining
enables users to define the rules that they can like to uncover, thus by creating the
data mining process more efficient.
Furthermore, a sophisticated mining query optimizer can be used to deed the
constraints defined by the user, thereby creating the mining process more effective.
Constraint-based mining boost interactive exploratory mining and analysis.
1. Metarule-Guided Mining of Association Rules
“How are metarules useful?” Metarules allow users to specify the syntactic form
of rules that they are interested in mining. The rule forms can be used as
constraints to help improve the efficiency of the mining process. Metarules may
be based on the analyst’s experience, expectations, or intuition regarding the
data or may be automatically generated based on the database schema.

A metarule can be used to specify this information describing the form of rules you
are interested in finding. An example of such a metarule is

where P1 and P2 are predicate variables that are instantiated to attributes from the
given database during the mining process, X is a variable representing a customer,
and Y and W take on values of the attributes assigned to P1 and P2, respectively.
Typically, a user will specify a list of attributes to be considered for instantiation
with P1 and P2. Otherwise, a default set may be used.

2. Constraint Pushing: Mining Guided by Rule Constraints

Rule constraints specify expected set/subset relationships of the variables in the


mined rules, constant initiation of variables, and aggregate functions. Users
typically employ their knowledge of the application or data to specify rule

38
SBIT –CSE-DM UNIT –II

constraints for the mining task. These rule constraints may be used together with,
or as an alternative to, metarule
metarule-guided mining.

This can be expressed in the DMQL data mining query language as follows,

39
SBIT –CSE-DM UNIT –II

40
SBIT –CSE-DM UNIT –II

41
SBIT CSE DM _____________________________________________________________________

UNIT – II (DATA MINING)

Association Rule Mining:


Mining Frequent Patterns–Associations and correlations – Mining
Methods –Mining various kinds of Association Rules– Correlation
Analysis– Constraint based Association mining. Graph Pattern Mining,
SPM.

Basic Concepts:
Frequent Item sets, Closed Item sets, and Association Rules, Frequent
Item set Mining Methods: Apriori Algorithm, Generating Association
Rules from Frequent Item sets, A Pattern-Growth Approach for Mining
Frequent Item sets.
Market Basket Analysis: A Motivating Example
• Frequent item set mining leads to the discovery of associations
and correlations among items in large transactional data sets.
• The discovery of interesting correlation relationships among huge
amounts of business transaction records can help in many
business decision-making processes.
• A typical example of frequent item set mining is market basket
analysis.
• This process analyzes customer buying habits by finding
associations between the different items that customers place in
their “shopping baskets”

1
SBIT CSE DM _____________________________________________________________________

Market basket analysis

• Analyze the buying patterns that reflect items that are frequently
purchasedtogether.
• These patterns can be represented in the form of Association rules.
• Rule form: “A => B [support, confidence]”.
• For example, the information that customers who purchase
computers Also tend to buy software at the same time is
represented as:

Computer=> financial_management_software
[ support=2%, confidence= 60%]
• Typically, association rules are considered interesting if they
satisfy both a minimum support threshold and a minimum
confidence threshold such thresholds can be set by the users or
domain experts.
Frequent Item sets, Closed Item sets, and Association Rules:
• Let I be a set of items {I1, I2, I3,…, Im},Let D be a set of database
transactions where each transaction T is a set of items such that
T⊆ I.
• Each transaction is associated with an identifier, called TID.
• Let A be a set of items. A transaction T is said to contain A if and
only if A ⊆ T.
• An association rule is an implication of the form A ⇒ B

Where A ⊆ I, B ⊆ I, and A∩B= Φ.

• The rule A=> B holds in the transaction set D with support s,


where s is the percentage of transactions in D that contain A 𝖴 B
(i.e both A and B ).
• This is taken to be the probability, P(A 𝖴 B).
• The rule A=>B has confidence c in the transaction set D if c is
the percentage of transactions in D containing A that also contain
B. This is taken to be the conditional probability P(B/A).
• That is Support (A=>B) = P(A 𝖴 B).
• Confidence (A=>B) = P( B/A).
2
SBIT CSE DM _____________________________________________________________________
• Support says that 67% of customers purchased milk and
cheese.
• Confidence is that 100% of the customers that bought
milk also bought cheese

Strong Association Rules : Rules that satisfy both a minimum support


threshold (min_sup) and a minimum confidence threshold (min_conf) are called
strong.
• By convention, we write support and confidence value so as to occur
between 0% and 100% rather than 0 to 1.0.
Item set: A set of items is referred to as an item set. An item set that
contains k items is a k-item set.
Ex: The set {computer, software} is a 2-item set.
Support count: The frequency of an item set is the number of
transactions that contain in the item set.
This is also known as the frequency, support count or count of the item
set.
Frequent item set: If an item set satisfies minimum support, then it
is a frequent item set .
The set of frequent K-item sets is commonly denoted by L .
K
• In general, association rule mining can be viewed as a two-step process:
1. Find all frequent item sets: By definition, each of these item sets should
satisfy minimum support count, min sup.
2. Generate strong association rules from the frequent itemsets: By
definition, these rules must satisfy minimum support and minimum confidence
Support & confidence:
• Support(s) of an association rule is defined as the percentage of
records that contain A U B to the total number of records in the
database.
Support (A->B)=P (A U B) = __support_count(A U B )
count(total transactions)
• Confidence of an association rule is defined as the percentage of the
number of transactions that contain A U B to the total number ofrecords
that contain A.
Confidence(A-> B) =P(B/A) = support(AUB)
support(A).
=support count(AUB)
_____________________
support_count(A).

3
SBIT CSE DM _____________________________________________________________________

Example : Data set D

Support count, Support and


Confidence:Support count(1,3)=2
|D| = 4
Support (1->3)=2/4 = 0.5
Support (3->2)=2/4 =0.5
Confidence (3->2)= count(2 U 3) / count(3)
= 2/3
= 0.67

• Closed Item set: An item set is closed if none of its immediate supersets
have same support count same as Item set.

• A maximal frequent item set is represented as a frequent itemset


for which none of its direct supersets are frequent.

Frequent Item set Mining Methods

Apriori Algorithm:
Finding Frequent Itemsets by ConfinedCandidate Generation

• The name of the algorithm is based on the fact that the


algorithm uses prior knowledge of frequent item set properties.

• Uses an iterative approach known as a level-wise search, where


k itemsets are used to explore (k+1)-itemsets.

• First, the set of frequent 1-itemsets is found.

• This set is denoted by L1. L1 is used to find L2 , the set of


frequent 2-itemsets,which is used to find L3 ,and so on, until no
morefrequent k-itemsets can be found.

• The finding of each LK requires one full scan of the database.

• Apriori property: All subsets of a frequent itemset must also be


Frequent.

4
SBIT CSE DM _____________________________________________________________________

The Apriori property is based on

• If an itemset I does not satisfy the min sup, threshold, then I is


notfrequent
• If an item A is added to the itemset I, then the resulting
itemset cannot occur more frequently than I. Therefore, this is not
frequent.
• This property belongs to a special category of properties
called Anti Monotonic.

• Anti-Monotone property
• Using the apriori property in the algorithm:
• Let us look at how Lk-1 is used to find Lk, for k>=2

Two steps:

• Join : Ck is generated by joining Lk-1with itself.


• Prune: Any (k-1)-itemset that is not frequent cannot be a subset of
a frequent k-itemset.

• Scaning the database to determine the count of each


candidate in Ck

• To reduce the size of Ck the Apriori property is used

if any(k-1) subset of a candidate k-itemset is not in Lk-1,

then the candidate cannot be frequent ,

so it can be removed from Ck.

5
SBIT CSE DM _____________________________________________________________________
Example: Transactional data for an All Electronics branch

TID List of items


T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1,I2,I3

• Generation of the candidate itemsets and frequent itemsets,where the


minimum support count is 2.

6
SBIT CSE DM _____________________________________________________________________

The above figure: represents Generation of candidate item sets and


frequent item sets, where the minimum support count is 2.
• Scan D for count of each candidate
• C1: I1 – 6, I2 – 7, I3 -6, I4 – 2, I5 - 2
• Compare candidate support count with minimum support count
(min_sup=2)

• L1: I1 – 6, I2 – 7, I3 -6, I4 – 2, I5 - 2

• Generate C2 candidates from L1 and scan D for count of each

candidate

• C2: {I1,I2} – 4, {I1, I3} – 4, {I1, I4} – 1, …

• Compare candidate support count with minimum support count

• L2: {I1,I2} – 4, {I1, I3} – 4, {I1, I5} – 2, {I2, I3} – 4, {I2, I4} - 2, {I2, I5} – 2

• Generate C3 candidates from L2 using the join and prune steps:

• Join: C3=L2xL2={{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2,

I3, I5}, {I2,

I4, I5}}

• Prune: C3: {I1, I2, I3}, {I1, I2, I5}

• Scan D for count of each candidate

• C3: {I1, I2, I3} - 2, {I1, I2, I5} – 2


7
SBIT CSE DM _____________________________________________________________________
• Compare candidate support count with minimum support count

• L3: {I1, I2, I3} – 2, {I1, I2, I5} – 2

• Generate C4 candidates from L3

• C4=L3xL3={I1, I2, I3, I5}

• This itemset is pruned, because its subset {{I2, I3, I5}} is not

frequent => C4=null

8
SBIT CSE DM _____________________________________________________________________

Support 30% confidence 60%

9
SBIT CSE DM _____________________________________________________________________

Generating association rules from frequent itemsets


Generating strong association rules:
• For each frequent itemset “l”, generate all nonempty subsets of
Confidence (A=>B)=P(B|A)= support_count(AUB)/support_count(A)
• support_count(AUB) – number of transactions containing the
itemsetsAUB
• support_count(A) - number of transactions containing the itemset A
• Generating association rules. Let’s try an example based
onthe transactional data for All Electronics
• The data contain frequent itemset X ={I1, I2,I5}.
• What are the association rules that can be generated from X?
• The nonempty subsets of X are {I1, I2}, {I1, I5}, {I2, I5}, {I1},
{I2}, and {I5}.

• If the minimum confidence threshold is, 70%, then only thesecond,


third, and last rules are strong.

10
SBIT CSE DM _____________________________________________________________________

Drawbacks of Apriori

• It still need to generate a huge number of candidate sets.


• It need to repeatedly scan the whole database.

A Pattern-Growth Approach for MiningFrequent Itemsets

Two Steps:
• Scan the transaction DB for the first time, find frequent items (single
item) and order them into a list L in descending order.
- In the format of (item-name, support)
• For each transaction, order its frequent items according to the order
11
SBIT CSE DM _____________________________________________________________________
in L; Scan DB the second time, construct FP-tree by putting each
frequency ordered transaction onto it.

Example:
TID List of items

T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1,I2,I3

• The first scan of the database is the same as Apriori, which


derives the set of frequent (1-itemsets) and their support counts.

• Let the minimum support count be 2.

• The set of frequent items is sorted in the order of descending


support count.

• This resulting set or list is denoted by L.


Example
• I2,I1,I5
• I2,I4
TID List of items
• I2,I3
T100 I1,I2,I5 • I2,I1,I4
• I1,I3
T200 I2,I4 • I2,I3
T300 I2,I3 • I1,I3
T400 I1,I2,I4 • I2,I1,I3,I5
• I2,I1,I3
T500 I1,I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1,I2,I3
12
SBIT CSE DM _____________________________________________________________________
FP-Tree Definition:
• FP-tree is a frequent pattern tree.
1. One root labeled as “null"
2. Each node in the item prefix sub-trees has three fields:
• – item-name
• – count, the number of transactions represented by the portion of the
path reaching this node,
• – node-link that links to the next node in the FP-tree carrying
the same item-name.
3. Each entry in the frequent-item header table has two fields,
• – item-name, and head of node-link that points to the first
. node in the FP-tree carrying the item-name.

Mining Frequent Patterns Using FP-tree


The FP-tree is mined as follows.
• Start from each frequent length-1 pattern (as an initial suffix
pattern), construct its conditional pattern base
• Then construct its (conditional) FP-tree, and perform mining
recursively on the tree.
• The pattern growth is achieved by the concatenation of the suffix
13
SBIT CSE DM _____________________________________________________________________
pattern with the frequent patterns generated from a conditional
FP- tree.

Mining the FP-Tree by Creating Conditional (sub) pattern bases


Steps:
1. Start from each frequent length-1 pattern (as an initial suffix
Pattern).
2. Construct its conditional pattern base
3. Then, Construct its conditional FP-Tree & perform mining on such
a tree.
4. The pattern growth is achieved by concatenation of the suffix
pattern with the frequent patterns generated from a conditional
FP-Tree.

The union of all frequent patterns gives the required frequent itemset.
Now, Lets start from I5. The I5 is involved in 2 branches namely
{I2 I1 I5: 1} and {I2 I1 I3 I5: 1}.
• Therefore considering I5 as suffix, its 2 corresponding prefix paths
would be {I2 I1: 1} and {I2 I1 I3: 1}, which forms its conditional
pattern base.
• Out of these, Only I1 & I2 is selected in the conditional FP-Tree
because I3 is not satisfying the minimum support count.
14
SBIT CSE DM _____________________________________________________________________

15
SBIT CSE DM

Sequential Pattern Mining


The sequential pattern mining is a very important concept of data mining, and it is
an extension of association rule mining.
Sequential pattern mining was first introduced by Agrawal and Srikant in 1995.
“Given a set of sequences, where each sequence consists of a list of elements and
each element consists of a set of items and given a user specified min_support threshold.
Sequential pattern mining is to find all frequent subsequences, i.e., the
subsequences whose occurrence frequency in the set of sequences is no less than
min_support”.
Sequential pattern mining represents relation between different transactions while
association rule mining indicates relationship of items in same transaction.
Association rule mining finds items that are purchased with each other frequently
within same transaction. While sequential pattern mining finds items those are
purchased in a unique order by single customer within several transactions.
 A sequence is an ordered list of elements (transactions)
 s = < e1 e2 e3 … >
o Each element contains a collection of events (items)
 ei = {i1, i2, …, ik}
o Each element is attributed to a specific time or location.
oLength of a sequence, |s|, is given by the number of elements of the sequence.
oA k-sequence is a sequence that contains k events (items) .
oA sequence <a1 a2 … an> is contained in another sequence <b1 b2 … bm> (m ≥ n)
if there exist integers i1 < i2 < … < in such that a1  bi1 , a2  bi1, …, an  bin
o A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is ≥
minsup)

 A huge number of possible sequential patterns are hidden in databases


o A mining algorithm should
 find the complete set of patterns, when possible, satisfying the minimum
support (frequency) threshold
 be highly efficient, scalable, involving only a small number of database
scans
 be able to incorporate various kinds of user-specific constraints
 Sequence: A sequence is formally defined as the ordered set of items {s1, s2, s3, …,
sn}. As the name suggests, it is the sequence of items occurring together. It can be
considered as a transaction or purchased items together in a basket.

42
SBIT CSE DM UNIT

 Subsequence: The subset of the sequence is called a subsequence. Suppose {a, b, g, q,


y, e, c} is a sequence. The subsequence of this can be {a, b, c} or {y, e}. Observe that the
subsequence is not necessarily consecutive items of the sequence. From the sequences
of databases, subsequences are found from which the generalized sequence patterns
are found at the end.

 Sequence pattern: A sub-sequence is called a pattern when it is found in multiple


sequences. The database consists of the sequences. When a subsequence has a
frequency equal to more than the “support” value. For example: the pattern <a, b> is a
sequence pattern mined from sequences {b, x, c, a}, {a, b, q}, and {a, u, b}.

 Given n events: i1, i2, i3, …, in


 Candidate 1-subsequences:
<{i1}>, <{i2}>, <{i3}>, …, <{in}>
 Candidate 2-subsequences:
<{i1, i2}>, <{i1, i3}>, …, <{i1} {i1}>, <{i1} {i2}>, …, <{in-1} {in}>
 Candidate 3-subsequences:
<{i1, i2 , i3}>, <{i1, i2 , i4}>, …, <{i1, i2} {i1}>, <{i1, i2} {i2}>, …,
<{i1} {i1 , i2}>, <{i1} {i1 , i3}>, …, <{i1} {i1} {i1}>, <{i1} {i1} {i2}>, …

43
SBIT CSE DM UNIT

44
SBIT CSE DM UNIT

45
SBIT CSE DM UNIT

46
SBIT CSE DM UNIT

Sequential Pattern (GSP) Mining uses:

Sequential pattern mining, also known as GSP (Generalized Sequential Pattern)


mining, is a technique used to identify patterns in sequential data.
The goal of GSP mining is to discover patterns in data that occur over time, such
as customer buying habits, website navigation patterns, or sensor data.

47
SBIT CSE DM UNIT

48
SBIT CSE DM UNIT

49
SBIT CSE DM UNIT

50
SBIT CSE DM UNIT

SPADE (Sequential PAttern Discovery using Equivalence classes), for discovering the set
of all frequent sequences. The key features of our approach are as follows:
1. We use a vertical id-list database format, where we associate with each sequence a list
of objects in which it occurs, along with the time-stamps. We show that all frequent
sequences can be enumerated via simple temporal joins (or intersections) on id-lists.
2. We use a lattice-theoretic approach to decompose the original search space (lattice)
into smaller pieces (sub-lattices) which can be processed independently in main-
51
SBIT CSE DM UNIT

memory. Our approach usually requires three database scans, or only a single scan with
some pre-processed information, thus minimizing the I/O costs.
3. We decouple the problem decomposition from the pattern search. We propose two
different search strategies for enumerating the frequent sequences within each
sublattice: breadth-first search and depth-first search.
SPADE not only minimizes I/O costs by reducing database scans, but also minimizes
computational costs by using efficient search schemes.
SPADE scales linearly in the database size, and a number of other database parameters.

52
SBIT CSE DM UNIT

The sequential pattern mining problem was first addressed by


Agrawal and Srikant [1995] .

They said that, for a given sequential database, in which each sequence
consists of a list of transactions.

All these transactions are ordered by transaction time and each


transaction is a set of items.

Sequential pattern mining is made in order to discover all


sequential patterns based on user-defined minimum support.

The support of a pattern is calculated through the number of


data-sequences that the pattern contains.

Sequential Pattern Mining is a well known data mining


technique which consists of finding sub-sequences and
patterns which are appearing in a given set of sequence very often.

The PrefixSpan algorithm which is proposed by Jian


Pei et al. widely used to find the sequential patterns.

It avoids the huge candidate sequence generation thus improvise the


execution time and memory utilization.

53
SBIT CSE DM UNIT

A pattern-growth method based on projection is used in


Prefix Span algorithm for mining sequential patterns.

The basic idea behind this method is, rather than projecting
sequence databases by evaluating the frequent occurrences of
sub-sequences, the projection is made on frequent prefix.

This helps to reduce the processing time which ultimately increases


the algorithm efficiency.

The Prefix Span algorithm is run on different datasets and results are
drawn based on minimum support value. One new parameter maximum
prefix length is also considered while running the algorithm.

54
SBIT CSE DM UNIT

55
SBIT CSE DM UNIT

56
SBIT CSE DM UNIT

57
SBIT CSE DM UNIT

58
SBIT CSE DM UNIT

Data mining includes different of techniques of data analysis. Main aim is


to find hidden data patterns which is useful for the given large data set.
Graph mining which is having gained much attention and become the
unique approaches which is represented by graph for mining the datasets.
Graph mining, a structure graph is significant for displaying complex
structure particularly the interactions associated with them.
59
SBIT CSE DM UNIT

Graph Mining is the set of tools and techniques used to


(a) Analyze the properties of real-world graphs
(b) Predict how the structure and properties of a given graph might affect
some application
(c) Develop models that can generate realistic graphs that match the
patterns found in real-world graphs of interest.

Graph Pattern Mining


 Frequent graph patterns
 Pattern summarization
 Optimal graph patterns
 Graph patterns with constraints
 Approximate graph patterns
Graph Classification
 Pattern-based approach
 Decision tree
 Decision stumps
 Graph Compression
Discovering subgraphs that occur often in a graph
• Algorithm
1. Candidate generation

60
SBIT CSE DM UNIT

2. Candidate pruning
3. Support counting

61
SBIT CSE DM UNIT

62
SBIT CSE DM UNIT

63
SBIT CSE DM UNIT

64
SBIT CSE DM UNIT

65
SBIT CSE DM UNIT

66
UNIT-III

Classification and Prediction

Classification:
o predicts categorical class labels
o classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in
classifying new data
Prediction
models continuous-valued functions, i.e., predicts unknown or missing
values
Typical applications

o Credit approval
o Target marketing
o Medical diagnosis
o Fraud detection

Classification: Basic Concepts


Supervised learning (classification)
o Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations a New data
is classified based on the training set
Unsupervised learning (clustering)
o The class labels of training data is unknown
o Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data

Classification vs. Numeric Prediction


Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in
classifying new data

Numeric Prediction

• models continuous-valued functions, i.e., predicts unknown or


missing values

Typical applications

• Credit/loan approval:
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is

Classification—A Two-Step Process

Model construction: describing a set of predetermined classes

• Each tuple/sample is assumed to belong to a predefined class, as


determined by the class label attribute
• The set of tuples used for model construction: training set
• The model is represented as classification rules, decision trees, or
mathematical formulae

Model usage: for classifying future or unknown objects


Estimate accuracy of the model

• The known label of test sample is compared with the classified result
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
Process (1): Model Construction

Process (2): Using the Model in Prediction


Issues Regarding Classification and Prediction

Data cleaning: This refers to the preprocessing of data in order to remove


or reduce noise (by applying smoothing techniques, for example) and the
treatment of missing values (e.g., by replacing a missing value with the
most commonly occurring value for that attribute, or with the most
probable value based on statistics).

Relevance analysis: Many of the attributes in the data may be redundant.


Correlation analysis can be used to identify whether any two given
attributes are statistically related.

Data transformation and reduction: The data may be transformed by


normalization, particularly when neural networks or methods involving
distance measurements are used in the learning step. Normalization
involves scaling all values for a given attribute so that they fall within a
small specified range, such as -1.0 to 1.0, or 0.0 to 1.0. In methods that
use distance

Comparing Classification and Prediction Methods


Classification and prediction methods can be compared and evaluated
according to the following criteria:

o Accuracy
o Speed
o Robustness
o Scalability
o Interpretability
Classification by Decision Tree Induction

Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute node (nonleaf node)
denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution(Terminal node)
The topmost node in a tree is the root node.

Decision tree generation consists of two phases


o Tree construction
o At start, all the training examples are at the root
o Partition examples recursively based on selected attributes Tree
pruning
o Identify and remove branches that reflect noise or outliers

A typical decision tree is shown in Figure. It represents the concept buys


computer, that is, it predicts whether a customer at AllElectronics is likely to
purchase a computer. Internal nodes are denoted by rectangles, and leaf
nodes are denoted by ovals. Some decision tree algorithms produce
only binary trees (where each internal node branches to exactly two other
nodes), whereas others can produce non binary trees.

“How are decision trees used for classification?” Given a tuple, X, for which
the associated class label is unknown, the attribute values of the tuple are
tested against the decision tree. A path is traced from the root to a leaf
node, which holds the class prediction for that tuple. Decision trees can
easily be converted to classification rules.

Decision Tree Induction

The tree starts as a single node, N, representing the training tuples


in D (step 1)
If the tuples in D are all of the same class, then node N becomes a leaf and
is labeled with that class (steps 2 and 3). Note that steps 4 and 5 are
terminating conditions. All of the terminating conditions are explained at
the end of the algorithm.
Otherwise, the algorithm calls Attribute selection method to determine the
splitting criterion. The splitting criterion tells us which attribute to test at
node N by determining the ―best‖ way to separate or partition the tuples
in D into individual classes(step 6). The splitting criterion also tells us
which branches to grow from node N with respect to the outcomes of the
chosen test. More specifically, the splitting criterion indicates the splitting
attribute and may also indicate either a split-point or a splitting subset.
The splitting criterion is determined so that, ideally, the resulting partitions
at each branch are as ―pure‖ as possible.
A partition is pure if all of the tuples in it belong to the same class. In other
words, if we were to split up the tuples in D according to the mutually
exclusive outcomes of the splitting criterion, we hope for the resulting
partitions to be as pure as possible.
The node N is labeled with the splitting criterion, which serves as a test at
the node (step 7). A branch is grown from node N for each of the outcomes
of the splitting criterion. The tuples in D are partitioned accordingly (steps
10 to 11). There are three possible scenarios, as illustrated in Figure.
Let A be the splitting attribute. A has v distinct values, {a1, a2, : : : , av},
based on the training data.
Attribute Selection Measures

An attribute selection measure is a heuristic for selecting the splitting


criterion that ―best‖ separates a given data partition, D, of class-labeled
training tuples into individual classes. If we were to split D into smaller
partitions according to the outcomes of the splitting criterion, If the
splitting attribute is continuous-valued or if we are restricted to binary
trees then, respectively, either a split point or a splitting subset must also be
determined as part of the splitting criterion This section describes three
popular attribute selection measures—information gain, gain ratio,
and gini inde

Information gain:ID3 uses information gain as its attribute selection


measure.
Information gain is defined as the difference between the original
information requirement (i.e., based on just the proportion of classes) and
the new requirement (i.e., obtained after partitioning on A). That is,

In other words, Gain(A) tells us how much would be gained by branching


on A. It is the expected reduction in the information requirement caused by
knowing the value of A. The attribute A with the highest information gain,
(Gain(A)), is chosen as the splitting attribute at node N.

Example Induction of a decision tree using information gain.

Table 6.1 presents a training set, D, of class-labeled tuples randomly


selected from the AllElectronics customer database. (The data are adapted
from [Qui86]. In this example, each attribute is discrete-valued.
Continuous-valued attributes have been generalized.) The class label
attribute, buys computer, has two distinct values (namely, {yes, no});
therefore, there are two distinct classes (that is, m = 2). Let class C1
correspond to yes and class C2 correspond to no. There are nine tuples of
class yes and five tuples of class no. A (root) node N is created for the tuples
in D. To find the splitting criterion for these tuples, we must compute the
information gain of each attribute. We first use Equation (6.1) to compute
the expected information needed to classify a tuple in D:
The expected information needed to classify a tuple in D if the tuples are
partitioned according to age is

Hence, the gain in information from such a partitioning would be

Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151


bits, and Gain(credit rating) = 0.048 bits. Because age has the highest
information gain among the attributes, it is selected as the splitting
attribute. Node N is labeled with age, and branches are grown for each of
the attribute’s values. The tuples are then partitioned accordingly, as
shown in Figure 6.5. Notice that the tuples falling into the partition for age
= middle aged all belong to the same class. Because they all belong to
class “yes,” a leaf should therefore be created at the end of this branch and
labeled with “yes.” The final decision tree returned by the algorithm is
shown in Figure 6.5.

Tree Pruning

Introduction: When a decision tree is built, many of the branches will


reflect anomalies in the training data due to noise or outliers. Tree pruning
methods address this problem of over fitting the data. Such methods
typically use statistical measures to remove the least reliable branches. An
un-pruned tree and a pruned version of it are shown in Figure 6.6. Pruned
trees tend to be smaller and less complex and, thus, easier to comprehend.
They are usually faster and better at correctly classifying independent test
data (i.e., of previously unseen tuples) than un-pruned trees.

“How does tree pruning work?” There are two common approaches to tree
pruning: pre pruning and post pruning.

In the pre pruning approach, a tree is “pruned” by halting its construction


early (e.g., by deciding not to further split or partition the subset of training
tuples at a given node).
Upon halting, the node becomes a leaf. The leaf may hold the most frequent
class among the subset tuples or the probability distribution of those
tuples.
When constructing a tree, measures such as statistical significance,
information gain, Gini index, and so on can be used to assess the goodness
of a split. If partitioning the tuples at a node would result in a split that falls
below a pre specified threshold, then further partitioning of the given subset
is halted. There are difficulties, however, in choosing an appropriate
threshold. High thresholds could result in oversimplified trees, whereas low
thresholds could result in very little simplification.

The second and more common approach is post pruning, which removes
subtrees from a “fully grown” tree. A subtree at a given node is pruned by
removing its branches and replacing it with a leaf. The leaf is labeled with
the most frequent class among the subtree being replaced. For example,
notice the subtree at node “A3?” in the un pruned tree of Figure 6.6.
Suppose that the most common class within this subtree is “class B.” In the
pruned version of the tree, the sub tree in question is pruned by replacing
it with the leaf “class B.”

Bayesian Classification

“What are Bayesian classifiers?” Bayesian classifiers are statistical


classifiers. They can predict class membership probabilities, such as the
probability that a given tuple belongs to a particular class.
Bayesian classification is based on Bayes’ theorem, a simple Bayesian
classifier known as the naïve Bayesian classifier Bayesian classifiers have
also exhibited high accuracy and speed when applied to large databases.

1. Bayes’ Theorem

Let X be a data tuple. In Bayesian terms, X is considered ―evidence.‖ As


usual, it is described by measurements made on a set of n attributes.
Let H be some hypothesis, such as that the data tuple X belongs to a
specified class C. For classification problems, we want to determine P(H/X),
the probability that the hypothesis H holds given the ―evidence‖ or
observed data tuple X. In other words, we are looking for the probability
that tuple X belongs to class C, given that we know the attribute
description of X.

“How are these probabilities estimated?” P(H), P(X/H), and P(X) may be
estimated from the given data, as we shall see below. Bayes’ theorem is
useful in that it provides a way of calculating the posterior
probability, P(H/X), from P(H), P(X/H), and P(X).
Bayes’ theorem is
2. Naïve Bayesian Classification
Bayesian Belief Networks
A belief network is defined by two components—a directed acyclic
graph and a set of conditional probability tables (Figure 6.11). Each node in
the directed acyclic graph represents a random variable. The variables may
be discrete or continuous-valued. They may correspond to actual attributes
given in the data or to ―hidden variables‖ believed to form a relationship
(e.g., in the case of medical data, a hidden variable may indicate a
syndrome, representing a number of symptoms that, together, characterize
a specific disease). Each arc represents a probabilistic dependence. If an
arc is drawn from a node Y to a node Z, then Y is a parent or immediate
predecessor of Z, and Z is a descendant of Y. Each variable is conditionally
independent of its non descendants in the graph, given its parents.

A belief network has one conditional probability table (CPT) for each
variable. The CPT for a variable Y specifies the conditional
distribution P(YjParents(Y)), where Parents(Y) are the parents of Y. Figure(b)
shows a CPT for the variable LungCancer. The conditional probability for
each known value of LungCancer is given for each possible combination of
values of its parents. For instance, from the upper leftmost and bottom
rightmost entries, respectively, we see that
Let X = (x1, : : : , xn) be a data tuple described by the variables or
attributes Y1, : : : , Yn, respectively. Recall that each variable is
conditionally independent of its non descendants in the network graph,
given its parents. This allows the network to provide a complete
representation of the existing joint probability distribution with the
following equation:

Rule Based Classification

Using IF-THEN Rules for Classification

Represent the knowledge in the form of IF-THEN rules

R: IF age = youth AND student = yes THEN buys_computer = yes

Rule antecedent/precondition vs. rule consequent


Assessment of a rule: coverage and accuracy

o ncovers = # of tuples covered by R


o ncorrect = # of tuples correctly classified by R
o o coverage(R) = ncovers /|D| /* D: training data set */
o accuracy(R) = ncorrect / ncovers
If more than one rule is triggered, need conflict resolution

• Size ordering: assign the highest priority to the triggering rules


that has the
• toughest‖ requirement (i.e., with the most attribute test)
• Class-based ordering: decreasing order of prevalence or
misclassification cost per class
• Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts

Rule Extraction from a Decision Tree

• Rules are easier to understand than large trees

• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction: the leaf
holds the class prediction
• Rules are mutually exclusive and exhaustive

Example: Rule extraction from our buys_computer decision-tree

Rule Extraction from the Training Data


Sequential covering algorithm: Extracts rules directly from training data

• Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER


• Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
o Steps:

• Rules are learned one at a time


• Each time a rule is learned, the tuples covered by the rules are
removed
• The process repeats on the remaining tuples unless termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold
• Comp. w. decision-tree induction: learning a set of
rules simultaneously

Lazy Learners (or Learning from Your Neighbors)

The classification methods discussed so far in this chapter—decision tree


induction, Bayesian classification, rule-based classification, classification
by backpropagation, support vector machines, and classification based on
association rule mining—are all examples of eager learners. Eager learners,
when given a set of training tuples, will construct a generalization (i.e.,
classification) model before receiving new (e.g., test) tuples to classify. We
can think of the learned model as being ready and eager to classify
previously unseen tuples.

k-Nearest-Neighbor Classifiers
The k-nearest-neighbor method was first described in the early 1950s. The
method is labor intensive when given large training sets, and did not gain
popularity until the 1960s when increased computing power became
available. It has since been widely used in the area of pattern recognition.

Nearest-neighbor classifiers are based on learning by analogy, that is, by


comparing a given test tuple with training tuples that are similar to it. The
training tuples are described by n attributes. Each tuple represents a point
in an n-dimensional space. In this way, all of the training tuples are stored
in an n-dimensional pattern space. When given an unknown tuple, a k-
nearest-neighbor classifier searches the pattern space for the k training
tuples that are closest to the unknown tuple. These k training tuples are
the k ―nearest neighbors‖ of the unknown tuple.

―Closeness‖ is defined in terms of a distance metric, such as Euclidean


distance. The
Euclidean distance between two points or tuples, say, X1 = (x11, x12, : : :
, x1n) and X2 = (x21, x22, : : , x2n), is

Case-Based Reasoning
Case-based reasoning (CBR) classifiers use a database of problem solutions
to solve new problems. Unlike nearest-neighbor classifiers, which store
training tuples as points in Euclidean space, CBR stores the tuples or
―cases‖ for problem solving as complex symbolic descriptions. Business
applications of CBR include problem resolution for customer service help
desks, where cases describe product-related diagnostic problems. CBR has
also been applied to areas such as engineering and law, where cases are
either technical designs or legal rulings, respectively. Medical education is
another area for CBR, where patient case histories and treatments are used
to help diagnose and treat new patients.
When given a new case to classify, a case-based reasoner will first check if
an identical training case exists. If one is found, then the accompanying
solution to that case is returned. If no identical case is found, then the
case-based reasoner will search for training cases having SCE Department
of Information Technology components that are similar to those of the new
case. Conceptually, these training cases may be considered as neighbors of
the new case. If cases are represented as graphs, this involves searching for
subgraphs that are similar to subgraphs within the new case. The case-
based reasoner tries to combine the solutions of the neighbouring training
cases in order to propose a solution for the new case. If incompatibilities
arise with the individual solutions, then backtracking to search for other
solutions may be necessary. The case-based reasoner may employ
background knowledge and problem-solving strategies in order to propose a
feasible combined solution.
Other Classification Methods

Genetic Algorithms
Genetic Algorithm: based on an analogy to biological evolution
• An initial population is created consisting of randomly generated
rules

o Each rule is represented by a string of bits


• E.g., if A1 and ¬A2 then C2 can be encoded as 100 o If an attribute
has k > 2 values, k bits can be used
• Based on the notion of survival of the fittest, a new population is
formed to consist of the fittest rules and their offsprings
• The fitness of a rule is represented by its classification accuracy on a
set of training examples
• Offsprings are generated by crossover and mutation
• The process continues until a population P evolves when each rule
in P satisfies a prespecified threshold
• Slow but easily parallelizable

Rough Set Approach:


o Rough sets are used to approximately or ―roughly‖ define equivalent
classes
A rough set for a given class C is approximated by two sets: a lower
approximation (certain to be in C) and an upper approximation (cannot be
described as not belonging to C)

• Finding the minimal subsets (reducts) of attributes for feature


reduction is NP-hard but a discernibility matrix (which stores the
differences between attribute values for each pair of data tuples) is
used to reduce the computation intensity

Figure: A rough set approximation of the set of tuples of the class C


suing lower and upper approximation sets of C. The rectangular
regions represent equivalence classes
Fuzzy Set approaches

• Fuzzy logic uses truth values between 0.0 and 1.0 to represent the
degree of membership (such as using fuzzy membership graph)
• Attribute values are converted to fuzzy values

e.g., income is mapped into the discrete categories {low, medium, high} with
fuzzy values calculated

• For a given new sample, more than one fuzzy value may apply

• Each applicable rule contributes a vote for membership in the


categories
• Typically, the truth values for each predicted category are summed,
and these sums are combined
Cluster Analysis

4.1 Cluster Analysis:


The process of grouping a set of physical or abstract objects into classes of similar objects
is called clustering.
A cluster is a collection of data objects that are similar to one another within the same
cluster and are dissimilar to the objects in other clusters.
A cluster of data objects can be treated collectively as one group and so may be considered
as a form of data compression.
Cluster analysis tools based on k-means, k-medoids, and several methods have also been
built into many statisticalanalysis software packages or systems, such as S-Plus, SPSS, and
SAS.

4.1.1 Applications:
Cluster analysis has been widely used in numerous applications, including market research,
pattern recognition, data analysis, and image processing.
In business, clustering can help marketers discover distinct groups in their customer bases
and characterize customer groups based on purchasing patterns.
In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value,and geographic location, as well as the identification of groups of
automobile insurance policy holders with a high average claim cost.
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.
Clustering can also be used for outlier detection,Applications of outlier detection include
the detection of credit card fraud and the monitoring of criminal activities in electronic
commerce.

4.1.2 Typical Requirements Of Clustering InData Mining:


 Scalability:
Many clustering algorithms work well on small data sets containing fewer than several
hundred data objects; however, a large database may contain millions of objects. Clustering
on a sample of a given large data set may lead to biased results.
Highly scalable clustering algorithms are needed.
 Ability to deal with different types of attributes:
Many algorithms are designed to cluster interval-based (numerical) data. However,
applications may require clustering other types of data, such as binary, categorical
(nominal), and ordinal data, or mixtures of these data types.
 Discovery of clusters with arbitrary shape:
Many clustering algorithms determine clusters based on Euclidean or Manhattan distance
measures. Algorithms based on such distance measures tend to find spherical clusters with
similar size and density.
However, a cluster could be of any shape. It is important to develop algorithms thatcan
detect clusters of arbitrary shape.
 Minimal requirements for domain knowledge to determine input parameters:
Many clustering algorithms require users to input certain parameters in cluster analysis
(such as the number of desired clusters). The clustering results can be quite sensitive to
input parameters. Parameters are often difficult to determine, especially for data sets
containing high-dimensional objects. This not only burdens users, but it also makes the
quality of clustering difficult to control.
 Ability to deal with noisy data:
Most real-world databases contain outliers or missing, unknown, or erroneous data.
Some clustering algorithms are sensitive to such data and may lead to clusters of poor
quality.
 Incremental clustering and insensitivity to the order of input records:
Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates)
into existing clustering structures and, instead, must determine a new clustering from
scratch. Some clustering algorithms are sensitive to the order of input data.
That is, given a set of data objects, such an algorithm may return dramatically different
clusterings depending on the order of presentation of the input objects.
It is important to develop incremental clustering algorithms and algorithms thatare
insensitive to the order of input.
 High dimensionality:
A database or a data warehouse can contain several dimensionsor attributes.Many
clustering algorithms are good at handling low-dimensional data,involving only two to
three dimensions. Human eyes are good at judging the qualityof clustering for up to three
dimensions. Finding clusters of data objects in highdimensionalspace is challenging,
especially considering that such data can be sparseand highly skewed.
 Constraint-based clustering:
Real-world applications may need to perform clustering under various kinds of constraints.
Suppose that your job is to choose the locations for a given number of new automatic
banking machines (ATMs) in a city. To decide upon this, you may cluster households
while considering constraints such as the city’s rivers and highway networks, and the type
and number of customers per cluster. A challenging task is to find groups of data with good
clustering behavior that satisfy specified constraints.
 Interpretability and usability:
Users expect clustering results to be interpretable, comprehensible, and usable. That is,
clustering may need to be tied to specific semantic interpretations and applications. It is
important to study how an application goal may influence the selection of clustering
features and methods.

4.2 Major Clustering Methods:


 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods

4.2.1 Partitioning Methods:


A partitioning method constructs k partitions of the data, where each partition represents a
cluster and k <= n. That is, it classifies the data into k groups, which together satisfy the
following requirements:
Each group must contain at least one object, and
Each object must belong to exactly one group.

A partitioning method creates an initial partitioning. It then uses an iterative relocation


technique that attempts to improve the partitioning by moving objects from one group to
another.

The general criterion of a good partitioning is that objects in the same cluster are close or
related to each other, whereas objects of different clusters are far apart or very different.

4.2.2 Hierarchical Methods:


A hierarchical method creates a hierarchical decomposition ofthe given set of data objects. A
hierarchical method can be classified as being eitheragglomerative or divisive, based on
howthe hierarchical decomposition is formed.

 Theagglomerative approach, also called the bottom-up approach, starts with each
objectforming a separate group. It successively merges the objects or groups that are
closeto one another, until all of the groups are merged into one or until a termination
condition holds.
 The divisive approach, also calledthe top-down approach, starts with all of the objects in
the same cluster. In each successiveiteration, a cluster is split up into smaller clusters,
until eventually each objectis in one cluster, or until a termination condition holds.
Hierarchical methods suffer fromthe fact that once a step (merge or split) is done,it can never
be undone. This rigidity is useful in that it leads to smaller computationcosts by not having
toworry about a combinatorial number of different choices.

There are two approachesto improving the quality of hierarchical clustering:

 Perform careful analysis ofobject ―linkages‖ at each hierarchical partitioning, such as in


Chameleon, or
 Integratehierarchical agglomeration and other approaches by first using a
hierarchicalagglomerative algorithm to group objects into microclusters, and then
performingmacroclustering on the microclusters using another clustering method such as
iterative relocation.
4.2.3 Density-based methods:
 Most partitioning methods cluster objects based on the distance between objects. Such
methods can find only spherical-shaped clusters and encounter difficulty at discovering
clusters of arbitrary shapes.
 Other clustering methods have been developed based on the notion of density. Their
general idea is to continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold; that is, for each data point within a given
cluster, the neighborhood of a given radius has to contain at least a minimum number of
points. Such a method can be used to filter out noise (outliers)and discover clusters of
arbitrary shape.
 DBSCAN and its extension, OPTICS, are typical density-based methods that
growclusters according to a density-based connectivity analysis. DENCLUE is a
methodthat clusters objects based on the analysis of the value distributions of density
functions.
4.2.4 Grid-Based Methods:
 Grid-based methods quantize the object space into a finite number of cells that form a
grid structure.
 All of the clustering operations are performed on the grid structure i.e., on the quantized
space. The main advantage of this approach is its fast processing time, which is
typically independent of the number of data objects and dependent only on the number
of cells in each dimension in the quantized space.
 STING is a typical example of a grid-based method. Wave Cluster applies wavelet
transformation for clustering analysis and is both grid-based and density-based.

4.2.5 Model-Based Methods:


 Model-based methods hypothesize a model for each of the clusters and find the best fit
of the data to the given model.
 A model-based algorithm may locate clusters by constructing a density function that
reflects the spatial distribution of the data points.
 It also leads to a way of automatically determining the number of clusters based on
standard statistics, taking ―noise‖ or outliers into account and thus yielding robust
clustering methods.

4.3 Tasks in Data Mining:


 Clustering High-Dimensional Data
 Constraint-Based Clustering
4.3.1 Clustering High-Dimensional Data:
It is a particularly important task in cluster analysis because many applications
require the analysis of objects containing a large number of features or dimensions.
For example, text documents may contain thousands of terms or keywords as
features, and DNA micro array data may provide information on the expression
levels of thousands of genes under hundreds of conditions.
Clustering high-dimensional data is challenging due to the curse of dimensionality.
Many dimensions may not be relevant. As the number of dimensions increases,
thedata become increasingly sparse so that the distance measurement between pairs
ofpoints become meaningless and the average density of points anywhere in the
data islikely to be low. Therefore, a different clustering methodology needs to be
developedfor high-dimensional data.
CLIQUE and PROCLUS are two influential subspace clustering methods, which
search for clusters in subspaces ofthe data, rather than over the entire data space.
Frequent pattern–based clustering,another clustering methodology, extractsdistinct
frequent patterns among subsets ofdimensions that occur frequently. It uses such
patterns to group objects and generatemeaningful clusters.

4.3.2 Constraint-Based Clustering:


It is a clustering approach that performs clustering by incorporation of user-specified
or application-oriented constraints.
A constraint expresses a user’s expectation or describes properties of the desired
clustering results, and provides an effective means for communicating with the
clustering process.
Various kinds of constraints can be specified, either by a user or as per application
requirements.
Spatial clustering employs with the existence of obstacles and clustering under user-
specified constraints. In addition, semi-supervised clusteringemploys forpairwise
constraints in order to improvethe quality of the resulting clustering.

4.4 Classical Partitioning Methods:


The mostwell-known and commonly used partitioningmethods are
 The k-Means Method
 k-Medoids Method
4.4.1 Centroid-Based Technique: The K-Means Method:
The k-means algorithm takes the input parameter, k, and partitions a set of n objects intok
clusters so that the resulting intracluster similarity is high but the intercluster similarity is
low.
Cluster similarity is measured in regard to the mean value of the objects in a cluster, which
can be viewed as the cluster’s centroid or center of gravity.
The k-means algorithm proceeds as follows.
First, it randomly selects k of the objects, each of which initially represents a cluster
mean or center.
For each of the remaining objects, an object is assigned to the cluster to which it is the
most similar, based on the distance between the object and the cluster mean.
It then computes the new mean for each cluster.
This process iterates until the criterion function converges.

Typically, the square-error criterion is used, defined as

whereE is the sum of the square error for all objects in the data set
pis the point in space representing a given object
miis the mean of cluster Ci

4.4.1 The k-means partitioning algorithm:


The k-means algorithm for partitioning, where each cluster’s center is represented by the mean
value of the objects in the cluster.
Clustering of a set of objects based on the k-means method

4.4.2 The k-Medoids Method:

The k-means algorithm is sensitive to outliers because an object with an extremely large
value may substantially distort the distribution of data. This effect is particularly
exacerbated due to the use of the square-error function.
Instead of taking the mean value of the objects in a cluster as a reference point, we can pick
actual objects to represent the clusters, using one representative object per cluster. Each
remaining object is clustered with the representative object to which it is the most similar.
Thepartitioning method is then performed based on the principle of minimizing the sum of
the dissimilarities between each object and its corresponding reference point. That is, an
absolute-error criterion is used, defined as

whereE is the sum of the absolute error for all objects in the data set

pis the point inspace representing a given object in clusterCj

ojis the representative object of Cj


The initial representative objects are chosen arbitrarily. The iterative process of replacing
representative objects by non representative objects continues as long as the quality of the
resulting clustering is improved.
This quality is estimated using a cost function that measures the average
dissimilaritybetween an object and the representative object of its cluster.
To determine whether a non representative object, oj random, is a good replacement for a
current representativeobject, oj, the following four cases are examined for each of the
nonrepresentative objects.

Case 1:

pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to one of the other representative objects, oi,i≠j, then p is reassigned to oi.

Case 2:

pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to orandom, then p is reassigned to orandom.

Case 3:

pcurrently belongs to representative object, oi, i≠j. If ojis replaced by orandomas a representative
object and p is still closest to oi, then the assignment does notchange.

Case 4:

pcurrently belongs to representative object, oi, i≠j. If ojis replaced byorandomas a representative
object and p is closest to orandom, then p is reassigned
toorandom.
Four cases of the cost function for k-medoids clustering

4.4.2 Thek-MedoidsAlgorithm:

The k-medoids algorithm for partitioning based on medoid or central objects.


The k-medoids method ismore robust than k-means in the presence of noise and outliers,
because a medoid is lessinfluenced by outliers or other extreme values than a mean. However,
its processing ismore costly than the k-means method.

4.5 Hierarchical Clustering Methods:

A hierarchical clustering method works by grouping data objects into a tree of clusters.
The quality of a pure hierarchical clusteringmethod suffers fromits inability to
performadjustment once amerge or split decision hasbeen executed. That is, if a particular
merge or split decision later turns out to have been apoor choice, the method cannot
backtrack and correct it.

Hierarchical clustering methods can be further classified as either agglomerative or divisive,


depending on whether the hierarchical decomposition is formed in a bottom-up or top-down
fashion.

4.5.1 Agglomerative hierarchical clustering:


This bottom-up strategy starts by placing each object in its own cluster and then merges
these atomic clusters into larger and larger clusters, until all of the objects are in a single
cluster or until certain termination conditions are satisfied.
Most hierarchical clustering methods belong to this category. They differ only in their
definition of intercluster similarity.

4.5.2 Divisive hierarchical clustering:


This top-down strategy does the reverse of agglomerativehierarchical clustering by
starting with all objects in one cluster.
It subdividesthe cluster into smaller and smaller pieces, until each object forms a cluster
on itsown or until it satisfies certain termination conditions, such as a desired number
ofclusters is obtained or the diameter of each cluster is within a certain threshold.
4.6 Constraint-Based Cluster Analysis:
Constraint-based clustering finds clusters that satisfy user-specified preferences orconstraints.
Depending on the nature of the constraints, constraint-based clusteringmay adopt rather different
approaches.
There are a few categories of constraints.
 Constraints on individual objects:

We can specify constraints on the objects to beclustered. In a real estate application, for
example, one may like to spatially cluster only those luxury mansions worth over a million
dollars. This constraint confines the setof objects to be clustered. It can easily be handled
by preprocessing after which the problem reduces to an instance ofunconstrained
clustering.

 Constraints on the selection of clustering parameters:

A user may like to set a desired range for each clustering parameter. Clustering parameters
are usually quite specific to the given clustering algorithm. Examples of parameters include
k, the desired numberof clusters in a k-means algorithm; or e the radius and the minimum
number of points in the DBSCAN algorithm. Although such user-specified parameters may
strongly influence the clustering results, they are usually confined to the algorithm itself.
Thus, their fine tuning and processing are usually not considered a form of constraint-based
clustering.
 Constraints on distance or similarity functions:

We can specify different distance orsimilarity functions for specific attributes of the objects
to be clustered, or differentdistance measures for specific pairs of objects.When clustering
sportsmen, for example,we may use different weighting schemes for height, body weight,
age, and skilllevel. Although this will likely change the mining results, it may not alter the
clusteringprocess per se. However, in some cases, such changes may make the evaluationof
the distance function nontrivial, especially when it is tightly intertwined with the clustering
process.
 User-specified constraints on the properties of individual clusters:
A user may like tospecify desired characteristics of the resulting clusters, which may
strongly influencethe clustering process.
 Semi-supervised clustering based on partial supervision:
The quality of unsupervisedclustering can be significantly improved using some weak form
of supervision.This may be in the formof pairwise constraints (i.e., pairs of objects labeled
as belongingto the same or different cluster). Such a constrained clustering process is
calledsemi-supervised clustering.

4.7 Outlier Analysis:

There exist data objects that do not comply with the general behavior or model of the data.
Such data objects, which are grossly different from or inconsistent with the remaining set
of data, are called outliers.
Many data mining algorithms try to minimize the influence of outliers or eliminate them all
together. This, however, could result in the loss of important hidden information because
one person’s noise could be another person’s signal. In other words, the outliers may be of
particular interest, such as in the case of fraud detection, where outliers may indicate
fraudulent activity. Thus, outlier detection and analysis is an interesting data mining task,
referred to as outlier mining.
It can be used in fraud detection, for example, by detecting unusual usage of credit cards or
telecommunication services. In addition, it is useful in customized marketing for
identifying the spending behavior of customers with extremely low or extremely high
incomes, or in medicalanalysis for finding unusual responses to various medical treatments.

Outlier mining can be described as follows: Given a set of n data points or objectsand k, the
expected number of outliers, find the top k objects that are considerablydissimilar,
exceptional, or inconsistent with respect to the remaining data. The outliermining problem
can be viewed as two subproblems:
Define what data can be considered as inconsistent in a given data set, and
Find an efficient method to mine the outliers so defined.
Types of outlier detection:
 Statistical Distribution-Based Outlier Detection
 Distance-Based Outlier Detection
 Density-Based Local Outlier Detection
 Deviation-Based Outlier Detection

4.7.1 Statistical Distribution-Based Outlier Detection:


The statistical distribution-based approach to outlier detection assumes a distributionor
probability model for the given data set (e.g., a normal or Poisson distribution) andthen
identifies outliers with respect to the model using a discordancy test. Application ofthe
test requires knowledge of the data set parameters knowledge of distribution parameters
such as the mean and variance and theexpected number of outliers.
A statistical discordancy test examines two hypotheses:
A working hypothesis
An alternative hypothesis
A working hypothesis, H, is a statement that the entire data set of n objects comes from
an initial distribution model, F, that is,

The hypothesis is retained if there is no statistically significant evidence supporting its


rejection. A discordancy test verifies whether an object, oi, is significantly large (or
small) in relation to the distribution F. Different test statistics have been proposed for use
as a discordancy test, depending on the available knowledge of the data. Assuming that
some statistic, T, has been chosen for discordancy testing, and the value of the statistic
for object oi is vi, then the distribution of T is constructed. Significance probability,
SP(vi)=Prob(T > vi), is evaluated. If SP(vi) is sufficiently small, then oi is discordant and
the working hypothesis is rejected.
An alternative hypothesis, H, which states that oi comes from another distribution model,
G, is adopted. The result is very much dependent on which model F is chosen because
oimay be an outlier under one model and a perfectly valid value under another. The
alternative distribution is very important in determining the power of the test, that is, the
probability that the working hypothesis is rejected when oi is really an outlier.
There are different kinds of alternative distributions.
Inherent alternative distribution:
In this case, the working hypothesis that all of the objects come from distribution F is
rejected in favor of the alternative hypothesis that all of the objects arise from another
distribution, G:
H :oi € G, where i = 1, 2,…, n
F and G may be different distributions or differ only in parameters of the same
distribution.
There are constraints on the form of the G distribution in that it must have potential to
produce outliers. For example, it may have a different mean or dispersion, or a longer
tail.
Mixture alternative distribution:
The mixture alternative states that discordant values are not outliers in the F population,
but contaminants from some other population,
G. In this case, the alternative hypothesis is

Slippage alternative distribution:


This alternative states that all of the objects (apart from some prescribed small number)
arise independently from the initial model, F, with its given parameters, whereas the
remaining objects are independent observations from a modified version of F in which
the parameters have been shifted.
There are two basic types of procedures for detecting outliers:
Block procedures:
In this case, either all of the suspect objects are treated as outliersor all of them are accepted
as consistent.
Consecutive procedures:
An example of such a procedure is the insideoutprocedure. Its main idea is that the object
that is least likely to be an outlier istested first. If it is found to be an outlier, then all of the
more extreme values are alsoconsidered outliers; otherwise, the next most extreme object is
tested, and so on. Thisprocedure tends to be more effective than block procedures.

4.7.2 Distance-Based Outlier Detection:


The notion of distance-based outliers was introduced to counter the main limitationsimposed
by statistical methods. An object, o, in a data set, D, is a distance-based (DB)outlier with
parameters pct and dmin,that is, a DB(pct;dmin)-outlier, if at least a fraction,pct, of the
objects in D lie at a distance greater than dmin from o. In other words, rather thatrelying on
statistical tests, we can think of distance-based outliers as thoseobjects that do not have
enoughneighbors, where neighbors are defined based ondistance from the given object. In
comparison with statistical-based methods, distancebased outlier detection generalizes the
ideas behind discordancy testing for various standarddistributions. Distance-based outlier
detection avoids the excessive computationthat can be associated with fitting the observed
distribution into some standard distributionand in selecting discordancy tests.
For many discordancy tests, it can be shown that if an object, o, is an outlier accordingto the
given test, then o is also a DB(pct, dmin)-outlier for some suitably defined pct anddmin.
For example, if objects that lie three or more standard deviations from the mean
are considered to be outliers, assuming a normal distribution, then this definition can
be generalized by a DB(0.9988, 0.13s) outlier.
Several efficient algorithms for mining distance-based outliers have been developed.
Index-based algorithm:
Given a data set, the index-based algorithm uses multidimensionalindexing structures, such
as R-trees or k-d trees, to search for neighbors of eachobject o within radius dminaround that
object. Let Mbe the maximum number ofobjects within the dmin-neighborhood of an outlier.
Therefore, onceM+1 neighborsof object o are found, it is clear that o is not an outlier. This
algorithm has a worst-casecomplexity of O(n2k), where n is the number of objects in the data
set and k is thedimensionality. The index-based algorithm scales well as k increases.
However, thiscomplexity evaluation takes only the search time into account, even though the
taskof building an index in itself can be computationally intensive.
Nested-loop algorithm:
The nested-loop algorithm has the same computational complexityas the index-based
algorithm but avoids index structure construction and triesto minimize the number of I/Os. It
divides the memory buffer space into two halvesand the data set into several logical blocks.
By carefully choosing the order in whichblocks are loaded into each half, I/O efficiency can
be achieved.
Cell-based algorithm:
To avoidO(n2) computational complexity, a cell-based algorithm was developed for memory-
resident data sets. Its complexity is O(ck+n), where c is a constant depending on the number
of cells and k is the dimensionality.

In this method, the data space is partitioned into cells with a side length equal to
Eachcell has two layers surrounding it. The first layer is one cell thick, while the secondis

cells thick, rounded up to the closest integer. The algorithm countsoutliers on a


cell-by-cell rather than an object-by-object basis. For a given cell, itaccumulates three
counts—the number of objects in the cell, in the cell and the firstlayer together, and in the
cell and both layers together. Let’s refer to these counts ascell count, cell + 1 layer count, and
cell + 2 layers count, respectively.

Let Mbe the maximum number ofoutliers that can exist in the dmin-neighborhood of an
outlier.
An object, o, in the current cell is considered an outlier only if cell + 1 layer countis less
than or equal to M. If this condition does not hold, then all of the objectsin the cell can be
removed from further investigation as they cannot be outliers.
If cell_+ 2_layers_count is less than or equal to M, then all of the objects in thecell are
considered outliers. Otherwise, if this number is more than M, then itis possible that some
of the objects in the cell may be outliers. To detect theseoutliers, object-by-object
processing is used where, for each object, o, in the cell,objects in the second layer of o
are examined. For objects in the cell, only thoseobjects having no more than M points in
their dmin-neighborhoods are outliers.The dmin-neighborhood of an object consists of
the object’s cell, all of its firstlayer, and some of its second layer.
A variation to the algorithm is linear with respect to n and guarantees that no morethan three
passes over the data set are required. It can be used for large disk-residentdata sets, yet does
not scale well for high dimensions.

4.7.3 Density-Based Local Outlier Detection:


Statistical and distance-based outlier detection both depend on the overall or
globaldistribution of the given set of data points, D. However, data are usually not
uniformlydistributed. These methods encounter difficulties when analyzing data with rather
different
density distributions.
To define the local outlier factor of an object, we need to introduce the concepts ofk-
distance, k-distance neighborhood, reachability distance,13 and local reachability density.
These are defined as follows:
The k-distance of an object p is the maximal distance that p gets from its k-
nearestneighbors. This distance is denoted as k-distance(p). It is defined as the distance,
d(p, o), between p and an object o 2 D, such that for at least k objects, o0 2 D, it holds that
d(p, o’)_d(p, o). That is, there are at least k objects inDthat are as close asor closer to p than
o, and for at most k-1 objects, o00 2 D, it holds that d(p;o’’) <d(p, o).

That is, there are at most k-1 objects that are closer to p than o. You may bewondering at this
point how k is determined. The LOF method links to density-basedclustering in that it sets k
to the parameter rMinPts,which specifies the minimumnumberof points for use in identifying
clusters based on density.
Here, MinPts (as k) is used to define the local neighborhood of an object, p.
The k-distance neighborhood of an object p is denoted Nkdistance(p)(p), or Nk(p)for short. By
setting k to MinPts, we get NMinPts(p). It contains the MinPts-nearestneighbors of p. That is, it
contains every object whose distance is not greater than theMinPts-distance of p.
The reachability distance of an object p with respect to object o (where o is within
theMinPts-nearest neighbors of p), is defined as reach
distMinPts(p, o) = max{MinPtsdistance(o), d(p, o)}.
Intuitively, if an object p is far away , then the reachabilitydistance between the two is simply
their actual distance. However, if they are sufficientlyclose (i.e., where p is within the
MinPts-distance neighborhood of o), thenthe actual distance is replaced by the MinPts-
distance of o. This helps to significantlyreduce the statistical fluctuations of d(p, o) for all of
the p close to o.
The higher thevalue of MinPts is, the more similar is the reachability distance for objects
withinthe same neighborhood.
Intuitively, the local reachability density of p is the inverse of the average reachability
density based on the MinPts-nearest neighbors of p. It is defined as

The local outlier factor (LOF) of p captures the degree to which we call p an outlier.
It is defined as

It is the average of the ratio of the local reachability density of p and those of p’s
MinPts-nearest neighbors. It is easy to see that the lower p’s local reachability density
is, and the higher the local reachability density of p’s MinPts-nearest neighbors are,
the higher LOF(p) is.
4.7.4 Deviation-Based Outlier Detection:
Deviation-based outlier detection does not use statistical tests or distance-basedmeasures to
identify exceptional objects. Instead, it identifies outliers by examining themain
characteristics of objects in a group.Objects that ―deviate‖ fromthisdescription areconsidered
outliers. Hence, in this approach the term deviations is typically used to referto outliers. In
this section, we study two techniques for deviation-based outlier detection.The first
sequentially compares objects in a set, while the second employs an OLAPdata cube
approach.

Sequential Exception Technique:


The sequential exception technique simulates the way in which humans can
distinguishunusual objects from among a series of supposedly like objects. It uses implicit
redundancyof the data. Given a data set, D, of n objects, it builds a sequence of subsets,{D1,
D2, …,Dm}, of these objects with 2<=m <= n such that

Dissimilarities are assessed between subsets in the sequence. The technique introducesthe
following key terms.
Exception set:
This is the set of deviations or outliers. It is defined as the smallestsubset of objects whose
removal results in the greatest reduction of dissimilarity in the residual set.
Dissimilarity function:
This function does not require a metric distance between theobjects. It is any function that, if
given a set of objects, returns a lowvalue if the objectsare similar to one another. The greater
the dissimilarity among the objects, the higherthe value returned by the function. The
dissimilarity of a subset is incrementally computedbased on the subset prior to it in the
sequence. Given a subset of n numbers, {x1, …,xn}, a possible dissimilarity function is the
variance of the numbers in theset, that is,

where x is the mean of the n numbers in the set. For character strings, the dissimilarityfunction
may be in the form of a pattern string (e.g., containing wildcard charactersthat is used to cover
all of the patterns seen so far. The dissimilarity increases when the pattern covering all of the
strings in Dj-1 does not cover any string in Dj that isnot in Dj-1.
Cardinality function:
This is typically the count of the number of objects in a given set.
Smoothing factor:
This function is computed for each subset in the sequence. Itassesses how much the
dissimilarity can be reduced by removing the subset from theoriginal set of objects.
Types Of Data In Cluster Analysis Are:
Interval-Scaled Variables
Interval-scaled variables are continuous measurements of a roughly linearscale.
Typical examples include weight and height, latitude and longitude coordinates
(e.g., when clustering houses), and weather temperature
Binary Variables
A binary variable is a variable that can take only 2 values.
For example, generally, gender variables can take 2 variables male and female.
Contingency Table For Binary Data
Let us consider binary values 0 and 1
Let p=a+b+c+d
Simple matching coefficient (invariant, if the binary variable is symmetric):
Jaccard coefficient (noninvariant if the binary variable is asymmetric):
Nominal or Categorical Variables
A generalization of the binary variable in that it can take more than 2 states,
e.g., red, yellow, blue, green.
Method 1: Simple matching
The dissimilarity between two objects i and j can be computed based on
the simple matching.
m: Let m be no of matches (i.e., the number of variables for which i and j are in
the same state).
p: Let p be total no of variables.
Method 2: use a large number of binary variables
Creating a new binary variable for each of the M nominal states.
ordinal Variables
An ordinal variable can be discrete or continuous.
In this, order is important, e.g., rank.
It can be treated like interval-scaled.
By replacing xif by their rank,
By mapping the range of each variable onto [0, 1] by replacing the i-th object in
the f-th variable .
Then compute the dissimilarity using methods for interval-scaled variables.
Ratio-Scaled Intervals
Ratio-scaled variable: It is a positive measurement on a nonlinear scale,
approximately at an exponential scale, such as Ae^Bt or A^e-Bt
MULTIMEDIA MINING
Multimedia data mining is used for extracting interesting information for multimedia data
sets, such as audio, video, images, graphics, speech, text and combination of several types of
data set which are all converted from different formats into digital media.
Mining in multimedia is referred to automatic annotation or annotation mining.
Multimedia data is a combination of a number of media objects (i.e., text, graphics, sound,
animation, video, etc.) that must be presented in a coherent, synchronized manner.
It must contain at least a discrete and a continuous media.

Similarity Search in Multimedia Data searching/or similarities in multimedia data is based


on either the data description or the data content.
Similarity searching in multimedia data, we consider two main families of multimedia
indexing and retrieval systems:
Mining Time-Series and Sequence Data
A time-series database consists of sequences of values or events changing with time.
The values are typically measured at equal time intervals.
Time-series databases are popular in many applications, such as stock market, production
process, scientific experiments, medical treatments, and so on.
A time-series database is also a sequence database.
Sequence database is any database consists of sequences of ordered events, with or
without concrete notions of time.
" There are four major components or movements that are used to characterize time-
series data.
Long-term or trend movements: These indicate the general direction in which a
time-series graph is moving over a long interval of time.
Cyclic movements or cyclic variations: These refer to the cycles, that is, the long-
term oscillations. That is, the cycles need not necessarily follow exactly, similar patterns after
equal intervals of time.
Seasonal movements or seasonal variations: These movements are due to events
that recur annually.
Irregular or random movements: These characterize the sporadic motion of time
series due to random or chance events, such as labor disputes, floods, or announced personnel
changes within companies.
The above trend, cyclic, seasonal and irregular movements are represented by
variables T, C, S, I, respectively.
Time-series analysis is also referred to as the decomposition of a time series into these
four basic movements.
The time- series variable Y can be modeled as either the product of the four variables
(i.e., Y =T x C x S x I) or their sum.
Similarity Search in Time-Series Analysis
A similarity search finds data sequences that differ only slightly from the given query
sequence. Given a set of time-series sequences, there are two types of similarity search.
Subsequence matching find all of the data sequences that are similar to the given
sequence.
Whole sequence matching finds those sequences that are similar to one other.
Similarity search in timeseries analysis is useful for the analysis of financial markets (e.g.,
stock data analysis), medical diagnosis (e.g., cardiogram analysis), and in scientific or
engineering databases (e.g., power consumption analysis).
Data Transformation: From Time Domain to Frequency Domain
For similarity analysis of time-series data, Euclidean distance is typically used as a
similarity measure. Many techniques for signal analysis require the data to be in the
frequency domain.

Multimedia Mining Process

CATEGORIES OF MULTIMEDIA DATA MINING


The multimedia data mining is classified into two broad categories as static media and
dynamic media.
Static media contains text (digital library, creating SMS and MMS) and images
(photos and medical images).
Dynamic media contains Audio (music and MP3 sounds) and Video (movies).
Multimedia mining refers to analysis of large amount of multimedia information in
order to extract patterns based on their statistical relationships.

Categories of Multimedia Data Mining


Models for Multimedia Mining

The models which are used to perform multimedia data are very important in mining.
Commonly four different multimedia mining models have been used. These are
classification, association rule, clustering and statistical modelling.

1. Classification: Classification is a technique for multimedia data analysis that can


learn from every property of a specified set of multimedia. It is divided into a
predefined class label to achieve the purpose of classification.

Classification is the process of constructing data into categories for its better effective
and efficient use; it creates a function that well-planned data item into one of many
predefined classes.

Decision tree classification has a perceptive nature that uses conceptual model
without loss of exactness.

2. Association Rule: Association Rule is one of the most important data mining
techniques that help find relations between data items in huge databases.

There are two types of associations in multimedia mining: image content and non-
image content features. Mining the frequently occurring patterns between different
images becomes mining the repeated patterns in a set of transactions. Multi-relational
association rule mining displays multiple reports for the same image. In image
classification also, multiple-level association rule techniques are used.

3. Clustering: Cluster analysis divides the data objects into multiple groups or clusters.
Cluster analysis combines all objects based on their groups.

In multimedia mining, the clustering technique can be applied to group similar


images, objects, sounds, videos and texts. Clustering algorithms can be divided into
several methods: hierarchical methods, density-based methods, grid-based methods,
model-based methods, k-means algorithms, and graph-based models.

4. Statistical Modeling: Statistical mining models regulate the statistical validity of test
parameters and have been used to test hypotheses, undertake correlation studies, and
transform and make data for further analysis. This is used to establish links between
words and partitioned image regions to form a simple co-occurrence model.

Issues in Multimedia Mining


Major Issues in multimedia data mining contains content-based retrieval, similarity search,
dimensional analysis, classification, prediction analysis and mining associations in
multimedia data.

1. Content-based retrieval and Similarity search

Content-based retrieval in multimedia is a stimulating problem since multimedia data is


required for detailed analysis from pixel values. We considered two main families of
multimedia retrieval systems, i.e. similarity search in multimedia data.

o Description-based retrieval system creates indices and object retrieval based on


image descriptions, such as keywords, captions, size, and creation time.
o Content-based retrieval system supports image content retrieval, for example, colour
histogram, texture, shape, objects, and wavelet transform.
o Use of content-based retrieval system: Visual features index images and promote
object retrieval based on feature similarity; it is very desirable in various applications.
These applications include diagnosis, weather prediction, TV production and internet
search engines for pictures and e-commerce.

2. Multidimensional Analysis

To perform multidimensional analysis of large multimedia databases, multimedia data cubes


may be designed and constructed similarly to traditional data cubes from relational data. A
multimedia data cube has several dimensions.

A multimedia data cube can have additional dimensions and measures for multimedia data,
such as colour, texture, and shape.

The Multimedia data mining system prototype is Multimedia Miner, the extension of the
DBMiner system that handles multimedia data.

The Image Excavator component of Multimedia Miner uses image contextual information,
like HTML tags on Web pages, to derive keywords. By navigating online directory
structures, like Yahoo! directory, it is possible to build hierarchies of keywords mapped on
the directories in which the image was found.

3. Classification and Prediction Analysis

Classification and predictive analysis has been used for mining multimedia data, particularly
in scientific analysis like astronomy, seismology, and geoscientific analysis.

Decision tree classification is an important method for reported image data mining
applications.

Image data mining classification and clustering are carefully connected to image analysis and
scientific data mining. The image data are frequently in large volumes and need substantial
processing power, such as parallel and distributed processing. Hence, many image analysis
techniques and scientific data analysis methods could be applied to image data mining.

4. Mining Associations in Multimedia


Data Association rules involving multimedia objects have been mined in image and video
databases. Three categories can be observed:

o Associations between image content and non


non-image
image content features
o Associations among image contents that are not related to spati
spatial
al relationships
o Associations among image contents related to spatial relationships

First, an image contains multiple objects, each with various features such as colour, shape,
texture, keyword, and spatial locations, so that many possible associations can be made.

Second, a picture containing multiple repeated objects is essential in image analysis. The
recurrence of similar objects should not be ignored in association analysis.

Third, to find the associations between the spatial relationships and multimedia images can be
used to discover object associations and correlations.

With the associations between multimedia objects, we can treat every image as a transaction
and find commonly occurring patterns among different images.

Architecture for Multimedia Data Mining

Multimedia mining architecture is given in the below image. The architecture has several
components. Important components are Input, Multimedia Content, Spatiotemporal
Segmentation, Feature Extraction, Finding similar Patterns, and E
Evaluation
valuation of Results.

1. The input stage comprises a multimedia database used to find the patterns and
perform the data mining.
2. Multimedia Content is the data selection stage that requires the user to select the
databases, subset of fields, or data for data mining.
3. Spatio-temporal
temporal segmentation is nothing but moving objects in image sequences in
the videos, and it is useful for object segmentation.
4. Feature extraction is the preprocessing step that involves integrating data from
various sources and making choices regarding characterizing or Finding a similar
pattern stage is the heart of the whole data mining process. The hidden patterns and
trends in the data are basically uncovered in this stage. Some approaches to finding
similar pattern stages contain aassociation,
ssociation, classification, clustering, regression, time-
time
series analysis and visualization.
5. Evaluation of Results is a data mining process used to evaluate the results, and this is
important to determine whether the prior stage must be revisited or not. Th This stage
consists of reporting and using the extracted knowledge to produce new actions,
products, services, or marketing strategies.
Text Mining:
Text mining is a component of data mining that deals specifically with unstructured text
data.
Text Mining also referred as TextDataMining and Knowledge Discovery in Textual
Database.
It involves the use of natural language processing (NLP) techniques to extract useful
information and insights from large amounts of unstructured text data.
Text mining can be used as a preprocessing step for data mining.
Text mining is widely used in various fields, such as natural language processing, information retrieval,
and social media analysis.

“Text Mining is the procedure of synthesizing information, by analyzing relations,


patterns, and rules among textual data.”

By using text mining, the unstructured text data can be transformed into structured data that
can be used for data mining tasks such as classification, clustering, and association rule
mining.
Process of Text Mining

Gathering unstructured information from various sources accessible in various document


organizations, for example, plain text, web pages, PDF records, etc.

o Text Pre-processing is a significant task and a critical step in Text Mining, Natural
Language Processing (NLP), and information retrieval(IR).
In the field of text mining, data pre-processing is used for extracting useful
information and knowledge from unstructured text data.
Information Retrieval (IR) is a matter of choosing which documents in a collection
should be retrieved to fulfill the user's need.

o Featureselection:
Feature selection is a significant part of data mining. Feature selection can be defined
as the process of reducing the input of processing or finding the essential information
sources. The feature selection is also called variable selection.
o DataMining:
Now, in this step, the text mining procedure merges with the conventional process.
Classic Data Mining procedures are used in the structural database.
o Evaluate:
Afterward, it evaluates the results. Once the result is evaluated, the result abandon.
o Applications: Customer care, risk management, social media analysis etc ..
Pre-processing and data cleansing tasks are performed to distinguish and eliminate
inconsistency in the data. The data cleansing process makes sure to capture the genuine
text, and it is performed to eliminate stop words stemming (the process of identifying the
root of a certain word and indexing the data.

Processing and controlling tasks are applied to review and further clean the data set.
Pattern analysis is implemented in Management Information System.

Information processed in the above steps is utilized to extract important and applicable
data for a powerful and convenient decision-making process and trend analysis.

Procedures for Analyzing Text Mining


Text Summarization: To extract its partial content and reflect its whole content automatically.
Text Categorization: To assign a category to the text among categories predefined by users.
Text Clustering: To segment texts into several clusters, depending on the substantial
relevance.

Procedures for Analyzing Text Mining

Text Retrieval Methods

Text retrieval is the process of transforming unstructured text into a structured format to
identify meaningful patterns and new insights.
By using advanced analytical techniques, including Naïve Bayes, Support Vector Machines
(SVM), and other deep learning algorithms.
There are two methods of text retrieval methods
Document Selection − In document selection methods, the query is regarded as defining
constraint for choosing relevant documents.
A general approach of this category is the Boolean retrieval model, in which a document is
defined by a set of keywords and a user provides a Boolean expression of keywords, such as
car and repair shops, tea or coffee, or database systems but not Oracle.
The retrieval system can take such a Boolean query and return records that satisfy the
Boolean expression. The Boolean retrieval techniques usually only work well when the user
understands a lot about the document set and can formulate the best query in this way.
Document ranking − Document ranking methods use the query to rank all records in the
order of applicability.
There are several ranking methods based on a huge spectrum of numerical foundations, such
as algebra, logic, probability, and statistics.
The common intuition behind all of these techniques is that it can connect the keywords in a
query with those in the records and score each record depending on how well it matches the
query.
The degree of relevance of records with a score computed depending on the information
including the frequency of words in the document and the whole set.
For ordinary users and exploratory queries, these techniques are more suitable than document
selection methods.
The most popular approach of this method is the vector space model.
The basic idea of the vector space model is the following:
It can represent a document and a query both as vectors in a high-dimensional space
corresponding to all the keywords and use an appropriate similarity measure to evaluate the
similarity among the query vector and the record vector.
The similarity values can then be used for ranking documents.

TF-IDF
Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical
method in natural language processing and information retrieval.
It measures how important a term is within a document relative to a collection of
documents (i.e., relative to a corpus).
Words within a text document are transformed into importance numbers by a text
vectorization process. There are many different text vectorization scoring schemes,
TF-IDF being one of the most common.

Inverse Document Frequency: IDF of a term reflects the proportion of documents in the
corpus that contain the term.
Words unique to a small percentage of documents receive higher importance values than
words common across all documents.

The TF-IDF of a term is calculated by multiplying TF and IDF scores.

The resulting TF-IDF score reflects the importance of a term for a document in the corpus.
TF-IDF is useful in many natural language processing applications.
For example, Search Engines use TF-IDF to rank the relevance of a document for a query.
TF-IDF is also employed in text classification, text summarization, and topic modeling.
Dimensionality Reduction

Dimensionality reduction is a technique used to reduce the number of features in a dataset


while retaining as much of the important information as possible.
It is a process of transforming high-dimensional data into a lower-dimensional space that
still preserves the essence of the original data.
Dimensionality reduction technique can be defined as,
"It is a way of converting the higher dimensions dataset into lesser dimensions
dataset ensuring that it provides similar information."
Latent Semantic Analysis

Latent Semantic Analysis is a natural language processing method that uses the statistical
approach to identify the association among the words in a document.
Singular Value Decomposition:
Singular Value Decomposition is the statistical method that is used to find the latent(hidden)
semantic structure of words spread across the document.
C = collection of documents.
d = number of documents.
n = number of unique words in the whole collection.
M=dXn
The SVD decomposes the M matrix i.e word to document matrix into three matrices as
follows
M = U∑VT

U = distribution of words across the different contexts


∑ = diagonal matrix of the association among the contexts
V = distribution of contexts across the different documents
T

A very significant feature of SVD is that it allows us to truncate few contexts which are not
necessarily required by us.
The ∑ matrix provides us with the diagonal values which represent the significance of the
context from highest to the lowest.
By using these values we can reduce the dimensions and hence this can be used as a
dimensionality reduction technique too.
Text Mining Approaches in Data Mining:

These are the following text mining approaches that are used in data mining.

1. Keyword-based Association Analysis:

It collects sets of keywords or terms that often happen together and afterward discover the
association relationship among them.

First, it preprocesses the text data by parsing, stemming, removing stop words, etc.
Once it pre-processed the data, then it induces association mining algorithms. Here, human
effort is not required, so the number of unwanted results and the execution time is reduced.

2. Document Classification Analysis:

Automatic document classification:

This analysis is used for the automatic classification of the huge number of online text documents like
web pages, emails, etc.
Text document classification varies with the classification of relational data as document databases
are not organized according to attribute values pairs.
What is Web Mining?
● Web mining is the use of data mining techniques to extract knowledge from web data.
● Web data includes :
○ web documents
○ hyperlinks between documents
○ usage logs of web sites
● The WWW is huge, widely distributed, global information service centre and, therefore,
constitutes a rich source for data mining.
Data Mining vs Web Mining
● Data Mining : It is a concept of identifying a significant pattern from the data that gives a better
outcome.
● Web Mining : It is the process of performing data mining in the web. Extracting the web
documents and discovering the patterns from it.
Web Data Mining Process

https://ieeexplore.ieee.org/document/5485404/
Issues
● Web data sets can be very large
○ Tens to hundreds of terabyte
● Cannot mine on a single server
○ Need large farms of servers
● Proper organization of hardware and software to mine multi-terabyte data sets
● Difficulty in finding relevant information
● Extracting new knowledge from the web
Web Mining Taxonomy

https://www.researchgate.net/figure/Taxonomy-of-Web-Mining-Source-Kavita-et-al-2011_fig1_282357293
Web Content Mining - Introduction ??
● Mining, extraction and integration of useful data, information and knowledge from Web page
content.
● Web content mining is related but different from data mining and text mining.
● Web data are mainly semi-structured and/or unstructured, while data mining deals primarily
with structured data.
What is Web Structure Mining?
● Web structure mining is the process of discovering structure information from the web.
● The structure of typical web graph consists of Web pages as nodes, and hyperlinks as edges
connecting between two related pages.

Hyperlink

Web document
Web Structure Mining (cont.)
● This type of mining can be performed either at the document level(intra-page) or at the
hyperlink level(inter-page).
● The research at the hyperlink level is called Hyperlink analysis.
● Hyperlink structure can be used to retrieve useful information on the web.

There are two main approaches:

● PageRank
● Hubs and Authorities - HITS
PageRank
● Used to discover the most important pages on the web.
● Prioritize pages returned from search by looking at web structure.
● Importance of pages is calculated based on the number of pages which point to it (backlinks).
● Weighting is used to provide more importance to backlinks coming from important pages.
● PR(p) = (1-d) + d (PR(1)/N1 + …… + PR(n)/Nn)
○ PR(i): PageRank for a page i which points to target page p.
○ Ni: Number of links coming out of page i.
○ d: constant value between 0 and 1 used for normalization.
○ (1-d): Bit of probability math magic so that sum of all webpages pageranks should be one.
Hubs and Authorities
● Authoritative pages
○ Authors defines an authority as the best source for the request.
○ Highly important pages.
○ Best source for requested information.
● Hub pages
○ Contains links to highly important pages.

Hubs Authorities
Web Structure Mining applications
● Information retrieval in social networks.
● To find out the relevance of each web page.
● Measuring the completeness of Web sites.
● Used in search engines to find out the relevant information.
Web Usage Mining
● Web usage mining: automatic discovery of patterns in clickstreams and associated data
collected or generated as a result of user interactions with one or more Web sites.
● Goal: analyze the behavioral patterns and profiles of users interacting with a Web site.
● The discovered patterns are usually represented as collections of pages, objects, or resources that
are frequently accessed by groups of users with common interests.
● Data in Web Usage Mining:
a. Web server logs
b. Site contents
c. Data about the visitors, gathered from external channels
Three Phases

Pre-Processing Pattern Discovery Pattern Analysis

Raw Server log Preprocessed data Rules and Patterns Interesting Knowledge
reference
Data Preparation
● Data cleaning
○ By checking the suffix of the URL name, for example, all log entries with filename
suffixes such as, \gif, jpeg, etc
● User identification
○ If a page is requested that is not directly linked to the previous pages, multiple users are
assumed to exist on the same machine
○ Other heuristics involve using a combination of IP address, machine name, browser agent,
and temporal information to identify users
● Transaction identification
○ All of the page references made by a user during a single visit to a site
○ Size of a transaction can range from a single page reference to all of the page references
Pattern Discovery Tasks
● Clustering and Classification
○ Clustering of users help to discover groups of users with similar navigation patterns =>
provide personalized Web content
○ Clustering of pages help to discover groups of pages having related content => search
engine
○ E.g. clients who often access webminer software products tend to be from educational
institutions.
○ clients who placed an online order for software tend to be students in the 20-25 age group
and live in the United States.
○ 75% of clients who download software and visit between 7:00 and 11:00 pm on weekend
are engineering students
Pattern Discovery Tasks
● Sequential Patterns:
○ extract frequently occurring intersession patterns such that the presence of a set of items
followed by another item in time order
○ Used to predict future user visit patterns=>placing ads or recommendations

● Association Rules:
○ Discover correlations among pages accessed together by a client
○ Help the restructure of Web site
○ Develop e-commerce marketing strategies - Grocery Mart
Pattern Analysis Tasks
● Pattern Analysis is the final stage of WUM, which involves the validation and
interpretation of the mined pattern

● Validation:
○ to eliminate the irrelative rules or patterns and to extract the interesting rules or
patterns from the output of the pattern discovery process

● Interpretation:
○ the output of mining algorithms is mainly in mathematic form and not suitable for
direct human interpretations

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy