DM NOTES
DM NOTES
Data warehouse refers to the process of compiling and organizing data into
one common database.
The data from multiple sources are integrated into a common source
known as Data Warehouse.
Data Warehouse:
A Data Warehouse refers to a place where data can be stored for useful
mining. It is like a quick computer system with exceptionally huge data
storage capacity. Data from the various organization's systems are copied to
the Warehouse, where it can be fetched and conformed to delete errors.
Here, advanced requests can be made against the warehouse storage of
data.
Data warehouses and databases both are relative data systems, but both
are made to serve different purposes. A data warehouse is built to store a
huge amount of historical data typically using Online Analytical
Processing (OLAP). A database is made to store current transactions and
allow quick access to specific transactions commonly known as Online
Transaction Processing (OLTP).
1. Subject Oriented
The different data present in the data warehouse provides information for a
specific period.
3. Integrated
4. Non- Volatile
DATA MINING:
Data mining refers to the process of extracting useful data from the
databases. The data mining process depends on the data compiled in the
data warehousing phase to recognize meaningful patterns. A data
warehousing is created to support management systems.
Page | 2
Effective data mining aids in various aspects of planning business strategies
and managing operations. That includes customer
customer-facing
facing functions such as
marketing, advertising, sales and customer s
support,
upport, plus manufacturing,
supply chain management, finance and HR. Data mining supports fraud
detection, risk management, Cyber Security Planning and many other
critical business use cases. It also plays an important role in healthcare,
government, scientific research, mathematics, sports and more.
Data Analysis
Itt is the process of analysing and organizing raw data in order to determine
useful information’s andnd decision
decisions.
Data Mining and Knowledge Discovery KDD
Data mining is also called Knowledge Discovery in Database (KDD).
(KDD The
knowledge discovery process includes Data cleaning, Data integration, Data
selection, Data transformation, Data mining, Pattern e
evaluation,
valuation, and
Knowledge presentation..
KDD is the overall process of converting raw data into useful information.
This process consists of a series of transformation steps, from data pre-
processing to post processing of data mining results.
Input data
Stored in a variety of formats (flat files, sprea
spreadsheets,
dsheets, or relational tables)
Pre-processing
It transforms the raw input data into an appropriate format for subsequent
analysis.
Steps involved in pre-processing
processing
It will combine data from multiple sources,
Cleaning data to remove noise
Duplicate observations
Selecting records and features that are relevant to the data mining task.
Post processing
This step that ensures that only valid and useful results are
incorporated into the decision support system.
Statistical measures or testing methods applied to eliminate false data
mining results. Example for post processing:
visualization, which allows analysts to explore the data and the data
mining results from a variety of viewpoints.(output) Information
The significant components of data mining systems are a data source, data
mining engine, data warehouse server, the pattern evaluation module,
graphical user interface, and knowledge base.
Page | 4
Data Source:
The actual source of data is the Database, data warehouse, World Wide Web
(WWW), text files, and other documents. You need a huge amount of
historical data for data mining to be successful. Organizations typically store
data in databases or data warehouses.
Different processes:
Before passing the data to the database or data warehouse server, the data
must be cleaned, integrated, and selected. As the information comes from
various sources and in different formats, it can't be used directly for the
data mining procedure because the data may not be complete and accurate.
So, the first data requires to be cleaned and unified (schema). More
information than needed will be collected from various data sources, and
only the data of interest will have to be selected and passed to the server.
These procedures are not as easy as we think. Several methods may be
performed on the data as part of selection, integration, and cleaning.
The database or data warehouse server consists of the original data that is
ready to be processed. Hence, the server is cause for retrieving the relevant
data that is based on data mining as per user request.
The data mining engine is a major component of any data mining system. It
contains several modules for operating data mining tasks, including
association, characterization, classification, clustering, prediction, time-
series analysis, etc.
In other words, we can say data mining is the root of our data mining
architecture. It comprises instruments and software used to obtain insights
and knowledge from data collected from various data sources and stored
within the data warehouse.
The graphical user interface (GUI) module communicates between the data
mining system and the user. This module helps the user to easily and
efficiently use the system without knowing the complexity of the process.
This module cooperates with the data mining system when the user specifies
a query or a task and displays the results.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might
be helpful to guide the search or evaluate the result patterns. The knowledge
base may even contain user views and data from user experiences that
might be helpful in the data mining process. The data mining engine may
receive inputs from the knowledge base to make the result more accurate
and reliable. The pattern assessment module regularly interacts with the
knowledge base to get inputs, and also update it.
The main objective of the KDD process is to extract information from data in
the context of large databases. It does this by using Data Mining algorithms
to identify what is deemed knowledge. KDD is the organized procedure of
recognizing valid, useful, and understandable patterns from huge and
complex data sets. Data Mining is the root of the KDD procedure.
Page | 6
The following diagram shows the process of knowledge discovery:
There are mainly 3 data types of data where the data stores on which
the mining can be performed.
Different sources of data that are used in data mining process.
Data Base Data
Data warehouse Data
Transactional Data
DATABASE DATA
A database is also called a database management system or DBMS. Every
DBMS stores data that are related to each other in a way or the other. It
also has a set of software programs that are used to manage data and
provide easy access to it.
A Relational database is defined as the collection of data organized in
tables with rows and columns.
A data warehouse is a single data storage location that collects data from
multiple sources and then stores it in the form of a unified plan. When
data is stored in a data warehouse, it undergoes cleaning, integration,
loading, and refreshing. Data stored in a data warehouse is organized in
several parts. If you want information on data that was stored 6 or 12
months back, you will get it in the form of a summary.
Transactional Databases:
Page | 8
Object-oriented
oriented databases (OODBs)
Object-oriented
oriented databases (OODBs) are designed to store and manipulate
objects. It is similar to object
object-oriented
oriented programming languages, i.e. Java
and Python. Objects can contain data, methods, and relationships to other
objects.
In OODB, the object itself is the storage rather than the representation of
the data. This allows for more efficient and natural handling of complex
data structures and relationships between objects.
Advantages of OODBs
Object-oriented
oriented databases (OODBs) have many advantages
They work well with object
object-oriented programming languages.
They are easy to model.
They are fast for object-oriented
oriented workloads
workloads.
They can handle complex data structures wellwell.
Object-relational
relational databases (ORDBs)
Object-relational
tional databases (ORDBs) are a hybrid between traditional
relational databases and OODBs. ORDBs are designed to handle both
structured and unstructured data, much like OODBs, but they also support
SQL queries and transactions, much like traditional relatio
relational
nal databases.
One of the major goals of Object relational data model is to close the gap
between relational databases and the object oriented practices frequently
used in many programming languages such as C++, C#, Java etc.
Multimedia Databases:
Multimedia databases consists audio, video, images and text
media.
They can be stored on Object-Oriented Databases.
They are used to store complex information in pre-specified
formats.
Application: Digital libraries, video-on demand, news-on
demand, musical database, etc.
Spatial Database
Store geographical information.
Stores data in the form of coordinates, topology, lines,
polygons, etc.
Application: Maps, Global positioning, etc.
Time-series Databases
Time series databases contain stock exchange data and user
logged activities.
Handles array of numbers indexed by time, date, etc.
It requires real-time analysis.
Application: extreme, Graphite, Influx DB, etc.
WWW
WWW refers to World wide web is a collection of documents and
resources like audio, video, text, etc which are identified by
Uniform Resource Locators (URLs) through web browsers,
linked by HTML pages, and accessible via the Internet network.
It is the most heterogeneous repository as it collects data from
multiple resources.
It is dynamic in nature as Volume of data is continuously
increasing and changing.
Application: Online shopping, Job search, Research, studying,
etc.
Structured Data: This type of data is organized into a specific format,
such as a database table or spreadsheet. Examples include
transaction data, customer data, and inventory data.
Semi-Structured Data: This type of data has some structure, but not
as much as structured data. Examples include XML and JSON files,
and email messages.
Unstructured Data: This type of data does not have a specific format,
and can include text, images, audio, and video. Examples include
social media posts, customer reviews, and news articles.
External Data: This type of data is obtained from external sources such
as government agencies, industry reports, weather data, satellite images,
GPS data, etc.
Heterogeneous Database
It consists of data from multiple dissimilar sources.
These sources may include different types of data bases such as relational
database, flat files, etc..
Page | 10
Legacy Database
It is a group of Heterogeneous Databases that combine different kinds of
relational oriented database, hierarchical network, spread sheets,
multimedia etc..
The Heterogeneous Database in a legacy database can be connected by
intra or inter computer networks.
Time-Series Data: This type of data is collected over time, such as stock
prices, weather data, and website visitor logs.
Loose Coupling − In this scheme, the data mining system may use
some of the functions of database and data warehouse system. It fetches
the data from the data respiratory managed by these systems and
performs data mining on that data. It then stores the mining result
either in a file or in a designated place in a database or in a data
warehouse.
Page | 12
Mining different kinds of knowledge in databases − Different users
may be interested in different kinds of knowledge. Therefore it is
necessary for data mining to cover a broad range of knowledge
discovery task.
Interactive mining of knowledge at multiple levels of abstraction −
The data mining process needs to be interactive because it allows
users to focus the search for patterns, providing and refining data
mining requests based on the returned results.
Incorporation of background knowledge − To guide discovery
process and to express the discovered patterns, the background
knowledge can be used. Background knowledge may be used to
express the discovered patterns .
Data mining query languages and ad hoc data mining − Data
Mining Query language that allows the user to describe ad hoc mining
tasks, should be integrated with a data warehouse query language and
optimized for efficient and flexible data mining.
Performance Issues
There can be performance-related issues such as follows:
Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
A data mining task can be specified in the form of a data mining query,
which is input to the data mining system. A data mining query is defined in
terms of data mining task primitives. These primitives allow the user to
interactively communicate with the data mining system. The data mining
primitives specify the following.
Page | 14
1. Set of task-relevant data to be mined.
2. Kind of knowledge to be mined.
3. Background knowledge to be used in the discovery process.
4. Interestingness measures and thresholds for pattern evaluation.
5. Representation for visualizing the discovered patterns.
This specifies the portions of the database or the set of data in which the
user is interested.
This knowledge about the domain to be mined is useful for guiding the
knowledge discovery process and evaluating the patterns found.
For example, interesting measures for association rules include support and
confidence. Rules whose support and confidence values are below user-
specified thresholds are considered uninteresting.
Page | 16
5. The expected representation for visualizing the discovered patterns
A class or concept implies there is a data set or set of features that define
the class or a concept. A class can be a category of items on a shop floor,
and a concept could be the abstract idea on which data may be categorized
like products to be put on clearance sale and non-sale products. There are
two concepts here, one that helps with grouping and the other that helps in
differentiating.
o Frequent item set: This term refers to a group of items that are
commonly found together, such as milk and sugar.
o Frequent substructure: It refers to the various types of data
structures that can be combined with an item set or subsequences,
such as trees and graphs.
o Frequent Subsequence: A regular pattern series, such as buying a
phone followed by a cover
3. Association Analysis
Page | 18
types of data structures, such as trees and graphs, that can be combined
with an item set or subsequence.
4. Classification
5. Prediction/Predictive
6. Cluster Analysis
7. Outlier Analysis
Evolution Analysis pertains to the study of data sets that change over
time. Evolution analysis models are designed to capture evolutionary
trends in data helping to characterize, classify, cluster or discriminate
time-related data.
INTERESTINGNESS PATTERNS
(4) novel.
Page | 20
UNIT-II
UNIT-II
ASSOCIATION RULES
Introduction:
• ARM is to find out association rules or Frequent patterns or subsequences or
correlation relationships among large set of data items that satisfy the predefined
minimum support and confidence from a given database.
A frequent item set is a set of items that occur together frequently in a dataset.
The frequency of an item set is measured by the support and count, which is the
number of transactions or records in the dataset that contain the item set.
Support : It is one of the measures of interestingness. This tells about the usefulness and
certainty of rules. 5% Support means total 5% of transactions in the database follow the rule.
For example, a set of items, such as milk and bread, that appear frequently
together in a transaction data set is a frequent itemset.
A subsequence, such as buying first a PC, then a digital camera, and then a
memory card, if it occurs frequently in a shopping history database, is a (frequent)
sequential pattern.
Maximal Itemset: An itemset is maximal frequent if none of its supersets are frequent.
Closed Itemset:
An (frequent) itemset is called closed if it has no (frequent) superset having the same support.
The support of the rule is the joint probability of a transaction containing both A and B, given
as sup(A ⇒ B) = P(A ∧ B) = sup(A ∪ B).
K- Itemset: Itemset which contains K items is a K-itemset. So it can be said that an itemset is
frequent if the corresponding support count is greater than the minimum support count.
In frequent mining usually, interesting associations and correlations between item sets in
transactional and relational databases are found.
DM Notes 2
Data Warehousing and OLAP
• Database design:
OLTP Vs OLAP
DataWarehousing: A Multitiered Architecture
• The top tier is a front-end client layer, which contains query and
reporting tools, analysis tools, and/or data mining tools.
Data Warehouse Modeling: A Multidimensional Data Model
• What is a data cube?”
• A data cube allows data to be modeled and viewed in multiple dimensions.
• It is defined by dimensions and facts.
• In general terms, dimensions are the entities with respect to
which an organization wants to keep records.
• Each dimension may have a table associated with it, called a
dimension table.
• A multidimensional data model is typically organized around a
central theme, such as sales.
• This theme is represented by a fact table. Facts are numeric
measures.
• Now, suppose that we would like to view the sales data with a third
dimension 3-D Data cube: 3-D View of Sales Data for All Electronics
According to time, item, and location
• Suppose that we would now like to view our sales data with an
additional fourth dimension, such as supplier.
• The result would form a lattice of cuboids, each showing the data
at a different level of summarization.
• The cuboid that holds the lowest level of summarization is called the
"base cuboid“.
• The 0-D cuboid, which holds the highest level of summarization, is called
the "apex cuboid". Lattice of cuboids with 4D data cube Stars,
Snowflakes, and Fact Constellations: Schemas for
ultidimensional Data Models
• The most popular data model for a data warehouse is a
multidimensional model.
1) Roll-up
2) Drill-down
3) Slice and dice
4) Pivot (rotate)
OLAP Operations on Multidimensional Data
Operations
1) Roll-up :
The roll-up operation (also called as drill-up) performs
aggregation on a data cube, either by climbing up a concept
hierarchy for a dimension.
Ex: roll-up from cities to country
2) Drill-down:
Drill-down is the reverse of roll-up.
Drill-down can be realized by either stepping down a concept
hierarchy for a dimension.
EX: drill-down from quarter to months
4) Pivot (rotate):
Pivot (also called rotate) is a visualization operation that rotates the
data axes to provide an alternative presentation of the data.
Data Preprocessing
Data Mining 1
Data Quality: Why Preprocess the Data?
• Data have quality if they satisfy the requirements of the intended use.
Data Mining 6
Major Tasks in Data Preprocessing
• Data cleaning can be applied to remove noise and correct inconsistencies in the data.
– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration merges data from multiple sources into a coherent data store, such
as a data warehouse.
– Integration of multiple databases, data cubes, or files
• Data reduction can reduce the data size by aggregating, eliminating redundant
features, or clustering.
– Dimensionality reduction, Numerosity reduction, Data compression
• Data transformations and Data Discretization, such as normalization, may be
applied.
– For example, normalization may improve the accuracy and efficiency of mining algorithms
involving distance measurements.
– Concept hierarchy generation
Data Mining 9
Major Tasks in Data Preprocessing
Data Cleaning
• Data cleaning routines work to “clean” the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.
– If users believe the data are dirty, they are unlikely to trust the results of any data mining
that has been applied to it.
– Dirty data can cause confusion for the mining procedure, resulting in unreliable output
Data Mining 10
Major Tasks in Data Preprocessing
Data Integration
• Data integration merges data from multiple sources into a coherent data store, such
as a data warehouse.
Data Mining 11
Major Tasks in Data Preprocessing
Data Reduction
• Data reduction obtains a reduced representation of the data set that is much smaller
in volume, yet produces the same (or almost the same) analytical results.
• Data reduction strategies include dimensionality reduction and numerosity
reduction.
Data Mining 12
Major Tasks in Data Preprocessing
Data transformations and Data Discretization
• The data are transformed or consolidated so that the resulting mining process may be
more efficient, and the patterns found may be easier to understand.
Data Mining 13
• Data Quality and Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Transformation and Data Discretization
• Data Reduction
Data Mining 14
Data Cleaning
• Data in the real world is dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
• e.g., Occupation = “ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary = “−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age = “42”, Birthday = “03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– intentional: (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
Data Mining 15
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data.
Data Mining 16
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same class: smarter
– the most probable value: inference-based such as Bayesian formula or decision tree.
• a popular strategy.
• In comparison to the other methods, it uses the most information from the present data to predict
missing values.
Data Mining 17
Noisy Data and
How to Handle Noisy Data?
• Noise: random error or variance in a measured variable
• Outliers may represent noise.
• Given a numeric attribute such as, say, price, how can we “smooth” out the data to
remove the noise?
Data Mining 18
Binning Methods for Data Smoothing
• Binning methods smooth a sorted data by distributing them into bins (buckets).
Data Mining 19
Binning Methods for Data Smoothing: Example
• Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
• Partition into (equal-frequency) bins:
– Bin 1: 4, 8, 15
– Bin 2: 21, 21, 24
– Bin 3: 25, 28, 34
• Smoothing by bin means:
– Bin 1: 9, 9, 9
– Bin 2: 22, 22, 22
– Bin 3: 29, 29, 29
• Smoothing by bin medians:
– Bin 1: 8, 8, 8
– Bin 2: 21, 21, 21
– Bin 3: 28, 28, 28
• Smoothing by bin boundaries:
– Bin 1: 4, 4, 15
– Bin 2: 21, 21, 24
– Bin 3: 25, 25, 34
Data Mining 20
Data Smoothing
• Many methods for data smoothing are also methods for data reduction involving
discretization.
– For example, the binning techniques reduce the number of distinct values per attribute.
• This acts as a form of data reduction for logic-based data mining methods, such as decision tree
induction, which repeatedly make value comparisons on sorted data.
• Concept hierarchies are a form of data discretization that can also be used for data
smoothing.
– A concept hierarchy for price, for example, may map real price values into inexpensive,
moderately priced, and expensive, thereby reducing the number of data values to be
handled by the mining process.
Data Mining 21
Data Cleaning as a Process
• Data discrepancy detection
– Use metadata (e.g., domain, range, dependency, distribution)
– Check uniqueness rule, consecutive rule and null rule
– For example, values that are more than two standard deviations away from the mean for a
given attribute may be flagged as potential outliers.
– Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and
make corrections
• Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g.,
correlation and clustering to find outliers)
• Data migration and integration
– Data migration tools: allow transformations to be specified
– ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations
through a graphical user interface
• Integration of the two processes
– Iterative and interactive (e.g., Potter’s Wheels is a data cleaning tool)
Data Mining 22
Data Integration
• Data integration:
– Combines data from multiple sources into a coherent source.
– Careful integration can help reduce and avoid redundancies and inconsistencies.
• Schema integration:
– Integrate metadata from different sources
– e.g., A.cust-id B.cust-#
• Entity identification problem:
– Identify real world entities from multiple data sources,
– e.g., Bill Clinton = William Clinton
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different sources are different
– Possible reasons: different representations, different scales, e.g., metric vs. British units
Data Mining 24
Data Transformation
• In data transformation, the data are transformed or consolidated into forms
appropriate for mining.
• In data transformation, a function that maps the entire set of values of a given attribute
to a new set of replacement values such that each old value can be identified with one
of the new values.
Data Mining 35
Min-Max Normalization
• Min-max normalization performs a linear transformation on the original data.
• Suppose that minA and maxA are minimum and maximum values of an attribute A.
• Min-max normalization maps a value, vi of an attribute A to 𝒗′𝒊 in the range
[new_minA,new_maxA] by computing:
• Min-max normalization preserves the relationships among the original data values.
• We can standardize the range of all the numerical attributes to [0,1] by applying
min-max normalization with newmin=0 and newmax=1 to all the numeric attributes.
Data Mining 36
Discretization
Discretization: To transform a numeric (continuous) attribute into a categorical attribute.
• Some data mining algorithms require that data be in the form of categorical attributes.
• In discretization:
– The range of a continuous attribute is divided into intervals.
– Then, interval labels can be used to replace actual data values to obtain a categorical
attribute.
Data Mining 40
Discretization Methods
• A basic distinction between discretization methods for classification is whether class
information is used (supervised) or not (unsupervised).
Supervised Discretization:
• Classification (e.g., decision tree analysis)
• Correlation (e.g., 2 ) analysis
Data Mining 41
Discretization by Binning
• Attribute values can be discretized by applying equal-width or equal-frequency
binning.
• Binning aproaches sorts the atribute values first, then partition them into the bins.
– equal width approach divides the range of the attribute into a user-specified number of
intervals each having the same width.
– equal frequency (equal depth) approach tries to put the same number of objects into
each interval.
• After bins are determined, all values are replaced by bin labels to discretize that
attribute.
– Instead of bin labels, values may be replaced by bin means (or medians).
Data Mining 42
Data Reduction
• Data reduction: Obtain a reduced representation of the data set that is much smaller
in volume but yet produces the same (or almost the same) analytical results
• Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
Data Mining 52
Data Reduction: Dimensionality Reduction
• Curse of dimensionality
– When dimensionality increases, data becomes increasingly sparse.
– Density and distance between points, which is critical to clustering, outlier analysis,
becomes less meaningful.
– The possible combinations of subspaces will grow exponentially.
• Dimensionality reduction
– Avoid the curse of dimensionality.
– Help eliminate irrelevant features and reduce noise.
– Reduce time and space required in data mining.
– Allow easier visualization.
• Dimensionality reduction techniques
– Wavelet transforms
– Principal Component Analysis
– Supervised and nonlinear techniques (e.g., feature selection)
Data Mining 53
Dimensionality Reduction
Attribute Subset Selection
• Data sets for analysis may contain hundreds of attributes, many of which may be
irrelevant to the mining task or redundant.
• Redundant Attributes duplicate much or all of the information contained in one or
more other attributes.
– price of a product and the sales tax paid contain much of the same information.
• Irrelevant Attributes contain almost no useful information for the data mining task.
– students' IDs are irrelevant to predict students' grade.
• Attribute Subset Selection reduces the data set size by removing irrelevant or
redundant attribute.
– The goal of attribute subset selection is to find a minimum set of attributes such that the
resulting probability distribution of the data classes is as close as possible to the original
distribution obtained using all attributes.
– Attribute subset selection reduces the number of attributes appearing in the discovered
patterns, helping to make the patterns easier to understand.
Data Mining 54
Dimensionality Reduction
Attribute Subset Selection
Attribute Subset Selection Techniques:
• Brute-force approach:
– Try all possible feature subsets as input to data mining algorithm.
• Embedded approaches:
– Feature selection occurs naturally as part of the data mining algorithm.
• Filter approaches:
– Features are selected before data mining algorithm is run.
Data Mining 55
Numerosity Reduction
Sampling
Simple Random Sampling
– There is an equal probability of selecting any particular item
Sampling without replacement
• As each item is selected, it is removed from the population
Sampling with replacement
• Objects are not removed from the population as they are selected for the sample.
Stratified Sampling
– Split the data into several partitions; then draw random samples from each partition.
• In the simplest version, equal numbers of objects are drawn from each group even though the
groups are of different sizes.
• In an other variation, the number of objects drawn from each group is proportional to the size of
that group.
Data Mining 66
SBIT –CSE-DM UNIT –II
16
SBIT –CSE-DM UNIT –II
17
SBIT –CSE-DM UNIT –II
18
SBIT –CSE-DM UNIT –II
19
SBIT –CSE-DM UNIT –II
20
SBIT –CSE-DM UNIT –II
21
SBIT –CSE-DM UNIT –II
22
SBIT –CSE-DM UNIT –II
23
SBIT –CSE-DM UNIT –II
24
SBIT –CSE-DM UNIT –II
25
SBIT –CSE-DM UNIT –II
26
SBIT –CSE-DM UNIT –II
27
SBIT –CSE-DM UNIT –II
28
SBIT –CSE-DM UNIT –II
29
SBIT –CSE-DM UNIT –II
ECLAT
30
SBIT –CSE-DM UNIT –II
31
SBIT –CSE-DM UNIT –II
32
SBIT –CSE-DM UNIT –II
Correlation Analysis
The word correlation is used in everyday life to denote some form of association.
Correlation analysis in market research is a statistical method that identifies the
strength of a relationship between two or more variables.
The correlation coefficient is measured on a scale that varies from + 1 through 0 to –
1. Complete correlation between two variables is expressed by either + 1 or -1.
When one variable increases as the other increases the correlation is positive; when
one decreases as the other increases it is negative. Complete absence of correlation
is represented by 0.
33
SBIT –CSE-DM UNIT –II
34
SBIT –CSE-DM UNIT –II
35
SBIT –CSE-DM UNIT –II
36
SBIT –CSE-DM UNIT –II
37
SBIT –CSE-DM UNIT –II
A metarule can be used to specify this information describing the form of rules you
are interested in finding. An example of such a metarule is
where P1 and P2 are predicate variables that are instantiated to attributes from the
given database during the mining process, X is a variable representing a customer,
and Y and W take on values of the attributes assigned to P1 and P2, respectively.
Typically, a user will specify a list of attributes to be considered for instantiation
with P1 and P2. Otherwise, a default set may be used.
38
SBIT –CSE-DM UNIT –II
constraints for the mining task. These rule constraints may be used together with,
or as an alternative to, metarule
metarule-guided mining.
This can be expressed in the DMQL data mining query language as follows,
39
SBIT –CSE-DM UNIT –II
40
SBIT –CSE-DM UNIT –II
41
SBIT CSE DM _____________________________________________________________________
Basic Concepts:
Frequent Item sets, Closed Item sets, and Association Rules, Frequent
Item set Mining Methods: Apriori Algorithm, Generating Association
Rules from Frequent Item sets, A Pattern-Growth Approach for Mining
Frequent Item sets.
Market Basket Analysis: A Motivating Example
• Frequent item set mining leads to the discovery of associations
and correlations among items in large transactional data sets.
• The discovery of interesting correlation relationships among huge
amounts of business transaction records can help in many
business decision-making processes.
• A typical example of frequent item set mining is market basket
analysis.
• This process analyzes customer buying habits by finding
associations between the different items that customers place in
their “shopping baskets”
1
SBIT CSE DM _____________________________________________________________________
• Analyze the buying patterns that reflect items that are frequently
purchasedtogether.
• These patterns can be represented in the form of Association rules.
• Rule form: “A => B [support, confidence]”.
• For example, the information that customers who purchase
computers Also tend to buy software at the same time is
represented as:
Computer=> financial_management_software
[ support=2%, confidence= 60%]
• Typically, association rules are considered interesting if they
satisfy both a minimum support threshold and a minimum
confidence threshold such thresholds can be set by the users or
domain experts.
Frequent Item sets, Closed Item sets, and Association Rules:
• Let I be a set of items {I1, I2, I3,…, Im},Let D be a set of database
transactions where each transaction T is a set of items such that
T⊆ I.
• Each transaction is associated with an identifier, called TID.
• Let A be a set of items. A transaction T is said to contain A if and
only if A ⊆ T.
• An association rule is an implication of the form A ⇒ B
3
SBIT CSE DM _____________________________________________________________________
• Closed Item set: An item set is closed if none of its immediate supersets
have same support count same as Item set.
Apriori Algorithm:
Finding Frequent Itemsets by ConfinedCandidate Generation
4
SBIT CSE DM _____________________________________________________________________
• Anti-Monotone property
• Using the apriori property in the algorithm:
• Let us look at how Lk-1 is used to find Lk, for k>=2
Two steps:
5
SBIT CSE DM _____________________________________________________________________
Example: Transactional data for an All Electronics branch
6
SBIT CSE DM _____________________________________________________________________
• L1: I1 – 6, I2 – 7, I3 -6, I4 – 2, I5 - 2
candidate
• L2: {I1,I2} – 4, {I1, I3} – 4, {I1, I5} – 2, {I2, I3} – 4, {I2, I4} - 2, {I2, I5} – 2
• Join: C3=L2xL2={{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2,
I4, I5}}
• This itemset is pruned, because its subset {{I2, I3, I5}} is not
8
SBIT CSE DM _____________________________________________________________________
9
SBIT CSE DM _____________________________________________________________________
10
SBIT CSE DM _____________________________________________________________________
Drawbacks of Apriori
Two Steps:
• Scan the transaction DB for the first time, find frequent items (single
item) and order them into a list L in descending order.
- In the format of (item-name, support)
• For each transaction, order its frequent items according to the order
11
SBIT CSE DM _____________________________________________________________________
in L; Scan DB the second time, construct FP-tree by putting each
frequency ordered transaction onto it.
Example:
TID List of items
T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1,I2,I3
The union of all frequent patterns gives the required frequent itemset.
Now, Lets start from I5. The I5 is involved in 2 branches namely
{I2 I1 I5: 1} and {I2 I1 I3 I5: 1}.
• Therefore considering I5 as suffix, its 2 corresponding prefix paths
would be {I2 I1: 1} and {I2 I1 I3: 1}, which forms its conditional
pattern base.
• Out of these, Only I1 & I2 is selected in the conditional FP-Tree
because I3 is not satisfying the minimum support count.
14
SBIT CSE DM _____________________________________________________________________
15
SBIT CSE DM
42
SBIT CSE DM UNIT
43
SBIT CSE DM UNIT
44
SBIT CSE DM UNIT
45
SBIT CSE DM UNIT
46
SBIT CSE DM UNIT
47
SBIT CSE DM UNIT
48
SBIT CSE DM UNIT
49
SBIT CSE DM UNIT
50
SBIT CSE DM UNIT
SPADE (Sequential PAttern Discovery using Equivalence classes), for discovering the set
of all frequent sequences. The key features of our approach are as follows:
1. We use a vertical id-list database format, where we associate with each sequence a list
of objects in which it occurs, along with the time-stamps. We show that all frequent
sequences can be enumerated via simple temporal joins (or intersections) on id-lists.
2. We use a lattice-theoretic approach to decompose the original search space (lattice)
into smaller pieces (sub-lattices) which can be processed independently in main-
51
SBIT CSE DM UNIT
memory. Our approach usually requires three database scans, or only a single scan with
some pre-processed information, thus minimizing the I/O costs.
3. We decouple the problem decomposition from the pattern search. We propose two
different search strategies for enumerating the frequent sequences within each
sublattice: breadth-first search and depth-first search.
SPADE not only minimizes I/O costs by reducing database scans, but also minimizes
computational costs by using efficient search schemes.
SPADE scales linearly in the database size, and a number of other database parameters.
52
SBIT CSE DM UNIT
They said that, for a given sequential database, in which each sequence
consists of a list of transactions.
53
SBIT CSE DM UNIT
The basic idea behind this method is, rather than projecting
sequence databases by evaluating the frequent occurrences of
sub-sequences, the projection is made on frequent prefix.
The Prefix Span algorithm is run on different datasets and results are
drawn based on minimum support value. One new parameter maximum
prefix length is also considered while running the algorithm.
54
SBIT CSE DM UNIT
55
SBIT CSE DM UNIT
56
SBIT CSE DM UNIT
57
SBIT CSE DM UNIT
58
SBIT CSE DM UNIT
60
SBIT CSE DM UNIT
2. Candidate pruning
3. Support counting
61
SBIT CSE DM UNIT
62
SBIT CSE DM UNIT
63
SBIT CSE DM UNIT
64
SBIT CSE DM UNIT
65
SBIT CSE DM UNIT
66
UNIT-III
Classification:
o predicts categorical class labels
o classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in
classifying new data
Prediction
models continuous-valued functions, i.e., predicts unknown or missing
values
Typical applications
o Credit approval
o Target marketing
o Medical diagnosis
o Fraud detection
Numeric Prediction
Typical applications
• Credit/loan approval:
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is
• The known label of test sample is compared with the classified result
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
Process (1): Model Construction
o Accuracy
o Speed
o Robustness
o Scalability
o Interpretability
Classification by Decision Tree Induction
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute node (nonleaf node)
denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution(Terminal node)
The topmost node in a tree is the root node.
“How are decision trees used for classification?” Given a tuple, X, for which
the associated class label is unknown, the attribute values of the tuple are
tested against the decision tree. A path is traced from the root to a leaf
node, which holds the class prediction for that tuple. Decision trees can
easily be converted to classification rules.
Tree Pruning
“How does tree pruning work?” There are two common approaches to tree
pruning: pre pruning and post pruning.
The second and more common approach is post pruning, which removes
subtrees from a “fully grown” tree. A subtree at a given node is pruned by
removing its branches and replacing it with a leaf. The leaf is labeled with
the most frequent class among the subtree being replaced. For example,
notice the subtree at node “A3?” in the un pruned tree of Figure 6.6.
Suppose that the most common class within this subtree is “class B.” In the
pruned version of the tree, the sub tree in question is pruned by replacing
it with the leaf “class B.”
Bayesian Classification
1. Bayes’ Theorem
“How are these probabilities estimated?” P(H), P(X/H), and P(X) may be
estimated from the given data, as we shall see below. Bayes’ theorem is
useful in that it provides a way of calculating the posterior
probability, P(H/X), from P(H), P(X/H), and P(X).
Bayes’ theorem is
2. Naïve Bayesian Classification
Bayesian Belief Networks
A belief network is defined by two components—a directed acyclic
graph and a set of conditional probability tables (Figure 6.11). Each node in
the directed acyclic graph represents a random variable. The variables may
be discrete or continuous-valued. They may correspond to actual attributes
given in the data or to ―hidden variables‖ believed to form a relationship
(e.g., in the case of medical data, a hidden variable may indicate a
syndrome, representing a number of symptoms that, together, characterize
a specific disease). Each arc represents a probabilistic dependence. If an
arc is drawn from a node Y to a node Z, then Y is a parent or immediate
predecessor of Z, and Z is a descendant of Y. Each variable is conditionally
independent of its non descendants in the graph, given its parents.
A belief network has one conditional probability table (CPT) for each
variable. The CPT for a variable Y specifies the conditional
distribution P(YjParents(Y)), where Parents(Y) are the parents of Y. Figure(b)
shows a CPT for the variable LungCancer. The conditional probability for
each known value of LungCancer is given for each possible combination of
values of its parents. For instance, from the upper leftmost and bottom
rightmost entries, respectively, we see that
Let X = (x1, : : : , xn) be a data tuple described by the variables or
attributes Y1, : : : , Yn, respectively. Recall that each variable is
conditionally independent of its non descendants in the network graph,
given its parents. This allows the network to provide a complete
representation of the existing joint probability distribution with the
following equation:
• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction: the leaf
holds the class prediction
• Rules are mutually exclusive and exhaustive
k-Nearest-Neighbor Classifiers
The k-nearest-neighbor method was first described in the early 1950s. The
method is labor intensive when given large training sets, and did not gain
popularity until the 1960s when increased computing power became
available. It has since been widely used in the area of pattern recognition.
Case-Based Reasoning
Case-based reasoning (CBR) classifiers use a database of problem solutions
to solve new problems. Unlike nearest-neighbor classifiers, which store
training tuples as points in Euclidean space, CBR stores the tuples or
―cases‖ for problem solving as complex symbolic descriptions. Business
applications of CBR include problem resolution for customer service help
desks, where cases describe product-related diagnostic problems. CBR has
also been applied to areas such as engineering and law, where cases are
either technical designs or legal rulings, respectively. Medical education is
another area for CBR, where patient case histories and treatments are used
to help diagnose and treat new patients.
When given a new case to classify, a case-based reasoner will first check if
an identical training case exists. If one is found, then the accompanying
solution to that case is returned. If no identical case is found, then the
case-based reasoner will search for training cases having SCE Department
of Information Technology components that are similar to those of the new
case. Conceptually, these training cases may be considered as neighbors of
the new case. If cases are represented as graphs, this involves searching for
subgraphs that are similar to subgraphs within the new case. The case-
based reasoner tries to combine the solutions of the neighbouring training
cases in order to propose a solution for the new case. If incompatibilities
arise with the individual solutions, then backtracking to search for other
solutions may be necessary. The case-based reasoner may employ
background knowledge and problem-solving strategies in order to propose a
feasible combined solution.
Other Classification Methods
Genetic Algorithms
Genetic Algorithm: based on an analogy to biological evolution
• An initial population is created consisting of randomly generated
rules
• Fuzzy logic uses truth values between 0.0 and 1.0 to represent the
degree of membership (such as using fuzzy membership graph)
• Attribute values are converted to fuzzy values
e.g., income is mapped into the discrete categories {low, medium, high} with
fuzzy values calculated
• For a given new sample, more than one fuzzy value may apply
4.1.1 Applications:
Cluster analysis has been widely used in numerous applications, including market research,
pattern recognition, data analysis, and image processing.
In business, clustering can help marketers discover distinct groups in their customer bases
and characterize customer groups based on purchasing patterns.
In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value,and geographic location, as well as the identification of groups of
automobile insurance policy holders with a high average claim cost.
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.
Clustering can also be used for outlier detection,Applications of outlier detection include
the detection of credit card fraud and the monitoring of criminal activities in electronic
commerce.
The general criterion of a good partitioning is that objects in the same cluster are close or
related to each other, whereas objects of different clusters are far apart or very different.
Theagglomerative approach, also called the bottom-up approach, starts with each
objectforming a separate group. It successively merges the objects or groups that are
closeto one another, until all of the groups are merged into one or until a termination
condition holds.
The divisive approach, also calledthe top-down approach, starts with all of the objects in
the same cluster. In each successiveiteration, a cluster is split up into smaller clusters,
until eventually each objectis in one cluster, or until a termination condition holds.
Hierarchical methods suffer fromthe fact that once a step (merge or split) is done,it can never
be undone. This rigidity is useful in that it leads to smaller computationcosts by not having
toworry about a combinatorial number of different choices.
whereE is the sum of the square error for all objects in the data set
pis the point in space representing a given object
miis the mean of cluster Ci
The k-means algorithm is sensitive to outliers because an object with an extremely large
value may substantially distort the distribution of data. This effect is particularly
exacerbated due to the use of the square-error function.
Instead of taking the mean value of the objects in a cluster as a reference point, we can pick
actual objects to represent the clusters, using one representative object per cluster. Each
remaining object is clustered with the representative object to which it is the most similar.
Thepartitioning method is then performed based on the principle of minimizing the sum of
the dissimilarities between each object and its corresponding reference point. That is, an
absolute-error criterion is used, defined as
whereE is the sum of the absolute error for all objects in the data set
Case 1:
pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to one of the other representative objects, oi,i≠j, then p is reassigned to oi.
Case 2:
pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to orandom, then p is reassigned to orandom.
Case 3:
pcurrently belongs to representative object, oi, i≠j. If ojis replaced by orandomas a representative
object and p is still closest to oi, then the assignment does notchange.
Case 4:
pcurrently belongs to representative object, oi, i≠j. If ojis replaced byorandomas a representative
object and p is closest to orandom, then p is reassigned
toorandom.
Four cases of the cost function for k-medoids clustering
4.4.2 Thek-MedoidsAlgorithm:
A hierarchical clustering method works by grouping data objects into a tree of clusters.
The quality of a pure hierarchical clusteringmethod suffers fromits inability to
performadjustment once amerge or split decision hasbeen executed. That is, if a particular
merge or split decision later turns out to have been apoor choice, the method cannot
backtrack and correct it.
We can specify constraints on the objects to beclustered. In a real estate application, for
example, one may like to spatially cluster only those luxury mansions worth over a million
dollars. This constraint confines the setof objects to be clustered. It can easily be handled
by preprocessing after which the problem reduces to an instance ofunconstrained
clustering.
A user may like to set a desired range for each clustering parameter. Clustering parameters
are usually quite specific to the given clustering algorithm. Examples of parameters include
k, the desired numberof clusters in a k-means algorithm; or e the radius and the minimum
number of points in the DBSCAN algorithm. Although such user-specified parameters may
strongly influence the clustering results, they are usually confined to the algorithm itself.
Thus, their fine tuning and processing are usually not considered a form of constraint-based
clustering.
Constraints on distance or similarity functions:
We can specify different distance orsimilarity functions for specific attributes of the objects
to be clustered, or differentdistance measures for specific pairs of objects.When clustering
sportsmen, for example,we may use different weighting schemes for height, body weight,
age, and skilllevel. Although this will likely change the mining results, it may not alter the
clusteringprocess per se. However, in some cases, such changes may make the evaluationof
the distance function nontrivial, especially when it is tightly intertwined with the clustering
process.
User-specified constraints on the properties of individual clusters:
A user may like tospecify desired characteristics of the resulting clusters, which may
strongly influencethe clustering process.
Semi-supervised clustering based on partial supervision:
The quality of unsupervisedclustering can be significantly improved using some weak form
of supervision.This may be in the formof pairwise constraints (i.e., pairs of objects labeled
as belongingto the same or different cluster). Such a constrained clustering process is
calledsemi-supervised clustering.
There exist data objects that do not comply with the general behavior or model of the data.
Such data objects, which are grossly different from or inconsistent with the remaining set
of data, are called outliers.
Many data mining algorithms try to minimize the influence of outliers or eliminate them all
together. This, however, could result in the loss of important hidden information because
one person’s noise could be another person’s signal. In other words, the outliers may be of
particular interest, such as in the case of fraud detection, where outliers may indicate
fraudulent activity. Thus, outlier detection and analysis is an interesting data mining task,
referred to as outlier mining.
It can be used in fraud detection, for example, by detecting unusual usage of credit cards or
telecommunication services. In addition, it is useful in customized marketing for
identifying the spending behavior of customers with extremely low or extremely high
incomes, or in medicalanalysis for finding unusual responses to various medical treatments.
Outlier mining can be described as follows: Given a set of n data points or objectsand k, the
expected number of outliers, find the top k objects that are considerablydissimilar,
exceptional, or inconsistent with respect to the remaining data. The outliermining problem
can be viewed as two subproblems:
Define what data can be considered as inconsistent in a given data set, and
Find an efficient method to mine the outliers so defined.
Types of outlier detection:
Statistical Distribution-Based Outlier Detection
Distance-Based Outlier Detection
Density-Based Local Outlier Detection
Deviation-Based Outlier Detection
In this method, the data space is partitioned into cells with a side length equal to
Eachcell has two layers surrounding it. The first layer is one cell thick, while the secondis
Let Mbe the maximum number ofoutliers that can exist in the dmin-neighborhood of an
outlier.
An object, o, in the current cell is considered an outlier only if cell + 1 layer countis less
than or equal to M. If this condition does not hold, then all of the objectsin the cell can be
removed from further investigation as they cannot be outliers.
If cell_+ 2_layers_count is less than or equal to M, then all of the objects in thecell are
considered outliers. Otherwise, if this number is more than M, then itis possible that some
of the objects in the cell may be outliers. To detect theseoutliers, object-by-object
processing is used where, for each object, o, in the cell,objects in the second layer of o
are examined. For objects in the cell, only thoseobjects having no more than M points in
their dmin-neighborhoods are outliers.The dmin-neighborhood of an object consists of
the object’s cell, all of its firstlayer, and some of its second layer.
A variation to the algorithm is linear with respect to n and guarantees that no morethan three
passes over the data set are required. It can be used for large disk-residentdata sets, yet does
not scale well for high dimensions.
That is, there are at most k-1 objects that are closer to p than o. You may bewondering at this
point how k is determined. The LOF method links to density-basedclustering in that it sets k
to the parameter rMinPts,which specifies the minimumnumberof points for use in identifying
clusters based on density.
Here, MinPts (as k) is used to define the local neighborhood of an object, p.
The k-distance neighborhood of an object p is denoted Nkdistance(p)(p), or Nk(p)for short. By
setting k to MinPts, we get NMinPts(p). It contains the MinPts-nearestneighbors of p. That is, it
contains every object whose distance is not greater than theMinPts-distance of p.
The reachability distance of an object p with respect to object o (where o is within
theMinPts-nearest neighbors of p), is defined as reach
distMinPts(p, o) = max{MinPtsdistance(o), d(p, o)}.
Intuitively, if an object p is far away , then the reachabilitydistance between the two is simply
their actual distance. However, if they are sufficientlyclose (i.e., where p is within the
MinPts-distance neighborhood of o), thenthe actual distance is replaced by the MinPts-
distance of o. This helps to significantlyreduce the statistical fluctuations of d(p, o) for all of
the p close to o.
The higher thevalue of MinPts is, the more similar is the reachability distance for objects
withinthe same neighborhood.
Intuitively, the local reachability density of p is the inverse of the average reachability
density based on the MinPts-nearest neighbors of p. It is defined as
The local outlier factor (LOF) of p captures the degree to which we call p an outlier.
It is defined as
It is the average of the ratio of the local reachability density of p and those of p’s
MinPts-nearest neighbors. It is easy to see that the lower p’s local reachability density
is, and the higher the local reachability density of p’s MinPts-nearest neighbors are,
the higher LOF(p) is.
4.7.4 Deviation-Based Outlier Detection:
Deviation-based outlier detection does not use statistical tests or distance-basedmeasures to
identify exceptional objects. Instead, it identifies outliers by examining themain
characteristics of objects in a group.Objects that ―deviate‖ fromthisdescription areconsidered
outliers. Hence, in this approach the term deviations is typically used to referto outliers. In
this section, we study two techniques for deviation-based outlier detection.The first
sequentially compares objects in a set, while the second employs an OLAPdata cube
approach.
Dissimilarities are assessed between subsets in the sequence. The technique introducesthe
following key terms.
Exception set:
This is the set of deviations or outliers. It is defined as the smallestsubset of objects whose
removal results in the greatest reduction of dissimilarity in the residual set.
Dissimilarity function:
This function does not require a metric distance between theobjects. It is any function that, if
given a set of objects, returns a lowvalue if the objectsare similar to one another. The greater
the dissimilarity among the objects, the higherthe value returned by the function. The
dissimilarity of a subset is incrementally computedbased on the subset prior to it in the
sequence. Given a subset of n numbers, {x1, …,xn}, a possible dissimilarity function is the
variance of the numbers in theset, that is,
where x is the mean of the n numbers in the set. For character strings, the dissimilarityfunction
may be in the form of a pattern string (e.g., containing wildcard charactersthat is used to cover
all of the patterns seen so far. The dissimilarity increases when the pattern covering all of the
strings in Dj-1 does not cover any string in Dj that isnot in Dj-1.
Cardinality function:
This is typically the count of the number of objects in a given set.
Smoothing factor:
This function is computed for each subset in the sequence. Itassesses how much the
dissimilarity can be reduced by removing the subset from theoriginal set of objects.
Types Of Data In Cluster Analysis Are:
Interval-Scaled Variables
Interval-scaled variables are continuous measurements of a roughly linearscale.
Typical examples include weight and height, latitude and longitude coordinates
(e.g., when clustering houses), and weather temperature
Binary Variables
A binary variable is a variable that can take only 2 values.
For example, generally, gender variables can take 2 variables male and female.
Contingency Table For Binary Data
Let us consider binary values 0 and 1
Let p=a+b+c+d
Simple matching coefficient (invariant, if the binary variable is symmetric):
Jaccard coefficient (noninvariant if the binary variable is asymmetric):
Nominal or Categorical Variables
A generalization of the binary variable in that it can take more than 2 states,
e.g., red, yellow, blue, green.
Method 1: Simple matching
The dissimilarity between two objects i and j can be computed based on
the simple matching.
m: Let m be no of matches (i.e., the number of variables for which i and j are in
the same state).
p: Let p be total no of variables.
Method 2: use a large number of binary variables
Creating a new binary variable for each of the M nominal states.
ordinal Variables
An ordinal variable can be discrete or continuous.
In this, order is important, e.g., rank.
It can be treated like interval-scaled.
By replacing xif by their rank,
By mapping the range of each variable onto [0, 1] by replacing the i-th object in
the f-th variable .
Then compute the dissimilarity using methods for interval-scaled variables.
Ratio-Scaled Intervals
Ratio-scaled variable: It is a positive measurement on a nonlinear scale,
approximately at an exponential scale, such as Ae^Bt or A^e-Bt
MULTIMEDIA MINING
Multimedia data mining is used for extracting interesting information for multimedia data
sets, such as audio, video, images, graphics, speech, text and combination of several types of
data set which are all converted from different formats into digital media.
Mining in multimedia is referred to automatic annotation or annotation mining.
Multimedia data is a combination of a number of media objects (i.e., text, graphics, sound,
animation, video, etc.) that must be presented in a coherent, synchronized manner.
It must contain at least a discrete and a continuous media.
The models which are used to perform multimedia data are very important in mining.
Commonly four different multimedia mining models have been used. These are
classification, association rule, clustering and statistical modelling.
Classification is the process of constructing data into categories for its better effective
and efficient use; it creates a function that well-planned data item into one of many
predefined classes.
Decision tree classification has a perceptive nature that uses conceptual model
without loss of exactness.
2. Association Rule: Association Rule is one of the most important data mining
techniques that help find relations between data items in huge databases.
There are two types of associations in multimedia mining: image content and non-
image content features. Mining the frequently occurring patterns between different
images becomes mining the repeated patterns in a set of transactions. Multi-relational
association rule mining displays multiple reports for the same image. In image
classification also, multiple-level association rule techniques are used.
3. Clustering: Cluster analysis divides the data objects into multiple groups or clusters.
Cluster analysis combines all objects based on their groups.
4. Statistical Modeling: Statistical mining models regulate the statistical validity of test
parameters and have been used to test hypotheses, undertake correlation studies, and
transform and make data for further analysis. This is used to establish links between
words and partitioned image regions to form a simple co-occurrence model.
2. Multidimensional Analysis
A multimedia data cube can have additional dimensions and measures for multimedia data,
such as colour, texture, and shape.
The Multimedia data mining system prototype is Multimedia Miner, the extension of the
DBMiner system that handles multimedia data.
The Image Excavator component of Multimedia Miner uses image contextual information,
like HTML tags on Web pages, to derive keywords. By navigating online directory
structures, like Yahoo! directory, it is possible to build hierarchies of keywords mapped on
the directories in which the image was found.
Classification and predictive analysis has been used for mining multimedia data, particularly
in scientific analysis like astronomy, seismology, and geoscientific analysis.
Decision tree classification is an important method for reported image data mining
applications.
Image data mining classification and clustering are carefully connected to image analysis and
scientific data mining. The image data are frequently in large volumes and need substantial
processing power, such as parallel and distributed processing. Hence, many image analysis
techniques and scientific data analysis methods could be applied to image data mining.
First, an image contains multiple objects, each with various features such as colour, shape,
texture, keyword, and spatial locations, so that many possible associations can be made.
Second, a picture containing multiple repeated objects is essential in image analysis. The
recurrence of similar objects should not be ignored in association analysis.
Third, to find the associations between the spatial relationships and multimedia images can be
used to discover object associations and correlations.
With the associations between multimedia objects, we can treat every image as a transaction
and find commonly occurring patterns among different images.
Multimedia mining architecture is given in the below image. The architecture has several
components. Important components are Input, Multimedia Content, Spatiotemporal
Segmentation, Feature Extraction, Finding similar Patterns, and E
Evaluation
valuation of Results.
1. The input stage comprises a multimedia database used to find the patterns and
perform the data mining.
2. Multimedia Content is the data selection stage that requires the user to select the
databases, subset of fields, or data for data mining.
3. Spatio-temporal
temporal segmentation is nothing but moving objects in image sequences in
the videos, and it is useful for object segmentation.
4. Feature extraction is the preprocessing step that involves integrating data from
various sources and making choices regarding characterizing or Finding a similar
pattern stage is the heart of the whole data mining process. The hidden patterns and
trends in the data are basically uncovered in this stage. Some approaches to finding
similar pattern stages contain aassociation,
ssociation, classification, clustering, regression, time-
time
series analysis and visualization.
5. Evaluation of Results is a data mining process used to evaluate the results, and this is
important to determine whether the prior stage must be revisited or not. Th This stage
consists of reporting and using the extracted knowledge to produce new actions,
products, services, or marketing strategies.
Text Mining:
Text mining is a component of data mining that deals specifically with unstructured text
data.
Text Mining also referred as TextDataMining and Knowledge Discovery in Textual
Database.
It involves the use of natural language processing (NLP) techniques to extract useful
information and insights from large amounts of unstructured text data.
Text mining can be used as a preprocessing step for data mining.
Text mining is widely used in various fields, such as natural language processing, information retrieval,
and social media analysis.
By using text mining, the unstructured text data can be transformed into structured data that
can be used for data mining tasks such as classification, clustering, and association rule
mining.
Process of Text Mining
o Text Pre-processing is a significant task and a critical step in Text Mining, Natural
Language Processing (NLP), and information retrieval(IR).
In the field of text mining, data pre-processing is used for extracting useful
information and knowledge from unstructured text data.
Information Retrieval (IR) is a matter of choosing which documents in a collection
should be retrieved to fulfill the user's need.
o Featureselection:
Feature selection is a significant part of data mining. Feature selection can be defined
as the process of reducing the input of processing or finding the essential information
sources. The feature selection is also called variable selection.
o DataMining:
Now, in this step, the text mining procedure merges with the conventional process.
Classic Data Mining procedures are used in the structural database.
o Evaluate:
Afterward, it evaluates the results. Once the result is evaluated, the result abandon.
o Applications: Customer care, risk management, social media analysis etc ..
Pre-processing and data cleansing tasks are performed to distinguish and eliminate
inconsistency in the data. The data cleansing process makes sure to capture the genuine
text, and it is performed to eliminate stop words stemming (the process of identifying the
root of a certain word and indexing the data.
Processing and controlling tasks are applied to review and further clean the data set.
Pattern analysis is implemented in Management Information System.
Information processed in the above steps is utilized to extract important and applicable
data for a powerful and convenient decision-making process and trend analysis.
Text retrieval is the process of transforming unstructured text into a structured format to
identify meaningful patterns and new insights.
By using advanced analytical techniques, including Naïve Bayes, Support Vector Machines
(SVM), and other deep learning algorithms.
There are two methods of text retrieval methods
Document Selection − In document selection methods, the query is regarded as defining
constraint for choosing relevant documents.
A general approach of this category is the Boolean retrieval model, in which a document is
defined by a set of keywords and a user provides a Boolean expression of keywords, such as
car and repair shops, tea or coffee, or database systems but not Oracle.
The retrieval system can take such a Boolean query and return records that satisfy the
Boolean expression. The Boolean retrieval techniques usually only work well when the user
understands a lot about the document set and can formulate the best query in this way.
Document ranking − Document ranking methods use the query to rank all records in the
order of applicability.
There are several ranking methods based on a huge spectrum of numerical foundations, such
as algebra, logic, probability, and statistics.
The common intuition behind all of these techniques is that it can connect the keywords in a
query with those in the records and score each record depending on how well it matches the
query.
The degree of relevance of records with a score computed depending on the information
including the frequency of words in the document and the whole set.
For ordinary users and exploratory queries, these techniques are more suitable than document
selection methods.
The most popular approach of this method is the vector space model.
The basic idea of the vector space model is the following:
It can represent a document and a query both as vectors in a high-dimensional space
corresponding to all the keywords and use an appropriate similarity measure to evaluate the
similarity among the query vector and the record vector.
The similarity values can then be used for ranking documents.
TF-IDF
Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical
method in natural language processing and information retrieval.
It measures how important a term is within a document relative to a collection of
documents (i.e., relative to a corpus).
Words within a text document are transformed into importance numbers by a text
vectorization process. There are many different text vectorization scoring schemes,
TF-IDF being one of the most common.
Inverse Document Frequency: IDF of a term reflects the proportion of documents in the
corpus that contain the term.
Words unique to a small percentage of documents receive higher importance values than
words common across all documents.
The resulting TF-IDF score reflects the importance of a term for a document in the corpus.
TF-IDF is useful in many natural language processing applications.
For example, Search Engines use TF-IDF to rank the relevance of a document for a query.
TF-IDF is also employed in text classification, text summarization, and topic modeling.
Dimensionality Reduction
Latent Semantic Analysis is a natural language processing method that uses the statistical
approach to identify the association among the words in a document.
Singular Value Decomposition:
Singular Value Decomposition is the statistical method that is used to find the latent(hidden)
semantic structure of words spread across the document.
C = collection of documents.
d = number of documents.
n = number of unique words in the whole collection.
M=dXn
The SVD decomposes the M matrix i.e word to document matrix into three matrices as
follows
M = U∑VT
A very significant feature of SVD is that it allows us to truncate few contexts which are not
necessarily required by us.
The ∑ matrix provides us with the diagonal values which represent the significance of the
context from highest to the lowest.
By using these values we can reduce the dimensions and hence this can be used as a
dimensionality reduction technique too.
Text Mining Approaches in Data Mining:
These are the following text mining approaches that are used in data mining.
It collects sets of keywords or terms that often happen together and afterward discover the
association relationship among them.
First, it preprocesses the text data by parsing, stemming, removing stop words, etc.
Once it pre-processed the data, then it induces association mining algorithms. Here, human
effort is not required, so the number of unwanted results and the execution time is reduced.
This analysis is used for the automatic classification of the huge number of online text documents like
web pages, emails, etc.
Text document classification varies with the classification of relational data as document databases
are not organized according to attribute values pairs.
What is Web Mining?
● Web mining is the use of data mining techniques to extract knowledge from web data.
● Web data includes :
○ web documents
○ hyperlinks between documents
○ usage logs of web sites
● The WWW is huge, widely distributed, global information service centre and, therefore,
constitutes a rich source for data mining.
Data Mining vs Web Mining
● Data Mining : It is a concept of identifying a significant pattern from the data that gives a better
outcome.
● Web Mining : It is the process of performing data mining in the web. Extracting the web
documents and discovering the patterns from it.
Web Data Mining Process
https://ieeexplore.ieee.org/document/5485404/
Issues
● Web data sets can be very large
○ Tens to hundreds of terabyte
● Cannot mine on a single server
○ Need large farms of servers
● Proper organization of hardware and software to mine multi-terabyte data sets
● Difficulty in finding relevant information
● Extracting new knowledge from the web
Web Mining Taxonomy
https://www.researchgate.net/figure/Taxonomy-of-Web-Mining-Source-Kavita-et-al-2011_fig1_282357293
Web Content Mining - Introduction ??
● Mining, extraction and integration of useful data, information and knowledge from Web page
content.
● Web content mining is related but different from data mining and text mining.
● Web data are mainly semi-structured and/or unstructured, while data mining deals primarily
with structured data.
What is Web Structure Mining?
● Web structure mining is the process of discovering structure information from the web.
● The structure of typical web graph consists of Web pages as nodes, and hyperlinks as edges
connecting between two related pages.
Hyperlink
Web document
Web Structure Mining (cont.)
● This type of mining can be performed either at the document level(intra-page) or at the
hyperlink level(inter-page).
● The research at the hyperlink level is called Hyperlink analysis.
● Hyperlink structure can be used to retrieve useful information on the web.
● PageRank
● Hubs and Authorities - HITS
PageRank
● Used to discover the most important pages on the web.
● Prioritize pages returned from search by looking at web structure.
● Importance of pages is calculated based on the number of pages which point to it (backlinks).
● Weighting is used to provide more importance to backlinks coming from important pages.
● PR(p) = (1-d) + d (PR(1)/N1 + …… + PR(n)/Nn)
○ PR(i): PageRank for a page i which points to target page p.
○ Ni: Number of links coming out of page i.
○ d: constant value between 0 and 1 used for normalization.
○ (1-d): Bit of probability math magic so that sum of all webpages pageranks should be one.
Hubs and Authorities
● Authoritative pages
○ Authors defines an authority as the best source for the request.
○ Highly important pages.
○ Best source for requested information.
● Hub pages
○ Contains links to highly important pages.
Hubs Authorities
Web Structure Mining applications
● Information retrieval in social networks.
● To find out the relevance of each web page.
● Measuring the completeness of Web sites.
● Used in search engines to find out the relevant information.
Web Usage Mining
● Web usage mining: automatic discovery of patterns in clickstreams and associated data
collected or generated as a result of user interactions with one or more Web sites.
● Goal: analyze the behavioral patterns and profiles of users interacting with a Web site.
● The discovered patterns are usually represented as collections of pages, objects, or resources that
are frequently accessed by groups of users with common interests.
● Data in Web Usage Mining:
a. Web server logs
b. Site contents
c. Data about the visitors, gathered from external channels
Three Phases
Raw Server log Preprocessed data Rules and Patterns Interesting Knowledge
reference
Data Preparation
● Data cleaning
○ By checking the suffix of the URL name, for example, all log entries with filename
suffixes such as, \gif, jpeg, etc
● User identification
○ If a page is requested that is not directly linked to the previous pages, multiple users are
assumed to exist on the same machine
○ Other heuristics involve using a combination of IP address, machine name, browser agent,
and temporal information to identify users
● Transaction identification
○ All of the page references made by a user during a single visit to a site
○ Size of a transaction can range from a single page reference to all of the page references
Pattern Discovery Tasks
● Clustering and Classification
○ Clustering of users help to discover groups of users with similar navigation patterns =>
provide personalized Web content
○ Clustering of pages help to discover groups of pages having related content => search
engine
○ E.g. clients who often access webminer software products tend to be from educational
institutions.
○ clients who placed an online order for software tend to be students in the 20-25 age group
and live in the United States.
○ 75% of clients who download software and visit between 7:00 and 11:00 pm on weekend
are engineering students
Pattern Discovery Tasks
● Sequential Patterns:
○ extract frequently occurring intersession patterns such that the presence of a set of items
followed by another item in time order
○ Used to predict future user visit patterns=>placing ads or recommendations
● Association Rules:
○ Discover correlations among pages accessed together by a client
○ Help the restructure of Web site
○ Develop e-commerce marketing strategies - Grocery Mart
Pattern Analysis Tasks
● Pattern Analysis is the final stage of WUM, which involves the validation and
interpretation of the mined pattern
● Validation:
○ to eliminate the irrelative rules or patterns and to extract the interesting rules or
patterns from the output of the pattern discovery process
● Interpretation:
○ the output of mining algorithms is mainly in mathematic form and not suitable for
direct human interpretations