DWDM Notes
DWDM Notes
Knowledge discovery from Data (KDD) is essential for data mining. While others view data
mining as an essential step in the process of knowledge discovery. Here is the list of steps
involved in the knowledge discovery process −
Data Cleaning − In this step, the noise and inconsistent data is removed.
Data Integration − In this step, multiple data sources are combined.
Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.
Data Transformation − In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
Data Mining − In this step, intelligent methods are applied in order to extract data
patterns.
Pattern Evaluation − In this step, data patterns are evaluated.
Knowledge Presentation − In this step, knowledge is represented.
3. DataWarehouse
A datawarehouse is defined as the collection of data integrated from multiple
sources that will query and decision making.
There are three types of datawarehouse: Enterprise datawarehouse, Data
Mart and Virtual Warehouse.
Two approaches can be used to update data in DataWarehouse: Query-
driven Approach and Update-driven Approach.
Application: Business decision making, Data mining, etc.
4. Transactional Databases
Transactional databases are a collection of data organized by time stamps, date,
etc to represent transaction in databases.
This type of database has the capability to roll back or undo its operation when a
transaction is not completed or committed.
Highly flexible system where users can modify information without changing any
sensitive information.
Follows ACID property of DBMS.
Application: Banking, Distributed systems, Object databases, etc.
5. Multimedia Databases
Multimedia databases consists audio, video, images and text media.
They can be stored on Object-Oriented Databases.
They are used to store complex information in pre-specified formats.
Application: Digital libraries, video-on demand, news-on demand, musical
database, etc.
6. Spatial Database
Store geographical information.
Stores data in the form of coordinates, topology, lines, polygons, etc.
Application: Maps, Global positioning, etc.
7. Time-series Databases
Time series databases contain stock exchange data and user logged activities.
Handles array of numbers indexed by time, date, etc.
It requires real-time analysis.
Application: eXtremeDB, Graphite, InfluxDB, etc.
8. WWW
WWW refers to World wide web is a collection of documents and resources like
audio, video, text, etc which are identified by Uniform Resource Locators (URLs)
through web browsers, linked by HTML pages, and accessible via the Internet
network.
It is the most heterogeneous repository as it collects data from multiple resources.
It is dynamic in nature as Volume of data is continuously increasing and changing.
Application: Online shopping, Job search, Research, studying, etc.
a) Descriptive Function
The descriptive function deals with the general properties of data in the database.
Here is the list of descriptive functions −
1. Class/Concept Description
2. Mining of Frequent Patterns
3. Mining of Associations
4. Mining of Correlations
5. Mining of Clusters
1. Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and printers, and
concepts of customers include big spenders and budget spenders. Such descriptions of a class
or a concept are called class/concept descriptions. These descriptions can be derived by the
following two ways −
Data Characterization − This refers to summarizing data of class under study. This
class under study is called as Target Class.
Data Discrimination − It refers to the mapping or classification of a class with some
predefined group or class.
4. Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical
correlations between associated-attribute-value pairs or between two item sets to analyze
that if they have positive, negative or no effect on each other.
5. Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the objects
in other clusters.
3. Decision Trees − A decision tree is a structure that includes a root node, branches,
and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes
the outcome of a test, and each leaf node holds a class label.
6. Outlier Analysis − Outliers may be defined as the data objects that do not comply
with the general behavior or model of the data available.
Note − These primitives allow us to communicate in an interactive manner with the data
mining system. Here is the list of Data Mining Task Primitives −
Set of task relevant data to be mined.
Kind of knowledge to be mined.
Background knowledge to be used in discovery process.
Interestingness measures and thresholds for pattern evaluation.
Representation for visualizing the discovered patterns.
1. Statistics:
It uses the mathematical analysis to express representations, model and summarize
empirical data or real world observations.
Statistical analysis involves the collection of methods, applicable to large amount of
data to conclude and report the trend.
2. Machine learning
Arthur Samuel defined machine learning as a field of study that gives computers the
ability to learn without being programmed.
When the new data is entered in the computer, algorithms help the data to grow or
change due to machine learning.
In machine learning, an algorithm is constructed to predict the data from the available
database (Predictive analysis).
It is related to computational statistics.
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
intruding and attacking network prompted intrusion detection to become a critical component
of network administration.
Major Issues in data mining:
Data mining is a dynamic and fast-expanding field with great strengths. The major issues
can divided into five groups:
a) Mining Methodology
b) User Interaction
c) Efficiency and scalability
d) Diverse Data Types Issues
e) Data mining society
a) Mining Methodology:
It refers to the following kinds of issues −
Mining different kinds of knowledge in databases − Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data mining
to cover a broad range of knowledge discovery task.
Mining knowledge in multidimensional space – when searching for knowledge in
large datasets, we can explore the data in multidimensional space.
Handling noisy or incomplete data − the data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the discovered patterns will be
poor.
Pattern evaluation − the patterns discovered should be interesting because either
they represent common knowledge or lack novelty.
b) User Interaction:
Interactive mining of knowledge at multiple levels of abstraction − The data
mining process needs to be interactive because it allows users to focus the search for
patterns, providing and refining data mining requests based on the returned results.
Incorporation of background knowledge − To guide discovery process and to
express the discovered patterns, the background knowledge can be used. Background
knowledge may be used to express the discovered patterns not only in concise terms
but at multiple levels of abstraction.
Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be integrated
with a data warehouse query language and optimized for efficient and flexible data
mining.
Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
2. Binary Attributes: Binary data has only 2 values/states. For Example yes or no,
affected or unaffected, true or false.
i) Symmetric: Both values are equally important (Gender).
ii) Asymmetric: Both values are not equally important (Result).
3. Ordinal Attributes: The Ordinal Attributes contains values that have a meaningful
sequence or ranking (order) between them, but the magnitude between values is not
actually known, the order of values that shows what is important but don’t indicate how
important it is.
Attribute Values
Grade O, S, A, B, C, D, F
5. Discrete: Discrete data have finite values it can be numerical and can also be in
categorical form. These attributes has finite or countably infinite set of values.
Example
Attribute Values
Teacher, Business man,
Profession
Peon
ZIP Code 521157, 521301
6. Continuous: Continuous data have infinite no of states. Continuous data is of float type.
There can be many values between 2 and 3.
Example:
Attribute Values
Height 5.4, 5.7, 6.2, etc.,
Weight 50, 65, 70, 73, etc.,
Mean = = =
Sometimes, each values in a set maybe associated with weight for i=1, 2, …., N.
then the mean is as follows
Mean = = =
This is called weighted arithmetic mean or weighted average.
b) Median: Median is middle value among all values.
For N number of odd list median is th value.
For N number of even list median is th value.
c) Mode:
The mode is another measure of central tendency.
Datasets with one, two, or three modes are respectively called unimodal, bimodal,
and trimodal.
A dataset with two or more modes is multimodal. If each data occurs only once,
then there is no mode.
The data values can represent as Bar charts, pie charts, Line graphs, etc.
Quantile plots:
A quantile plot is a simple and effective way to have a first look at a univariate data
distribution.
Plots quantile information
For a data xi data sorted in increasing order, fi indicates that approximately 100
fi% of the data are below or equal to the value xi
Note that
the 0.25 quantile corresponds to quartile Q1,
the 0.50 quantile is the median, and
the 0.75 quantile is Q3.
Scatter Plot:
Scatter plot
Is one of the most effective graphical methods for determining if there appears to
be a relationship, clusters of points, or outliers between two numerical attributes.
Each pair of values is treated as a pair of coordinates and plotted as points in the plane
Data Visualization:
Visualization is the use of computer graphics to create visual images which aid in the
understanding of complex, often massive representations of data.
Categorization of visualization methods:
a) Pixel-oriented visualization techniques
b) Geometric projection visualization techniques
c) Icon-based visualization techniques
d) Hierarchical visualization techniques
e) Visualizing complex data and relations
a) Pixel-oriented visualization techniques
For a data set of m dimensions, create m windows on the screen, one for each
dimension
The m dimension values of a record are mapped to m pixels at the corresponding
positions in the windows
The colors of the pixels reflect the corresponding values
To save space and show the connections among multiple dimensions, space filling is
often done in a circle segment
a) Euclidean Distance
Assume that we have measurements xik, i = 1, … , N, on variables k = 1, … , p (also
called attributes).
The Euclidean distance between the ith and jth objects is
Note that λ and p are two different parameters. Dimension of the data matrix remains
finite.
c) Mahalanobis Distance
Let X be a N × p matrix. Then the ith row of X is
DATA PREPROCESSING
1. Preprocessing
Real-world databases are highly susceptible to noisy, missing, and inconsistent data
due to their typically huge size (often several gigabytes or more) and their likely origin from
multiple, heterogeneous sources. Low-quality data will lead to low-quality mining results, so
we prefer a preprocessing concepts.
Data Preprocessing Techniques
* Data cleaning can be applied to remove noise and correct inconsistencies in the data.
* Data integration merges data from multiple sources into coherent data store, such as
a data warehouse.
* Data reduction can reduce the data size by aggregating, eliminating redundant
features, orclustering, for instance. These techniques are not mutually exclusive; they
may worktogether.
* Data transformations, such as normalization, may be applied.
Need for preprocessing
Incomplete, noisy and inconsistent data are common place properties of large real world
databases and data warehouses.
Incomplete data can occur for a number of reasons:
Attributes of interest may not always be available
Relevant data may not be recorded due to misunderstanding, or because of equipment
malfunctions.
Data that were inconsistent with other recorded data may have been deleted.
Missing data, particularly for tuples with missing values for some attributes, may
need to be inferred.
The data collection instruments used may be faulty.
There may have been human or computer errors occurring at data entry.
Errors in data transmission can also occur.
There may be technology limitations, such as limited buffer size for coordinating
synchronized data transfer and consumption.
Data cleaning routines work to ―clean‖ the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.
Data integration is the process of integrating multiple databases cubes or files. Yet
some attributes representing a given may have different names in different databases,
causing inconsistencies and redundancies.
Data transformation is a kind of operations, such as normalization and aggregation,
are additional data preprocessing procedures that would contribute toward the success
of the mining process.
Data reduction obtains a reduced representation of data set that is much smaller in
volume, yet produces the same(or almost the same) analytical results.
2. DATA CLEANING
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing values, smooth out noise while identifying
outliers and correct inconsistencies in the data.
2.1 Missing Values
Many tuples have no recorded value for several attributes, such as customer income.so we
can fill the missing values for this attributes.
The following methods are useful for performing missing values over several attributes:
1. Ignore the tuple: This is usually done when the class label missing (assuming the
mining task involves classification). This method is not very effective, unless the
tuple contains several attributes with missing values. It is especially poor when the
percentage of the missing values per attribute varies considerably.
2. Fill in the missing values manually: This approach is time –consuming and may not
be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute value
by the same constant, such as a label like ―unknown‖ or -∞.
4. Use the attribute mean to fill in the missing value: For example, suppose that the
average income of customers is $56,000. Use this value to replace the missing value
for income.
5. Use the most probable value to fill in the missing value: This may be determined
with regression, inference-based tools using a Bayesian formalism or decision tree
induction. For example, using the other customer attributes in the sets decision tree is
constructed to predict the missing value for income.
In general, the large the width, the greater the effect of the smoothing. Alternatively, bins
may be equal-width, where the interval range of values in each bin is constant Example 2:
Remove the noise in the following data using smoothing techniques:
8, 4,9,21,25,24,29,26,28,15
Sorted data for price (in dollars):4,8,9,15,21,21,24,25,26,28,29,34
Partition into equal-frequency (equi-depth) bins:
Bin 1: 4, 8,9,15
Bin 2: 21,21,24,25
Bin 3: 26,28,29,34
Smoothing by bin means:
Bin 1: 9,9,9,9
Bin 2: 23,23,23,23
Bin 3: 29,29,29,29
Smoothing by bin boundaries:
Bin 1: 4, 4,4,15
Bin 2: 21,21,25,25
Bin3: 26,26,26,34
Regression: Data can be smoothed by fitting the data to function, such as with regression.
Linear regression involves finding the ―best‖ line to fit two attributes (or variables), so that
one attribute can be used to predict the other. Multiple linear regressions is an extension of
linear regression, where more than two attributes are involved and the data are fit to a
multidimensional surface.
Clustering: Outliers may be detected by clustering, where similar values are organized into
groups, or ―clusters.‖ Intuitively, values that fall outside of the set of clusters may be
considered outliers.
3. Data Integration
Data mining often requires data integration - the merging of data from stores into a coherent
data store, as in data warehousing. These sources may include multiple data bases, data
cubes, or flat files.
Issues in Data Integration
a) Schema integration & object matching.
b) Redundancy.
c) Detection & Resolution of data value conflict
a) Schema Integration & Object Matching
Schema integration & object matching can be tricky because same entity can be
represented in different forms in different tables. This is referred to as the entity identification
problem. Metadata can be used to help avoid errors in schema integration. The meta data may
also be used to help transform the data.
b) Redundancy:
Redundancy is another important issue an attribute (such as annual revenue, for
instance) may be redundant if it can be ―derived‖ from another attribute are set of attributes.
Inconsistencies in attribute of dimension naming can also cause redundancies in the resulting
data set. Some redundancies can be detected by correlation analysis and covariance analysis.
For Nominal data, we use the 2 (Chi-Square) test.
For Numeric attributes we can use the correlation coefficient and covariance.
Correlation analysis for numerical data:
2
For nominal data, a correlation relationship between two attributes, A and B, can be
discovered by a 2 (Chi-Square) test. Suppose A has c distinct values, namely a1, a2, a3,
……., ac. B has r distinct values, namely b1, b2, b3, …., br. The data tuples are described by
table.
The 2 value is computed as
𝟐
𝒓 𝒄 𝒐𝒊𝒋 −𝒆𝒊𝒋
= 2
𝒊=𝟏
𝒋=𝟏 𝒆𝒊𝒋
Where oij is the observed frequency of the joint event (Ai,Bj) and eij is the expected
frequency of (Ai,Bj), which can computed as
𝒄𝒐𝒖𝒏𝒕 𝑨=𝒂𝒊 𝑿𝒄𝒐𝒖𝒏𝒕(𝑩=𝒃𝒋 )
eij =
𝒏
For Example,
Male Female Total
Fiction 250 200 450
Non_Fiction 50 1000 1050
Total 300 1200 1500
𝑐𝑜𝑢𝑛𝑡 𝑚𝑎𝑙𝑒 𝑋𝑐𝑜𝑢𝑛𝑡 (𝑓𝑖𝑐𝑡𝑖𝑜𝑛 ) 300𝑋450
e11 = = = 90
𝑛 1500
𝑐𝑜𝑢𝑛𝑡 𝑚𝑎𝑙𝑒 𝑋𝑐𝑜𝑢𝑛𝑡 (𝑛𝑜𝑛 _𝑓𝑖𝑐𝑡𝑖𝑜𝑛 ) 300𝑋1050
e12 = = = 210
𝑛 1500
𝑐𝑜𝑢𝑛𝑡 𝑓𝑒𝑚𝑎𝑙𝑒 𝑋𝑐𝑜𝑢𝑛𝑡 (𝑓𝑖𝑐𝑡𝑖𝑜𝑛 ) 1200 𝑋450
e21 = = = 360
𝑛 1500
𝑐𝑜𝑢𝑛𝑡 𝑓𝑒𝑚𝑎𝑙𝑒 𝑋𝑐𝑜𝑢𝑛𝑡 (𝑛𝑜𝑛 _𝑓𝑖𝑐𝑡𝑖𝑜𝑛 ) 1200 𝑋1050
e22 = = = 840
𝑛 1500
4. Data Reduction:
Obtain a reduced representation of the data set that is much smaller in volume but yet
produces the same (or almost the same) analytical results.
Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
globally optimal solution. Many other attributes evaluation measure can be used, such as the
information gain measure used in building decision trees for classification.
1. Stepwise forward selection: The procedure starts with an empty set of attributes as the
reduced set. The best of original attributes is determined and added to the reduced set. At
each subsequent iteration or step, the best of the remaining original attributes is added to the
set.
2. Stepwise backward elimination: The procedure starts with full set of attributes. At each
step, it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The stepwise forward
selection and backward elimination methods can be combined so that, at each step, the
procedure selects the best attribute and removes the worst from among the remaining
attributes.
4. Decision tree induction: Decision tree induction constructs a flowchart like structure
where each internal node denotes a test on an attribute, each branch corresponds to an
outcome of the test, and each leaf node denotes a class prediction. At each node, the
algorithm choices the ―best‖ attribute to partition the data into individual classes. A tree is
constructed from the given data. All attributes that do not appear in the tree are assumed to be
irrelevant. The set of attributes appearing in the tree from the reduced subset of attributes.
Threshold measure is used as stopping criteria.
4.3Numerosity Reduction:
Numerosity reduction is used to reduce the data volume by choosing alternative, smaller
forms of the data representation
Techniques for Numerosity reduction:
Parametric - In this model only the data parameters need to be stored, instead of the
actual data. (e.g.,) Log-linear models, Regression
Parametric model
1. Regression
Linear regression
In linear regression, the data are model to fit a straight line. For example, a
random variable, Y called a response variable), can be modeled as a linear
function of another random variable, X called a predictor variable), with the
equation Y=αX+β
Where the variance of Y is assumed to be constant. The coefficients, α and β
(called regression coefficients), specify the slope of the line and the Y- intercept,
respectively.
Multiple- linear regression
Multiple linear regression is an extension of (simple) linear regression, allowing a
response variable Y, to be modeled as a linear function of two or more predictor
variables.
2. Log-Linear Models
Log-Linear Models can be used to estimate the probability of each point in a
multidimensional space for a set of discretized attributes, based on a smaller
subset of dimensional combinations.
Nonparametric Model
1. Histograms
A histogram for an attribute A partitions the data distribution of A into disjoint
subsets, or buckets. If each bucket represents only a single attribute-value/frequency pair, the
buckets are called singleton buckets.
Ex: The following data are bast of prices of commonly sold items at All Electronics. The
numbers have been sorted:
1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,15,15,18,18,18,18,18,18,18,18,20,20,20,2
0,20,20,21,21,21,21,21,25,25,25,25,25,28,28,30,30,30
2. Clustering
Clustering technique consider data tuples as objects. They partition the objects into
groups or clusters, so that objects within a cluster are similar to one another and dissimilar to
objects in other clusters. Similarity is defined in terms of how close the objects are in space,
based on a distance function. The quality of a cluster may be represented by its diameter, the
maximum distance between any two objects in the cluster. Centroid distance is an alternative
measure of cluster quality and is defined as the average distance of each cluster object from
the cluster centroid.
3. Sampling:
Sampling can be used as a data reduction technique because it allows a large data set
to be represented by a much smaller random sample (or subset) of the data. Suppose that a
large data set D, contains N tuples, then the possible samples are Simple Random sample
without Replacement (SRS WOR) of size n: This is created by drawing „n‟ of the „N‟ tuples
from D (n<N), where the probability of drawing any tuple in D is 1/N, i.e., all tuples are
equally likely to be sampled.
1. The length, L, of the input data vector must be an integer power of 2. This condition
can be met by padding the data vector with zeros as necessary (L >=n).
2. Each transform involves applying two functions
The first applies some data smoothing, such as a sum or weighted average.
The second performs a weighted difference, which acts to bring out the detailed
features of data.
3. The two functions are applied to pairs of data points in X, that is, to all pairs of
measurements (X2i , X2i+1). This results in two sets of data of length L/2. In general,
these represent a smoothed or low-frequency version of the input data and high
frequency content of it, respectively.
4. The two functions are recursively applied to the sets of data obtained in the previous
loop, until the resulting data sets obtained are of length 2.
In the above figure, Y1 and Y2, for the given set of data originally mapped to the axes X1
and X2. This information helps identify groups or patterns within the data. The sorted axes
are such that the first axis shows the most variance among the data, the second axis shows the
next highest variance, and so on.
The size of the data can be reduced by eliminating the weaker components.
Advantage of PCA
PCA is computationally inexpensive
Multidimensional data of more than two dimensions can be handled by reducing the
problem to two dimensions.
Principal components may be used as inputs to multiple regression and cluster analysis.
Min-max normalization preserves the relationships among the original data values. Itwill
encounter an ―out-of-bounds‖ error if a future input case for normalization fallsoutside of the
original data range for A.
Example:-Min-max normalization. Suppose that the minimum and maximum values
fortheattribute income are $12,000 and $98,000, respectively. We would like to map income
to the range [0.0, 1.0]. By min-max normalization, a value of $73,600 for income
istransformed to
b) Z-Score Normalization
The values for an attribute, A, are normalized based on the mean (i.e., average) and standard
deviation of A. A value, vi, of A is normalized to vi’ by computing
ExampleDecimal scaling. Suppose that the recorded values of A range from -986 to 917.
Themaximum absolute value of A is 986. To normalize by decimal scaling, we
thereforedivide each value by 1000 (i.e., j = 3) so that -986 normalizes to -0.986 and
917normalizes to 0.917.
statetheir partial ordering. The systemcan then try to automatically generate the
attributeordering so as to construct a meaningful concept hierarchy.
4. Specification of only a partial set of attributes: Sometimes a user can be
carelesswhen defining a hierarchy, or have only a vague idea about what should be
includedin a hierarchy. Consequently, the user may have included only a small subset
of therelevant attributes in the hierarchy specification.
Data cleaning routines attempt to fill in missing values, smooth out noise
whileidentifying outliers, and correct inconsistencies in the data.
Data integration combines data from multiple sources to form a coherent datastore.
The resolution of semantic heterogeneity, metadata, correlation analysis,tuple
duplication detection, and data conflict detection contribute to smooth dataintegration.
Data reduction techniques obtain a reduced representation of the data while
minimizingthe loss of information content. These include methods of
dimensionalityreduction, numerosity reduction, and data compression.
Data transformation routines convert the data into appropriate forms for mining.For
example, in normalization, attribute data are scaled so as to fall within asmall range
such as 0.0 to 1.0. Other examples are data discretization and concepthierarchy
generation.
Data discretization transforms numeric data by mapping values to interval or
conceptlabels. Such methods can be used to automatically generate concept
hierarchiesfor the data, which allows for mining at multiple levels of granularity.
DATA CLASSIFICATION
Classification is a form of data analysis that extracts models describing important data
classes. Such models, called classifiers, predict categorical (discrete, unordered) class labels.
For example, we can build a classification model to categorize bank loan applications as
either safe or risky. Such analysis can help provide us with a better understanding of the data
at large. Many classification methods have been proposed by researchers in machine learning,
pattern recognition, and statistics.
During tree construction, attribute selection measures are used to select the attribute
that best partitions the tuples into distinct classes. When decision trees are built, many of the
branches may reflect noise or outliers in the training data. Tree pruning attempts to identify
and remove such branches, with the goal of improving classification accuracy on unseen data.
During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning,
developed a decision tree algorithm known as ID3 (Iterative Dichotomiser).
This work expanded on earlier work on concept learning systems, described by E. B. Hunt, J.
Marin,and P. T. Stone. Quinlan later presented C4.5 (a successor of ID3), which became a
benchmark to which newer supervised learning algorithms are often compared.
In 1984,a group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C. Stone)
publishedthe book Classification and Regression Trees (CART), which described the
generation ofbinary decision trees.
Algorithm: Generate decision tree. Generate a decision tree from the training tuples of data
partition, D.
Input:
Data partition, D, which is a set of training tuples and their associated class labels;
attribute list, the set of candidate attributes;
Attribute selection method, a procedure to determine the splitting criterion that “best”
partitions the data tuples into individual classes. This criterion consists of a splitting
attribute and, possibly, either a split-point or splitting subset.
Output: A decision tree.
Method:
1) create a node N;
2) if tuples in D are all of the same class, C, then
3) return N as a leaf node labeled with the class C;
4) if attribute list is empty then
5) return N as a leaf node labeled with the majority class in D; // majority voting
6) apply Attribute selection method(D, attribute list) to find the “best” splitting
criterion;
7) label node N with splitting criterion;
8) if splitting attribute is discrete-valued and
multiway splits allowed then // not restricted to binary trees
9) attribute list attribute list - splitting attribute; // remove splitting attribute
10) for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
11) let Dj be the set of data tuples in D satisfying outcome j; // a partition
12) if Dj is empty then
13) attach a leaf labeled with the majority class in D to node N;
14) else attach the node returned by Generate decision tree(Dj , attribute list) to node N;
endfor
15) return N;
Binary Attributes: The test condition for a binary attribute generates two potential
outcomes.
Nominal Attributes:These can have many values. These can be represented in two ways.
Ordinal attributes:These can produce binary or multiway splits. The values can be grouped
as long as the grouping does not violate the order property of attribute values.
Where piis the nonzero probability that an arbitrary tuple in D belongs to class Ciand is
estimated by |Ci,D|/|D|. A log function to the base 2 is used, because the information is
encoded in bits.Info(D) is also known as the entropy of D.
Information gain is defined as the difference between the original information requirement
(i.e., based on just the proportion of classes) and the new requirement (i.e., obtained after
partitioning on A). That is,
The attribute A with the highest information gain, Gain(A), is chosen as the
splittingattribute at nodeN. This is equivalent to saying that we want to partition on the
attributeA that would do the “best classification,” so that the amount of information still
requiredto finish classifying the tuples is minimal.
3.6.2 Gain Ratio
C4.5, a successor of ID3, uses an extension to information gain known as gain ratio,
which attempts to overcome this bias. It applies a kind of normalization to information gain
using a “split information” value defined analogously with Info(D) as
This value represents the potential information generated by splitting the trainingdata set, D,
into v partitions, corresponding to the v outcomes of a test on attribute A. Note that, for each
outcome, it considers the number of tuples having that outcome with respect to the total
number of tuples in D. It differs from information gain, which measures the information with
respect to classification that is acquired based on the same partitioning. The gain ratio is
defined as
Where piis the nonzero probability that an arbitrary tuple in D belongs to class Ciand
is estimated by |Ci,D|/|D| over m classes.
Note: The Gini index considers a binary split for each attribute.
When considering a binary split, we compute a weighted sum of the impurity of
eachresulting partition. For example, if a binary split on A partitions D into D1 and D2, the
Gini index of D given that partitioning is
For each attribute, each of the possible binary splits is considered. For a discrete-valued
attribute, the subset that gives the minimum Gini index for that attribute is selected as its
splitting subset.
For continuous-valued attributes, each possible split-point must be considered. The
strategy is similar to that described earlier for information gain, where the midpoint
between each pair of (sorted) adjacent values is taken as a possible split-point.
The reduction in impurity that would be incurred by a binary split on a discrete- or
continuous-valued attribute A is
This set isindependent of the training set used to build the unpruned tree and of any
test set usedfor accuracy estimation.
The algorithm generates a set of progressively pruned trees. Ingeneral, the smallest
decision tree that minimizes the cost complexity is preferred.
C4.5 uses a method called pessimistic pruning, which is similar to the cost
complexitymethod in that it also uses error rate estimates to make decisions regarding
subtreepruning.
3.8 Scalability of Decision Tree Induction:
“What if D, the disk-resident training set of class-labeled tuples, does not fit in
memory? Inother words, how scalable is decision tree induction?” The efficiency of existing
decisiontree algorithms, such as ID3, C4.5, and CART, has been well established for
relativelysmall data sets. Efficiency becomes an issue of concern when these algorithms are
appliedto the mining of very large real-world databases. The pioneering decision tree
algorithmsthat we have discussed so far have the restriction that the training tuples should
residein memory.
In data mining applications, very large training sets of millions of tuples are
common.Most often, the training data will not fit in memory! Therefore, decision tree
construction becomes inefficient due to swapping of the training tuples in and out of main
and cache memories. More scalable approaches, capable of handling training data that are too
large to fit in memory, are required. Earlier strategies to “save space” included discretizing
continuous-valued attributes and sampling data at each node. These techniques, however, still
assume that the training set can fit in memory.
Several scalable decision tree induction methods have been introduced in recent
studies.RainForest, for example, adapts to the amount of main memory available and
appliesto any decision tree induction algorithm. The method maintains an AVC-set
(where“AVC” stands for “Attribute-Value, Classlabel”) for each attribute, at each tree
node,describing the training tuples at the node. The AVC-set of an attribute A at node Ngives
the class label counts for each value of A for the tuples at N. The set of all AVC-sets at a node
N is the AVC-groupof N. The size of an AVC-set for attribute A at node N depends only on
the number ofdistinct values of A and the number of classes in the set of tuples at N.
Typically, this sizeshould fit in memory, even for real-world data. RainForest also has
techniques, however,for handling the case where the AVC-group does not fit in memory.
Therefore, themethod has high scalability for decision tree induction in very large data sets.
Solution:
Here the target class is buys_computer and values are yes, no. By using ID3
algorithm, we are constructing decision tree.
For ID3 Algorithm we have calculate Information gain attribute selection measure.
P buys_computer (yes) 9
CLASS
N buys_computer (no) 5
TOTAL 14
𝟗 𝟗 𝟓 𝟓
Info(D) = I(9,5) = - 𝐥𝐨𝐠 𝟐 - 𝐥𝐨𝐠 𝟐 = 0.940
𝟏𝟒 𝟏𝟒 𝟏𝟒 𝟏𝟒
𝟐 𝟐 𝟑 𝟑
I(2,3) = - 𝐥𝐨𝐠 𝟐 - 𝐥𝐨𝐠 𝟐 = 0.970
𝟓 𝟓 𝟓 𝟓
𝟒 𝟒 𝟎 𝟎
I(4,0) = - 𝐥𝐨𝐠 𝟐 - 𝐥𝐨𝐠 𝟐 = 0
𝟒 𝟒 𝟒 𝟒
𝟑 𝟑 𝟐 𝟐
I(3,2) = - 𝐥𝐨𝐠 𝟐 - 𝐥𝐨𝐠 𝟐 = 0.970
𝟓 𝟓 𝟓 𝟓
Finally, age has the highest information gain among the attributes, it is selected as the
splitting attribute. Node N is labeled with age, and branches are grown for each of the
attribute’s values. The tuples are then partitioned accordingly, as
Bayesian Classification:
Bayesian classifiers are statistical classifiers.
They can predict class membership probabilities, such as the probability that a given
tuple belongs to a particular class.
Bayesian classification is based on Bayes’ theorem.
Bayes’ Theorem:
Let X be a data tuple. In Bayesian terms, X is considered ― “evidence” and it is
described by measurements made on a set of n attributes.
Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
For classification problems, we want to determine P(H|X), the probability that the
hypothesis H holds given the ―evidence‖ or observed data tuple X.
P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.
Bayes’ theorem is useful in that it provides a way of calculating the posterior
probability, P(H|X), from P(H), P(X|H), and P(X).
𝑷 𝑿 𝑯 𝑷(𝑯)
𝑷 𝑯𝑿 =
𝑷(𝑿)
3. As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class
prior probabilities are not known, then it is commonly assumed that the classes are
equally likely, that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize
P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci).
4. Given data sets with many attributes, it would be extremely computationally
expensive to compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci),
the naive assumption of class conditional independence is made. This presumes that
the values of the attributes are conditionally independent of one another, given the
class label of the tuple. Thus,
𝒏
𝑷 𝑿 𝑪𝒊 = 𝑷 𝒙𝒌 𝑪𝒊
𝒌=𝟏
= 𝑷 𝒙𝟏 𝑪𝟏 × 𝑷 𝒙𝟐 𝑪𝟐 × … … . .× 𝑷 𝒙𝒏 𝑪𝒊
5. We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), : : : , P(xn|Ci) from the
training tuples.
6. For each attribute, we look at whether the attribute is categorical or continuous-
valued. For instance, to compute P(X|Ci), we consider the following:
If Ak is categorical, then P(xk|Ci) is the number of tuples of class Ci in=havingthe
value xk for Ak, divided by |Ci ,D| the number of tuples of class Ci in D.
If Ak is continuous-valued, then we need to do a bit more work, but the calculation is
pretty straightforward.
Example:
We wish to predict the class label of a tuple using naïve Bayesian classification, given
the same training data above. The training data were shown above in Table. The data tuples
are described by the attributes age, income, student, and credit rating. The class label
attribute, buys computer, has two distinct values (namely, {yes, no}). Let C1 correspond to
the class buys computer=yes and C2 correspond to buys computer=no. The tuple we wish to
classify is
X={age= “youth”, income= “medium”, student= “yes”, credit_rating= “fair”}
We need to maximize P(X|Ci)P(Ci), for i=1,2. P(Ci), the prior probability of each
class, can be computed based on the training tuples:
For example, having lung cancer is influenced by a person’s family history of lung
cancer, as well as whether or not the person is a smoker. Note that the variable PositiveXRay
is independent of whether the patient has a family history of lung cancer or is a smoker, given
that we know the patient has lung cancer.
In other words, once we know the outcome of the variable LungCancer, then the
variables FamilyHistory and Smoker do not provide any additional information regarding
PositiveXRay. The arcs also show that the variable LungCancer is conditionally independent
of Emphysema, given its parents, FamilyHistory and Smoker.
A belief network has one conditional probability table (CPT) for each variable. The
CPT for a variable Y specifies the conditional distribution P(Y|Parents(Y)), where Parents(Y)
are the parents of Y. Figure (b) shows a CPT for the variable LungCancer. The conditional
probability for each known value of LungCancer is given for each possible combination of
the values of its parents. For instance, from the upper leftmost and bottom rightmost entries,
respectively.
Association Analysis
Association:
Association mining aims to extract interesting correlations, frequent patterns,
associations or casual structures among sets of items or objects in transaction databases,
relational database or other data repositories. Association rules are widely used in various
areas such as telecommunication networks, market and risk management, inventory control,
cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc.
Examples:
Rule Form: BodyHead [Support, confidence]
Buys (X, “Computer”) Buys (X, “Software”) [40%, 50%]
Association Rule:
An association rule is an implication expression of the form XY, where X and Y
are disjoint itemsets, i.e., X ∩ Y = ∅. The strength of an association rule can be
measured in terms of its support and confidence. Support determines how often a rule is
applicable to a given data set, while confidence determines how frequently items in Y
appear in transactions that contain X. The formal definition of these metrics are
𝜎(𝑋∪Y)
Support, s(XY) = 𝑁
𝜎(𝑋∪Y)
Confidence, c(XY) = 𝜎(𝑋)
Why Use Support and Confidence? Support is an important measure because a rule
that has very low support may occur simply by chance. A low support rule is also likely to
be uninteresting from a business perspective because it may not be profitable to promote items
that customers seldom buy together. For these reasons, support is often used to eliminate
uninteresting rules.
Confidence, on the other hand, measures the reliability of the inference made by a
rule. For a given rule XY, the higher the confidence, the more likely it is for Y to be present
in transactions that contain X. Confidence also provides an estimate of the conditional
probability of Y given X.
a) Apriori Algorithm:
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for
mining frequent itemsets for Boolean association rules. The name of the algorithm is based
on the fact that the algorithm uses prior knowledge of frequent itemset properties, as we shall
see later. Apriori employs an iterative approach known as a level-wise search, where k-
itemsets are used to explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate
the count for each item, and collecting those items that satisfy minimum support. The
resulting set is denoted by L1. Next, L1 is used to find L2, the set of frequent 2-itemsets,
which is used to find L3, and so on, until no more frequent k-itemsets can be found. The
finding of each Lk requires one full scan of the database.
To improve the efficiency of the level-wise generation of frequent itemsets, an
important property called the Apriori property is used to reduce the search space.
Apriori property: All nonempty subsets of a frequent itemset must also be frequent.
The Apriori property is based on the following observation. By definition, if an
itemset I does not satisfy the minimum support threshold, min sup, then I is not frequent, that
is, P(I)< min sup. If an item A is added to the itemset I, then the resulting itemset (i.e.,IUA)
cannot occur more frequently than I. Therefore, IUA is not frequent either, that is, P(IUA) <
min sup.
This property belongs to a special category of properties called antimonotonicity in
the sense that if a set cannot pass a test, all of its supersets will fail the same test as well. It is
called antimonotonicity because the property is monotonic in the context of failing a test.
A two-step process is followed, consisting of join and prune actions.
1. The join step: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1 with
itself. This set of candidates is denoted Ck.
2. The prune step: Ck is a superset of Lk, that is, its members may or may not be frequent,
but all of the frequent k-itemsets are included in Ck. A database scan to determine the count
of each candidate in Ck would result in the determination of Lk.
Example:
1. In the first iteration of the algorithm, each item is a member of the set of candidate 1-
itemsets, C1. The algorithm simply scans all of the transactions to count the number of
occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. (Here, we
are referring to absolute support because we are using a support count. The
corresponding relative support is 2/9 = 22%.) The set of frequent 1-itemsets, L1, can
then be determined. It consists of the candidate 1-itemsets satisfying minimum
support. In our example, all of the candidates in C1 satisfy minimum support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 ⋈ L1 to
generate a candidate set of 2-itemsets, C2. C2 consists of 2-itemsets. Note that no
candidates are removed from C2 during the prune step because each subset of the
candidates is also frequent.
4. Next, the transactions in D are scanned and the support count of each candidate
itemset in C2 is accumulated, as shown in the middle table of the second row in Figure
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-
itemsets in C2 having minimum support.
6. The generation of the set of the candidate 3-itemsets, C3, is detailed in Figure From
the join step, we first get C3 = L2 ⋈ L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2,
I3, I4}, {I2, I3, I5}, {I2, I4, I5}} Based on the Apriori property that all subsets of a
frequent itemset must also be frequent, we can determine that the four latter
candidates cannot possibly be frequent. We therefore remove them from C3, thereby
saving the effort of unnecessarily obtaining their counts during the subsequent scan of
D to determine L3.
7. The transactions in D are scanned to determine L3, consisting of those candidate 3-
itemsets in C3 having minimum support.
8. The algorithm uses L3 ⋈ L3 to generate a candidate set of 4-itemsets, C4. Although
the join results in {I1, I2, I3, I5}, itemset {I1, I2, I3, I5} is pruned because its subset
{I2, I3, I5} is not frequent. Thus, C4 ≠Ø, and the algorithm terminates, having found
all of the frequent itemsets.
b) FP-Growth:
The first scan of the database is the same as Apriori, which derives the set of frequent
items (1-itemsets) and their support counts (frequencies). Let the minimum support count be
2. The set of frequent items is sorted in the order of descending support count. This resulting
set or list is denoted by L. Thus, we have L = {{I2:7}, {I1:6}, {I3:6}, {I4:2}, {I5:2}}
An FP-tree is then constructed as follows. First, create the root of the tree, labeled
with “null.” Scan database D a second time. The items in each transaction are processed in L
order (i.e., sorted according to descending support count), and a branch is created for each
transaction.
The FP-tree is mined as follows. Start from each frequent length-1 pattern (as an
initial suffix pattern), construct its conditional pattern base (a “sub-database,” which
consists of the set of prefix paths in the FP-tree co-occurring with the suffix pattern), then
construct its (conditional) FP-tree, and perform mining recursively on the tree. The pattern
growth is achieved by the concatenation of the suffix pattern with the frequent patterns
generated from a conditional FP-tree.
Finally, we can conclude that frequent itemsets are {I2, I1, I5} and {I2, I1, I3}.