Key For DM II Mid 2017 May Set A
Key For DM II Mid 2017 May Set A
Key For DM II Mid 2017 May Set A
Part A
(Compulsory)
1. a. Write a note on attribute oriented induction.
Attribute-Oriented Induction (AOI) is a data mining technique that produces simplified
descriptive patterns. Classical AOI uses a predictive strategy to determine distinct values of
an attribute but generalises attributes indiscriminately i.e. the value 'ANY' is replaced like
any other value without restrictions. AOI only produces interesting rules by using interior
concepts of attribute hierarchies. Attribute Oriented Induction is not confined to categorical
data nor particular measures.
Procedure:
Collect the task-relevant data( initial relation) using a relational database query
Perform generalization by attribute removal or attribute generalization.
Apply aggregation by merging identical, generalized tuples and accumulating their
respective counts.
Interactive presentation with users.
Key points in Attribute Oriented Induction
Data focusing: task-relevant data, including dimensions, and the result is the initial relation.
Attribute-removal: remove attribute A if there is a large set of distinct values for A but (1)
there is no generalization operator on A, or (2) As higher level concepts are expressed in
terms of other attributes.
Attribute-generalization: If there is a large set of distinct values for A, and there exists a set
of generalization operators on A, then select an operator and generalize A.
Attribute-threshold control: typical 2-8, specified/default.
Generalized relation threshold control: control the final relation/rule size.
b. List out the characteristics of kNN classifier.
The k-nearest-neighbor method was first described in the early 1950s. The method is labor
intensive when given large training sets, and did not gain popularity until the 1960s when
increased computing power became available. It has since been widely used in the area of
pattern recognition. Nearest-neighbor classifiers are based on learning by analogy, that is, by
comparing a given test tuplewith training tuples that are similar to it. The training tuples are
described by n attributes. Each tuple represents a point in an n-dimensional space. In this
way, all of the training tuples are stored in an n-dimensional pattern space. When given an
unknown tuple, a k-nearest-neighbor classifier searches the pattern space for the k training
tuples that are closest to the unknown tuple. These k training tuples are the k nearest
neighbors of the unknown tuple. Closeness is defined in terms of a distance metric, such
as Euclidean distance.
c. What are Random Forests?
Random forest (or random forests) is an ensemble classifier that consists of many decision
A
trees and outputs the class that is the mode of the class's output by individual trees. The term
came from random decision forests that was first proposed by Tin Kam Ho of Bell Labs in
1995. The method combines Breiman's "bagging" idea and the random selection of features.
The advantages of random forest are:
It is one of the most accurate learning algorithms available. For many data sets, it
produces a highly accurate classifier.
It runs efficiently on large databases.
It can handle thousands of input variables without variable deletion.
It gives estimates of what variables are important in the classification.
It generates an internal unbiased estimate of the generalization error as the forest
building progresses.
It has an effective method for estimating missing data and maintains accuracy when a
large proportion of the data are missing.
It has methods for balancing error in class population unbalanced data sets.
Generated forests can be saved for future use on other data.
Prototypes are computed that give information about the relation between the variables
and the classification.
It computes proximities between pairs of cases that can be used in clustering, locating
outliers, or (by scaling) give interesting views of the data.
The capabilities of the above can be extended to unlabeled data, leading to
unsupervised clustering, data views and outlier detection.
It offers an experimental method for detecting variable interactions.
Disadvantages
Random forests have been observed to overfit for some datasets with noisy
classification/regression tasks.
For data including categorical variables with different number of levels, random
forests are biased in favor of those attributes with more levels. Therefore, the variable
importance scores from random forest are not reliable for this type of data.
d. What is Apriori Principle?
Apriori principle (Main observation):
If an itemset is frequent, then all of its subsets must also be frequent
Apriori principle holds due to the following property of the support measure:
X , Y : ( X Y ) s ( X ) s (Y )
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCDE
Part-B
2. a. Explain Decision Tree Induction.Describe its implementation using Hunts algorithm.
b. What are the measures of selecting the splitting attribute. Explain.
The major steps are as follows:
The tree starts as a single root node containing all of the training tuples.
If the tuples are all from the same class, then the node becomes a leaf, labeled with that class.
Else, an attribute selection method is called to determine the splitting criterion. Such a method may
using a heuristic or statistical measure (e.g., information gain or gini index) to select the \best" way
to separate the tuples into individual classes. The splitting criterion consists of a splitting attribute
and may also indicate either a split-point or a splitting subset, as described below.
Next, the node is labeled with the splitting criterion, which serves as a test at the node. A branch is
grown from the node to each of the outcomes of the splitting criterion and the tuples are partitioned
accordingly. There are three possible scenarios for such partitioning.
1. If the splitting attribute is discrete-valued, then a branch is grown for each possible value of the attribute.
2. If the splitting attribute, A, is continuous-valued, then two branches are grown, corresponding to the
conditions A <= split point and A > split point.
3. If the splitting attribute is discrete-valued and a binary tree must be produced (e.g., if the gini index was used
as a selection measure), then the test at the node is A SA?" where SA is the splitting subset for A. It is a
subset of the known values of A. If a given tuple has value aj of A and if aj SA, then the test at the node is
satisfied.
The algorithm recurses to create a decision tree for the tuples at each partition.
The stopping conditions are:
If all tuples at a given node belong to the same class, then transform that node into a leaf, labeled
with that class.
If there are no more attributes left to create more partitions, then majority voting can be used to
convert the given node into a leaf, labeled with the most common class among the tuples.
If there are no tuples for a given branch, a leaf is created with the majority class from the parent node.
(or)
3. a. Explain in detail neural network learning for classification using back propagation
algorithm.
Classification by Backpropagation
Backpropagation: A neural network learning algorithm. It was started by psychologists and neurobiologists to develop and test
computational analogues of neurons. A neural network is a set of connected input/output units where each connection has a weight
associated with it. During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label
of the input tuples. Also referred to as connectionist learning due to the connections between units
Strength:
o High tolerance to noisy data
o Ability to classify untrained patterns
o Well-suited for continuous-valued inputs and outputs o Successful on a wide array of real-world data
Algorithms are inherently parallel
o Techniques have recently been developed for the extraction of rules from trained neural networks
A Neuron(= a perceptron)
A
o The inputs to the network correspond to the attributes measured for each training tuple
o Inputs are fed simultaneously into the units making up the input layer
o They are then weighted and fed simultaneously to a hidden layer
o The number of hidden layers is arbitrary, although usually only one
o The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network's prediction
o The network is feed-forward in that none of the weights cycles back to an input unit or to an output unit of a previous layer
o From a statistical point of view, networks perform nonlinear regression: Given enough hidden units and enough training samples, they can closely
approximate any function
Row 4:
Then row 4 contain B,D,A. Now we can just rename the frequency of occurrences in
the existing branch. As B:4,D,A:3.
Row 5:
A
In fifth raw have only item D. Now we have opportunity draw new branch from 'null'
node. See Figure 4.
Row 6:
B and D appears in row 6. So just change the B:4 to B:5 and D:3 to D:4.
Row 7:
Attach two new nodes A and E to the D node which hanging on the null node. Then
mark D,A,E as D:2,A:1 and E:1.
Step 6: Validation
After the five steps the final FP tree as follows: Figure 5.
How we know is this correct? Now count the frequency of occurrence of each item of
the FP tree and compare it with Table 2. If both counts equal, then it is positive point
to indicate your tree is correct.
b. Explain k-means clustering with a suitable example. Also write a short note on BIRCH
and CURE clustering techniques.
K-Means Clustering
K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering
problem. The procedure follows a simple and easy way to classify a given data set through a certain number
of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster.
These centroids shoud be placed in a cunning way because of different location causes different result. So,
the better choice is to place them as much as possible far away from each other. The next step is to take each
point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the
first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids
as barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new
binding has to be done between the same data set points and the nearest new centroid. A loop has been
generated. As a result of this loop we may notice that the k centroids change their location step by step until
no more changes are done. In other words centroids do not move any more.
Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The
objective function
where is a chosen distance measure between a data point and the cluster centre , is an
indicator of the distance of the n data points from their respective cluster centres.
Although it can be proved that the procedure will always terminate, the k-means algorithm does not
necessarily find the most optimal configuration, corresponding to the global objective function minimum.
The algorithm is also significantly sensitive to the initial randomly selected cluster centres. The k-means
algorithm can be run multiple times to reduce this effect.
K-means is a simple algorithm that has been adapted to many problem domains. As we are going to see, it is
a good candidate for extension to work with fuzzy feature vectors.
An example
Suppose that we have n sample feature vectors x1, x2, ..., xn all from the same class, and we know that they
fall into k compact clusters, k < n. Let mi be the mean of the vectors in cluster i. If the clusters are well
separated, we can use a minimum-distance classifier to separate them. That is, we can say that x is in cluster
i if || x - mi || is the minimum of all the k distances. This suggests the following procedure for finding the k
means:
Here is an example showing how the means m1 and m2 move into the centers of two clusters.
Begins by partitioning objects hierarchically using tree structures, and then applies other clustering
algorithms to refine the clusters.
BIRCH is local (instead of global). Each clustering decision is made without scanning all data points or
currently existing clusters.
BIRCH exploits the observation that the data space is usually not uniformly occupied, and therefore not
every data point is equally important for clustering purposes.
BIRCH makes full use of available memory to derive the finest possible subclusters while minimizing I/O
costs.
Initial CF Tree
Smaller CF Tree
Good Clusters
Better Clusters
A new hierarchical clustering algorithm that uses a fixed number of points as representatives
(partition)
Centroid based approach: uses 1 pt to represent cluster => too little information sensitive
to data shapes
All point based approach: uses all points to cluster => too much information sensitive to
outliers
A constant number c of well scattered points in a cluster are chosen, and then shrunk toward the
center of the cluster by a specified fraction alpha
The clusters with the closest pair of representative points are merged at each step
Stops when there are only k clusters left, where k can be specified
A
CUREs Advantages
More accurate:
More efficient:
CURE adjusts well to clusters having non-spherical shapes and wide variances in size.
(or)
5. a. Explain Association Rule Mining.
Association Rules Mining
Association rule learning is a popular and well researched method for discovering interesting relations
between variables in large databases. It describes analyzing and presenting strong rules discovered in
databases using different measures of interestingness. Based on the concept of strong rules, association
rules were introduced for discovering regularities between products in large-scale transaction data recorded
by point-of-sale (POS) systems in supermarkets. For example, the rule
found in the sales data of a supermarket would indicate that if a
customer buys onions and potatoes together, he or she is likely to also buy hamburger meat. Such
information can be used as the basis for decisions about marketing activities such as, e.g., promotional
pricing or product placements. In addition to the above example from market basket analysis association
rules are employed today in many application areas including Web usage mining, intrusion detection and
bioinformatics. As opposed to sequence mining, association rule learning typically does not consider the
order of items either within a transaction or across transactions.
Formally, given a set of transactions, find rules that will predict the occurrence of an item based
on the occurrences of other items in the transaction.
Given a set of transactions T, the goal of association rule mining is to find all rules having
Brute-force approach:
.. Two-step approach:
2. Rule Generation
Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent
itemset
The Self-Organizing Map was developed by professor Kohonen. The SOM has been proven useful in many
applications. The SOM algorithm is based on unsupervised, competitive learning. It provides a topology
preserving mapping from the high dimensional space to map units. Map units, or neurons, usually form a
two-dimensional lattice and thus the mapping is a mapping from high dimensional space onto a plane. The
property of topology preserving means that the mapping preserves the relative distance between the points.
Points that are near each other in the input space are mapped to nearby map units in the SOM. The SOM can
thus serve as a cluster analyzing tool of high-dimensional data. Also, the SOM has the capability to
generalize. Generalization capability means that the network can recognize or characterize inputs it has
never encountered before. A new input is assimilated with the map unit it is mapped to.
In the basic SOM algorithm, the topological relations and the number of neurons are fixed from the
beginning. This number of neurons determines the scale or the granularity of the resulting model. Scale
selection affects the accuracy and the generalization capability of the model. It must be taken into account
that the generalization and accuracy are contradictory goals. By improving the first, we lose on the second,
and vice versa.
2. SVM
Support Vector Machine (SVM) finds an optimal* solution. It maximizes the distance between the hyperplane and
the difficult points close to decision boundary. If there are no points near the decision surface, then there are no
very uncertain classification decisions. SVMs maximize the margin around the separating hyperplane. The decision
function is fully specified by a subset of training samples, the support vectors. Solving SVMs is a quadratic
programming problem. Seen by many as the most successful current text classification method.The classifier is a
separating hyperplane. The most important training points are the support vectors; they define the hyperplane.
Quadratic optimization algorithms can identify which training points xi are support vectors with non-zero Lagrangian
multipliers i. Both in the dual formulation of the problem and in the solution, training points appear only inside
inner products: A
Find 1N such that
T
Q() =i - ijyiyjxi xj is maximized and
(1) iyi = 0
(2) 0 i C for all i
T
f(x) = iyixi x + b
It does not need to represent the space explicitly, simply by defining a kernel function
The kernel function plays the role of the dot product in the feature space.
Properties of SVM
Flexibility in choosing a similarity function
Sparseness of solution when dealing with large data sets
- only support vectors are used to specify the separating hyperplane
Ability to handle large feature spaces
- complexity does not depend on the dimensionality of the feature space
Overfitting can be controlled by soft margin approach
Nice math property: a simple convex optimization problem which is guaranteed to converge
to a single global solution
Feature Selection
SVM Applications
SVM has been used successfully in many real-world problems
- text (and hypertext) categorization
- image classification
- bioinformatics (Protein classification,
Cancer classification)
- hand-written character recognition
Weakness of SVM
It is sensitive to noise
- A relatively small number of mislabeled examples can dramatically decrease the performance
It only considers two classes
- how to do multi-class classification with SVM?
- Answer: A
1) with output arity m, learn m SVMs
SVM 1 learns Output==1 vs Output != 1
SVM 2 learns Output==2 vs Output != 2
:
SVM m learns Output==m vs Output != m
2)To predict the output for a new input, just predict with each SVM and find out which one puts
the prediction the furthest into the positive region.
BIRCH
Refer 4.b.
ADITYA COLLEGE OF ENGINEERING
PUNGANUR ROAD, MADANAPALLE-517325
III-B.Tech(R13) II-Sem II-Internal Examinations May-2017 (Objective)
A
13A05603 Datamining (Computer Science & Engineering)
Support vector machines Choose hyperplane based on support vectors. Support vector =
critical point close to decision boundary. (Degree-1) SVMs are linear classifiers.
Kernels: powerful and elegant way to define similarity metric. Perhaps best performing text
classifier.
5. Name some graph based clustering techniques.
Chameleon
CURE
BIRCH