Note 1455181909
Note 1455181909
Note 1455181909
Unit –II
Association Mining
• Given: (1) database of transactions, (2) each transaction is a list of items (purchased
by a customer in a visit)
• Find: all rules that correlate the presence of one set of items with that of another set of
items
– E.g., 98% of people who purchase tires and auto accessories also get
automotive services done
• Applications
– * Maintenance Agreement (What the store should do to boost
Maintenance Agreement sales)
– Home Electronics * (What other products should the store stocks up?)
– Attached mailing in direct marketing
– Detecting ―ping-pong‖ing of patients, faulty ―collisions‖
• Find all the rules X & Y Z with minimum confidence and support
– support, s, probability that a transaction contains {X Y Z}
– confidence, c, conditional probability that a transaction having {X Y} also
contains Z
Let minimum support 50%, and minimum confidence 50%, we have
– A C (50%, 66.6%)
– C A (50%, 100%)
Transaction ID Items Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F
The method that mines the complete set of frequent itemsets with candidate generation.
Apriori property & The Apriori Algorithm.
Apriori property
Header Table
• Completeness:
– never breaks a long pattern of any transaction
– preserves complete information for frequent pattern mining
• Compactness
– reduce irrelevant information—infrequent items are gone
– frequency descending ordering: more frequent items are more likely to be
shared
– never be larger than the original database (if not count node-links and counts)
– Example: For Connect-4 DB, compression ratio could be over 100
Mining Frequent Patterns Using FP-tree
Food
Milk Bread
Fraser Sunset
TID Items
T1 {111, 121, 211, 221}
T2 {111, 211, 222, 323}
T3 {112, 122, 221, 411}
T4 {111, 121}
T5 {111, 122, 211, 221, 413}
– If adopting the same min_support across multi-levels then toss t if any of t’s
ancestors is infrequent.
– If adopting reduced min_support at lower levels then examine only those
descendents whose ancestor’s support is frequent/non-negligible.
Correlation in detail.
Numeric correlation
• Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price)
• A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C },
– where C is a set of constraints on S1, S2 including frequency constraint
• A classification of (single-variable) constraints:
– Class constraint: S A. e.g. S Item
– Domain constraint:
• S v, { , , , , , }. e.g. S.Price < 100
• v S, is or . e.g. snacks S.Type
• V S, or S V, { , , , , }
– e.g. {snacks, sodas } S.Type
– Aggregation constraint: agg(S) v, where agg is in {min, max, sum, count,
avg}, and { , , , , , }.
• e.g. count(S1.Type) 1 , avg(S2.Price) 100
3. Convertible Constraint
• Suppose all items in patterns are listed in a total order R
• A constraint C is convertible anti-monotone iff a pattern S satisfying the constraint
implies that each suffix of S w.r.t. R also satisfies C
• A constraint C is convertible monotone iff a pattern S satisfying the constraint implies
that each pattern of which S is a suffix w.r.t. R also satisfies C
• Succinctness:
– For any set S1 and S2 satisfying C, S1 S2 satisfies C
– Given A1 is the sets of size 1 satisfying C, then any set S satisfying C are
based on A1 , i.e., it contains a subset belongs to A1 ,
• Example :
– sum(S.Price ) v is not succinct
– min(S.Price ) v is succinct
• Optimization:
– If C is succinct, then C is pre-counting prunable. The satisfaction of the
constraint alone is not affected by the iterative support counting.
Classification:
• Typical applications
– Credit approval
– Target marketing
– Medical diagnosis
– Fraud detection
–
Classification—A Two-Step Process
Training Dataset
• Example:
– Weather problem: build a decision tree to guide the decision about whether or
not to play tennis.
– Dataset
(weather.nominal.arff)
• Validation:
– Using training set as a test set will provide optimal classification accuracy.
– Expected accuracy on a different test set will always be less.
– 10-fold cross validation is more robust than using the training set as a test set.
• Divide data into 10 sets with about same proportion of class label
values as in original set.
• Run classification 10 times independently with the remaining 9/10 of
the set as the training set.
• Average accuracy.
– Ratio validation: 67% training set / 33% test set.
– Best: having a separate training set and test set.
• Results:
– Classification accuracy (correctly classified instances).
– Errors (absolute mean, root squared mean, …)
– Kappa statistic (measures agreement between predicted and observed
classification; -100%-100% is the proportion of agreements after chance
agreement has been excluded; 0% means complete agreement by chance)
• Results:
– TP (True Positive) rate per class label
– FP (False Positive) rate
– Precision = TP rate = TP / (TP + FN)) * 100%
– Recall = TP / (TP + FP)) * 100%
– F-measure = 2* recall * precision / recall + precision
• ID3 characteristics:
– Requires nominal values
– Improved into C4.5
• Dealing with numeric attributes
• Dealing with missing values
• Dealing with noisy data
• Generating rules from trees
• Methods:
– C5.0: target field must be categorical, predictor fields may be numeric or
categorical, provides multiple splits on the field that provides the maximum
information gain at each level
– QUEST: target field must be categorical, predictor fields may be numeric
ranges or categorical, statistical binary split
– C&RT: target and predictor fields may be numeric ranges or categorical,
statistical binary split based on regression
– CHAID: target and predictor fields may be numeric ranges or categorical,
statistical binary split based on chi-square
Bayesian Classification:
• Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most
practical approaches to certain types of learning problems
• Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct. Prior knowledge can be combined with
observed data.
• Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
• Standard: Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be
measured
Bayesian Theorem
• Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the
Bayes theorem
• Greatly reduces the computation cost, only count the class distribution.
Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Bayesian Temperature W indy
classification
hot 2/9 2/5 true 3/9 3/5
• The mild 4/9 2/5 false 6/9 2/5
classification
problem may cool 3/9 1/5 be formalized
using a- posteriori
probabilities:
• P(C|X) = prob. that the sample tuple
• X=<x1,…,xk> is of class C.
• E.g. P(class=N | outlook=sunny, windy=true,…)
• Idea: assign to sample X the class label C such that P(C|X) is maximal
• Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
• P(X) is constant for all classes
• P(C) = relative freq of class C samples
• C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
• Problem: computing P(X|C) is unfeasible!
Association-Based Classification
Discarding one or more subtrees and replacing them with leaves simplify a decision tree,
and that is the main task in decision-tree pruning. In replacing the subtree with a leaf, the
algorithm expects to lower the predicted error rate and increase the quality of a classification
model. But computation of error rate is not simple. An error rate based only on a training data
set does not provide a suitable estimate. One possibility to estimate the predicted error rate is
to use a new, additional set of test samples if they are available, or to use the cross-validation
techniques. This technique divides initially available samples into equal sized blocks and, for
each block, the tree is constructed from all samples except this block and tested with a given
block of samples. With the available training and testing samples, the basic idea of decision
tree-pruning is to remove parts of the tree (subtrees) that do not contribute to the
classification accuracy of unseen testing samples, producing a less complex and thus more
comprehensible tree. There are two ways in which the recursive-partitioning method can be
modified:
1. Deciding not to divide a set of samples any further under some conditions. The
stopping criterion is usually based on some statistical tests, such as the χ2 test: If there
are no significant differences in classification accuracy before and after division, then
represent a current node as a leaf. The decision is made in advance, before splitting,
and therefore this approach is called prepruning.
2. Removing restrospectively some of the tree structure using selected accuracy criteria.
The decision in this process of postpruning is made after the tree has been built.
C4.5 follows the postpruning approach, but it uses a specific technique to estimate the
predicted error rate. This method is called pessimistic pruning. For every node in a tree, the
estimation of the upper confidence limit ucf is computed using the statistical tables for
binomial distribution (given in most textbooks on statistics). Parameter Ucf is a function of
∣ Ti∣ and E for a given node. C4.5 uses the default confidence level of 25%, and compares
U25% (∣ Ti∣ /E) for a given node Ti with a weighted confidence of its leaves. Weights are the
total number of cases for every leaf. If the predicted error for a root node in a subtree is less
than weighted sum of U25% for the leaves (predicted error for the subtree), then a subtree will
be replaced with its root node, which becomes a new leaf in a pruned tree.
Let us illustrate this procedure with one simple example. A subtree of a decision tree
is given in Figure, where the root node is the test x1 on three possible values {1, 2, 3} of the
attribute A. The children of the root node are leaves denoted with corresponding classes and
(∣ Ti∣ /E) parameters. The question is to estimate the possibility of pruning the subtree and
replacing it with its root node as a new, generalized leaf node.
To analyze the possibility of replacing the subtree with a leaf node it is necessary to
compute a predicted error PE for the initial tree and for a replaced node. Using default
confidence of 25%, the upper confidence limits for all nodes are collected from statistical
tables: U25% (6, 0) = 0.206, U25%(9, 0) = 0.143, U25%(1, 0) = 0.750, and U25%(16, 1) = 0.157.
Using these values, the predicted errors for the initial tree and the replaced node are
Since the existing subtree has a higher value of predicted error than the replaced node,
it is recommended that the decision tree be pruned and the subtree replaced with the new leaf
node.
Weakness
Long training time
Require a number of parameters typically best determined empirically, e.g.,
the network topology or ``structure."
Poor interpretability: Difficult to interpret the symbolic meaning behind the
learned weights and of ``hidden units" in the network
Strength
High tolerance to noisy data
Ability to classify untrained patterns
Well-suited for continuous-valued inputs and outputs
Successful on a wide array of real-world data
Algorithms are inherently parallel
Techniques have recently been developed for the extraction of rules from
trained neural networks
A Neuron (= a perceptron)
The n-dimensional input vector x is mapped into variable y by means of the scalar
product and a nonlinear function mapping
Backpropagation
Iteratively process a set of training tuples & compare the network's prediction with the
actual known target value
For each training tuple, the weights are modified to minimize the mean squared error
between the network's prediction and the actual target value
Modifications are made in the ―backwards‖ direction: from the output layer, through
each hidden layer down to the first hidden layer, hence ―backpropagation‖
Steps
Initialize weights (to small random #s) and biases in the network
Propagate the inputs forward (by applying activation function)
Backpropagate the error (by updating weights and biases)
Terminating condition (when error is very small, etc.)
Efficiency of backpropagation: Each epoch (one interation through the training set)
takes O(|D| * w), with |D| tuples and w weights, but # of epochs can be exponential to
n, the number of inputs, in the worst case
Rule extraction from networks: network pruning
Simplify the network structure by removing weighted links that have the least
effect on the trained network
Then perform link, unit, or activation value clustering
The set of input and activation values are studied to derive rules describing the
relationship between the input and hidden unit layers
Sensitivity analysis: assess the impact that a given input variable has on a network
output. The knowledge gained from this analysis can be represented in rules
SVM—Support Vector Machines
A new classification method for both linear and nonlinear data
It uses a nonlinear mapping to transform the original training data into a higher
dimension
With the new dimension, it searches for the linear optimal separating hyperplane (i.e.,
―decision boundary‖)
With an appropriate nonlinear mapping to a sufficiently high dimension, data from
two classes can always be separated by a hyperplane
SVM finds this hyperplane using support vectors (―essential‖ training tuples) and
margins (defined by the support vectors)
Features: training can be slow but accuracy is high owing to their ability to
model complex nonlinear decision boundaries (margin maximization)
Used both for classification and prediction
Applications:
handwritten digit recognition, object recognition, speaker identification,
benchmarking time-series prediction tests
SVM—General Philosophy
Associative Classification
Associative classification
Association rules are generated and analyzed for use in classification
Search for strong associations between frequent patterns (conjunctions of
attribute-value pairs) and class labels
Classification: Based on evaluating a set of rules in the form of
P1 ^ p2 … ^ pl ―Aclass = C‖ (conf, sup)
Why effective?
It explores highly confident associations among multiple attributes and may
overcome some constraints introduced by decision-tree induction, which
considers only one attribute at a time
In many studies, associative classification has been found to be more accurate than some
traditional classification methods, such as C4.
Associative Classification May Achieve High Accuracy and Efficiency (Cong et al.
SIGMOD05)
Other Classification Methods
The k-Nearest Neighbor Algorithm
Genetic Algorithms
Figure: A rough set approximation of the set of tuples of the class C suing lower and
upper approximation sets of C. The rectangular regions represent equivalence classes
QUESTIONS
UNIT II
PART - A
1. What is association rule mining? Explain with example.
2. How association rules are mined in large databases?
3. Design a method that mines the complete set of frequent item sets without candidate
generation?
4. Explain iceberg queries with example.
5. Differentiate mining quantitative association rules and distance based association rules.
6. Explain how to improve the efficiency of apriori algorithm?
7. List out different kinds of constraint based association mining.
8. How to transform from association analysis to correlation analysis?
9. What is the difference between classification and prediction?
10. Explain the issues involved in classification and prediction.
11. Explain classification related concepts in terms of association rule miming.
12. What is a “decision tree”.
13. Define Bayes Theorem.
14. Explain K-nearest neighbor classifiers.
15. What is linear regression.
16. What is SVM?
17. Define Lazy Learners.
PART - B
1. How classification can be done using decision tree induction?
2. Explain in detail the Bayesian classification techniques.
3. Give an overview of the different classification approaches.
4. Explain market basket analysis with an motivating example for association rule mining.
5. Explain all the ways to classify association rule mining.
6. Write an apriori algorithm with an example.
7. Explain mining multi level association rules from transactional database.
8. Explain how to mine multidimensional association rules from relational databases and
data warehouses?
9. Write an algorithm to mine frequent item set without candidate generation. Give
suitable example.
10. Explain: Classification by Back propagation and Associative Classification.
11. Explain in detail, the Decision Tree Induction.
12. Briefly describe how classification is done using Bayes Method.
REFERENCES
1. www.tutorialspoint.com/dwh/dwh_overview.htm
2. http://study.com/academy/lesson/data-warehousing-and-data-mining-information-for-business-
intelligence.html
3. http://www.dei.unipd.it/~capri/SI/MATERIALE/DWDM0405.pdf.
4. https://www.cse.iitb.ac.in/infolab/Data/Talks/krithi-talk-impact.ppt