??? ????????? ???
??? ????????? ???
??? ????????? ???
Module 3 – Classification.
Q1 Explain Decision Tree Based Classification Approach With Example.
Ans.
Decision tree-based classification is a popular machine learning approach used for both predictive modeling and decision
support. The decision tree is a tree-like model where each node represents a decision or a test on an attribute, each branch
represents the outcome of the test, and each leaf node represents the class label or the decision.
Decision Tree-Based Classification Process:
1. Data Collection:
• Collect a dataset with labeled examples. Each example consists of a set of attributes and the corresponding class
label.
2. Data Preprocessing:
• Preprocess the data by handling missing values, encoding categorical variables, and splitting the dataset into
training and testing sets.
3. Decision Tree Construction:
• Use a decision tree algorithm (e.g., ID3, C4.5, CART) to construct the tree. The algorithm selects the best
attribute at each node based on criteria such as information gain or Gini impurity.
4. Decision Tree Training:
• Train the decision tree on the training dataset. The tree is recursively grown by making decisions at each node,
splitting the data based on the selected attribute.
5. Decision Making (Classification):
• Once the decision tree is trained, it can be used to classify new, unseen instances. Starting from the root node,
each instance traverses the tree based on the attribute tests until it reaches a leaf node, which corresponds to the
predicted class label.
Example:
Let's consider a simple example of classifying whether a person will play golf based on weather conditions. The dataset
includes the following attributes: Outlook, Temperature, Humidity, and Wind.
Dataset:
Outlook Temperature Humidity Wind Play Golf
Sunny Hot High Weak No
Sunny Hot High Strong No
Overcast Hot High Weak Yes
Rainy Mild High Weak Yes
Rainy Cool Normal Weak Yes
Rainy Cool Normal Strong No
Overcast Cool Normal Strong Yes
Sunny Mild High Weak No
Sunny Cool Normal Weak Yes
Rainy Mild Normal Weak Yes
Sunny Mild Normal Strong Yes
Overcast Mild High Strong Yes
Overcast Hot Normal Weak Yes
Rainy Mild High Strong No
Decision Tree:
Outlook
/ | \
Sunny Overcast Rainy
/ \ / / \
Humidity Wind Humidity
/ \ /\ / \
High Normal Weak Strong
/ | | | |
No Yes Yes Yes No
Module 4 – Clustering.
Q1 Explain K-Means And K-Medoids Algorithm.
Ans.
K-Means Algorithm:
K-Means is a clustering algorithm that partitions a dataset into K clusters, where each data point belongs to the cluster
with the nearest mean (centroid). The algorithm aims to minimize the sum of squared distances between data points and
their assigned cluster centroids.
Steps Of The K-Means Algorithm:
1. Initialization:
• Randomly select K initial centroids, one for each cluster.
2. Assignment:
• Assign each data point to the cluster whose centroid is the closest (usually using Euclidean distance).
3. Update Centroids:
• Recalculate the centroids as the mean of all data points in each cluster.
4. Repeat:
• Repeat steps 2 and 3 until convergence (when centroids no longer change significantly) or a specified number of
iterations is reached.
5. Output:
• The final clusters and their centroids.
K-Medoids Algorithm:
K-Medoids is a variation of K-Means that, instead of using the mean as the centroid, uses the actual data point from the
cluster that minimizes the sum of distances to other points in the cluster. This makes K-Medoids more robust to outliers,
as the medoid is less sensitive to extreme values.
Steps Of The K-Medoids Algorithm:
1. Initialization:
• Randomly select K initial data points as medoids.
2. Assignment:
• For each data point, assign it to the cluster represented by the closest medoid (using a distance metric such as
Euclidean distance).
3. Update Medoids:
• For each cluster, select the data point that minimizes the sum of distances to other points in the cluster as the new
medoid.
4. Repeat:
• Repeat steps 2 and 3 until convergence or a specified number of iterations is reached.
5. Output:
• The final clusters and their medoids.
Image Segmentation, Customer Segmentation, Social Market Segmentation, Anomaly Detection, Biological
Network Analysis, Document Clustering, Genetics, Classification, Natural Language Processing, Etc.
Genomics, Etc., And Many More.
Module 5 – Mining Frequent Patterns And Association.
Q1 Explain Apriori Algorithm And Steps Of Apriori Algorithm.
Ans.
The Apriori algorithm is a popular algorithm for mining frequent itemsets and generating association rules from
transactional databases. It was proposed by Rakesh Agrawal and Ramakrishnan Srikant in 1994. The Apriori algorithm
works based on the "apriori property," which states that if an itemset is frequent, then all of its subsets must also be
frequent. The algorithm uses this property to efficiently discover frequent itemsets.
Steps of the Apriori Algorithm:
1. Initialize:
• Create a table to store the support count of each itemset.
• Scan the transaction database to count the support of each individual item.
2. Generate Frequent 1-Itemsets:
• Identify frequent 1-itemsets by filtering out items with support below a predefined threshold (minimum support).
3. Generate Candidate 2-Itemsets:
• Create candidate 2-itemsets by combining frequent 1-itemsets. For each pair of frequent 1-itemsets {A} and {B},
generate {A, B} if the first (k-1) items of A are equal to the first (k-1) items of B.
4. Scan Database for Support Count:
• Scan the transaction database to count the support of each candidate 2-itemset.
• Prune candidate 2-itemsets that do not meet the minimum support threshold.
5. Generate Candidate k-Itemsets:
• Create candidate k-itemsets by joining frequent (k-1)-itemsets. For each pair of frequent (k-1)-itemsets {A} and
{B}, generate {A, B} if the first (k-2) items of A are equal to the first (k-2) items of B.
6. Scan Database for Support Count (Repeat):
• Scan the transaction database to count the support of each candidate k-itemset.
• Prune candidate k-itemsets that do not meet the minimum support threshold.
7. Repeat Until No More Frequent Itemsets:
• Repeat steps 5 and 6 to generate candidate k-itemsets and scan the database until no more frequent itemsets can be
found.
8. Generate Association Rules:
• Use the frequent itemsets to generate association rules that meet a predefined confidence threshold.
• An association rule has the form A -> B, where A and B are itemsets, and the rule's confidence is the ratio of the
support of {A, B} to the support of {A}.