Lecture 17 18
Lecture 17 18
Lecture 17 18
ICT-4261
By-
Dr. Jesmin Akhter
Professor
Institute of Information Technology
Jahangirnagar University
Contents
The course will mainly cover the following topics:
A Gentle Introduction to Machine Learning
Important Elements in Machine Learning
Linear Regression
Logistic Regression
Naive Bayes
Support Vector Machines
Decision Trees and Ensemble Learning
Clustering Fundamentals
Hierarchical Clustering
Neural Networks and Deep Learning
Unsupervised Learning
Outline
However, the block Dark color will contain both males and females (which are the targets we
want to classify). This concept is expressed using the term purity. An ideal scenario is based on
nodes where the impurity is null so that all subsequent decisions will be taken only on the
remaining features.
In our example, we can simply start from the color block:
Since we do not need the separation based on the length of the hair, so it is called impurity.
Impurity nodes are added to the tree, which unnecessarily makes the structure bigger and more
complex.
DECISION TREE REPRESENTATION
A decision tree can contain categorical data (YES/NO) as well as numeric data.
Figure illustrates a typical learned decision tree. This decision tree classifies Saturday mornings
according to whether they are suitable for playing tennis.
For example, the instance (Outlook = Sunny,Temperature = Hot, Humidity = High, Wind = Strong)
would be sorted down the leftmost branch of this decision tree and would therefore be classified as
a negative instance (i.e., the tree predicts that PlayTennis = no).
The decision tree shown in Figure corresponds to the expression
How do Decision Trees work?
The decision of making strategic splits heavily affects a tree’s accuracy.The decision criteria are
different for classification and regression trees.
Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes.The
creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say
that the purity of the node increases with respect to the target variable.The decision tree splits the
nodes on all available variables and then selects the split which results in most homogeneous sub-
nodes.
The algorithm selection is also based on the type of target variables. Let us look at some algorithms
used in Decision Trees:
ID3 → (extension of D3)
C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits when computing
classification trees)
MARS → (multivariate adaptive regression splines)
How do Decision Trees work?
The ID3 algorithm builds decision trees using a top-down greedy search approach through the space
of possible branches with no backtracking.A greedy algorithm, as the name suggests, always makes
the choice that seems to be the best at that moment.
Steps in ID3 algorithm:
It begins with the original set S as the root node.
On each iteration of the algorithm, it iterates through the very unused attribute of the set S and
calculates Entropy(H) and Information gain(IG) of this attribute.
It then selects the attribute which has the smallest Entropy or Largest Information gain.
The set S is then split by the selected attribute to produce a subset of the data.
The algorithm continues to recur on each subset, considering only attributes never selected before.
Attribute Selection Measures
If the dataset consists of N attributes then deciding which attribute to place at the root or at different
levels of the tree as internal nodes is a complicated step. By just randomly selecting any node to be the
root can’t solve the issue. If we follow a random approach, it may give us bad results with low accuracy.
For solving this attribute selection problem, researchers worked and devised some solutions.They
suggested using some criteria like :
— Entropy,
— Information gain,
— Gini index,
— Gain Ratio,
— Reduction in Variance
— Chi-Square
These criteria will calculate values for every attribute.The values are sorted, and attributes are placed in
the tree by following the order
– Attribute with a high value(in case of information gain) is placed at the root.
– While using Information Gain as a criterion, we assume attributes to be categorical, and
– For the Gini index, attributes are assumed to be continuous.
Entropy
Measures the level of impurity in a group of examples. Impurity is the degree of randomness; it tells
how random our data is. if the target attribute can take on n different values, then the entropy of S
relative to this n-wise classification is defined as:
Where pi is the proportion of S belonging to class i. Note the logarithm is still base 2 because
entropy is a measure of the expected encoding length measured in bits.
Now we can rewrite:
Entropy= -(p1*log2(p1) + p2*log2(p2) + .. + pn*log2(pn))
Entropy
Entropy is a measure of the randomness in the information being processed.The higher the entropy,
the harder it is to draw any conclusions from that information. Flipping a coin is an example of an
action that provides information that is random.
From the above graph, it is quite evident that the entropy H(X) is zero when the probability is either
0 or 1. The Entropy is maximum when the probability is 0.5 because it projects perfect randomness in
the data and there is no chance if perfectly determining the outcome.
Entropy-Example 01
A pure sub-split means that either you should be getting “yes”, or you should be getting “no”.
Suppose a feature as 8 “yes” and 4 “no” initially, after the first split the left node gets 5 ‘yes’ and 2 ‘no’
whereas right node gets 3 ‘yes’ and 2 ‘no’.
We see here the split is not pure, why? Because we can still see some negative classes in both the
nodes. In order to make a decision tree, we need to calculate the impurity of each split, and when the
purity is 100%, we make it as a leaf node.
To check the impurity of feature 2 and feature 3 we will take the help for Entropy formula.
For feature 2,
Entropy-Example 01
For feature 3,
We can clearly see from the tree itself that left node has low entropy or more purity than right node
since left node has a greater number of “yes” and it is easy to decide here.
Information gain
Information gain or IG is a statistical property that measures how well a given attribute separates
the training examples according to their target classification.
Constructing a decision tree is all about finding an attribute that returns the highest information
gain and the smallest entropy.
Information gain measures the reduction of uncertainty given some feature and it is also a deciding
factor for which attribute should be selected as a decision node or root node.
Information gain is used in both classification and regression decision trees. In classification, entropy
is used as a measure of impurity, while in regression, variance is used as a measure of impurity
Information gain
Information gain is a decrease in entropy. It computes the difference between entropy before split and
average entropy after split of the dataset based on given attribute values. ID3 (Iterative Dichotomiser)
decision tree algorithm uses information gain.
Where “before” is the dataset before the split, K is the number of subsets generated by the split, and
(j, after) is subset j after the split.
Information gain
It is just entropy of the full dataset – entropy of the dataset given some feature.
To understand this better let’s consider an example:
– Suppose our entire population has a total of 30 instances.The dataset is to predict whether the
person will go to the gym or not. Let’s say 16 people go to the gym and 14 people don’t
– Now we have two features to predict whether he/she will go to the gym or not.
– Feature 1 is “Energy” which takes two values “high” and “low”
– Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral” and “Highly motivated”.
Information gain-Example 01
Let’s see how our decision tree will be made using these 2 features.We’ll use information gain to
decide which feature should be the root node and which feature should be placed after the split.
Let’s calculate the entropy
Information gain-Example 01
Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:
Our parent entropy was near 0.99 and after looking at this value of information gain, we can say that
the entropy of the dataset will decrease by 0.37 if we make “Energy” as our root node.
Similarly, we will do this with the other feature “Motivation” and calculate its information gain.
Information gain-Example 01
Let’s calculate the entropy here:
Now we have the value of E(Parent) and E(Parent|Motivation), information gain will be:
Information gain-Example 01
We now see that the “Energy” feature gives more reduction which is 0.37 than the “Motivation”
feature. Hence we will select the feature which has the highest information gain and then split the
node based on that feature.
In this example “Energy” will be our root node and we’ll do the same for sub-nodes. Here we can see
that when the energy is “high” the entropy is low and hence we can say a person will definitely go to
the gym if he has high energy, but what if the energy is low? We will again split the node based on the
new feature which is “Motivation”.
Entropy-Example 02
Golf
Golf
Entropy-Example 02
Always remember that the higher the Entropy, the lower will be the purity and the higher will be the
impurity.
As the goal of machine learning is to decrease the uncertainty or impurity in the dataset, here by
using the entropy we are getting the impurity of a particular node, we don’t know if the parent
entropy or the entropy of a particular node has decreased or not.
For this, we bring a new metric called “Information gain” which tells us how much the parent entropy
has decreased after splitting it with some feature.
Important Terminology
Root Node: It represents the entire population or sample and this further gets divided into two or more
homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the
opposite process of splitting.
Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
Decision Tree Algorithm
The decision tree algorithm can be used for solving regression and classification problems
The goal of using a Decision Tree is to create a training model that can use to predict the class or
value of the target variable by learning simple decision rules inferred from training data.
In Decision Trees, for predicting a class label for a record we start from the root of the tree. We
compare the values of the root attribute with the record’s attribute. On the basis of comparison, we
follow the branch corresponding to that value and jump to the next node.
Decision Tree Algorithm
ID3 algorithm uses entropy to calculate the homogeneity of a sample. If the sample is
completely homogeneous the entropy is zero and if the sample is an equally divided it has
entropy of one.
Classification using the ID3 algorithm
11/18/2024 DT 30
Calculations
Here
Step 1 9 samples are Yes
– Find the entropy of the class variable 5 samples are No
𝟗 ← 𝒀𝒆𝒔
• 𝐻 𝑆 = −𝑃 𝑌𝑒𝑠 𝑙𝑜𝑔2 𝑃 𝑌𝑒𝑠 − 𝑷 𝑵𝒐 𝒍𝒐𝒈𝟐𝑷 𝑵𝒐
𝟓 ← 𝑵𝒐
9 9 5 5
= − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2
14 14 14 14
= 0.94
11/18/2024 DT 31
Step 2: Calculate average weighted entropy for
each attribute
For Outlook: Values(Outlook)= Sunny, Overcast, Rain
𝟐 ← 𝒀𝒆𝒔 𝟐 𝟐 𝟑 𝟑
– Entropy (SSunny) = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 = 𝟎. 𝟗𝟕𝟏
𝟑 ← 𝑵𝒐 𝟓 𝟓 𝟓 𝟓
𝟒 ← 𝒀𝒆𝒔
– Entropy (SOvercast) =𝟎
𝟎 ← 𝑵𝒐
𝟑 ← 𝒀𝒆𝒔
– Entropy (SRain) = 𝟎. 𝟗𝟕𝟏
𝟐 ← 𝑵𝒐
11/18/2024 DT 32
Step 2: Calculate average weighted entropy for
each attribute (Cont…)
For Temperature: Values(Temperature)= Hot, Mild, Cool
𝟐 ← 𝒀𝒆𝒔 𝟐 𝟐 𝟐 𝟐
– Entropy (SHot) = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 =𝟏
𝟐 ← 𝑵𝒐 𝟓 𝟓 𝟓 𝟓
𝟒 ← 𝒀𝒆𝒔
– Entropy (SMild) = 𝟎. 𝟗𝟏𝟖𝟑
𝟐 ← 𝑵𝒐
𝟑 ← 𝒀𝒆𝒔
– Entropy (SCool) = 𝟎. 𝟖𝟏𝟏𝟑
𝟏 ← 𝑵𝒐
11/18/2024 DT 33
Step 2: Calculate average weighted entropy for
each attribute (Cont…)
For Humidity: Values(Humidity)= High, Normal
𝟑 ← 𝒀𝒆𝒔 𝟑 𝟔 𝟐 𝟒
– Entropy (SHigh) = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 = 𝟎. 𝟗𝟖𝟓𝟐
𝟒 ← 𝑵𝒐 𝟕 𝟕 𝟕 𝟕
𝟔 ← 𝒀𝒆𝒔
– Entropy (SNormal) = 𝟎. 𝟓𝟗𝟏𝟔
𝟏 ← 𝑵𝒐
11/18/2024 DT 34
Step 2: Calculate average weighted entropy for
each attribute (Cont…)
For Wind: Values(Wind)= Weak, Strong
𝟔 ← 𝒀𝒆𝒔
– Entropy (SWeak) = 𝟎. 𝟖𝟏𝟏𝟑
𝟐 ← 𝑵𝒐
𝟑 ← 𝒀𝒆𝒔
– Entropy (SStrong) =𝟏
𝟑 ← 𝑵𝒐
11/18/2024 DT 35
Step3: Find the information gain: Outlook
Information gain is the difference between parent entropy and average weighted entropy we found above.
5 4 5
Gain(S,Outlook)= H(s) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑢𝑛𝑛𝑦 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑅𝑎𝑖𝑛
14 14 14
5 4 5
Gain(S,Outlook)= 0.94 − 14
× 0.971 − 14
× 0.0 − 14
× 0.971
11/18/2024 DT 36
Gain: Temperature
Gain(S,𝑇𝑒𝑚𝑝𝑒𝑟𝑢𝑟𝑒)=0.0289
11/18/2024 DT 37
Gain: Humidity
11/18/2024 DT 38
Gain: Wind
11/18/2024 DT 39
Information gain
– IG(S, outlook) = 0.94 - 0.693 = 0.247
– IG(S, Temperature) = 0.940 - 0.911 = 0.029
– IG(S, Humidity) = 0.940 - 0.788 = 0.152
– IG(S, Windy) = 0.940 - 0.8932 = 0.048
11/18/2024 DT 40
Finding Root Node
Since Outlook has the highest gain so, Outlook is the root node. And it has three child nodes (Sunny, Overcast,
Rain)
11/18/2024 DT 41
11/18/2024 DT 42
Calculate IG of Temperature
𝟐 ← 𝒀𝒆𝒔
𝐻 𝑆 = −𝑷 𝒀𝒆𝒔 𝒍𝒐𝒈𝟐 𝑷 𝒀𝒆𝒔 − 𝑷 𝑵𝒐 𝒍𝒐𝒈𝟐 𝑷 𝑵𝒐
𝟑 ← 𝑵𝒐
2 2 3 3
= − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2
5 5 5 5
= 0.971
For Temperature:
Values(Temperature)= Hot, Mild, Cool
0 ← 𝑌𝑒𝑠 0 0 2 2
Entropy (SHot) = − 2 𝑙𝑜𝑔2 − 2 𝑙𝑜𝑔2 =0
2 ← 𝑁𝑜 2 2
1 ← 𝑌𝑒𝑠
Entropy (SMild) =1
1 ← 𝑁𝑜
1 ← 𝑌𝑒𝑠
Entropy (SCool) =0
0 ← 𝑁𝑜
11/18/2024 DT 43
Calculate IG of Humidity
𝟐 ← 𝒀𝒆𝒔
𝐻 𝑆 = −𝑷 𝒀𝒆𝒔 𝒍𝒐𝒈𝟐 𝑷 𝒀𝒆𝒔 − 𝑷 𝑵𝒐 𝒍𝒐𝒈𝟐 𝑷 𝑵𝒐
𝟑 ← 𝑵𝒐
2 2 3 3
= − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2
5 5 5 5
= 0.971
For Humidity:
Values(Humidity)= High, Normal
0 ← 𝑌𝑒𝑠 0 0 3 3
Entropy (SHigh) = − 3 𝑙𝑜𝑔2 − 3 𝑙𝑜𝑔2 =0
3 ← 𝑁𝑜 3 3
2 ← 𝑌𝑒𝑠
Entropy (SNormal) =0
0 ← 𝑁𝑜
Humidity
11/18/2024 DT 44
Calculate IG of Windy
𝟐 ← 𝒀𝒆𝒔
𝐻 𝑆 = −𝑷 𝒀𝒆𝒔 𝒍𝒐𝒈𝟐 𝑷 𝒀𝒆𝒔 − 𝑷 𝑵𝒐 𝒍𝒐𝒈𝟐 𝑷 𝑵𝒐
𝟑 ← 𝑵𝒐
2 2 3 3
= − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2
5 5 5 5
= 0.971
For Windy:
Values(Windy)= Weak, Strong
1 ← 𝑌𝑒𝑠 1 1 2 2
Entropy (SWeak) = − 3 𝑙𝑜𝑔2 − 3 𝑙𝑜𝑔2 = .5278
2 ← 𝑁𝑜 3 3
1 ← 𝑌𝑒𝑠
Entropy (SStrong =1
1 ← 𝑁𝑜
Windy
11/18/2024 DT 45
Information gain
Humidity
Windy
Calculate IG of Temperature
Calculate IG of Humidity
Calculate IG of Windy
Information gain
A decision tree for the concept PlayTennis
An example is classified by sorting it through the tree to the appropriate leaf node, then returning the
classification associated with this leaf (in this case,Yes or No).
This tree classifies Saturday mornings according to whether or not they are suitable for playing tennis.
Thank You
52