Lecture 17 18

Machine Learning
ICT-4261
By-
Dr. Jesmin Akhter
Professor
Institute of Information Technology
Jahangirnagar University
Contents
The course will mainly cover the following topics:
 A Gentle Introduction to Machine Learning
 Important Elements in Machine Learning
 Linear Regression
 Logistic Regression
 Naive Bayes
 Support Vector Machines
 Decision Trees and Ensemble Learning
 Clustering Fundamentals
 Hierarchical Clustering
 Neural Networks and Deep Learning
 Unsupervised Learning
Outline
 Decision Trees and Ensemble Learning

Decision Trees
 A decision tree is an approach to predictive analysis that can help you make decisions.
 A decision tree consists of nodes and leaves, with each leaf denoting a class.
 Classes (tall or short) are the outputs of the tree.
 Attributes (gender and height) are a set of features that describe the data.
 The input data consists of values of the different attributes. Using these attribute values, the decision
tree generates a class as the output for each input data.
Basic Principles
 The top, or first node, is called the root node.

 The last level of nodes are the leaf nodes and contain the final classification.
 The intermediate nodes are the descendant or “hidden” layers.
 Binary trees, like the one shown to the right, are the most popular type of tree.
However, M-ary trees (M branches at each node) are possible.
 Decision trees attempt to classify a pattern through a sequence of questions. For
example, attributes such as gender and height can be used to classify people as
short or tall. But the best threshold for height is gender dependent. In a binary tree, by convention
if the answer to a question is “yes”, the left branch is selected.
 Key questions include how to grow the tree, how to stop growing, and how to prune the tree to
increase generalization.
Binary decisions
 A binary decision tree is a structure based on a sequential decision process.
Starting from the root, a feature is evaluated and one of the two branches is
selected. This procedure is repeated until a final leaf is reached, which normally
represents the classification target.
 Let's consider an input dataset X
 In the dataset, we have n data points, and each point has m features. Then the
decision tree might look like this where t is the threshold value
 According to the feature and the threshold, the structure of the tree will
change.
 Pick the feature that best separates our data
– A perfect separating feature will be present in a node so that node will be
able to separate the data in the best way. If we choose such a node, it
reduces the number of steps, and we get the target in less number of steps
and complexity.
Binary decisions
 It's necessary to find the feature that minimizes the number of following decision steps.
 For example, let's consider a class of students where all males have dark hair and all females have
blonde hair, while both subsets have samples of different sizes. If our task is to determine the
composition of the class, we can start with the following subdivision:
Binary decisions
 However, the block Dark color will contain both males and females (which are the targets we
want to classify). This concept is expressed using the term purity. An ideal scenario is based on
nodes where the impurity is null so that all subsequent decisions will be taken only on the
remaining features.
 In our example, we can simply start from the color block:
 Since we do not need the separation based on the length of the hair, so it is called impurity.
Impurity nodes are added to the tree, which unnecessarily makes the structure bigger and more
complex.
DECISION TREE REPRESENTATION
 A decision tree can contain categorical data (YES/NO) as well as numeric data.
 Figure illustrates a typical learned decision tree. This decision tree classifies Saturday mornings
according to whether they are suitable for playing tennis.
 For example, the instance (Outlook = Sunny,Temperature = Hot, Humidity = High, Wind = Strong)
would be sorted down the leftmost branch of this decision tree and would therefore be classified as
a negative instance (i.e., the tree predicts that PlayTennis = no).
 The decision tree shown in Figure corresponds to the expression
How do Decision Trees work?
 The decision of making strategic splits heavily affects a tree’s accuracy.The decision criteria are
different for classification and regression trees.
 Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes.The
creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say
that the purity of the node increases with respect to the target variable.The decision tree splits the
nodes on all available variables and then selects the split which results in most homogeneous sub-
nodes.
 The algorithm selection is also based on the type of target variables. Let us look at some algorithms
used in Decision Trees:
 ID3 → (extension of D3)
C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits when computing
classification trees)
MARS → (multivariate adaptive regression splines)
How do Decision Trees work?
 The ID3 algorithm builds decision trees using a top-down greedy search approach through the space
of possible branches with no backtracking.A greedy algorithm, as the name suggests, always makes
the choice that seems to be the best at that moment.
 Steps in ID3 algorithm:
 It begins with the original set S as the root node.
 On each iteration of the algorithm, it iterates through the very unused attribute of the set S and
calculates Entropy(H) and Information gain(IG) of this attribute.
 It then selects the attribute which has the smallest Entropy or Largest Information gain.
 The set S is then split by the selected attribute to produce a subset of the data.
 The algorithm continues to recur on each subset, considering only attributes never selected before.
Attribute Selection Measures
 If the dataset consists of N attributes then deciding which attribute to place at the root or at different
levels of the tree as internal nodes is a complicated step. By just randomly selecting any node to be the
root can’t solve the issue. If we follow a random approach, it may give us bad results with low accuracy.
 For solving this attribute selection problem, researchers worked and devised some solutions.They
suggested using some criteria like :
— Entropy,
— Information gain,
— Gini index,
— Gain Ratio,
— Reduction in Variance
— Chi-Square
 These criteria will calculate values for every attribute.The values are sorted, and attributes are placed in
the tree by following the order
– Attribute with a high value(in case of information gain) is placed at the root.
– While using Information Gain as a criterion, we assume attributes to be categorical, and
– For the Gini index, attributes are assumed to be continuous.
Entropy
 Measures the level of impurity in a group of examples. Impurity is the degree of randomness; it tells
how random our data is. if the target attribute can take on n different values, then the entropy of S
relative to this n-wise classification is defined as:
Where pi is the proportion of S belonging to class i. Note the logarithm is still base 2 because
entropy is a measure of the expected encoding length measured in bits.
 Now we can rewrite:
Entropy= -(p1*log2(p1) + p2*log2(p2) + .. + pn*log2(pn))
Entropy
 Entropy is a measure of the randomness in the information being processed.The higher the entropy,
the harder it is to draw any conclusions from that information. Flipping a coin is an example of an
action that provides information that is random.
 From the above graph, it is quite evident that the entropy H(X) is zero when the probability is either
0 or 1. The Entropy is maximum when the probability is 0.5 because it projects perfect randomness in
the data and there is no chance if perfectly determining the outcome.
Entropy-Example 01
 A pure sub-split means that either you should be getting “yes”, or you should be getting “no”.
 Suppose a feature as 8 “yes” and 4 “no” initially, after the first split the left node gets 5 ‘yes’ and 2 ‘no’
whereas right node gets 3 ‘yes’ and 2 ‘no’.
 We see here the split is not pure, why? Because we can still see some negative classes in both the
nodes. In order to make a decision tree, we need to calculate the impurity of each split, and when the
purity is 100%, we make it as a leaf node.
 To check the impurity of feature 2 and feature 3 we will take the help for Entropy formula.
 For feature 2,
Entropy-Example 01
 For feature 3,
 We can clearly see from the tree itself that left node has low entropy or more purity than right node
since left node has a greater number of “yes” and it is easy to decide here.
Information gain
 Information gain or IG is a statistical property that measures how well a given attribute separates
the training examples according to their target classification.
 Constructing a decision tree is all about finding an attribute that returns the highest information
gain and the smallest entropy.
 Information gain measures the reduction of uncertainty given some feature and it is also a deciding
factor for which attribute should be selected as a decision node or root node.
 Information gain is used in both classification and regression decision trees. In classification, entropy
is used as a measure of impurity, while in regression, variance is used as a measure of impurity
Information gain
 Information gain is a decrease in entropy. It computes the difference between entropy before split and
average entropy after split of the dataset based on given attribute values. ID3 (Iterative Dichotomiser)
decision tree algorithm uses information gain.
 In a much simpler way, we can conclude that:
 Where “before” is the dataset before the split, K is the number of subsets generated by the split, and
(j, after) is subset j after the split.
Information gain
 It is just entropy of the full dataset – entropy of the dataset given some feature.
 To understand this better let’s consider an example:
– Suppose our entire population has a total of 30 instances.The dataset is to predict whether the
person will go to the gym or not. Let’s say 16 people go to the gym and 14 people don’t
– Now we have two features to predict whether he/she will go to the gym or not.
– Feature 1 is “Energy” which takes two values “high” and “low”
– Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral” and “Highly motivated”.
Information gain-Example 01
 Let’s see how our decision tree will be made using these 2 features.We’ll use information gain to
decide which feature should be the root node and which feature should be placed after the split.

 Let’s calculate the entropy
 To see the weighted average of entropy of each node we will do as follows:
 Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:
 Our parent entropy was near 0.99 and after looking at this value of information gain, we can say that
the entropy of the dataset will decrease by 0.37 if we make “Energy” as our root node.
 Similarly, we will do this with the other feature “Motivation” and calculate its information gain.
 Let’s calculate the entropy here:
 To see the weighted average of entropy of each node we will do as follows:
 Now we have the value of E(Parent) and E(Parent|Motivation), information gain will be:
 We now see that the “Energy” feature gives more reduction which is 0.37 than the “Motivation”
feature. Hence we will select the feature which has the highest information gain and then split the
node based on that feature.
 In this example “Energy” will be our root node and we’ll do the same for sub-nodes. Here we can see
that when the energy is “high” the entropy is low and hence we can say a person will definitely go to
the gym if he has high energy, but what if the energy is low? We will again split the node based on the
new feature which is “Motivation”.
Entropy-Example 02
Golf
Golf
Entropy-Example 02
 Entropy for multiple attributes is represented as:

Entropy
 Always remember that the higher the Entropy, the lower will be the purity and the higher will be the
impurity.
 As the goal of machine learning is to decrease the uncertainty or impurity in the dataset, here by
using the entropy we are getting the impurity of a particular node, we don’t know if the parent
entropy or the entropy of a particular node has decreased or not.
 For this, we bring a new metric called “Information gain” which tells us how much the parent entropy
has decreased after splitting it with some feature.
Important Terminology
 Root Node: It represents the entire population or sample and this further gets divided into two or more
homogeneous sets.
 Splitting: It is a process of dividing a node into two or more sub-nodes.
 Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
 Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
 Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the
opposite process of splitting.
 Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
Decision Tree Algorithm
 The decision tree algorithm can be used for solving regression and classification problems
 The goal of using a Decision Tree is to create a training model that can use to predict the class or
value of the target variable by learning simple decision rules inferred from training data.
 In Decision Trees, for predicting a class label for a record we start from the root of the tree. We
compare the values of the root attribute with the record’s attribute. On the basis of comparison, we
follow the branch corresponding to that value and jump to the next node.
Decision Tree Algorithm
 ID3 algorithm uses entropy to calculate the homogeneity of a sample. If the sample is
completely homogeneous the entropy is zero and if the sample is an equally divided it has
entropy of one.
Classification using the ID3 algorithm
 Consider a dataset based on which we will determine whether to play football or

not.
 Here There are four independent
variables to determine the
dependent variable.
 The independent variables are

Outlook, Temperature, Humidity,
and Wind.
 The dependent variable is whether

to play football or not.
11/18/2024 DT 30
Calculations
Here
 Step 1 9 samples are Yes
– Find the entropy of the class variable 5 samples are No
𝟗 ← 𝒀𝒆𝒔
• 𝐻 𝑆  = −𝑃 𝑌𝑒𝑠 𝑙𝑜𝑔2 𝑃 𝑌𝑒𝑠 − 𝑷 𝑵𝒐 𝒍𝒐𝒈𝟐𝑷 𝑵𝒐
𝟓 ← 𝑵𝒐
9 9 5 5
= − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2
14 14 14 14
= 0.94
11/18/2024 DT 31
Step 2: Calculate average weighted entropy for
each attribute
 For Outlook: Values(Outlook)= Sunny, Overcast, Rain
𝟐 ← 𝒀𝒆𝒔 𝟐 𝟐 𝟑 𝟑
– Entropy (SSunny)  = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 = 𝟎. 𝟗𝟕𝟏
𝟑 ← 𝑵𝒐 𝟓 𝟓 𝟓 𝟓
𝟒 ← 𝒀𝒆𝒔
– Entropy (SOvercast)  =𝟎
𝟎 ← 𝑵𝒐
𝟑 ← 𝒀𝒆𝒔
– Entropy (SRain)  = 𝟎. 𝟗𝟕𝟏
𝟐 ← 𝑵𝒐
11/18/2024 DT 32
each attribute (Cont…)
 For Temperature: Values(Temperature)= Hot, Mild, Cool
𝟐 ← 𝒀𝒆𝒔 𝟐 𝟐 𝟐 𝟐
– Entropy (SHot)  = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 =𝟏
𝟐 ← 𝑵𝒐 𝟓 𝟓 𝟓 𝟓
𝟒 ← 𝒀𝒆𝒔
– Entropy (SMild)  = 𝟎. 𝟗𝟏𝟖𝟑
𝟐 ← 𝑵𝒐
– Entropy (SCool)  = 𝟎. 𝟖𝟏𝟏𝟑
𝟏 ← 𝑵𝒐
11/18/2024 DT 33
 For Humidity: Values(Humidity)= High, Normal
𝟑 ← 𝒀𝒆𝒔 𝟑 𝟔 𝟐 𝟒
– Entropy (SHigh)  = − 𝒍𝒐𝒈𝟐 − 𝒍𝒐𝒈𝟐 = 𝟎. 𝟗𝟖𝟓𝟐
𝟒 ← 𝑵𝒐 𝟕 𝟕 𝟕 𝟕
𝟔 ← 𝒀𝒆𝒔
– Entropy (SNormal)  = 𝟎. 𝟓𝟗𝟏𝟔
𝟏 ← 𝑵𝒐
11/18/2024 DT 34
 For Wind: Values(Wind)= Weak, Strong
𝟔 ← 𝒀𝒆𝒔
– Entropy (SWeak)  = 𝟎. 𝟖𝟏𝟏𝟑
𝟐 ← 𝑵𝒐
– Entropy (SStrong)  =𝟏
𝟑 ← 𝑵𝒐
11/18/2024 DT 35
Step3: Find the information gain: Outlook
Information gain is the difference between parent entropy and average weighted entropy we found above.
5 4 5
 Gain(S,Outlook)= H(s) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑢𝑛𝑛𝑦 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑅𝑎𝑖𝑛
14 14 14
5 4 5
 Gain(S,Outlook)= 0.94 − 14
× 0.971 − 14
× 0.0 − 14
× 0.971

11/18/2024 DT 36
Gain: Temperature
Gain(S,𝑇𝑒𝑚𝑝𝑒𝑟𝑢𝑟𝑒)=0.0289
11/18/2024 DT 37
Gain: Humidity
11/18/2024 DT 38
Gain: Wind
11/18/2024 DT 39
Information gain
– IG(S, outlook) = 0.94 - 0.693 = 0.247
– IG(S, Temperature) = 0.940 - 0.911 = 0.029
– IG(S, Humidity) = 0.940 - 0.788 = 0.152
– IG(S, Windy) = 0.940 - 0.8932 = 0.048
11/18/2024 DT 40
Finding Root Node
 Since Outlook has the highest gain so, Outlook is the root node. And it has three child nodes (Sunny, Overcast,
Rain)
11/18/2024 DT 41
11/18/2024 DT 42
Calculate IG of Temperature
𝟐 ← 𝒀𝒆𝒔
 𝐻 𝑆  = −𝑷 𝒀𝒆𝒔 𝒍𝒐𝒈𝟐 𝑷 𝒀𝒆𝒔 − 𝑷 𝑵𝒐 𝒍𝒐𝒈𝟐 𝑷 𝑵𝒐
𝟑 ← 𝑵𝒐
2 2 3 3
5 5 5 5
= 0.971
 For Temperature:
Values(Temperature)= Hot, Mild, Cool
0 ← 𝑌𝑒𝑠 0 0 2 2
 Entropy (SHot)  = − 2 𝑙𝑜𝑔2 − 2 𝑙𝑜𝑔2 =0
2 ← 𝑁𝑜 2 2
1 ← 𝑌𝑒𝑠
 Entropy (SMild)  =1
1 ← 𝑁𝑜
1 ← 𝑌𝑒𝑠
 Entropy (SCool)  =0
0 ← 𝑁𝑜
11/18/2024 DT 43
Calculate IG of Humidity
𝟑 ← 𝑵𝒐
2 2 3 3
5 5 5 5
= 0.971
 For Humidity:
Values(Humidity)= High, Normal
0 ← 𝑌𝑒𝑠 0 0 3 3
 Entropy (SHigh)  = − 3 𝑙𝑜𝑔2 − 3 𝑙𝑜𝑔2 =0
3 ← 𝑁𝑜 3 3
2 ← 𝑌𝑒𝑠
 Entropy (SNormal)  =0
0 ← 𝑁𝑜
Humidity
11/18/2024 DT 44
Calculate IG of Windy
𝟑 ← 𝑵𝒐
2 2 3 3
5 5 5 5
= 0.971
 For Windy:
Values(Windy)= Weak, Strong
1 ← 𝑌𝑒𝑠 1 1 2 2
 Entropy (SWeak)  = − 3 𝑙𝑜𝑔2 − 3 𝑙𝑜𝑔2 = .5278
2 ← 𝑁𝑜 3 3
1 ← 𝑌𝑒𝑠
 Entropy (SStrong  =1
1 ← 𝑁𝑜
Windy
11/18/2024 DT 45
Information gain

Humidity
Windy
Calculate IG of Temperature
Calculate IG of Humidity
Calculate IG of Windy
Information gain
A decision tree for the concept PlayTennis
 An example is classified by sorting it through the tree to the appropriate leaf node, then returning the
classification associated with this leaf (in this case,Yes or No).
 This tree classifies Saturday mornings according to whether or not they are suitable for playing tennis.
Thank You
52

Lecture 17 18

Uploaded by

Copyright:

Available Formats

Lecture 17 18

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 17 18

Uploaded by

Copyright:

Available Formats

Machine Learning

 Decision Trees and Ensemble Learning

 The top, or first node, is called the root node.

 In a much simpler way, we can conclude that:

 To see the weighted average of entropy of each node we will do as follows:

 To see the weighted average of entropy of each node we will do as follows:

 Entropy for multiple attributes is represented as:

 Consider a dataset based on which we will determine whether to play football or

 The independent variables are

 The dependent variable is whether

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.