100% found this document useful (1 vote)
130 views57 pages

Enseble LEarning

Ensemble learning uses multiple machine learning models to obtain better predictive performance than could be obtained from any single model. It combines predictions from multiple base models, which reduces variance and helps avoid overfitting. Popular ensemble methods include bagging, boosting, and random forests. These methods improve accuracy by training each base model on a different sample of the data and averaging or voting on their predictions.

Uploaded by

YASH GAIKWAD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
130 views57 pages

Enseble LEarning

Ensemble learning uses multiple machine learning models to obtain better predictive performance than could be obtained from any single model. It combines predictions from multiple base models, which reduces variance and helps avoid overfitting. Popular ensemble methods include bagging, boosting, and random forests. These methods improve accuracy by training each base model on a different sample of the data and averaging or voting on their predictions.

Uploaded by

YASH GAIKWAD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Ensemble Learning

https://www.mygreatlearning.com/blog/adaboost-algorithm/#:~:text=A
daBoost%20algorithm%2C%20short%20for%20Adaptive,assigned%20to
%20incorrectly%20classified%20instances.
What is Ensemble Learning?
• A common practice nowadays is to check the reviews of items before buying
them.
• And when checking reviews, you often look for the items with a large number
of reviews so you could know for sure about its rating.
• After going through the reviews from multiple people you decide whether to
buy the item or not.
• Ensemble models in machine learning operate on a similar idea. They combine
the decisions from multiple models to improve the overall performance.
• This approach allows for better predictive performance compared to a single
model.
• This is the reason why ensemble methods were placed first in many
prestigious machine learning competitions, such as the Netflix Competition,
KDD 2009, and Kaggle.
Ensemble learning
• Using multiple learning algorithm together for the same task.
• In statistics and machine learning, ensemble methods use multiple learning
algorithms to obtain better predictive performance than could be obtained
from any of the constituent learning algorithms alone
• Better predictions than individual learning models.
• Ensemble methods are techniques that create multiple models and then
combine them to produce improved results.
• The main causes of error in learning models are due to noise, bias and
variance.
Voting or Averaging of predictions of multiple models
Ensemble Model Example of Regression
Significance of Ensemble Models
• Why Use Ensemble Models?
• Accuracy of Ensemble learner is suppose to be better than other algorithms
• Better Accuracy (Low Error)
• Higher Consistency (Avoid Over fitting)
• Reduced Bias and Variance Errors
• ## Bias and variance references-
• https://www.bmc.com/blogs/bias-variance-machine-learning/

• WHEN AND WHERE TO USE ENSEMBLE MODELS?


• Single model over fits
• Result worth the extra training
• Can be used for classification as well as regression
TWO GROUPS OF ENSEMBLE METHODS
• Sequential Ensemble methods: the base learners are generated sequentially
(e.g. AdaBoost)
• To exploit the dependence between the base learners.
• The overall performance can be boosted by weighing previously mislabeled
examples with higher weight.
• Parallel ensemble methods: the base learners are generated in parallel (e.g.
Random Forest)
• The basic motivation of parallel methods is to exploit interdependence
between the base learners
• The error can be reduced dramatically by averaging
ENSEMBLE METHODS
• Homogeneous Ensembles: a single base learning algorithm
to produce homogeneous base learners, i.e. learners of the
same type, leading to homogeneous ensembles
• Heterogeneous Ensembles: learners of different types,
leading to heterogeneous ensembles
• In order for ensemble methods to be more accurate than
any of its individual members, the base learners have to be
as accurate as possible and as diverse as possible.
Bagging or Bootstrap Aggregation
• Bagging or Bootstrap Aggregation was formally introduced by Leo Breiman in
1996.
• Bagging is an Ensemble Learning technique which aims to reduce the error
learning through the implementation of a set of homogeneous machine
learning algorithms.
• The key idea of bagging is the use of multiple base learners which are trained
separately with a random sample from the training set, which through a
voting or averaging approach, produce a more stable and accurate model.
Bagging or Bootstrap Aggregation
Bagging or Bootstrap Aggregation
• The main two components of bagging technique are: the random sampling with
replacement (bootstraping) and the set of homogeneous machine learning
algorithms (ensemble learning).
• The bagging process is quite easy to understand, first it is extracted “n” subsets
from the training set, then these subsets are used to train “n” base learners of
the same type.The bagging is parallel ensemble approach.
• For making a prediction, each one of the “n” learners are feed with the test
sample, the output of each learner is averaged (in case of regression) or voted
(in case of classification).
• The most famous such approach is “bagging” (standing for “bootstrap
aggregating”) that aims at producing an ensemble model that is more
robust than the individual models composing it.
• Bootstrapping
• Let’s begin by defining bootstrapping. This statistical technique
consists in generating samples of size B (called bootstrap samples)
from an initial dataset of size N by randomly drawing with
replacement B observations.
What Is Bootstrapping?
• The bootstrapping is statistical technique consists in generating samples of size B (called bootstrap samples)
from an initial dataset of size N by randomly drawing with replacement B observations.
What is a Bootstrap Sample?

A bootstrap sample is a smaller sample that is “bootstrapped” from a larger sample. Bootstrapping is a type of
resampling where large numbers of smaller samples of the same size are repeatedly drawn, with replacement,
from a single original sample.

For example, let’s say your sample was made up of ten numbers: 49, 34, 21, 18, 10, 8, 6, 5, 2, 1. You randomly
draw three numbers 5, 1, and 49. You then replace those numbers into the sample and draw three numbers
again. Repeat the process of drawing x numbers B times. Usually, original samples are much larger than this
simple example, and B can reach into the thousands. After a large number of iterations, the bootstrap statistics
are compiled into a bootstrap distribution. You’re replacing your numbers back into the pot, so your resamples
can have the same item repeated several times (e.g. 49 could appear a dozen times in a dozen resamples).


• Advantages of Bagging in Machine
Learning
• Bagging minimizes the overfitting of
data
• It improves the model’s accuracy
• It deals with higher dimensional data
efficiently
Steps to Perform Bagging

• Consider there are n observations and m features in the training set. You
need to select a random sample from the training dataset without
replacement
• A subset of m features is chosen randomly to create a model using sample
observations
• The feature offering the best split out of the lot is used to split the nodes
• The tree is grown, so you have the best root nodes
• The above steps are repeated n times. It aggregates the output of individual
decision trees to give the best prediction.
Bagging Model
• Advantages of a Bagging Model:
• 1. Bagging significantly decreases the variance without increasing bias.
• 2. Bagging methods work so well because of diversity in the training data since the sampling is done
by bootstrapping.
• 3. Also, if the training set is very huge, it can save computational time by training the model on a
relatively smaller data set and still can increase the accuracy of the model.
• 4. Works well with small datasets as well.
• · Disadvantages of a Bagging Model:
• 1. The main disadvantage of Bagging is that it improves the accuracy of the model at the expense of
interpretability i.e., if a single tree was being used as the base model, then it would have a more
attractive and easily interpretable diagram, but with the use of bagging this interpretability gets
lost.
• 2. Another disadvantage of Bootstrap Aggregation is that during sampling, we cannot interpret
which features are being selected i.e., there are chances that some features are never used, which
may result in a loss of important information.
Random Forest
• Random forest is one of the most popular tree-based supervised learning
algorithms. It is also the most flexible and easy to use.
• The algorithm can be used to solve both classification and regression
problems. Random forest tends to combine hundreds of decision trees and
then trains each decision tree on a different sample of the observations.
• The final predictions of the random forest are made by averaging the
predictions of each individual tree.
• The benefits of random forests are numerous. The individual decision trees
tend to overfit to the training data but random forest can mitigate that
issue by averaging the prediction results from different trees. This gives
random forests a higher predictive accuracy than a single decision tree.
Random forest Application
• Random forest has been used in a variety of applications, for example
to provide recommendations of different products to customers in
e-commerce.
• In medicine, a random forest algorithm can be used to identify the
patient’s disease by analyzing the patient’s medical record.
• Also in the banking sector, it can be used to easily determine whether
the customer is fraudulent or legitimate.
Random Forest
How does the Random Forest algorithm
work?
• Steps involved in random forest algorithm:
• Step 1: In Random forest n number of random records are taken from
the data set having k number of records.
• Step 2: Individual decision trees are constructed for each sample.
• Step 3: Each decision tree will generate an output.
• Step 4: Final output is considered based on Majority Voting or
Averaging for Classification and regression respectively.
Random Forest algorithm -Example
• For example: consider the fruit basket as the data as shown in the figure below.
• Now n number of samples are taken from the fruit basket and an individual
decision tree is constructed for each sample.
• Each decision tree will generate an output as shown in the figure.
• The final output is considered based on majority voting.
• In the below figure you can see that the majority decision tree gives output as an
apple when compared to a banana, so the final output is taken as an apple.
Difference Between Decision Tree &
Random Forest
• Random forest is a collection of decision trees; still, there are a lot of differences in their behavior

Decision trees Random Forest


1. Decision trees normally suffer from the 1. Random forests are created from
problem of overfitting if it’s allowed to grow subsets of data and the final output is
without any control. based on average or majority ranking and
hence the problem of overfitting is taken
care of.
2. A single decision tree is faster in 2. It is comparatively slower.
computation.
3. When a data set with features is taken 3. Random forest randomly selects
as input by a decision tree it will formulate observations, builds a decision tree and
some set of rules to do prediction. the average result is taken. It doesn’t use
any set of formulas.
Important Features of Random Forest
• 1. Diversity- Not all attributes/variables/features are considered while
making an individual tree, each tree is different.
• 2. Immune to the curse of dimensionality- Since each tree does not consider
all the features, the feature space is reduced.
• 3. Parallelization-Each tree is created independently out of different data
and attributes. This means that we can make full use of the CPU to build
random forests.
• 4.  Train-Test split- In a random forest we don’t have to segregate the data
for train and test as there will always be 30% of the data which is not seen by
the decision tree.
• 5.  Stability- Stability arises because the result is based on majority voting/
averaging
Important Hyperparameter of Random
Forest
• Hyperparameters are used in random forests to either enhance the performance and predictive power of
models or to make the model faster.
• Following hyperparameters increases the predictive power:
• 1. n_estimators– number of trees the algorithm builds before averaging the predictions.
• 2. max_features– maximum number of features random forest considers splitting a node.
• 3. mini_sample_leaf– determines the minimum number of leaves required to split an internal node.
• Following hyperparameters increases the speed:

• 1. n_jobs– it tells the engine how many processors it is allowed to use. If the value is 1, it can use only one
processor but if the value is -1 there is no limit.
• 2. random_state– controls randomness of the sample. The model will always produce the same results if it has
a definite value of random state and if it has been given the same hyperparameters and the same training data.
• 3. oob_score – OOB means out of the bag. It is a random forest cross-validation method. In this one-third of the
sample is not used to train the data instead used to evaluate its performance. These samples are called out of
bag samples.
BOOSTING
• Boosting refers to a family of algorithms that are able to convert weak
learners to strong learners.
• The main principle of boosting is to fit a sequence of weak learners−
models that are only slightly better than random guessing, such as small
decision trees− to weighted versions of the data.
• More weight is given to examples that were misclassified by earlier rounds.
• The predictions are then combined through a weighted majority vote
(classification) or a weighted sum (regression) to produce the final
prediction.
• The principal difference between boosting and the committee methods,
such as bagging, is that base learners are trained in sequence on a
weighted version of the data.
Boosting
• The Boosting algorithm then combines these multiple weak rules together to reduce
variance and bias in individual model rules into a single prediction rule, such that it
will be much more accurate than any one of the weak rules as.
• The basic principle behind the working of the boosting algorithm is to generate
multiple weak learners and combine their predictions to form one strong rule.
• These weak rules are generated by applying base Machine Learning algorithms on
different distributions of the data set.
• These algorithms generate weak rules for each iteration.
• After multiple iterations, the weak learners are combined to form a strong learner
that will predict a more accurate outcome.
• Two fundamental approaches for effective implementation of Boosting algorithm:
• Choosing the different subsets from training dataset for different iterations:
• To increase the efficiency of the base learner predictions, high weightage is placed on the
examples that were misclassified by earlier weak learner.
• How to combine weak leaners together:
• Taking a weighted majority vote of the predictions.
AdaBoost

• Reference- https://www.youtube.com/watch?v=LsK-xG1cLYA
• The idea of Boosting method is that instead of using a simple algorithm, which
is not strong enough to make the accurate predictions alone as there are high
variance and error rate, we combine multiple simple learning algorithms
together, rather than finding a single highly accurate prediction rule.
• Hence, we can say that Boosting algorithm is adaptive.
• These simple algorithms are known as a weak learner, and Boosting algorithm
calls this weak learner multiple times and feeding it a different subset of the
training samples, such that each time the base learning algorithm generates a
new weak prediction rule.
How the Ada Boost algorithm works:

• Step 1: The base algorithm reads the data and assigns equal weight to each
sample observation.
• Step 2: False predictions made by the base learner are identified. In the
next iteration, these false predictions are assigned to the next base learner
with a higher weightage on these incorrect predictions.
• Step 3: Repeat step 2 until the algorithm can correctly classify the output.
• Therefore, the main aim of Boosting is to focus more on miss-classified
predictions.
• Now that we know how the boosting algorithm works, let’s understand the
different types of boosting techniques.
AdaBoost

• The boosting technique follows a sequential order. The output of one base
learner will be input to another.
• If a base classifier is misclassified (red box), its weight will get increased
(over-weighting) and the next base learner will classify more correctly.
• The next logical step is to combine the classifiers to predict the results.
• Gradient Descent Boosting, AdaBoost, and XGboost are some extensions
over boosting methods.
• Gradient boosting minimizes the loss but adds gradient optimization in the
iteration, whereas Adaptive Boosting, or AdaBoost, tweaks the instance of
weights for every new predictor.
AdaBoost
• First of all, AdaBoost is short for Adaptive Boosting. Basically, Ada
Boosting was the first really successful boosting algorithm developed
for binary classification
• Generally, AdaBoost is used with short decision trees. Further, the
first tree is created, the performance of the tree on each training
instance is used.
• Also, we use it to weight how much attention the next tree.
• Thus, it is created should pay attention to each training instance.
Hence, training data that is hard to predict is given more weight.
Although, whereas easy to predict instances are given less weight.
Data Preparation for AdaBoost

• Quality Data:
• Because of the ensemble method attempt to correct misclassifications in
the training data. Also, you need to be careful that the training data is
high-quality.
• Outliers:
• Generally, outliers will force the ensemble down the rabbit hole of work.
Although, it is so hard to correct for cases that are unrealistic. These
could be removed from the training dataset.
• Noisy Data:
• Basically, noisy data, specifical noise in the output variable can be
problematic. But if possible, attempt to isolate and clean these from your
training dataset.
How Does AdaBoost Work?

• Reference
-https://www.mygreatlearning.com/blog/adaboost-algorithm/
• First, let us discuss how boosting works. It makes ‘n’ number of
decision trees during the data training period.
• As the first decision tree/model is made, the incorrectly classified
record in the first model is given priority.
• Only these records are sent as input for the second model.
• The process goes on until we specify a number of base learners we
want to create.
• Remember, repetition of records is allowed with all boosting
techniques.
• This figure shows how the first model is made and errors from the first model are noted by
the algorithm. The record which is incorrectly classified is used as input for the next model.
• This process is repeated until the specified condition is met.
• As you can see in the figure, there are ‘n’ number of models made by taking the errors from
the previous model.
• This is how boosting works. The models 1,2, 3,…, N are individual models that can be known as
decision trees. All types of boosting models work on the same principle. 
Adaboost
• Since we now know the boosting principle, it will be easy to understand the AdaBoost algorithm.
• Let’s dive into AdaBoost’s working. When the random forest is used, the algorithm makes an ‘n’
number of trees.
• It makes proper trees that consist of a start node with several leaf nodes. Some trees might be bigger
than others, but there is no fixed depth in a random forest.
• With AdaBoost, however, the algorithm only makes a node with two leaves, known as Stump.
• The figure here represents the stump. It can be seen clearly that it has only one node with two leaves.
These stumps are weak learners and boosting techniques prefer this. The order of stumps is very
important in AdaBoost
• Here’s a sample dataset consisting of only three features
where the output is in categorical form.
• The image shows the actual representation of the dataset. As
the output is in binary/categorical form, it becomes a
classification problem.
• In real life, the dataset can have any number of records and
features in it.
• Let us consider 5 datasets for explanation purposes. The
output is in categorical form, here in the form of Yes or No. All
these records will be assigned a sample weight.
• The formula used for this is ‘W=1/N’ where N is the number of
records. In this dataset, there are only 5 records, so the
sample weight becomes 1/5 initially.
• Every record gets the same weight. In this case, it’s 1/5. 
• Step 1 – Creating the First Base Learner
• To create the first learner, the algorithm takes the first feature, i.e., feature 1 and creates the first stump, f1. It
will create the same number of stumps as the number of features. In the case below, it will create 3 stumps as
there are only 3 features in this dataset.
• From these stumps, it will create three decision trees. This process can be called the stumps-base learner
model. Out of these 3 models, the algorithm selects only one. Two properties are considered while selecting a
base learner – Gini and Entropy.
• We must calculate Gini or Entropy the same way it is calculated for decision trees. The stump with the least
value will be the first base learner. In the figure below, all the 3 stumps can be made with 3 features.
• The number below the leaves represents the correctly and incorrectly classified records. By using these records,
the Gini or Entropy index is calculated.
• The stump that has the least Entropy or Gini will be selected as the base learner. Let’s assume that the entropy
index is the least for stump 1. So, let’s take stump 1, i.e., feature 1 as our first base learner.
• Here, feature (f1) has classified 2 records correctly and 1 incorrectly. The row in the figure that is marked red is
incorrectly classified. For this, we will be calculating the total error.
• Step 2 – Calculating the Total Error (TE)
• The total error is the sum of all the errors in the classified record for
sample weights. In our case, there is only 1 error,
so Total Error (TE) = 1/5.
• Step 3 – Calculating Performance of the Stump
• Formula for calculating Performance of the Stump is: –
• Performance of Stump Formula
• where, ln is natural log and TE is Total Error.
• In our case, TE is 1/5. By substituting the value of total error in the above formula and solving it, we get
the value for the performance of the stump as 0.693. Why is it necessary to calculate the TE and
performance of a stump?
• The answer is, we must update the sample weight before proceeding to the next model or stage
because if the same weight is applied, the output received will be from the first model.
• In boosting, only the wrong records/incorrectly classified records would get more preference than the
correctly classified records. Thus, only the wrong records from the decision tree/stump are passed on
to another stump.
• Whereas, in AdaBoost, both records were allowed to pass and the wrong records are repeated more
than the correct ones.
• We must increase the weight for the wrongly classified records and decrease the weight for the
correctly classified records. In the next step, we will be updating the weights based on the performance
of the stump.
• Step 4 – Updating Weights
• For incorrectly classified records, the formula for updating weights is:
• New Sample Weight = Sample Weight * e^(Performance)
• In our case Sample weight = 1/5 so, 1/5 * e^ (0.693) = 0.399
• For correctly classified records, we use the same formula with the
performance value being negative. This leads the weight for correctly
classified records to be reduced as compared to the incorrectly
classified ones. The formula is:
• New Sample Weight = Sample Weight * e^- (Performance)
• Putting the values, 1/5 * e^-(0.693) = 0.100
The updated weight for all the records can
be seen in the figure. As is known, the total
sum of all the weights should be 1.

In this case, it is seen that the total updated


weight of all the records is not 1, it’s 0.799.
To bring the sum to 1, every updated weight
must be divided by the total sum of
updated weight.

For example, if our updated weight is 0.399


and we divide this by 0.799,
i.e. 0.399/0.799=0.50. 
0.50 can be known as the normalized
weight. In the below figure, we can see all
the normalized weight and their sum is
approximately 1.
• Step 5 – Creating a New Dataset
• Now, it’s time to create a new dataset from our previous one. In the new dataset, the
frequency of incorrectly classified records will be more than the correct ones. The new
dataset has to be created using and considering the normalized weights. It will probably
select the wrong records for training purposes. That will be the second decision tree/stump.
To make a new dataset based on normalized weight, the algorithm will divide it into buckets.
• So, our first bucket is from 0 – 0.13, second will be from 0.13 – 0.63(0.13+0.50), third will be
from 0.63 – 0.76(0.63+0.13), and so on. After this the algorithm will run 5 iterations to select
different records from the older dataset. Suppose in the 1st iteration, the algorithm will take
a random value 0.46 to see which bucket that value falls into and select that record in the
new dataset. It will again select a random value, see which bucket it is in and select that
record for the new dataset. The same process is repeated 5 times.
• There is a high probability for wrong records to get selected several times. This will form the
new dataset. It can be seen in the image below that row number 2 has been selected
multiple times from the older dataset as that row is incorrectly classified in the previous one.
• Based on this new dataset, the algorithm will create a new decision
tree/stump and it will repeat the same process from step 1 till it sequentially
passes through all stumps and finds that there is less error as compared to
normalized weight that we had in the initial stage.
Stacking
• Model stacking is an efficient ensemble method in which the predictions,
generated by using various machine learning algorithms, are used as inputs in
a second-layer learning algorithm.
• This second-layer algorithm is trained to optimally combine the model
predictions to form a new set of predictions.
• For example, when linear regression is used as second-layer modeling, it
estimates these weights by minimizing the least square errors.
• However, the second-layer modeling is not restricted to only linear models; the
relationship between the predictors can be more complex, opening the door
to employing other machine learning algorithms.
Stacking

Model stacking uses a second-level algorithm to estimate prediction weights in the ensemble model.
Stacking
• In the standard stacking procedure, the first-level classifiers are fit to the same
training set that is used prepare the inputs for the second-level classifier, which
may lead to overfitting.
• The StackingCVClassifier, however, uses the concept of cross-validation: the
dataset is split into k folds, and in k successive rounds, k-1 folds are used to fit
the first level classifier; in each round, the first-level classifiers are then applied
to the remaining 1 subset that was not used for model fitting in each iteration.
• The resulting predictions are then stacked and provided -- as input data -- to the
second-level classifier. After the training of the StackingCVClassifier, the
first-level classifiers are fit to the entire dataset as illustrated in the figure
below.
How Does the Algorithm Decide Output for Test
Data?
• Suppose with the above dataset, the algorithm constructed 3 decision
trees or stumps. The test dataset will pass through all the stumps
which have been constructed by the algorithm. While passing through
the 1st stump, the output it produces is 1. Passing through the 2nd
stump, the output generated once again is 1. While passing through
the 3rd stump it gives the output as 0. In the AdaBoost algorithm too,
the majority of votes take place between the stumps, in the same way
as in random trees. In this case, the final output will be 1. This is how
the output with test data is decided.
Stacking
The simplest form of stacking can be described as an ensemble learning technique where the predictions of
multiple classifiers (referred as level-one classifiers) are used as new features to train a meta-classifier. The
meta-classifier can be any classifier of your choice.
Figure 1 shows how three different classifiers get trained. Their predictions get stacked and are used as
features to train the meta-classifier which makes the final prediction
Stacking
1. The level one predictions should come from a subset of the training data that was not used to train the level one

classifiers.

A simple way to achieve this is to split your training set in half. Use the first half of your training data to train the level one

classifiers. Then use the trained level one classifiers to make predictions on the second half of the training data. These

predictions should then be used to train meta-classifier.


A more robust way to do this, is to use k-fold cross validation to
generate the level one predictions. Here, the training data is split
into k-folds.
Then the first k-1 folds are used to train the level one classifiers. The
validation fold is then used to generate a subset of the level one
predictions. The process is repeated for each unique group.
Figure 2 illustrates this process.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy