Enseble LEarning
Enseble LEarning
https://www.mygreatlearning.com/blog/adaboost-algorithm/#:~:text=A
daBoost%20algorithm%2C%20short%20for%20Adaptive,assigned%20to
%20incorrectly%20classified%20instances.
What is Ensemble Learning?
• A common practice nowadays is to check the reviews of items before buying
them.
• And when checking reviews, you often look for the items with a large number
of reviews so you could know for sure about its rating.
• After going through the reviews from multiple people you decide whether to
buy the item or not.
• Ensemble models in machine learning operate on a similar idea. They combine
the decisions from multiple models to improve the overall performance.
• This approach allows for better predictive performance compared to a single
model.
• This is the reason why ensemble methods were placed first in many
prestigious machine learning competitions, such as the Netflix Competition,
KDD 2009, and Kaggle.
Ensemble learning
• Using multiple learning algorithm together for the same task.
• In statistics and machine learning, ensemble methods use multiple learning
algorithms to obtain better predictive performance than could be obtained
from any of the constituent learning algorithms alone
• Better predictions than individual learning models.
• Ensemble methods are techniques that create multiple models and then
combine them to produce improved results.
• The main causes of error in learning models are due to noise, bias and
variance.
Voting or Averaging of predictions of multiple models
Ensemble Model Example of Regression
Significance of Ensemble Models
• Why Use Ensemble Models?
• Accuracy of Ensemble learner is suppose to be better than other algorithms
• Better Accuracy (Low Error)
• Higher Consistency (Avoid Over fitting)
• Reduced Bias and Variance Errors
• ## Bias and variance references-
• https://www.bmc.com/blogs/bias-variance-machine-learning/
A bootstrap sample is a smaller sample that is “bootstrapped” from a larger sample. Bootstrapping is a type of
resampling where large numbers of smaller samples of the same size are repeatedly drawn, with replacement,
from a single original sample.
For example, let’s say your sample was made up of ten numbers: 49, 34, 21, 18, 10, 8, 6, 5, 2, 1. You randomly
draw three numbers 5, 1, and 49. You then replace those numbers into the sample and draw three numbers
again. Repeat the process of drawing x numbers B times. Usually, original samples are much larger than this
simple example, and B can reach into the thousands. After a large number of iterations, the bootstrap statistics
are compiled into a bootstrap distribution. You’re replacing your numbers back into the pot, so your resamples
can have the same item repeated several times (e.g. 49 could appear a dozen times in a dozen resamples).
•
• Advantages of Bagging in Machine
Learning
• Bagging minimizes the overfitting of
data
• It improves the model’s accuracy
• It deals with higher dimensional data
efficiently
Steps to Perform Bagging
• Consider there are n observations and m features in the training set. You
need to select a random sample from the training dataset without
replacement
• A subset of m features is chosen randomly to create a model using sample
observations
• The feature offering the best split out of the lot is used to split the nodes
• The tree is grown, so you have the best root nodes
• The above steps are repeated n times. It aggregates the output of individual
decision trees to give the best prediction.
Bagging Model
• Advantages of a Bagging Model:
• 1. Bagging significantly decreases the variance without increasing bias.
• 2. Bagging methods work so well because of diversity in the training data since the sampling is done
by bootstrapping.
• 3. Also, if the training set is very huge, it can save computational time by training the model on a
relatively smaller data set and still can increase the accuracy of the model.
• 4. Works well with small datasets as well.
• · Disadvantages of a Bagging Model:
• 1. The main disadvantage of Bagging is that it improves the accuracy of the model at the expense of
interpretability i.e., if a single tree was being used as the base model, then it would have a more
attractive and easily interpretable diagram, but with the use of bagging this interpretability gets
lost.
• 2. Another disadvantage of Bootstrap Aggregation is that during sampling, we cannot interpret
which features are being selected i.e., there are chances that some features are never used, which
may result in a loss of important information.
Random Forest
• Random forest is one of the most popular tree-based supervised learning
algorithms. It is also the most flexible and easy to use.
• The algorithm can be used to solve both classification and regression
problems. Random forest tends to combine hundreds of decision trees and
then trains each decision tree on a different sample of the observations.
• The final predictions of the random forest are made by averaging the
predictions of each individual tree.
• The benefits of random forests are numerous. The individual decision trees
tend to overfit to the training data but random forest can mitigate that
issue by averaging the prediction results from different trees. This gives
random forests a higher predictive accuracy than a single decision tree.
Random forest Application
• Random forest has been used in a variety of applications, for example
to provide recommendations of different products to customers in
e-commerce.
• In medicine, a random forest algorithm can be used to identify the
patient’s disease by analyzing the patient’s medical record.
• Also in the banking sector, it can be used to easily determine whether
the customer is fraudulent or legitimate.
Random Forest
How does the Random Forest algorithm
work?
• Steps involved in random forest algorithm:
• Step 1: In Random forest n number of random records are taken from
the data set having k number of records.
• Step 2: Individual decision trees are constructed for each sample.
• Step 3: Each decision tree will generate an output.
• Step 4: Final output is considered based on Majority Voting or
Averaging for Classification and regression respectively.
Random Forest algorithm -Example
• For example: consider the fruit basket as the data as shown in the figure below.
• Now n number of samples are taken from the fruit basket and an individual
decision tree is constructed for each sample.
• Each decision tree will generate an output as shown in the figure.
• The final output is considered based on majority voting.
• In the below figure you can see that the majority decision tree gives output as an
apple when compared to a banana, so the final output is taken as an apple.
Difference Between Decision Tree &
Random Forest
• Random forest is a collection of decision trees; still, there are a lot of differences in their behavior
• 1. n_jobs– it tells the engine how many processors it is allowed to use. If the value is 1, it can use only one
processor but if the value is -1 there is no limit.
• 2. random_state– controls randomness of the sample. The model will always produce the same results if it has
a definite value of random state and if it has been given the same hyperparameters and the same training data.
• 3. oob_score – OOB means out of the bag. It is a random forest cross-validation method. In this one-third of the
sample is not used to train the data instead used to evaluate its performance. These samples are called out of
bag samples.
BOOSTING
• Boosting refers to a family of algorithms that are able to convert weak
learners to strong learners.
• The main principle of boosting is to fit a sequence of weak learners−
models that are only slightly better than random guessing, such as small
decision trees− to weighted versions of the data.
• More weight is given to examples that were misclassified by earlier rounds.
• The predictions are then combined through a weighted majority vote
(classification) or a weighted sum (regression) to produce the final
prediction.
• The principal difference between boosting and the committee methods,
such as bagging, is that base learners are trained in sequence on a
weighted version of the data.
Boosting
• The Boosting algorithm then combines these multiple weak rules together to reduce
variance and bias in individual model rules into a single prediction rule, such that it
will be much more accurate than any one of the weak rules as.
• The basic principle behind the working of the boosting algorithm is to generate
multiple weak learners and combine their predictions to form one strong rule.
• These weak rules are generated by applying base Machine Learning algorithms on
different distributions of the data set.
• These algorithms generate weak rules for each iteration.
• After multiple iterations, the weak learners are combined to form a strong learner
that will predict a more accurate outcome.
• Two fundamental approaches for effective implementation of Boosting algorithm:
• Choosing the different subsets from training dataset for different iterations:
• To increase the efficiency of the base learner predictions, high weightage is placed on the
examples that were misclassified by earlier weak learner.
• How to combine weak leaners together:
• Taking a weighted majority vote of the predictions.
AdaBoost
• Reference- https://www.youtube.com/watch?v=LsK-xG1cLYA
• The idea of Boosting method is that instead of using a simple algorithm, which
is not strong enough to make the accurate predictions alone as there are high
variance and error rate, we combine multiple simple learning algorithms
together, rather than finding a single highly accurate prediction rule.
• Hence, we can say that Boosting algorithm is adaptive.
• These simple algorithms are known as a weak learner, and Boosting algorithm
calls this weak learner multiple times and feeding it a different subset of the
training samples, such that each time the base learning algorithm generates a
new weak prediction rule.
How the Ada Boost algorithm works:
• Step 1: The base algorithm reads the data and assigns equal weight to each
sample observation.
• Step 2: False predictions made by the base learner are identified. In the
next iteration, these false predictions are assigned to the next base learner
with a higher weightage on these incorrect predictions.
• Step 3: Repeat step 2 until the algorithm can correctly classify the output.
• Therefore, the main aim of Boosting is to focus more on miss-classified
predictions.
• Now that we know how the boosting algorithm works, let’s understand the
different types of boosting techniques.
AdaBoost
• The boosting technique follows a sequential order. The output of one base
learner will be input to another.
• If a base classifier is misclassified (red box), its weight will get increased
(over-weighting) and the next base learner will classify more correctly.
• The next logical step is to combine the classifiers to predict the results.
• Gradient Descent Boosting, AdaBoost, and XGboost are some extensions
over boosting methods.
• Gradient boosting minimizes the loss but adds gradient optimization in the
iteration, whereas Adaptive Boosting, or AdaBoost, tweaks the instance of
weights for every new predictor.
AdaBoost
• First of all, AdaBoost is short for Adaptive Boosting. Basically, Ada
Boosting was the first really successful boosting algorithm developed
for binary classification
• Generally, AdaBoost is used with short decision trees. Further, the
first tree is created, the performance of the tree on each training
instance is used.
• Also, we use it to weight how much attention the next tree.
• Thus, it is created should pay attention to each training instance.
Hence, training data that is hard to predict is given more weight.
Although, whereas easy to predict instances are given less weight.
Data Preparation for AdaBoost
• Quality Data:
• Because of the ensemble method attempt to correct misclassifications in
the training data. Also, you need to be careful that the training data is
high-quality.
• Outliers:
• Generally, outliers will force the ensemble down the rabbit hole of work.
Although, it is so hard to correct for cases that are unrealistic. These
could be removed from the training dataset.
• Noisy Data:
• Basically, noisy data, specifical noise in the output variable can be
problematic. But if possible, attempt to isolate and clean these from your
training dataset.
How Does AdaBoost Work?
• Reference
-https://www.mygreatlearning.com/blog/adaboost-algorithm/
• First, let us discuss how boosting works. It makes ‘n’ number of
decision trees during the data training period.
• As the first decision tree/model is made, the incorrectly classified
record in the first model is given priority.
• Only these records are sent as input for the second model.
• The process goes on until we specify a number of base learners we
want to create.
• Remember, repetition of records is allowed with all boosting
techniques.
• This figure shows how the first model is made and errors from the first model are noted by
the algorithm. The record which is incorrectly classified is used as input for the next model.
• This process is repeated until the specified condition is met.
• As you can see in the figure, there are ‘n’ number of models made by taking the errors from
the previous model.
• This is how boosting works. The models 1,2, 3,…, N are individual models that can be known as
decision trees. All types of boosting models work on the same principle.
Adaboost
• Since we now know the boosting principle, it will be easy to understand the AdaBoost algorithm.
• Let’s dive into AdaBoost’s working. When the random forest is used, the algorithm makes an ‘n’
number of trees.
• It makes proper trees that consist of a start node with several leaf nodes. Some trees might be bigger
than others, but there is no fixed depth in a random forest.
• With AdaBoost, however, the algorithm only makes a node with two leaves, known as Stump.
• The figure here represents the stump. It can be seen clearly that it has only one node with two leaves.
These stumps are weak learners and boosting techniques prefer this. The order of stumps is very
important in AdaBoost
• Here’s a sample dataset consisting of only three features
where the output is in categorical form.
• The image shows the actual representation of the dataset. As
the output is in binary/categorical form, it becomes a
classification problem.
• In real life, the dataset can have any number of records and
features in it.
• Let us consider 5 datasets for explanation purposes. The
output is in categorical form, here in the form of Yes or No. All
these records will be assigned a sample weight.
• The formula used for this is ‘W=1/N’ where N is the number of
records. In this dataset, there are only 5 records, so the
sample weight becomes 1/5 initially.
• Every record gets the same weight. In this case, it’s 1/5.
• Step 1 – Creating the First Base Learner
• To create the first learner, the algorithm takes the first feature, i.e., feature 1 and creates the first stump, f1. It
will create the same number of stumps as the number of features. In the case below, it will create 3 stumps as
there are only 3 features in this dataset.
• From these stumps, it will create three decision trees. This process can be called the stumps-base learner
model. Out of these 3 models, the algorithm selects only one. Two properties are considered while selecting a
base learner – Gini and Entropy.
• We must calculate Gini or Entropy the same way it is calculated for decision trees. The stump with the least
value will be the first base learner. In the figure below, all the 3 stumps can be made with 3 features.
• The number below the leaves represents the correctly and incorrectly classified records. By using these records,
the Gini or Entropy index is calculated.
• The stump that has the least Entropy or Gini will be selected as the base learner. Let’s assume that the entropy
index is the least for stump 1. So, let’s take stump 1, i.e., feature 1 as our first base learner.
• Here, feature (f1) has classified 2 records correctly and 1 incorrectly. The row in the figure that is marked red is
incorrectly classified. For this, we will be calculating the total error.
• Step 2 – Calculating the Total Error (TE)
• The total error is the sum of all the errors in the classified record for
sample weights. In our case, there is only 1 error,
so Total Error (TE) = 1/5.
• Step 3 – Calculating Performance of the Stump
• Formula for calculating Performance of the Stump is: –
• Performance of Stump Formula
• where, ln is natural log and TE is Total Error.
• In our case, TE is 1/5. By substituting the value of total error in the above formula and solving it, we get
the value for the performance of the stump as 0.693. Why is it necessary to calculate the TE and
performance of a stump?
• The answer is, we must update the sample weight before proceeding to the next model or stage
because if the same weight is applied, the output received will be from the first model.
• In boosting, only the wrong records/incorrectly classified records would get more preference than the
correctly classified records. Thus, only the wrong records from the decision tree/stump are passed on
to another stump.
• Whereas, in AdaBoost, both records were allowed to pass and the wrong records are repeated more
than the correct ones.
• We must increase the weight for the wrongly classified records and decrease the weight for the
correctly classified records. In the next step, we will be updating the weights based on the performance
of the stump.
• Step 4 – Updating Weights
• For incorrectly classified records, the formula for updating weights is:
• New Sample Weight = Sample Weight * e^(Performance)
• In our case Sample weight = 1/5 so, 1/5 * e^ (0.693) = 0.399
• For correctly classified records, we use the same formula with the
performance value being negative. This leads the weight for correctly
classified records to be reduced as compared to the incorrectly
classified ones. The formula is:
• New Sample Weight = Sample Weight * e^- (Performance)
• Putting the values, 1/5 * e^-(0.693) = 0.100
The updated weight for all the records can
be seen in the figure. As is known, the total
sum of all the weights should be 1.
Model stacking uses a second-level algorithm to estimate prediction weights in the ensemble model.
Stacking
• In the standard stacking procedure, the first-level classifiers are fit to the same
training set that is used prepare the inputs for the second-level classifier, which
may lead to overfitting.
• The StackingCVClassifier, however, uses the concept of cross-validation: the
dataset is split into k folds, and in k successive rounds, k-1 folds are used to fit
the first level classifier; in each round, the first-level classifiers are then applied
to the remaining 1 subset that was not used for model fitting in each iteration.
• The resulting predictions are then stacked and provided -- as input data -- to the
second-level classifier. After the training of the StackingCVClassifier, the
first-level classifiers are fit to the entire dataset as illustrated in the figure
below.
How Does the Algorithm Decide Output for Test
Data?
• Suppose with the above dataset, the algorithm constructed 3 decision
trees or stumps. The test dataset will pass through all the stumps
which have been constructed by the algorithm. While passing through
the 1st stump, the output it produces is 1. Passing through the 2nd
stump, the output generated once again is 1. While passing through
the 3rd stump it gives the output as 0. In the AdaBoost algorithm too,
the majority of votes take place between the stumps, in the same way
as in random trees. In this case, the final output will be 1. This is how
the output with test data is decided.
Stacking
The simplest form of stacking can be described as an ensemble learning technique where the predictions of
multiple classifiers (referred as level-one classifiers) are used as new features to train a meta-classifier. The
meta-classifier can be any classifier of your choice.
Figure 1 shows how three different classifiers get trained. Their predictions get stacked and are used as
features to train the meta-classifier which makes the final prediction
Stacking
1. The level one predictions should come from a subset of the training data that was not used to train the level one
classifiers.
A simple way to achieve this is to split your training set in half. Use the first half of your training data to train the level one
classifiers. Then use the trained level one classifiers to make predictions on the second half of the training data. These