Machine Learning Unil-1
Machine Learning Unil-1
Unil-1
Introduction to machine learning:- Machine learning is a growing technology which
enables computers to learn automatically from past data. Machine learning uses various
algorithms for building mathematical models and making predictions using historical data
or information. Currently, it is being used for various tasks such as image recognition, speech
recognition, email filtering, Face book auto-tagging, recommender system, and many more.
In the real world, we are surrounded by humans who can learn everything from their experiences
with their learning capability, and we have computers or machines which work on our
instructions. But can a machine also learn from experiences or past data like a human does? So
here comes the role of Machine Learning. Machine Learning is said as a subset of artificial
intelligence that is mainly concerned with the development of algorithms which allow a
computer to learn from the data and past experiences on their own. The term machine learning
was first introduced by Arthur Samuel in 1959. Machine learning enables a machine to
automatically learn from data, improve performance from experiences, and predict things
without being explicitly programmed. With the help of sample historical data, which is known
as training data, machine learning algorithms build a mathematical model that helps in making
predictions or decisions without being explicitly programmed. Machine learning brings computer
science and statistics together for creating predictive models. Machine learning constructs or
uses the algorithms that learn from historical data. The more we will provide the information, the
higher will be the performance.
A machine has the ability to learn if it can improve its performance by gaining more data.
The need for machine learning is increasing day by day. The reason behind the need for machine learning
is that it is capable of doing tasks that are too complex for a person to implement directly. As a human,
we have some limitations as we cannot access the huge amount of data manually, so for this, we need
some computer systems and here comes the machine learning to make things easy for us.We can train
machine learning algorithms by providing them the huge amount of data and let them explore the data,
construct the models, and predict the required output automatically. The performance of the machine
learning algorithm depends on the amount of data, and it can be determined by the cost function. With the
help of machine learning, we can save both time and money. The importance of machine learning can be
easily understood by its uses cases, currently, machine learning is used in self-driving cars, cyber fraud
detection, face recognition, and friend suggestion by Face book, etc. Various top companies such as
Netflix and Amazon have build machine learning models that are using a vast amount of data to analyze
the user interest and recommend product accordingly.
machine learning models:- machine learning can be classified into three types of
method:- Supervised learning:-Supervised learning is a type of machine learning method in
which we provide sample labeled data to the machine learning system in order to train it, and
on that basis, it predicts the output.The system creates a model using labeled data to
understand the datasets and learn about each data, once the training and processing are done
then we test the model by providing a sample data to check whether it is predicting the exact
output or not.The goal of supervised learning is to map input data with the output data. The
supervised learning is based on supervision, and it is the same as when a student learns things
in the supervision of the teacher. The example of supervised learning is spam filtering.
o Classification
o Regression
Unsupervised learning :-is a learning method in which a machine learns without any
supervision.
1. The training is provided to the machine with the set of data that has not been
labeled, classified, or categorized, and the algorithm needs to act on that data
without any supervision. The goal of unsupervised learning is to restructure the
input data into new features or a group of objects with similar patterns.
2. Clustering
3. Association
4.
The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.
There are some common methods given to find out the possible hypothesis from the Hypothesis
space, where hypothesis space is represented by uppercase-h (H) and hypothesis
by lowercase-h (h). Th ese are defined as follows:
Hypothesis (h):
It is defined as the approximate function that best describes the target in supervised machine
learning algorithms. It is primarily based on data as well as bias and restrictions applied to
data.Hence hypothesis (h) can be concluded as a single hypothesis that maps input to proper
output and can be evaluated as well as used to make predictions.The hypothesis (h) can be
formulated in machine learning as follows:y= mx + bWhere,Y: Rangem: Slope of the line which
divided test data or changes in y divided by change in x. x: domain c: intercept (constant)
Example: Let's understand the hypothesis (h) and hypothesis space (H) with a two-
dimensional coordinate plane showing the distribution of data as follows:
Now, assume we have some test data by which ML algorithms predict the outputs for
input as follows:
If we divide this coordinate plane in such as way that it can help you to predict output or
result as follows:
Bais:-
Machine learning is a branch of Artificial Intelligence, which allows machines to perform data
analysis and make predictions. However, if the machine learning model is not accurate, it can
make predictions errors, and these prediction errors are usually known as Bias and Variance. In
machine learning, these errors will always be present as there is always a slight difference
between the model predictions and actual predictions. The main aim of ML/data science
analysts is to reduce these errors in order to get more accurate results. In this topic, we are
going to discuss bias and variance, Bias-variance trade-off, Underfitting and Overfitting. But
before starting, let's first understand what errors in Machine learning are?
In machine learning
, there is always the need to test the stability of the model. It means based only on the training dataset;
we can't fit our model on the training dataset. For this purpose, we reserve a particular sample of the
dataset, which was not part of the training dataset. After that, we test our model on that sample before
deployment, and this complete process comes under cross-validation. This is something different from
the general train-test split.
But it has one of the big disadvantages that we are just using a 50% dataset to train our
model, so the model may miss out to capture important information of the dataset. It
also tends to give the underfitted model.
Leave-P-out cross-validation
In this approach, the p datasets are left out of the training data. It means, if there are
total n datapoints in the original input dataset, then n-p data points will be used as the
training dataset and the p data points as the validation set. This complete process is
repeated for all the samples, and the average error is calculated to know the
effectiveness of the model.
There is a disadvantage of this technique; that is, it can be computationally difficult for
the large p.
o In this approach, the bias is minimum as all the data points are used.
o The process is executed for n times; hence execution time is high.
o This approach leads to high variation in testing the effectiveness of the model as we
iteratively check against one data point.
K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups of samples of
equal sizes. These samples are called folds. For each learning set, the prediction
function uses k-1 folds, and the rest of the folds are used for the test set. This approach
is a very popular CV approach because it is easy to understand, and the output is less
biased than other methods.
Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5 folds.
On 1st iteration, the first fold is reserved for test the model, and rest are used to train the
model. On 2nd iteration, the second fold is used to test the model, and rest are used to
train the model. This process will continue until each fold is not used for the test fold.
This technique is similar to k-fold cross-validation with some little changes. This approach works on
stratification concept, it is a process of rearranging the data to ensure that each fold or group is a good
representative of the complete dataset. To deal with the bias and variance, it is one of the best
approaches.
It can be understood with an example of housing prices, such that the price of some houses can be much
high than other houses. To tackle such situations, a stratified k-fold cross-validation technique is useful.
Holdout Method
This method is the simplest cross-validation technique among all. In this method, we need to remove a
subset of the training data and use it to get prediction results by training it on the rest part of the dataset.
The error that occurs in this process tells how well our model will perform with the unknown dataset.
Although this approach is simple to perform, it still faces the issue of high variance, and it also produces
misleading results sometimes.
o Train/test split: The input data is divided into two parts, that are training set and test set on a
ratio of 70:30, 80:20, etc. It provides a high variance, which is one of the biggest disadvantages.
o Training Data: The training data is used to train the model, and the dependent variable
is known.
o Test Data: The test data is used to make the predictions from the model that is already
trained on the training data. This has the same features as training data but not the part
of that.
Limitations of Cross-Validation
There are some limitations of the cross-validation technique, which are given below:
o For the ideal conditions, it provides the optimum output. But for the inconsistent data, it may
produce a drastic result. So, it is one of the big disadvantages of cross-validation, as there is no
certainty of the type of data in machine learning.
o In predictive modeling, the data evolves over a period, due to which, it may face the differences
between the training set and validation sets. Such as if we create a model for the prediction of
stock market values, and the data is trained on the previous 5 years stock values, but the realistic
future values for the next 5 years may drastically different, so it is difficult to expect the correct
output for such situations.
Applications of Cross-Validation
o This technique can be used to compare the performance of different predictive modeling
methods.
o It can also be used for the meta-analysis, as it is already being used by the data scientists in the
field of medical statistics.
Dimensionality Reduction:- The number of input features, variables, or columns present in a given dataset is
known as dimensionality, and the process to reduce these features is called dimensionality reduction.
A dataset contains a huge number of input features in various cases, which makes the predictive modeling
task more complicated. Because it is very difficult to visualize or make predictions for the training dataset
with a high number of features, for such cases, dimensionality reduction techniques are required to use.
Dimensionality reduction technique can be defined as, "It is a way of converting the higher dimensions
dataset into lesser dimensions dataset ensuring that it provides similar information." These
techniques are widely used in machine learning
for obtaining a better fit predictive model while solving the classification and regression problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech recognition,
signal processing, bioinformatics, etc. It can also be used for data visualization, noise reduction,
cluster analysis, etc.
Some benefits of applying dimensionality reduction technique to the given dataset are given below:
o By reducing the dimensions of the features, the space required to store the dataset also gets
reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
There are also some disadvantages of applying the dimensionality reduction, which are given below:
o In the PCA dimensionality reduction technique, sometimes the principal components required to
consider are unknown.
There are two ways to apply the dimension reduction technique, which are given below:
Feature Selection
Feature selection is the process of selecting the subset of the relevant features and leaving out the
irrelevant features present in a dataset to build a model of high accuracy. In other words, it is a way of
selecting the optimal features from the input dataset.
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only the relevant features is taken. Some
common techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine learning model for its
evaluation. In this method, some features are fed to the ML model, and evaluate the performance. The
performance decides whether to add those features or remove to increase the accuracy of the model. This
method is more accurate than the filtering method but complex to work. Some common techniques of
wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination
3. Embedded Methods: Embedded methods check the different training iterations of the machine
learning model and evaluate the importance of each feature. Some common techniques of Embedded
methods are:
o LASSO
o Elastic Net
Feature Extraction:
Feature extraction is the process of transforming the space containing many dimensions into space with
fewer dimensions. This approach is useful when we want to keep the whole information but use fewer
resources while processing the information.
a. Kernel PCA
Principal Component Analysis is a statistical process that converts the observations of correlated features
into a set of linearly uncorrelated features with the help of orthogonal transformation. These new
transformed features are called the Principal Components. It is one of the popular tools that is used for
exploratory data analysis and predictive modeling.
PCA works by considering the variance of each attribute because the high attribute shows the good split
between the classes, and hence it reduces the dimensionality. Some real-world applications of PCA
are image processing, movie recommendation system, optimizing the power allocation in various
communication channels.
Backward Feature Elimination
The backward feature elimination technique is mainly used while developing Linear Regression or Logistic
Regression model. Below steps are performed in this technique to reduce the dimensionality or in feature
selection:
o In this technique, firstly, all the n variables of the given dataset are taken to train the model.
o Now we will remove one feature each time and train the model on n-1 features for n times, and
will compute the performance of the model.
o We will check the variable that has made the smallest or no change in the performance of the
model, and then we will drop that variable or features; after that, we will be left with n-1 features.
In this technique, by selecting the optimum performance of the model and maximum tolerable error rate,
we can define the optimal number of features require for the machine learning algorithms.
Forward feature selection follows the inverse process of the backward elimination process. It means, in
this technique, we don't eliminate the feature; instead, we will find the best features that can produce the
highest increase in the performance of the model. Below steps are performed in this technique:
o We start with a single feature only, and progressively we will add each feature at a time.
o The process will be repeated until we get a significant increase in the performance of the model.
If a dataset has too many missing values, then we drop those variables as they do not carry much useful
information. To perform this, we can set a threshold level, and if a variable has missing values more than
that threshold, we will drop that variable. The higher the threshold value, the more efficient the reduction.
As same as missing value ratio technique, data columns with some changes in the data have less
information. Therefore, we need to calculate the variance of each variable, and all data columns with
variance lower than a given threshold are dropped because low variance features will not affect the target
variable.
High Correlation Filter
High Correlation refers to the case when two variables carry approximately similar information. Due to this
factor, the performance of the model can be degraded. This correlation between the independent
numerical variable gives the calculated value of the correlation coefficient. If this value is higher than the
threshold value, we can remove one of the variables from the dataset. We can consider those variables or
features that show a high correlation with the target variable.
Random Forest
Random Forest is a popular and very useful feature selection algorithm in machine learning. This
algorithm contains an in-built feature importance package, so we do not need to program it separately. In
this technique, we need to generate a large set of trees against the target variable, and with the help of
usage statistics of each attribute, we need to find the subset of features.
Random forest algorithm takes only numerical variables, so we need to convert the input data into
numeric data using hot encoding.
Factor Analysis
Factor analysis is a technique in which each variable is kept within a group according to the correlation
with other variables, it means variables within a group can have a high correlation between themselves,
but they have a low correlation with variables of other groups.
We can understand it by an example, such as if we have two variables Income and spend. These two
variables have a high correlation, which means people with high income spends more, and vice versa. So,
such variables are put into a group, and that group is known as the factor. The number of these factors
will be reduced as compared to the original dimension of the dataset.
Auto-encoders
One of the popular methods of dimensionality reduction is auto-encoder, which is a type of ANN
or artificial neural network
, and its main aim is to copy the inputs to their outputs. In this, the input is compressed into latent-space
representation, and output is occurred using this representation. It has mainly two parts:
o Encoder: The function of the encoder is to compress the input to form the latent-space
representation.
o Decoder: The function of the decoder is to recreate the output from the latent-space
representation.
Subset selection :- growing number of machine learning problems involve finding subsets of
data points. Examples range from selecting subset of labeled or unlabeled data points, to subsets
of features or model parameters, to selecting subsets of pixels, key points, sentences etc. in
image segmentation, correspondence and summarization problems. The workshop would
encompass a wide variety of topics ranging from theoretical aspects of subset selection e.g.
coresets, submodularity, determinantal point processes, to several practical applications, {\em
e.g.}, time and energy efficient learning, learning under resource constraints, active learning,
human assisted learning, feature selection, model compression, feature induction, {\em etc.}
We believe that this workshop is very timely since, a) subset selection is naturally emerging and
has often been considered in isolation in many of the above applications, and b) by connecting
researchers working on both the theoretical and application domains above, we can foster a much
needed discussion on reusing a several technical innovations across these subareas and
applications. Furthermore, we would also like to connect researchers working on the theoretical
foundations of subset selection (in areas such as corsets and sub modularity) with researchers
working in applications (such as feature selection, active learning, data efficient learning, model
compression, and human assisted machine learning).
Shrinkage Methods in Linear Regression:- shrinkage methods (also known as
regularization) come in play. These methods apply a penalty term to the Loss function used in
the model. Minimizing the loss function is equal to maximizing the accuracy. To understand this
better, we need to go into the depths of Loss function in Linear Regression.Linear Regression
uses Least Squares to calculate the minimum error between the actual values and the predicted
values. The aim is to minimize the squared difference between the actual and predicted values to
draw the best possible regression curve for the best prediction accuracy.
Shrinking the coefficient estimates significantly reduces their variance. When we perform
shrinking, we essentially bring the coefficient estimates closer to 0.The need for shrinkage
method arises due to the issues of underfitting or overfitting the data. When we want to minimize
the mean error (Mean Squared Error(MSE) in case of Linear Regression),we need to optimize
the bias-variance trade-off.
The bias-variance trade-off indicates the level of underfitting or overfitting of the data with
respect to the Linear Regression model applied to it. A high bias-low variance means the model
is underfitted and a low bias-high variance means that the model is overfitted. We need to trade-
off between bias and variance to achieve the perfect combination for the minimum Mean
Squared Error as shown by the graph below.
In this figure, the green curve is variance, the black curve is squared bias and the purple curve is
the MSE. Lambda is the regularization parameter which will be covered later.
Used of shrinking methods:-
The best known shrinking methods are Ridge Regression and Lasso Regression which are often
used in place of Linear Regression.
Ridge Regression, like Linear Regression, aims to minimize the Residual Sum of Squares(RSS)
but with a slight change.
:We now know that there are better methods than simple Linear Regression in the form of Ridge
Regression and Lasso Regression which account for the underfitting and overfitting of data.
How does Regularization Work?:-Regularization works by adding a penalty or complexity term to the
complex model. Let's consider the simple linear regression equation:
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here represents the bias
of the model, and b represents the intercept.Linear regression models try to optimize the β0 and b to
minimize the cost function. The equation for the cost function for the linear model is given below:
Now, we will add a loss function and optimize parameter to make the model that can predict the accurate
value of Y. The loss function for the linear regression is called as RSS or Residual sum of squares.
Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:
o Ridge Regression
o Lasso Regression
Ridge Regression
o Ridge regression is one of the types of linear regression in which a small amount of bias is
introduced so that we can get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the complexity of the
model. It is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The amount of bias
added to the model is called Ridge Regression penalty. We can calculate it by multiplying with
the lambda to the squared weight of each individual feature.
o The equation for the cost function in ridge regression will be:
o In the above equation, the penalty term regularizes the coefficients of the model, and hence ridge
regression reduces the amplitudes of the coefficients that decreases the complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation becomes
the cost function of the linear regression model. Hence, for the minimum value of λ, the model
will resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model. It
stands for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the absolute
weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can
only shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso regression will
be:
o Some of the features in this technique are completely neglected for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well as the
feature selection.
o Ridge regression is mostly used to reduce the overfitting in the model, and it includes all the
features present in the model. It reduces the complexity of the model by shrinking the
coefficients.
o Lasso regression helps to reduce the overfitting in the model as well as feature selection.
Partial Least Squares regression (PLS) is a quick, efficient and optimal regression method based on covariance. It
is recommended in cases of regression where the number of explanatory variables is high, and where it is likely that
there is multicollinearity among the variables, i.e. that the explanatory variables are correlated.
XLSTAT provides a complete PLS regression method to model and predict your data in excel. XLSTAT proposes
several standard and advanced options that will let you gain a deep insight on your data:
The Partial Least Squares regression (PLS) is a method which reduces the variables, used to predict, to a smaller
set of predictors. These predictors are then used to perfom a regression.
The idea behind the PLS regression is to create, starting from a table with n observations described by p variables, a
set of h components with the PLS 1 and PLS 2 algorithms
Some programs differentiate PLS 1 from PLS 2. PLS 1 corresponds to the case where there is only one dependent
variable. PLS 2 corresponds to the case where there are several dependent variables. The algorithms used by
XLSTAT are such that the PLS 1 is only a particular case of PLS 2.
where Y is the matrix of the dependent variables, X is the matrix of the explanatory variables. Th, Ch, W*h , Wh
and Ph, are the matrices generated by the PLS algorithm, and Eh is the matrix of the residuals.
The matrix B of the regression coefficients of Y on X, with h components generated by the PLS regression
algorithm is given by:
B = Wh(P’hWh)-1C’hNote: the PLS regression leads to a linear model as the OLS and PCR do.
A great advantage of PLS regression over classic regression are the available charts that describe the data structure.
Thanks to the correlation and loading plots it is easy to study the relationship among the variables. It can be
relationships among the explanatory variables or dependent variables, as well as between explanatory and dependent
variables. The score plot gives information about sample proximity and dataset structure. The biplot gather all these
information in one chart.
PLS regression is also used to build predictive models. XLSTAT enables you to predict new samples' values.
The three methods – Partial Least Squares regression (PLS), Principal Component regression (PCR), which is
based on Principal Component analysis (PCA), and Ordinary Least Squares regression (OLS), which is the
regular linear regression, - give the same results if the number of components obtained from the Principal
Component analysis (PCA) in the PCR, or from the PLS regression is equal to the number of explanatory variables.