MC4301 - ML Unit 4 (Parametric Machine Learning)
MC4301 - ML Unit 4 (Parametric Machine Learning)
Logistic Regression
What are the differences between supervised learning, unsupervised learning &
reinforcement learning?
Machine learning algorithms are broadly classified into three categories - supervised
learning, unsupervised learning, and reinforcement learning.
about:blank 1/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
In this imaginary example, the probability of a person being infected with COVID-19
could be based on the viral load and the symptoms and the presence of antibodies, etc.
Viral load, symptoms, and antibodies would be our factors (Independent Variables),
which would influence our outcome (Dependent Variable).
In linear regression, the outcome is continuous and can be any possible value.
However in the case of logistic regression, the predicted outcome is discrete and
restricted to a limited number of values.
For example, say we are trying to apply machine learning to the sale of a house. If we
are trying to predict the sale price based on the size, year built, and number of stories
we would use linear regression, as linear regression can predict a sale price of any
possible value. If we are using those same factors to predict if the house sells or not,
we would logistic regression as the possible outcomes here are restricted to yes or no.
Logistic regression is used to solve classification problems, and the most common use
case is binary logistic regression, where the outcome is binary (yes or no). In the real
world, you can see logistic regression applied across multiple areas and fields.
Are there other use cases for logistic regression aside from binary logistic regression?
Yes. There are two other types of logistic regression that depend on the number of
predicted outcomes.
1. Binary logistic regression - When we have two possible outcomes, like our
original example of whether a person is likely to be infected with COVID-19
or not.
2. Multinomial logistic regression - When we have multiple outcomes, say if
we build out our original example to predict whether someone may have the
flu, an allergy, a cold, or COVID-19.
3. Ordinal logistic regression - When the outcome is ordered, like if we build
out our original example to also help determine the severity of a COVID-19
infection, sorting it into mild, moderate, and severe cases.
about:blank 2/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
Training data that satisfies the below assumptions is usually a good fit for logistic
regression.
If your training data does not satisfy the above assumptions, logistic regression may
not work for your use case.
Probability always ranges between 0 (does not happen) and 1 (happens). Using our
Covid-19 example, in the case of binary classification, the probability of testing
positive and not testing positive will sum up to 1. We use logistic function or sigmoid
function to calculate probability in logistic regression. The logistic function is a
simple S-shaped curve used to convert data into a value between 0 and 1.
In Supervised Learning, the model learns by example. Along with our input variable,
we also give our model the corresponding correct labels. While training, the model
gets to look at which label corresponds to our data and hence can find patterns
between our data and those labels.
1. It classifies spam Detection by teaching a model of what mail is spam and not
spam.
2. Speech recognition where you teach a machine to recognize your voice.
about:blank 3/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
What is Classification?
Classification algorithms used in machine learning utilize input training data for the
purpose of predicting the likelihood or probability that the data that follows will fall
into one of the predetermined categories. One of the most common applications of
classification is for filtering emails into “spam” or “non-spam”, as used by today’s top
email service providers.
We will explore classification algorithms in detail, and discover how a text analysis
software can perform actions like sentiment analysis - used for categorizing
unstructured text by opinion polarity (positive, negative, neutral, and the like).
about:blank 4/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
Lazy Learners
It first stores the training dataset before waiting for the test dataset to arrive. When
using a lazy learner, the classification is carried out using the training dataset's most
appropriate data. Less time is spent on training, but more time is spent on predictions.
Some of the examples are case-based reasoning and the KNN algorithm.
Eager Learners
Before obtaining a test dataset, eager learners build a classification model using a
training dataset. They spend more time studying and less time predicting. Some of the
examples are ANN, naive Bayes, and Decision trees.
Before diving into the four types of Classification Tasks in Machine Learning, let us
first discuss Classification Predictive Modeling.
A training dataset with numerous examples of inputs and outputs is necessary for
classification from a modeling standpoint.
about:blank 5/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
A model will determine the optimal way to map samples of input data to certain class
labels using the training dataset. The training dataset must therefore contain a large
number of samples of each class label and be suitably representative of the problem.
When providing class labels to a modeling algorithm, string values like "spam" or
"not spam" must first be converted to numeric values. Label encoding, which is
frequently used, assigns a distinct integer to every class label, such as "spam" = 0, "no
spam," = 1.
Some tasks may call for a class membership probability prediction for each example
rather than class labels. This adds more uncertainty to the prediction, which a user or
application can subsequently interpret. The ROC Curve is a well-liked diagnostic for
assessing anticipated probabilities.
There are four different types of Classification Tasks in Machine Learning and they
are following -
Binary Classification
Multi-Class Classification
Multi-Label Classification
Imbalanced Classification
Binary Classification
Those classification jobs with only two class labels are referred to as binary
classification.
Examples comprise -
Binary classification problems often require two classes, one representing the normal
state and the other representing the aberrant state.
about:blank 6/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
For instance, the normal condition is "not spam," while the abnormal state is "spam."
Another illustration is when a task involving a medical test has a normal condition of
"cancer not identified" and an abnormal state of "cancer detected."
Class label 0 is given to the class in the normal state, whereas class label 1 is given to
the class in the abnormal condition.
A model that forecasts a Bernoulli probability distribution for each case is frequently
used to represent a binary classification task.
The discrete probability distribution known as the Bernoulli distribution deals with
the situation where an event has a binary result of either 0 or 1. In terms of
classification, this indicates that the model forecasts the likelihood that an example
would fall within class 1, or the abnormal state.
Logistic Regression
Support Vector Machines
Simple Bayes
Decision Trees
Some algorithms, such as Support Vector Machines and Logistic Regression, were
created expressly for binary classification and do not by default support more than
two classes.
Multi-Class Classification
Examples comprise -
Categorization of faces.
Classifying plant species.
Character recognition using optical.
The multi-class classification does not have the idea of normal and abnormal
outcomes, in contrast to binary classification. Instead, instances are grouped into one
of several well-known classes.
In some cases, the number of class labels could be rather high. In a facial recognition
system, for instance, a model might predict that a shot belongs to one of thousands or
tens of thousands of faces.
Text translation models and other problems involving word prediction could be
categorized as a particular case of multi-class classification. Each word in the
sequence of words to be predicted requires a multi-class classification, where the
vocabulary size determines the number of possible classes that may be predicted and
may range from tens of thousands to hundreds of thousands of words.
about:blank 7/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
Multiclass classification tasks are frequently modeled using a model that forecasts a
Multinoulli probability distribution for each example.
Progressive Boosting
Choice trees
Nearest K Neighbors
Rough Forest
Simple Bayes
Multi-class problems can be solved using algorithms created for binary classification.
In order to do this, a method is known as "one-vs-rest" or "one model for each pair of
classes" is used, which includes fitting multiple binary classification models with each
class versus all other classes (called one-vs-one).
One-vs-One: For each pair of classes, fit a single binary classification model.
One-vs-Rest: Fit a single binary classification model for each class versus all
other classes.
Multi-Label Classification
Multi-label classification problems are those that feature two or more class labels and
allow for the prediction of one or more class labels for each example.
Think about the photo classification example. Here a model can predict the existence
of many known things in a photo, such as “person”, “apple”, "bicycle," etc. A
particular photo may have multiple objects in the scene.
This greatly contrasts with multi-class classification and binary classification, which
anticipate a single class label for each occurrence.
about:blank 8/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
Imbalanced Classification
A majority of the training dataset's instances belong to the normal class, while a
minority belong to the abnormal class, making imbalanced classification tasks binary
classification tasks in general.
Examples comprise -
Although they could need unique methods, these issues are modeled as binary
classification jobs.
Examples comprise -
SMOTE Oversampling
Random Undersampling
Examples comprise:
about:blank 9/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
Examples comprise -
F-Measure
Recall
Precision
You can apply many different classification methods based on the dataset you are
working with. It is so because the study of classification in statistics is extensive. The
top five machine learning algorithms are listed below.
1. Logistic Regression
2. Naive Bayes
Naive Bayes determines whether a data point falls into a particular category. It can be
used to classify phrases or words in text analysis as either falling within a
predetermined classification or not.
Text Tag
“A great game” Sports
“The election is over” Not Sports
“What a great score” Sports
“A clean and unforgettable game” Sports
“The spelling bee winner was a surprise” Not Sports
3. K-Nearest Neighbors
It calculates the likelihood that a data point will join the groups based on which group
the data points closest to it are a part of. When using k-NN for classification, you
determine how to classify the data according to its nearest neighbor.
10
about:blank 10/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
4. Decision Tree
The random forest algorithm is an extension of the Decision Tree algorithm where
you first create a number of decision trees using training data and then fit your new
data into one of the created ‘tree’ as a ‘random forest’. It averages the data to connect
it to the nearest tree data based on the data scale. These models are great for
improving the decision tree’s problem of forcing data points unnecessarily within a
category.
The supervised learning approach explicitly trains algorithms under close human
supervision. Both the input and the output data are first provided to the algorithm. The
algorithm then develops rules that map the input to the output. The training procedure
is repeated as soon as the highest level of performance is attained.
Regression
Classification
2. Unsupervised Learning
This approach is applied to examine data's inherent structure and derive insightful
information from it. This technique looks for insights that can produce better results
by looking for patterns and insights in unlabeled data.
Clustering
Dimensionality reduction
11
about:blank 11/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
3. Semi-supervised Learning
4. Reinforcement Learning
Classification Models
In the above figure, depending on the weather conditions and the humidity and wind,
we can systematically decide if we should play tennis or not. In decision trees, all the
False statements lie on the left of the tree and the True statements branch off to the
right. Knowing this, we can make a tree which has the features at the nodes and the
resulting classes at the leaves.
12
about:blank 12/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
the data points. K-Nearest Neighbor assumes that data points which are close
to one another must be similar and hence, the data point to be classified will be
grouped with the closest cluster.
After our model is finished, we must assess its performance to determine whether it is
a regression or classification model. So, we have the following options for assessing a
classification model:
1. Confusion Matrix
The confusion matrix describes the model performance and gives us a matrix
or table as an output.
The error matrix is another name for it.
The matrix is made up of the results of the forecasts in a condensed manner,
together with the total number of right and wrong guesses.
13
about:blank 13/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
(ylog(p)+(1?y)log(1?p))
3. AUC-ROC Curve
AUC is for Area Under the Curve, and ROC refers to Receiver Operating
Characteristics Curve.
It is a graph that displays the classification model's performance at various
thresholds.
The AUC-ROC Curve is used to show how well the multi-class classification
model performs.
The TPR and FPR are used to draw the ROC curve, with the True Positive
Rate (TPR) on the Y-axis and the FPR (False Positive Rate) on the X-axis.
There are many applications for classification algorithms. Here are a few of them
Speech Recognition
Detecting Spam Emails
Categorization of Drugs
Cancer Tumor Cell Identification
Biometric Authentication, etc.
Representation
A machine learning model can't directly see, hear, or sense input examples. Instead,
you must create a representation of the data to provide the model with a useful
vantage point into the data's key qualities. That is, in order to train a model, you must
choose the set of features that best represent the data.
14
about:blank 14/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
In the context of neural networks, Chollet says that layers extract representations.
The core building block of neural networks is the layer, a data-processing module that
you can think of as a filter for data. Some data goes in, and it comes out in a more
useful form. Specifically, layers extract representations out of the data fed into
them--hopefully, representations that are more meaningful for the problem at hand.
Most of deep learning consists of chaining together simple layers that will implement
a form of progressive data distillation. A deep-learning model is like a sieve for data
processing, made of a succession of increasingly refined data filters--the layers.
That makes me think that representations are the form that the training/test data takes
as it is progressively transformed. e.g. words could initially be represented as dense or
sparse (one-hot encoded) vectors. And then their representation changes one or more
times as they are fed into a model.
Mitchell says that we need to choose a representation for the target function.
This makes me think that the 'representation' could be described as the architecture of
the model, or maybe a mathematical description of the model. With this definition, we
don't know the true representation (equation) of the target function (if we did we
would have nothing to learn). So it is our task to decide what equation we want to use
to best approximate the target function.
Cost function
Machine Learning models require a high level of accuracy to work in the actual world.
But how do you calculate how wrong or right your model is? This is where the cost
function comes into the picture. A machine learning parameter that is used for
correctly judging the model, cost functions are important to understand to know how
well the model has estimated the relationship between your input and output
parameters.
After training your model, you need to see how well your model is performing. While
accuracy functions tell you how well the model is performing, they do not provide
you with an insight on how to better improve them. Hence, you need a correctional
function that can help you compute when the model is the most accurate, as you need
to hit that small spot between an undertrained model and an overtrained model.
A Cost Function is used to measure just how wrong the model is in finding a relation
between the input and output. It tells you how badly your model is
behaving/predicting
Consider a robot trained to stack boxes in a factory. The robot might have to consider
certain changeable parameters, called Variables, which influence how it performs.
15
about:blank 15/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
Let’s say the robot comes across an obstacle, like a rock. The robot might bump into
the rock and realize that it is not the correct action.
It will learn from this, and next time it will learn to avoid rocks. Hence, your machine
uses variables to better fit the data. The outcome of all these obstacles will further
optimize the robot and help it perform better. It will generalize and learn to avoid
obstacles in general, say like a fire that might have broken out. The outcome acts as a
cost function, which helps you optimize the variable, to get the best variables and fit
for the model.
Gradient Descent is an algorithm that is used to optimize the cost function or the error
of the model. It is used to find the minimum value of error possible in your model.
Gradient Descent can be thought of as the direction you have to take to reach the least
possible error. The error in your model can be different at different points, and you
have to find the quickest way to minimize it, to prevent resource wastage.
Gradient Descent can be visualized as a ball rolling down a hill. Here, the ball will
roll to the lowest point on the hill. It can take this point as the point where the error is
least as for any model, the error will be minimum at one point and will increase again
after that.
In gradient descent, you find the error in your model for different values of input
variables. This is repeated, and soon you see that the error values keep getting smaller
and smaller. Soon you’ll arrive at the values for variables when the error is the least,
and the cost function is optimized.
16
about:blank 16/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
A Linear Regression model uses a straight line to fit the model. This is done using the
equation for a straight line as shown :
In the equation, you can see that two entities can have changeable values (variable) a,
which is the point at which the line intercepts the x-axis, and b, which is how steep
the line will be, or slope.
At first, if the variables are not properly optimized, you get a line that might not
properly fit the model. As you optimize the values of the model, for some variables,
you will get the perfect fit. The perfect fit will be a straight line running through most
of the data points while ignoring the noise and outliers. A properly fit Linear
Regression model looks as shown below :
For the Linear regression model, the cost function will be the minimum of the Root
Mean Squared Error of the model, obtained by subtracting the predicted values from
actual values. The cost function will be the minimum of these error values.
By the definition of gradient descent, you have to find the direction in which the error
decreases constantly. This can be done by finding the difference between errors. The
small difference between errors can be obtained by differentiating the cost function
and subtracting it from the previous gradient descent to move down the slope.
17
about:blank 17/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 18/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 19/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 20/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 21/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 22/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 23/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 24/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 25/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 26/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 27/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 28/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 29/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 30/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 31/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 32/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 33/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 34/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 35/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 36/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 37/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 38/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 39/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 40/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 41/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 42/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 43/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 44/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 45/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 46/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 47/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 48/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 49/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 50/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 51/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 52/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 53/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 54/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 55/56
12/18/23, 6:41 PM MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank 56/56