0% found this document useful (0 votes)
9 views123 pages

Unit 3

Supervised learning is a machine learning approach where algorithms are trained on labeled data to predict outcomes for new data. It encompasses two main types: classification, which categorizes data into discrete classes, and regression, which predicts continuous values. Various algorithms are used for these tasks, including logistic regression, support vector machines, and decision trees, each with specific applications and strengths.

Uploaded by

devangverma2025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views123 pages

Unit 3

Supervised learning is a machine learning approach where algorithms are trained on labeled data to predict outcomes for new data. It encompasses two main types: classification, which categorizes data into discrete classes, and regression, which predicts continuous values. Various algorithms are used for these tasks, including logistic regression, support vector machines, and decision trees, each with specific applications and strengths.

Uploaded by

devangverma2025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 123

www.acharya.ac.

in

UNIT - III
SUPERVISED LEARNING
What is Supervised learning?
Supervised learning as the name indicates has the presence of
a supervisor as a teacher. Supervised learning is when we teach or train
the machine using data that is well-labelled. Which means some data is
already tagged with the correct answer. After that, the machine is
provided with a new set of examples (data) so that the supervised
learning algorithm analyses the training data (set of training examples)
and produces a correct outcome from labeled data.
• Supervised learning is a type of machine learning where the algorithm
learns from labeled data meaning the data comes with correct answers
Click to Edit or classifications.
Types of Supervised Learning
Supervised learning can be applied to two main types of problems:
• Classification: Where the output is a categorical variable (e.g., spam
vs. non-spam emails, yes vs. no).

• Regression: Where the output is a continuous variable (e.g., predicting


house prices, stock prices)

Click to Edit
Classification
Classification teaches a machine to sort things into categories. It
learns by looking at examples with labels (like emails marked “spam”
or “not spam”). After learning, it can decide which category new items
belong to, like identifying if a new email is spam or not. For example a
classification model might be trained on dataset of images labeled as
either dogs or cats and it can be used to predict the class of new and
unseen images as dogs or cats based on their features such as color, texture
and shape.

Click to Edit
X-axis: Feature 1 (e.g., Color/Texture)
Y-axis: Feature 2 (e.g., Shape/Size)
These are the input features used by the classification model.
What’s shown:
Colored background regions: These show the decision
boundaries of the classifier — areas where the model assigns a
different class label (e.g., dog or cat).
Colored dots: These are the data points (i.e., examples of dogs
Click to Edit and cats) plotted by their feature values.
Color bar on the right:
Indicates the predicted class labels:
Class 0 (probably Dogs)
Class 1
Class 2 (maybe Cats)
Classification Tasks
1. Classifying emails as either Spam or Not Spam: Predicting whether
an email is spam (class 1) or not spam (class 0). The output variable is
discrete, taking on two distinct values (0 or 1) representing the two
classes.
2. Classifying Images of Fruits: Classifying images of fruits into
categories such as apple (class 0), banana (class 1), or orange (class 2).
The output variable is discrete, with multiple distinct values (0) or 1 or
2) representing the three classes (apple, banana, orange).
3. Sentiment Analysis: Analyzing text data to determine the sentiment of
Click to Edit a review (positive, negative, neutral). The sentiment labels (positive,
negative, neutral) are discrete categories assigned to the input text.
4. Medical Diagnosis: Predicting the presence of a disease based on
patient symptoms and test results. The diagnosis categories (e.g.,
disease present, no disease) are discrete labels assigned to the patient
data
5. Customer Retention: Predicting whether a customer will renew a
subscription or not (churn prediction).
Importance of Classification in ML
1. Solves Real World Decision Problems -Classification helps make automated
decisions based on historical data.
Examples:
Spam detection (email → spam or not)
Disease diagnosis (symptoms → disease type)
Loan approval (applicant data → approve/reject).

2. Enables Predictive Modeling-Once trained, a classification model can predict the


category of new, unseen data, helping businesses and systems make forward-looking
decisions. Example:A model can predict whether a customer will churn or stay based on
Click to Edit past behavior.

3. Fast and Efficient Decision Making


With classification, systems can:
Make real-time decisions
Scale to large datasets
Minimize manual effort
Example: Face recognition systems unlock phones instantly using image classification .
4. Helps in Risk Management-Classification models can identify high-risk situations or individuals.
Example:
Fraud detection: Credit card transaction → fraud or not
Insurance: Claim prediction → high risk or low risk.

5. Drives Business Intelligence-By grouping data into categories, companies can: Segment customers
Target marketing efforts Personalize recommendations
Example: Netflix classifying users’ viewing habits to recommend shows.
6. Crucial in Healthcare & Safety-In critical fields like medicine, security, and autonomous vehicles,
classification helps:
Classify medical images (e.g., cancer detection)
Detect anomalies (e.g., cyberattacks)
Click to Edit Identify objects on the road (e.g., pedestrian or stop sign).
7. Supports Automation & AI-Classification models power intelligent systems like: Chatbots understanding
intent
(question, complaint, etc.)
Smart assistants (e.g., classifying speech commands)Self-driving cars.
8. Versatility Across Domains- Classification is used in:
Finance: credit scoring
E-commerce: product recommendation
HR: resume screening
Agriculture: classifying crop diseases
Classification Algorithms in Machine Learning
These classification algorithms offer a diverse set of tools for solving a
wide range of classification tasks in machine learning. Each algorithm
has its strengths and is suitable for different types of data and problem
domains. Experimenting with these algorithms and understanding their
characteristics can help in selecting the most appropriate approach for a
given classification problem.
1. Logistic Regression:
Description: Logistic regression is a linear classification algorithm used
for binary classification tasks. It estimates the probability that a given
Click to Edit input belongs to a particular class.
Advantages: Simple, interpretable, works well for linearly separable data.
Application: Spam detection, customer churn prediction.
2. Support Vector Machines (SVM):
A Support Vector Machine (SVM) is a powerful supervised machine
learning algorithm used for classification, regression, and even outlier
detection. It works especially well for binary classification problems and
when the data is high-dimensional (lots of features).
Description: SVM is a versatile classification algorithm that finds the optimal
hyperplane to separate classes in the feature space. It can handle linear and non-linear
classification tasks.
Advantages: Effective in high-dimensional spaces, works well with clear margin of
separation.
Application: Text categorization, image recognition.
Click to Edit
Youtube link- https://www.youtube.com/watch?v=Y6RRHw9uN9o&t=135s
3. Decision Trees:
Description: Decision trees are non-linear classifiers that recursively split the data
based on feature values to make predictions. They create a tree-like structure of
decisions.
Advantages: Easy to interpret, can handle both numerical and categorical data.
Application: Customer segmentation, medical diagnosis.
4. Random Forest:
Description: Random Forest is an ensemble method that consists of multiple decision
trees. It improves prediction accuracy and reduces overfitting by aggregating the predictions of
individual trees.
Advantages: Robust to overfitting, handles high-dimensional data well.
Application: Credit risk analysis, image classification.

5. K-Nearest Neighbors (KNN):


Description: KNN is an instance-based classifier that classifies data points based on the
majority class among their k nearest neighbors in the feature space.
Advantages: Simple, non-parametric, easy to implement.
Application: Pattern recognition, recommendation systems.
Click to Edit

6. Neural Networks:
Description: Neural networks are deep learning classifiers that consist of interconnected
layers of nodes. They learn complex patterns in the data through training with
backpropagation.
Advantages: Capable of learning intricate patterns, suitable for large datasets.
Application: Image recognition, speech recognition.
Performance Evaluation Metrics in Classification
A confusion matrix is a simple table that shows how well a classification model is
performing by comparing its predictions to the actual results. It breaks down the
predictions into four categories: correct predictions for both classes (true
positives and true negatives) and incorrect predictions (false positives and false
negatives). This helps you understand where the model is making mistakes, so you
can improve it.
The matrix displays the number of instances produced by the model on the test
data.
Click to Edit
• True Positive (TP): The model correctly predicted a positive outcome (the
actual outcome was positive).
• True Negative (TN): The model correctly predicted a negative outcome (the
actual outcome was negative).
• False Positive (FP): The model incorrectly predicted a positive outcome (the
actual outcome was negative). Also known as a Type I error.
• False Negative (FN): The model incorrectly predicted a negative outcome (the
actual outcome was positive). Also known as a Type II error.
1. Accuracy

Accuracy measures how often the model’s predictions are


correct overall. It gives a general idea of how well the model is
performing.

2. Precision
Click to Edit
Precision focuses on the quality of the model’s positive
predictions. It tells us how many of the instances predicted as
positive are actually positive. Precision is important in
situations where false positives need to be minimized
3. Recall

Recall measures how well the model identifies all actual


positive cases. It shows the proportion of true positives detected
out of all the actual positive instances.

4. F1-Score
Click to Edit It provides a better sense of a model’s overall performance,
particularly for imbalanced datasets. The F1 score is helpful
when both false positives and false negatives are important.
Example

Click to Edit
Click to Edit
Regression

Regression is a type of supervised learning technique in machine learning


where the goal is to predict continuous or quantitative outputs based on
input features. Unlike classification, where the output is categorical,
regression models predict a continuous value.
Regression is a process of finding the correlations between dependent and
independent variables. It helps in predicting the continuous variables such
as prediction of Market Trends, prediction of House prices, etc.
Click to Edit
The task of the Regression algorithm is to find the mapping function to
map the input variable(x) to the continuous output variable(y).Regression
algorithms are used if there is a relationship between the input variable
and the output variable.
It is used for the prediction of continuous variables. The goal of regression
tasks is to predict a continuous number or a real number. If there is
continuity between possible outcomes, then the problem is a regression
problem.
Key Characteristics of Regression
1. Continuous Output Regression models predict a continuous value,
unlike classification which predicts categories.
Example: Predicting a person’s weight, house price, or exam score.

2. Relationship Modeling-It estimates the relationship between dependent


and independent variables.
Example: How does the number of study hours affect exam scores?

Click to Edit 3. Error Measurement with Metrics-Common evaluation metrics include:


Mean Absolute Error (MAE),Mean Squared Error (MSE),Root Mean
Squared Error (RMSE),R² Score (coefficient of determination)

4. Linearity (in Simple Linear Regression)-Assumes a linear relationship


between input and output (in basic models).
For more complex relationships, models like polynomial regression or
random forest regression can be used.
1. Simple Linear Regression
• What it is:
A basic type of regression that looks at the relationship between one
input and one output.
• Use Case:
Predicting a student’s exam score based on just their study hours.
• Example:
"If a student studies more hours, they are likely to score higher.“

2. Multiple Linear Regression


Click to Edit • What it is:
This type uses multiple inputs (features) to predict a single outcome.
• Use Case:
Estimating house prices using data like area size, number of bedrooms,
and location.
• Example:
"Bigger houses in better locations with more rooms usually cost more."
3. Polynomial Regression
• What it is:
A more flexible regression type that can fit curved patterns in the data. It’s
useful when the relationship isn't straight or linear.
• Use Case:
Modeling car fuel efficiency based on engine size, where the relationship
might rise and fall in a curve.
• Example:
"Fuel efficiency increases with engine size up to a point, then starts to drop.“

4. Logistic Regression
Click to Edit • What it is:
Despite the name, this is actually used for classification, not for predicting
continuous values. It predicts categories like yes/no or pass/fail.
• Use Case:
Predicting whether a student will pass or fail based on their attendance and
study habits.
• Example:
"If a student attends classes regularly and studies well, they're likely to pass."
Regression algorithms in Machine Learning
1. Linear Regression
• What it does: Predicts a continuous value using one or
more input features assuming a straight-line relationship.
• Applications:
• Predicting house prices
• Estimating sales revenue
Click to Edit • Forecasting temperature
• Advantages:
• Simple to implement and interpret
• Fast training and prediction
• Good for linearly correlated data
2. Ridge & Lasso Regression (Regularized Linear Models)
What they do: Improve linear regression by reducing overfitting
using penalties.
Applications:
Financial forecasting (e.g., stock prices)
Medical predictions (e.g., disease progression)
Advantages:
Prevents overfitting
Click to Edit
Works well when features are correlated
Lasso helps with feature selection.
3. Polynomial Regression
What it does:
Polynomial regression is a type of regression analysis used in
statistics and machine learning when the relationship between
the independent variable (input) and the dependent variable
(output) is not linear.
Fits curved relationships by using powers of input features.
Applications:
Click to Edit Modeling growth trends (e.g., population, sales)
Physics-related predictions (e.g., motion trajectories)
Advantages:
Can capture more complex, non-linear patterns
More flexible than linear regression
4. Support Vector Regression (SVR)
What it does:
Uses the principles of Support Vector Machines to predict continuous values.
Applications:
Predicting stock market trends
Energy demand forecasting
Advantages:
Works well with non-linear and high-dimensional data
Robust to outliers.
5. Decision Tree Regression
What it does:
Click to Edit Splits data into tree-like structures to make predictions.
Applications:
Predicting student performance
Customer churn prediction
Advantages:
Easy to visualize and understand
Handles both numerical and categorical data
No need for feature scaling
6. Random Forest Regression
What it does:
Uses multiple decision trees (an ensemble) to improve accuracy
and reduce overfitting.
Applications:
Predicting real estate values
Medical risk assessment
Advantages:
Click to Edit
High accuracy and robustness
Handles missing values well
Reduces overfitting compared to a single tree🔹
7. Gradient Boosting (e.g., XGBoost, LightGBM)
What it does:
Builds trees one by one, where each new tree fixes errors made
by the previous ones.
Applications:
Credit scoring in finance
Click-through rate prediction in advertising
Click to Edit Advantages:
Excellent performance on structured data
Highly customizable and powerful
Often used in data science competitions
Importance and benefits of Regression in Machine Learning

1. Predictive Modeling: Regression is the backbone of predicting


numerical values, such as stock prices, sales figures, or weather
patterns.
2. Interpretability: Linear regression models are relatively simple,
making them ideal for understanding relationships between features
and outcomes.
3. Foundation for Advanced Techniques: Regression forms the basis
for more complex machine learning algorithms, such as neural
Click to Edit
networks and decision trees.
4. Real-World Applicability: It is widely used across domains like
healthcare (e.g., predicting disease progression), economics (e.g.,
forecasting market trends), and engineering (e.g., optimizing resource
allocation).
5. Versatility: Applicable to numerous datasets and scenarios,
ranging from small-scale problems to large, complex systems.
6. Feature Analysis: Helps identify the most influential
variables (features) in a dataset,
aiding in decision-making and model optimization.
•Robust Performance: Regression models are effective in
handling noise in data and are
Click to Edit
highly customizable to suit different types of problems.
•Scalability: Regression techniques can easily scale with
growing data and are computationally efficient.
Some sample Datasets
1. UCI Machine Learning Repository

• This is the ML community that was first created by a


researcher at the University of Irvine, California.
Dataset link - Iris Dataset
2. Kaggle
Click to Edit

• The Kaggle is very popular among all its competitive


resources. It is an online platform that involves a
community of data scientists, ML engineers, and
researchers.
Dataset link - Housing Dataset
3. Open Data on AWS

• The AWS dataset is publicly available to users to download and access it. The AWS
is known for cloud-based access to facilitate research, analysis, and experimentation.
Dataset link - Amazon-Products DataSet
4. Google Dataset Search
• The Google dataset is primarily used in every institution, universities and
organizations.
Dataset link - RTA Dataset
Click to Edit
5. Azure Open Datasets

• The Azure is known for cloud based platform and it is hosted by the company
Microsoft. The datasets present is azure platform used by various domains such as
Finance, Healthcare, Environmental Science and more.

Dataset link - Air Conditioners Dataset


6. Earth data

• The Earthdata is full of open access datasets organized by the


company NASA. It is used for data collection and promoting
scientific progress for societal benefit.
Dataset link -
Environment_Temperature_change_E_All_Data_NOFLAG
7. IMDB dataset
Click to Edit
• This is large movie review dataset which can be used for
train and test in ML modeling.
Dataset link - movies_metadata
K-Nearest neighbours (K-NN) Algorithm
K-nearest neighbors (KNN) algorithm is a type of supervised ML
algorithm which can be used for both classification as well as
regression predictive problems. However, it is mainly used for
classification predictive problems in industry. The main idea behind
KNN is to find the k-nearest data points to a given test data point and
use these nearest neighbors to make a prediction. The value of k is a
hyperparameter that needs to be tuned, and it represents the number of
neighbors to consider.
Click to Edit
Click to Edit
Properties of KNN are
1. Lazy Learning Algorithm
• What it means: KNN does not build a model ahead of time. It
stores the entire training dataset and makes decisions only when it
needs to make a prediction.
• How it works: When you input a new data point for classification,
KNN:
• Calculates the distance between this point and all the points in the training set.
• Picks the 'K' closest neighbors.
• Classifies based on the majority vote (for classification) or average (for
Click to Edit regression) of those neighbors.
• Why it's called “lazy”: Because it delays processing until a
prediction is requested—it doesn't "learn" in the traditional sense
during training
2. Non-parametric Algorithm
A non-parametric algorithm does not assume any specific
shape or form for the data. It’s more flexible and can adapt
to the data, no matter how complex or weird its shape is.
• Example: K-Nearest Neighbors (KNN)
• Doesn’t try to draw a line or curve through the data.
• It just looks at the neighbors of a new point and uses them to
predict.
• The model size can grow as you add more data.
Click to Edit
Youtube link-
https://www.youtube.com/watch?v=hlHn7ZH8j8E
How Does K-Nearest Neighbors Algorithm Work?
K-nearest neighbors (KNN) algorithm uses 'feature similarity'
to predict the values of new datapoints which further means
that the new data point will be assigned a value based on how
closely it matches the points in the training set. We can
understand its working with the help of following steps

Click to Edit
Click to Edit
How to choose the value of K

Click to Edit
Click to Edit

If k=3, the new data belongs to square category.


If k=7, the new data belongs to triangle category. So,
choosing k value is important. More the k value, more
accurate are the results.
Example
The following is an example to understand the concept of K and
working of KNN algorithm
• Suppose we have a dataset which can be plotted as follows −

Click to Edit
Now, we need to classify new data point with black dot (at
point 60,60) into blue or red class. We are assuming K = 3 i.e. it
would find three nearest data points. It is shown in the next
diagram −

Click to Edit
How to Calculate Euclidean Distance?

Click to Edit

The formula to calculate Euclidean distance is: This is the most commonly used
distance measure, and it is
limited to real-valued vectors.
Using this formula, it measures a
straight line between the query
point and the other point being
measured.
• For Two Points(2D)- The Euclidean distance formula
for calculating the distance between two points in a
two-dimensional space (2D) is given by:
d=

Click to Edit
Click to Edit
Click to Edit
Click to Edit
Linear Models
• Linear models in machine learning represent relationships as linear
combinations of input features. Their fundamental equation takes the
form:
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
Where:

• y is the target variable

• x₁ through xₙ are input features


Click to Edit

• 𝛽0 is the intercept (value of 𝑦 when all 𝑥 values are zero).

• 𝑛 𝑛 are coefficients(weights) that represent how much each


𝛽1,2,…,𝛽
predictor variable contributes to 𝑦.

• Ε(epsilon) represents error terms


The beauty of linear models lies in their mathematical
simplicity. The relationship between inputs and outputs follows
a straight line in two dimensions or a hyperplane in higher
dimensions. This characteristic makes them computationally
efficient and highly interpretable.

Classification and Regression Tasks with Linear Models


Classification and regression are two primary tasks in supervised machine
learning, where key difference lies in the nature of the output:
Click to Edit
classification deals with discrete outcomes (e.g., yes/no, categories),
while regression handles continuous values (e.g., price, temperature).
Click to Edit
For example, it can determine whether an email is spam or not,
classify images as “cat” or “dog,” or predict weather conditions like
“sunny,” “rainy,” or “cloudy.” with decision boundary and
regression models are used to predict house prices based on features
like size and location, or forecast stock prices over time with
straight fit line.
Characteristics of Linear Models
Simplicity: Linear models are straightforward to interpret and
implement.
Linearity: They assume that the relationship between predictors and
the target variable is additive and linear.
Scalability: Linear models work well with large datasets and high-
dimensional data.
Efficiency: These models are computationally efficient compared to
more complex models like deep neural networks.
Click to Edit
Generalization: While they perform well for problems with linear
patterns, they may struggle with nonlinear relationships unless
transformed appropriately
Interpretable: Coefficients in linear models represent the effect of each
predictor, making them easy to understand.
Sensitivity to Outliers: Linear models can be significantly affected by
outliers, which can distort predictions.
Linear Regression
Linear regression is also a type of
supervised machine-learning algorithm that learns from the
labelled datasets and maps the data points with most optimized
linear functions which can be used for prediction on new
datasets. It computes the linear relationship between the
dependent variable and one or more independent features by
fitting a linear equation with observed data. It predicts the
Click to Edit
continuous output variables based on the independent input
variable.
For the simplicity, take the following data (Single feature and
single target).

Square Feet (X) House Price (Y)

1300 240

1500 320

1700 330

1830 295

Click to Edit 1550 256

2350 409

1450 319

In machine learning, linear regression uses a linear equation to model the relationship between a
dependent variable (Y) and one or more independent variables (X).

The main goal of the linear regression model is to find the best-fitting straight line (often called a
regression line) through a set of data points.
Line of Regression
• A straight line that shows a relation between the dependent
variable and independent variables is known as the line of
regression or regression line.

Click to Edit
Furthermore, the linear relationship can be positive or negative
in nature as explained below −
1. Positive Linear Relationship
• A linear relationship will be called positive if both
independent and dependent variable increases. It can be
understood with the help of the following graph −

Click to Edit
2. Negative Linear Relationship
• A linear relationship will be called negative if the
independent increases and the dependent variable decreases.
It can be understood with the help of the following graph −

Click to Edit
How Linear Regression works?

1. Model Representation
This is how the model represents the relationship between input features and the
output.
Formula:
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
Where:

• y is the target variable


Click to Edit
• x₁ through xₙ are input features

• 𝛽0 is the intercept (value of 𝑦 when all 𝑥 values are zero).

• 𝑛 𝑛 are coefficients(weights) that represent how much each predictor


𝛽1,2,…,𝛽
variable contributes to 𝑦.

• ε represents error terms


2. Objective
• The goal of linear regression is to find the best weights so
that the model’s predictions are as close as possible to the
actual outputs.
3. Training
Training is the process of adjusting the weights using the
training data.
How it’s done:
Click to Edit • The model starts with random weights.
• It predicts outputs for the training inputs.
• It calculates how wrong the predictions are (error).
• It uses an optimization technique (like Gradient Descent) to
update weights to reduce the error.
• This process repeats until the error is minimized.
4. Prediction
Once the model is trained (i.e., it has learned good weights ), it
can make predictions for new inputs.
Example:Let’s say the model learned:
β1=8
ε=30
Now if a student studies for 5 hours (𝑥=5):
y=8x5+30=70
Click to Edit
Predicted exam score: 70
Example problem

Car Age (Years) Mileage (km) Price (USD)


2 30,000 20,000
5 70,000 15,000
7 100,000 10,000
10 150,000 7,000

The goal is to use linear regression to predict the price of a car that is 6
years old and has 80,000 km mileage.

Click to Edit
Step 1: Define the Linear Regression Model
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
Where:

• y is the target variable

• x₁ through xₙ are input features

• 𝛽0 is the intercept (value of 𝑦 when all 𝑥 values are zero).


Click to Edit

• 𝛽1,2,…,𝛽
𝑛 𝑛 are coefficients(weights) that represent how
much each predictor variable contributes to 𝑦.

• ε represents error terms


Step 2: Calculate the Coefficients
To solve this manually, we can use statistical formulas or a
tool like Python’s Scikit-Learn. Here’s a rough estimate:
Using regression analysis on this dataset, let's assume the
estimated regression equation is:
Price=25000−1000(Age)−0.05(Mileage)
• (Note: These coefficients are approximated for simplicity.
negative coefficients indicate an inverse relationship between an
Click to Edit
independent variable and the target variable.)
Step 3: Make the Prediction Now, substitute Age = 6 years and Mileage =
80,000 km into the equation:
Price=25000−1000(6)−0.05(80,000)
Price=25000−6000−4000
Price=15,000 USD

A linear regression model always includes an error term (ε), but in


simplified examples like this one, it is often omitted for clarity.
Thus, based on the linear regression model, the estimated price for a 6-
Click to Edit year-old car with 80,000 km mileage is $15,000.
A simple Python code for the same
# Define independent (X) and dependent
(y) variables
import numpy as np X = df[["Age", "Mileage"]]
import pandas as pd from sklearn. y = df["Price"]
model_selection # Train linear regression modelmodel =
import train_test_splitfrom LinearRegression()model.fit(X, y)
sklearn.linear_model # Predict the price of a car (6 years old,
import LinearRegression# Sample car 80,000 km)
Click to Edit dataset new_car = np.array([[6, 80000]])
data = { "Age": [2, 5, 7, 10], # Car # Age = 6, Mileage = 80,000
age in years "Mileage": [30000,
70000, 100000, 150000], # Mileage predicted_price =
in km "Price": [20000, 15000, model.predict(new_car)print(f"Predicted
Price: ${predicted_price[0]:,.2f}")
10000, 7000]
# Car price in USD}
# Convert dictionary to DataFramedf
Logistic Regression
Logistic regression is a supervised machine learning
algorithm used for classification tasks where the goal is to
predict the probability that an instance belongs to a given class
or not. Logistic regression is a statistical algorithm which
analyze the relationship between two data factors.
Logistic regression is used for binary classification where we
use sigmoid function, that takes input as independent variables
Click to Edit
and produces a probability value between 0 and 1.
Types of Logistic Regression
On the basis of the categories, Logistic Regression can be
classified into three types:
1. Binomial: In binomial Logistic regression, there can be only
two possible types of the dependent variables, such as 0 or 1,
Pass or Fail, etc.

2. Multinomial: In multinomial Logistic regression, there can


be 3 or more possible unordered types of the dependent
Click to Edit
variable, such as “cat”, “dogs”, or “sheep”

3. Ordinal: In ordinal Logistic regression, there can be 3 or


more possible ordered types of dependent variables, such as
“low”, “Medium”, or “High”.
Key Points:

• Logistic regression predicts the output of a categorical dependent


variable. Therefore, the outcome must be a categorical or discrete
value.

• It can be either Yes or No, 0 or 1, true or False, etc. but instead of


giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.

• In Logistic regression, instead of fitting a regression line, we fit an


Click to Edit “S” shaped logistic function, which predicts two maximum values (0
or 1).
Problem Statement
A university wants to predict whether a student will pass or fail based on
the number of hours they study.
Feature (X): Number of hours studied.
Target (Y): Pass (1) or Fail (0).
Step 1: Understanding Logistic Regression. Unlike linear regression,
logistic regression predicts probabilities and maps them to binary
outcomes (pass or fail) using the sigmoid function:

Click to Edit
Click to Edit
Click to Edit
A negative intercept suggests that when study hours = 0, the
probability of passing is very low. If the intercept was positive, it
Click to Edit would imply that even without studying, a student has a high
chance of passing— which may not be realistic.
Step 4: Interpretation
• The more hours studied, the higher the probability of passing.
• The decision boundary (typically at 0.5) separates passing
and failing students.
import numpy as np # Define independent variable (X) and dependent
variable (y)
import pandas as pd from X = df[["Hours_Studied"]]
sklearn.model_selection import y = df["Pass"]
train_test_split from # Split dataset into training and testing sets
sklearn.linear_model X_train, X_test, y_train, y_test =
import LogisticRegression from train_test_split(X, y, test_size=0.2,
random_state=42)
sklearn.metrics
# Train logistic regression modelmodel =
import accuracy_score LogisticRegression()model.fit(X_train, y_train)
# Predict outcomes on test datay_pred =
# Sample dataset: Hours studied model.predict(X_test)
vs Pass/Fail # Evaluate the model's accuracyaccuracy =
Click to Edit outcomedata = accuracy_score(y_test, y_pred)print(f"Model
Accuracy: {accuracy * 100:.2f}%")
{ "Hours_Studied": [1, 3, 5, 7, 9,
# Predict pass/fail for a student who studied 6
2, 4, 6, 8, 10], "Pass": [0, 0, 1, 1, hours
1, 0, 0, 1, 1, 1] # 1 = Pass, 0 = new_student = np.array([[6]])
Fail} # Hours Studied = 6
# Convert dictionary to prediction =
model.predict(new_student.reshape(1, -
DataFramedf = 1))print("Will the student pass?", "Yes" if
pd.DataFrame(data) prediction[0] == 1 else "No")
Multiclass classification using multinomial logistic
regression
Multinomial Logistic Regression is an extension of logistic regression
used for classifying data into more than two categories. Example use case:
Predicting the type of flower (Setosa, Versicolor, Virginica) using features
like petal length, sepal width, etc.This is a multiclass classification task.
Key Idea
• Binary logistic regression uses the sigmoid function to predict
probabilities of two classes.
Click to Edit • Multinomial logistic regression uses the softmax function to predict
the probabilities for three or more classes.
• The model gives a probability for each class, and the class with the
highest probability is chosen.
Example
• Let’s say we want to predict fruit type based on weight and
color score:

Click to Edit

Fruit has 3 classes → Apple (0), Orange (1), Banana (2)


Click to Edit
Suppose we want to predict the result for a fruit with weight =
150g, color score = 0.75:

Class Score Before Softmax Probability After Softmax


Apple 2.1 0.60
Orange 1.2 0.25
Banana 0.9 0.15

Since Apple has the highest probability, the model predicts-

Click to Edit Prediction: Apple


Applications of Linear Models
Domain Application Linear Model Used
Forecasting trends, demand,
Linear Regression, Logistic
Predictive Modeling or outcomes based on
Regression
historical data
Customer segmentation, lead
Marketing & Business scoring, campaign Logistic Regression
effectiveness
Credit risk scoring, stock
Linear Regression, Logistic
Finance price trend prediction, loan
Regression
approval
Click to Edit
Disease diagnosis (e.g.,
Healthcare diabetes risk), survival Logistic Regression
prediction
Predicting user ratings or
Recommendation Systems preferences (e.g., movie Linear Regression
scores)
NLP (Natural Language Text classification (e.g., spam Logistic Regression (with
Processing) detection, sentiment analysis) Bag-of-Words or TF-IDF)
Naive Bayes Classifiers
The main idea behind the Naive Bayes classifier is to
use Bayes’ Theorem to classify data based on the probabilities
of different classes given the features of the data. It is used
mostly in high-dimensional text classification
• The Naive Bayes Classifier is a simple probabilistic classifier
and it has very few number of parameters which are used to
build the ML models that can predict at a faster speed than other
classification algorithms.
Click to Edit
• It is a probabilistic classifier because it assumes that one feature
in the model is independent of existence of another feature. In
other words, each feature contributes to the predictions with no
relation between each other.

• Naïve Bayes Algorithm is used in spam filtration, Sentimental


analysis, classifying articles and many more.
What is Bayes' Theorem?

Bayes’ Theorem is a fundamental concept in probability theory and statistics. It


tells us how to update the probability of a hypothesis based on new evidence.
In simple terms: What is the probability of a cause (hypothesis) given that we
observed an effect (evidence)?
1. Hypothesis
• A hypothesis is an assumption or a possible explanation we are testing.
Example- The patient has particular disease.
2. Evidence
Click to Edit • Evidence is the new observed data or outcome.
In Bayes' Theorem: The evidence is event B, the thing that we have observed.
Example: Evidence: The patient’s test result is positive.

In Bayes' Theorem, The hypothesis is typically event A, and we are


calculating 𝑃(𝐴∣𝐵), the probability of the hypothesis given the evidence. Example:
Hypothesis: The patient has a disease.
Example for Bayes Theorem
Problem Statement
A medical test is designed to detect a rare disease. The test results are
positive or negative. The test is not perfect, meaning there can be false
positives and false negatives.
Given:
• P(Disease) = 0.01 → 1% of the population has the disease.
• P(Test Positive | Disease) = 0.95 → 95% probability the test correctly
detects the disease (true positive).
Click to Edit • P(Test Positive | No Disease) = 0.05 → 5% probability the test
incorrectly shows positive for a healthy person (false positive).
Question: If a person’s test result is positive, what is the probability that
they actually have the disease?
Solution Using Bayes' Theorem
Bayes' Theorem is given by:

In Bayes' Theorem, A and B represent different events whose


probabilities we want to calculate or update based on new information.
For the medical test example, we define:
Click to Edit
• A = The person has the disease.
• B = The test result is positive.
• Bayes' Theorem calculates the probability of A(having the disease)
given B(a positive test result):
Click to Edit
P(A∣B)- P(Disease∣TestPositive) is what we want to calculate.
P(B∣A)- P(TestPositive∣Disease) is the probability of a true
positive.
P(A)- P(Disease) is the prior probability of having the disease.
P(B)- P(TestPositive) is the overall probability of a positive
Click to Edit test.
Step 1: Compute P(TestPositive)
P(TestPositive)=P(TestPositive∣Disease)⋅P(Disease)+
P(TestPositive∣NoDisease)⋅P(NoDisease)
Substituting values:
P(TestPositive)=(0.95×0.01)+(0.05×0.99)
P(TestPositive)=0.0095+0.0495=0.059
Step 2: Compute P(Disease∣TestPositive)
Click to Edit
P(Disease∣TestPositive)=0.95×0.01/0.059
P(Disease∣TestPositive)=0.0095/0.059≈0.161
Step 3: Interpretation
• The probability that a person actually has the disease
given a positive test result is 16.1%, which is lower than
expected due to false positives. This means a positive test
does not guarantee the person has the disease—other
tests may be required for confirmation.

Click to Edit
Click to Edit
Naive Bayes Classifier
The Naive Bayes Classifier is a simple yet powerful machine learning
algorithm used for classification tasks. It’s based on Bayes’ Theorem,
with a key assumption: all input features are independent given the
class (which is rarely true in real life — hence the name “naive”).
Few examples of Naive Bayes Algorithm
1. Spam Email Detection
• Hypothesis: Email is spam
• Evidence: Words like "free", "win", "money" appear
Click to Edit • Naive Bayes uses the presence or absence of words to classify emails.
2. Sentiment Analysis
• Hypothesis: A review is positive or negative
• Evidence: Words in the review like "great", "bad", "amazing"
• Used in movie reviews, product feedback, etc.
3. Medical Diagnosis
Hypothesis: Patient has a particular disease
Evidence: Symptoms like fever, cough, headache
Naive Bayes can estimate the likelihood of diseases based on
symptom patterns.
4. News Article Classification
Hypothesis: An article belongs to the “sports” or “politics”
category
Click to Edit
Evidence: Words appearing in the article (e.g., "match",
"parliament")
Helps categorize large volumes of news automatically.
Types of Naive Bayes Classifier
1. Gaussian Naive Bayes
• Use When: Features are continuous and follow a normal (Gaussian)
distribution.
• Assumes: Data in each class is normally distributed.
• Formula: Uses the Gaussian probability density function to estimate
likelihoods.
Example: Predicting whether a person has a disease based on height, weight, or age.
2. Multinomial Naive Bayes
Click to Edit
• Use When: Features are discrete, like word counts or frequency.
• Assumes: Data is generated from a multinomial distribution (bag-of-words
model).
• Most commonly used for text classification problems.
Example: Spam detection, where features are counts of words in an email.
3. Bernoulli Naive Bayes
• Use When: Features are binary (i.e., 0 or 1), representing
presence or absence of a feature.
• Assumes: Each feature follows a Bernoulli distribution
(yes/no).
• Also used for text classification, but considers only whether a
word appears or not.
Example: Classifying documents based on whether specific
keywords exist in them.
Click to Edit
Applications of Naive Bayes Classifier
1. Spam Detection Classifies emails as spam or not spam
based on words, subject lines, and patterns.
Fast and works well even with thousands of emails.
2. Sentiment Analysis Determines if a product or movie
review is positive, negative, or neutral.
Common in customer feedback and social media monitoring.
Click to Edit
3. Medical Diagnosis Predicts diseases based on symptoms.
Example: Given symptoms like fever, cough, and fatigue,
predict the likelihood of flu vs. COVID.
4. Document Classification
Automatically assigns categories to news articles, blogs, or research
papers.
Example: Labeling a document as “sports”, “politics”, or “technology”.
5. Recommendation Systems
Predicts user preferences based on past behavior.
Example: Suggesting products on Amazon or videos on YouTube.
6. Real-time Prediction Systems
Click to Edit Because Naive Bayes is very fast and lightweight, it’s used in:Fraud
detection in banking
Real-time ad targeting in digital marketing Risk classification in
insurance
7. Face or Object Recognition
Helps in classifying images using presence or absence of certain
features (used with other techniques).
Working Principle of Naïve Bayes Theorem

For a classification problem with multiple features ( 𝑋1, 𝑋2,


…,𝑋𝑛), the algorithm applies Bayes' theorem:
𝑃(𝐶∣𝑋1,𝑋2,...,𝑋𝑛)=𝑃(𝑋1,𝑋2,...,𝑋𝑛∣𝐶)⋅𝑃(𝐶)/
𝑃(𝑋1,𝑋2,...,𝑋𝑛)

Since exact computation of 𝑃(𝑋1,𝑋2,...,𝑋𝑛∣𝐶) is complex,


Click to Edit
Naïve Bayes assumes feature independence, simplifying the
equation to:
𝑃(𝐶∣𝑋1,𝑋2,...,𝑋𝑛)=𝑃(𝑋1∣𝐶)⋅𝑃(𝑋2∣𝐶)⋅⋯⋅𝑃(𝑋𝑛∣𝐶).𝑃(𝐶)
How It Works in Practice
1. Calculate prior probabilities: Find P(C), which
represents how often each class appears in the dataset.
2. Compute likelihood: Calculate P(X | C), which measures
how likely feature X appears given class C.
3. Multiply the probabilities: Use the independence
assumption to multiply all individual feature probabilities.
4. Make a prediction: Assign the class with the highest
Click to Edit
probability.
Click to Edit
Click to Edit
Click to Edit
Decision Trees
A decision tree is a supervised learning algorithm used for both
classification and regression tasks. It models decisions as
a tree-like structure where internal nodes represent attribute
tests, branches represent attribute values and leaf nodes
represent final decisions or predictions.

Click to Edit
Let’s consider a decision tree for predicting whether a customer will buy
a product based on age, income and previous purchases: Here’s how
the decision tree works:
1. Root Node (Income)
First Question: “Is the person’s income greater than $50,000?”
• If Yes, proceed to the next question.

• If No, predict “No Purchase” (leaf node).

Click to Edit 2. Internal Node (Age):


If the person’s income is greater than $50,000, ask: “Is the person’s
age above 30?”
• If Yes, proceed to the next question.

• If No, predict “No Purchase” (leaf node).


3. Internal Node (Previous Purchases):
• If the person is above 30 and has made previous
purchases, predict “Purchase” (leaf node).

• If the person is above 30 and has not made previous


purchases, predict “No Purchase” (leaf node).

Click to Edit
Splitting of Decision Tree

Click to Edit

Example: Predicting Whether a Customer Will Buy a Product


Using Two Decision Trees
Tree 1: Customer Demographics

First tree asks two questions:


1. “Income > $50,000?”
• If Yes, Proceed to the next question.

• If No, “No Purchase”

2. “Age > 30?”


• Yes: “Purchase”
Click to Edit • No: “No Purchase”

Tree 2: Previous Purchases

“Previous Purchases > 0?”


• Yes: “Purchase”

• No: “No Purchase”


Combining Trees: Final Prediction

• Once we have predictions from both trees, we can


combine the results to make a final prediction. If Tree
1 predicts “Purchase” and Tree 2 predicts “No
Purchase”, the final prediction might be “Purchase” or
“No Purchase” depending on the weight or confidence
assigned to each tree. This can be decided based on the
Click to Edit
problem context.
Attribute Selection Measure(ASM)
The attribute selection measure (ASM) is a criterion used in decision tree
algorithms to select the best attribute for splitting the data at each node.
The ASM assigns a score to each attribute based on its ability to divide
the data into subsets that are more homogeneous in terms of the target
variable.
The attribute with the highest score is selected as the splitting attribute for
that node.The goal of the ASM is to find the attribute that provides the
most information gain or the best split for the data. The Gini Index, Gain
Ratio, and Information Gain are the most widely used selection
Click to Edit metrics.
Information Gain
In physics and mathematics, entropy is referred to as the randomness or
the impurity in a system. In information theory, it refers to the impurity in
a group of examples.
An entropy is a measure of the impurity or randomness of a dataset and it
is used in decision tree algorithms to evaluate the effectiveness of
attributes in partitioning the data into more homogeneous subsets with
respect to the target variable. A lower entropy indicates a more
homogeneous subset, while a higher entropy indicates a more
heterogeneous subset.
To calculate the information gain of the original dataset-
Information Gain tells you how much "useful information" a feature
gives about the target. If a feature splits the data into pure groups (i.e.,
mostly one class), it has high Information Gain.
Click to Edit
Example:
Imagine you're classifying fruits based on texture and weight.
If splitting by texture results in:
One group with only Apples
Another with only Oranges→
Then texture has high information gain (because it removes uncertainty).
Calculate the entropy of the original dataset D.

Where P is the probability that an arbitrary tuple in S belongs to class C.


The generic formula for calculating the entropy of a dataset D with
respect to a binary class label (example, "Yes" or "No") is given by:
Entropy(D) = -p(Yes) log2​p(Yes) - p(No) log2 p(No)
where:
p(Yes) is the proportion of examples in S that have a class label of "Yes,“
p(No) is the proportion of examples in S that have a class label of "No,“
Click to Edit
Log2 is the base-2 logarithm.
This formula represents the measure of impurity or randomness in the
dataset S with respect to the binary class label. A lower entropy value
indicates a more homogeneous or pure dataset, while a higher entropy
value indicates a more heterogeneous or impure dataset.
2. Calculate the average entropy after partitioning the
dataset based on the values of attribute A.

Click to Edit
•∣Sv​|= number of elements in the subset where attribute has value v
•∣S∣= total number of elements in the full dataset
3. Compute the information gain using the formula:

The attribute A with the highest information gain, Gain(S,A), is


Click to Edit chosen as the splitting attribute at a particular node in the
decision tree. This means that the attribute that provides the
most reduction in entropy or the most effective partitioning of
the data is selected for splitting at that node.
Example of Information Gain

Click to Edit
Click to Edit
Click to Edit
Click to Edit
Click to Edit
Click to Edit
Click to Edit
Click to Edit
Gini Index
The Gini Index is a proportion of the impurity or inequality of a circulation, regularly
utilized as an impurity measure in decision tree algorithms. With regards to decision
trees, the Gini Index is utilized to determine the best feature to split the data on at every
node of the tree.
The formula for Gini Index is as per the following:

where pi is the probability of a thing having a place with a specific class.


Using Gini Index in Classification Problems
The Gini Index is generally utilized as an impurity measure in decision tree algorithms
Click to Edit for classification problems. In decision trees, every node addresses an element, and
the objective is to split the data into subsets that are essentially as pure as could be
expected. The impurity measure (like Gini Index) is utilized to decide the best split
at every node.
• To illustrate this, we should consider an example of a decision tree for a binary
classification issue. The tree has two elements: age and income, and the objective is
to foresee regardless of whether an individual is probably going to purchase an item.
The tree is constructed utilizing the Gini Index as the impurity measure.
• Gini index is a measure of impurity or purity used while
creating a decision tree in the CART(Classification and
Regression Tree) algorithm.
• An attribute with the low Gini index should be preferred as
compared to the high Gini index.
• It only creates binary splits, and the CART algorithm uses
the Gini index to create binary splits.

Click to Edit
Gini index can be calculated using the below formula:
Gini Index vs Entropy

While entropy and the Gini Index are both normally utilized as impurity
measures in decision tree algorithms, they have various properties.
Entropy is more delicate to the circulation of class names and will in
general deliver more adjusted trees, while the Gini Index is less touchy
to the appropriation of class marks and will in general create more
limited trees with less splits. The decision of impurity measure relies
upon the particular issue and the attributes of the data.

Click to Edit
Example of Gini index

We should consider a binary classification issue where we have a dataset


of 10 examples with two classes: "Positive" and "Negative". Out of the
10 examples, 6 have a place with the "Positive" class and 4 have a
place with the "Negative" class.
• To calculate the Gini Index of the dataset, we initially calculate the
probability of each class:
p_1 = 6/10 = 0.6 (Positive)
p_2 = 4/10 = 0.4 (Negative)
Click to Edit
Then, at that point, we utilize the Gini Index formula to calculate the
impurity of the dataset:
Gini(S) = 1 - (p_1^2 + p_2^2)
= 1 - (0.6^2 + 0.4^2)
= 0.48
So, the Gini Index of the dataset is 0.48.
Presently suppose we need to split the dataset on an element "X" that has two potential values: "A" and
"B". We split the dataset into two subsets in view of the component:
Subset 1 (X = A): 4 Positive, 1 Negative
Subset 2 (X = B): 2 Positive, 3 Negative
To calculate the decrease in Gini Index for this split, we initially calculate the Gini Index of every
subset:
Gini(S_1) = 1 - (4/5)^2 - (1/5)^2 = 0.32
• Gini(S_2) = 1 - (2/5)^2 - (3/5)^2 = 0.48
Then, we utilize the information gain formula to calculate the decrease in Gini Index:

Click to Edit

IG(S, X) = Gini(S) - ((5/10 * Gini(S_1)) + (5/10 * Gini(S_2)))


= 0.48 - ((0.5 * 0.32) + (0.5 * 0.48))
= 0.08
• So, the information gain (i.e., decrease in Gini Index) for splitting the dataset on highlight "X" is
0.08.
For this situation, in the event that we calculate the information gain for all elements and pick the one
with the most noteworthy information gain, that component would be chosen as the best component to
split on at the root node of the decision tree.
Decision Tree Algorithms
1. ID3 (Iterative Dichotomiser 3) Algorithm
ID3 is an algorithm used to construct decision trees for classification tasks. It takes a
dataset as input and produces a decision tree that can be used to classify new data
points. The algorithm works by iteratively selecting the most informative attribute at
each node and splitting the data based on that attribute.
2. C4.5
C4.5 uses a modified version of information gain called the gain ratio to reduce the
bias towards features with many values. The gain ratio is computed by dividing
the information gain by the intrinsic information which measures the amount of
Click to Edit data required to describe an attribute’s values:
3. CART (Classification and Regression Trees)

CART is a widely used decision tree algorithm that is used


for classification and regression tasks.
• 1. For classification CART splits data based on the Gini impurity
which measures the likelihood of incorrectly classified randomly
selected data. The feature that minimizes the Gini impurity is selected
for splitting at each node.

Click to Edit
THANK YOU
Click to Edit

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy