0% found this document useful (0 votes)
8 views

Module 3 - Introduction to ML

Uploaded by

devaadi0713
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Module 3 - Introduction to ML

Uploaded by

devaadi0713
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

MACHINE LEARNING

Presenter: Dr. Amit Kumar Das


Professor,
Dept. of Computer Science and Engg.,
Institute of Engineering & Management.
WHAT IS LEARNING?
TYPES OF HUMAN LEARNING

 Learning through direct guidance from


expert – is just one form …

 Learning through indirect guidance

 Learning by self
WHAT IS MACHINE LEARNING?
WHAT IS MACHINE LEARNING?
TYPES OF MACHINE LEARNING
 Supervised learning – also called
predictive learning

 Unsupervised learning – also called


descriptive learning

 Reinforcement learning
MACHINE LEARNING PROCESS
 What was the most difficult subject in the last
semester?

 What if, you had a list of all possible questions


with answers, and a photographic memory?
MACHINE LEARNING PROCESS
 Data Input – Past data or information is
utilized as a basis for future decision-making

 Abstraction – The input data is represented


in a broader way through the underlying
algorithm

 Generalization – The abstracted


representation is generalized to form a
framework for making decisions
TYPICAL ML PROBLEMS
 Prediction of results of a game
 Predicting whether a tumor is malignant or
benign
 Price prediction in domains like real estate,
stocks, etc.
 Demand forecasting in retails
 Customer segmentation
 Self-driven cars
PROBLEMS NOT TO BE CONSIDERED FOR ML

 Bank interest calculation

 Inventory management (except the demand


forecast module)

 Customer on-boarding (except risk prediction


module)

 Tasks in which humans are very effective or


frequent human intervention is needed. For
example, air traffic control
TYPES OF DATA
 Qualitative data (Categorical)
 Student Name, Blood group, Grade, etc.

 Quantitative data (Numerical)


 Temperature, Age, Weight, etc.
DATA EXPLORATION
 Understand the central tendency –
 Mean
 Median
 Mode

 Understand data spread


 Standard Deviation

 Understand data value position


DATA EXPLORATION – CENTRAL TENDENCY

Mean vs. Median for Auto MPG


DATA EXPLORATION – DATA SPREAD
 Consider the data values of two attributes
 Attribute 1 values – 44, 46, 48, 45 and 47
 Attribute 2 values – 34, 46, 59, 39 and 52
 Both the set of values have a mean and
median of 46.
 First set of values is more concentrated or
clustered around the mean / median value
DATA EXPLORATION – DATA VALUE POSITION

 Any data set attribute has five values


 Minimum
 First quartile (Q1)
 Median (Q2)
 Third quartile (Q3), and
 Maximum

Minimum Q1 Q3 Maximum

Median (Q2)
DATA EXPLORATION – BOX PLOT
DATA EXPLORATION – BOX PLOT
DATA QUALITY

 Most occurring data quality issues are:


 Missing values
 Outliers

Missing values of attribute “horsepower” in Auto MPG


REMEDIATE DATA ISSUES
 Remove missing values / outliers – If
number of records are not many, remove them.
 Imputation - Impute the value with mean or
median or mode
 Capping - For values that lie outside the
1.5 X IQR limits, cap them by replacing the
observations below the lower limit with value of
5th percentile and those that lie above the upper
limit, with value of 95th percentile
 Estimate missing values – Assign attribute
values of similar data points in place of the
missing value
ISSUES IN MACHINE LEARNING

 Relatively new and evolving technology

 In
different countries, rules and regulations,
cultural background, emotional maturity of
people are drastically different

 Biggestfears - potential breach of privacy,


discriminatory behaviour, resulting
discontent
WHAT IS MODELLING IN CONTEXT OF
MACHINE LEARNING?
WHAT ARE THE DIFFERENT ML
ALGORITHMS?

 Supervised
 Classification – KNN, Naive Bayes, Decision Tree, etc.

 Regression – Simple Linear Regression, Logistic


Regression

 Unsupervised
 Clustering – K-Means
 Market Basket Analysis
SUPERVISED LEARNING - CLASSIFICATION

Labelled Training Data

Classifier Classification Model

Test Data

Intel
SUPERVISED LEARNING - REGRESSION

y = α + βx
UNSUPERVISED LEARNING

Unlabelled Data

Unsupervised Learning Model

Grouped data / Clusters


UNSUPERVISED LEARNING - CLUSTERING

Cluster 2

Cluster 1

Cluster 3
Cluster 4
UNSUPERVISED LEARNING – MARKET BASKET
ANALYSIS
SELECTING A MODEL

 Predictive models (supervised)


 Predict the value of a category or class
 Problems that can be solved : Prediction of win/loss,
fraudulent transactions, etc.
 Examples : k-Nearest Neighbor (kNN), Naïve Bayes,
Decision Tree, etc.
 Predict numerical values of the target
 Problems that can be solved : Prediction of revenue
growth, rainfall amount, etc,
 Examples: Linear Regression, Logistic Regression, etc.
SELECTING A MODEL
 Descriptive
models
(unsupervised)
 Group together
similar data
instances
 Problems that can be
solved: Customer
grouping or
segmentation based
on social,
demographic, ethnic,
etc. factors
 Most popular model
for clustering is k-
Means
TRAIN A MODEL – HOLDOUT METHOD
70% - 80% Training
Data

Input
Data Trained Model

Test
20% - 30% Data

Model Performance
K-FOLD CROSS-VALIDATION– OVERALL APPROACH
K-FOLD CROSS-VALIDATION– DETAILED APPROACH
K-FOLD CROSS-VALIDATION (CONTD.)
BOOTSTRAP SAMPLING / BOOTSTRAPPING
TRAIN A MODEL – UNDER VS. OVER FIT

Under fit Balanced fit Over fit

Under fit Balanced fit Over fit


TRAIN A MODEL – BIAS VS. VARIANCE
EVALUATING A MODEL - CLASSIFICATION

Actual Outcome  True Positive (TP) –


Win Loss
Predicted win, Actual win
 True Negative (TN) –
Predicted loss, Actual loss
 False Positive (FP) –
Win

Predicted win, Actual loss


Predicted Outcome

True Positive (TP) False Positive (FP)  False Negative (FN) –


Predicted loss, Actual win

 For both TP and TN,


predicted outcome
Loss

matches actual
outcome. Hence, they
False Negative (FN) True Negative (TN)
are correct
classifications.
EVALUATING A MODEL – CLASSIFICATION (CONTD.)
Actual
Actual Outcome
Actual Win Loss
Win Loss
Predicted Win 85 4
Predicted Loss 2 9
Win
Predicted Outcome

True Positive (TP) False Positive (FP)


Loss

False Negative (FN) True Negative (TN)

The percentage of misclassifications are indicated using error rate which is


measured as:

In context of the above confusion matrix,


EVALUATING A MODEL – CLASSIFICATION (CONTD.)
where P(a) = proportion of observed agreement between actual
and predicted in overall data set =

P(pr) = proportion of expected agreement between actual and predicted data both in case
of class of interest as well as the other classes =

Note: Kappa value can be 1 at the maximum, which represents perfect agreement between model’s prediction and actual values.
EVALUATING A MODEL (ROC CURVE)
TPR =

FPR =

Receiver Operating Characteristic curve


EVALUATING A MODEL (REGRESSION)
Value of the apartment unit 

Actual value

Error

Predicted value

Area (in square Feet) 


EVALUATING A MODEL (CLUSTERING)
“Clustering is in the eye of the beholder"

 Internal evaluation
 Silhouette width

 External evaluation
 Purity
EVALUATING A MODEL (CLUSTERING)
Cluster 2

Cluster 1
a(i)  Average distance between
ai2 ai1 the ith data instance and all other
data instances belonging to the
b14(1)
same cluster
ain_1 b(i)  Lowest average distance
b14(2)
between the i-the data instance and
b14(n4) data instances of all other clusters

Cluster 3
Cluster 4

Silhouette width calculation


ENSEMBLE
THANK YOU &
STAY TUNED!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy