0% found this document useful (0 votes)
0 views

Unit 1-1

The document provides an overview of machine learning fundamentals, including components of learning, the machine learning workflow, and the importance of training and testing datasets. It discusses concepts such as bias-variance tradeoff, overfitting, underfitting, and the role of Hoeffding's inequality and VC bounds in understanding generalization error. Additionally, it highlights the significance of validation techniques like k-fold cross-validation and the impact of noisy targets on model accuracy.

Uploaded by

cavya472004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Unit 1-1

The document provides an overview of machine learning fundamentals, including components of learning, the machine learning workflow, and the importance of training and testing datasets. It discusses concepts such as bias-variance tradeoff, overfitting, underfitting, and the role of Hoeffding's inequality and VC bounds in understanding generalization error. Additionally, it highlights the significance of validation techniques like k-fold cross-validation and the impact of noisy targets on model accuracy.

Uploaded by

cavya472004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

U18ECE0014

MACHINE LEARNING
Fundamentals of Machine Learning

S.Arun Kumar
Department of Electronics and
Communication Engineering
Overview of this
Lecture
• Components of
Learning
• Fundamentals of
Machine Learning
• Feasibility Of
Learning
• VC Bound
• Bias-Variance
tradeoff
• Learning Curves
Components of Learning
Components of Learning

Unknown Pattern to be learnt from Data

Training Examples

g : Learnt function from Data


Machine Learning Workflow
Machine Learning Workflow

Machine Learning : Uses Data to compute hypothesis g that approximates target f


Outside the Data Set
Another simple binary classification example

Can we “learn” from this data?


(Can we “infer” something outside the training data?)
Outside the Data Set
Learning process:
Check all the possible functions
Choose the one that fits all the data

Can’t make any prediction


Learning from D (to infer something outside D) is impossible if any f
can happen
Inferring Something Unknown
Consider a bin with red and green marbles
P[picking a red marble] = µ
P[picking a green marble] = 1 − µ

The value of µ is unknown to us


How to infer µ?
Pick N marbles independently
ν: the fraction of red marbles
Inferring with probability
Do you know µ? Does v say something about µ?
No!
Sample can be mostly green while bin is mostly red Possible
Can you say something about µ?
Yes!
ν is “probably” close to µ (if the sample is sufficiently large) Probable
Hoeffding’s Inequality
In big sample (large N), ν (sample mean) is probably close to µ:
2
P[|ν − µ| > ϵ] ≤ 2e− 2 є N

This is called Hoeffding’s inequality


The statement “µ = ν” is Probably Approximately Correct (PAC)

The quantity |ν − µ| > 𝜀 is a bad event, we want its probability to be low


The bound is valid for all 𝑁 and 𝜀 is a margin of error
The bound does not depend on v
If we set for a lower margin 𝜀, we have to increase the data 𝑁 in order to
have a small probability of |ν − µ| > 𝜖 (bad event) happening
Hoffding’s Inequality
2
P[|ν − µ| > ϵ] ≤ 2e− 2є N

Valid for all N and ϵ > 0


Does not depend on µ (no need to know µ)
Larger sample size N or looser gap ϵ
⇒ higher probability for µ ≈ ν
Connection to Learning
How to connect this to learning?
Each marble (uncolored) is a data point x ∈ X
Connection to Learning
Red

Red Marble
Green Marble
Connection to Learning
How to connect this to learning?
Each marble (uncolored) is a data point x ∈ X
red ball: h(x)  f (x) (h is wrong)
green ball: h(x)=f (x) (h is correct)

Both µ and ν depend on the particular hypothesis ℎ

ν → in-sample error 𝐸𝑖𝑛 (ℎ)


µ → out-of-sample error 𝐸𝑜𝑢𝑡(ℎ)

The Out of sa mple error 𝐸𝑜 𝑢 𝑡 (ℎ) is the quantity


that really matters
Connection to Learning

• A hypothesis is an assumption, an idea that is proposed for the sake of argument so that
it can be tested to see if it might be true.
• A research hypothesis is a statement of expectation or prediction that will be tested by
research

• For any fixed ‘h’ can probably infer unknown Eout(h) from known Ein(h).

.
For a particular function h:
Eout (h) = P [h(x )  f (x)] (out-of-sample, unknown)
⇔µ

Ein(h) = 1 / yn ] (in-sample, known)


ΣNn=1 [h(xn ) =
N
⇔ν

Now we can infer Eout (h) from Ein (h)!


Verifying a Hypothesis

2
Verifying a Hypothesis

2
Verifying a Hypothesis
For any h, when sample size (N) is large:
2
P[|Ein (h) − Eout (h)| > ϵ] ≤ 2e− 2є N
Given a hypothesis h ⇒ sample N data ⇒ Ein(h) to “verify” the
quality of h
Can we apply to mulitple hypothesis?
Apply to multiple bins (hypothesis)

Color in each bin depends in different hypothesis


Bingo when getting all green balls?
A Simple Solution
When is learning successful?
When our Learning Algorithm A picks the hypothesis g:
2
P[|Ein (g) − Eout (g)| > ϵ] ≤ 2M e− 2є N

If M is small and N is large enough:


If A finds Ein(g) ≈ 0
⇒ Eout (g) ≈ 0 (Learning is successful!)
Feasibility of Learning
2
P[|Ein (g) − Eout (g)| > ϵ] ≤ 2M e− 2є N

Two questions:
(1) Can we make sure Eout (g) ≈ Ein (g)?
(2) Can we make sure Ein (g) ≈ 0?
Feasibility of Learning –Tradeoff on M
2
P[|Ein (g) − Eout (g)| > ϵ] ≤ 2M e− 2є N

Two questions:
(1) Can we make sure Eout (g) ≈ Ein (g)?
(2) Can we make sure Ein (g) ≈ small enough
M: complexity of model
Small M: (1) holds, but (2) may not hold (too few choices)
(under-fitting)
Large M: (1) doesn’t hold, but (2) may hold (over-fitting)
Training and Testing
• Training involves feeding a machine learning
algorithm with data and allowing it to learn the
relationships within that data.
• The dataset used in training is known as the training
dataset. It contains input-output pairs where the
input data is used by the model to learn and the
output data serves as the target or label.
• Testing evaluates the trained model's performance
on unseen data to gauge its generalization capability.
• The dataset used for testing is known as the testing
dataset or test set. It contains data that was not used
during the training phase
Overfitting and Underfitting
•Overfitting: When a model performs well on training data but
poorly on test data, it means the model has learned the noise
in the training data instead of the underlying pattern.
•Underfitting: When a model performs poorly on both training
and test data, it means the model is too simple to capture the
underlying pattern in the data.

•Validation Set: Sometimes, a separate validation set is used in


addition to the training and testing sets. This helps in tuning
the model and avoiding overfitting.
Validation
• K-fold cross-validation is a method where the dataset is divided
into 'k' equally-sized subsets or folds. The model is trained and
validated 'k' times, each time using a different fold as the
validation set and the remaining folds as the training set.
•Step 1: Split the dataset into 'k' folds (typically, k is chosen as 5
or 10).
•Step 2: For each of the 'k' iterations:
•Training: Use 'k-1' folds for training the model.
•Validation: Use the remaining 1 fold for validating the model.
•Step 3: Calculate the performance metric (such as accuracy,
precision, recall, etc.) for each iteration.
•Step 4: Average the performance metrics from all 'k' iterations to
obtain the final performance estimate
Error and Noisy Targets
Error
Error in machine learning refers to the difference between the
predicted values by the model and the actual target values
Training Error: The error measured on the training dataset. It
indicates how well the model fits the training data.
Test Error: The error measured on the test dataset. It indicates
how well the model generalizes to unseen data.
Sources of Error:
•Bias: Systematic error due to erroneous assumptions in the
learning algorithm. High bias can lead to underfitting.
•Variance: Error due to the model's sensitivity to small
fluctuations in the training dataset. High variance can lead to
overfitting.
Noisy Targets
Noisy targets refer to inaccuracies or inconsistencies in the target
labels (output values) of the training data. These can arise due to
various factors and can negatively impact model training.
Sources of Noise in Targets:
• Mistakes in the process of measuring or recording the target
values.
• Inaccuracies introduced by human annotators
• Natural variability in the phenomenon being measured, which
might not be fully captured by the features.
Effects of Noisy Targets:
• Reduced Model Accuracy
• Increased Variance: The model might overfit to the noise,
resulting in high variance.
• Misleading Evaluation Metrics
Theory of Generalization
Probability of a “bad event” is less than a
huge number ➔ not useful bound
The Hoeffding’s inequality becomes:

ℙ [|𝐸𝑖 n (𝑔)− 𝐸o u t (𝑔)|> 𝜀] ≤ 2𝑀𝑒−2𝜀2𝑁


where 𝑀 is the number of hypotheses in ℋ → 𝑀 can be infinity 

The quantity 𝐸out(𝑔) − 𝐸𝑖𝑛 (𝑔) is called the generalization error

To reduce M number of hypotheses 𝑀 can be replaced by a finite quantity 𝑚ℋ (𝑁)


(called the growth function) which is eventually bounded by a polynomial
Theory of Generalization
It turns out that the number of hypotheses 𝑀 can be replaced by a 𝑚ℋ (𝑁)
quantity (called the growth function) which is eventually bounded by a
polynomial

▪ This is due to the fact the 𝑀 hypotheses 𝑥2


will be very overlapping → They generate
the same “classification dichotomy”

𝑥1
Vapnik- Chervonenkis Bound
The VC bound is crucial for understanding the generalization ability of a machine
learning model, which is its performance on unseen data. 1
ℙ 𝐸𝑖𝑛 𝑔 − 𝐸𝑜𝑢𝑡 𝑔 > 𝜀 ≤ 4𝑚ℋ 2𝑁 𝑒 − 2𝑁
8 𝜀
Vapnik Chervonenkis (VC) Bound

The VC bound is crucial for understanding the generalization ability of a machine


learning model, which is its performance on unseen data.
Vapnik Chervonenkis (VC) Bound

The VC bound provides a probabilistic upper


bound on the difference between the true
error (the error on the entire distribution of
data) and the empirical error (the error on
the training data). It helps in understanding
how well a model trained on a finite sample
is expected to perform on new data.
Choosing Reduced value of M
Choosing Reduced value of M
Choosing Reduced value of M
Effective No of lines
Effective No of lines
Effective No of lines
Effective No of lines
Effective No of lines
Dichotomies
Dichotomy refers to the division or separation of a dataset or problem into two
distinct or mutually exclusive classes or groups.
Growth Function
Break Point

2D PERCEPTRON
VC Dimension
VC Dimension
The VC-dimension is a single parameter that characterizes the growth function
Definition
The Vapnik-Chervonenkis dimension of a hypothesis set ℋ is the max number of
points for which the hypothesis can generate all possible classification dichotomies
VC Dimension
𝑥2 𝑥2 𝑥2 𝑥2 𝑥2 𝑥2 𝑥2 𝑁=3

Max n°dichotomies
2𝑁 = 8
𝑥2

Shattering: A set of N Points is said to be


shattered by a hypothesis space H, if there are 𝑥2
hypothesis(h) in H that separates positive 𝑁=4
examples from the negative examples in one The linear model is not able to provide all 24
dichotomies, we would need a nonlinear one
of the 2N possible ways
OBSERVATIONS 𝑥1
If 𝑑𝑉𝐶 is finite, then 𝑚ℋ ≤ 𝑁 𝑑 𝑉𝐶 + 1 → this is a polynomial that will be eventually
dominated by 𝑒−𝑁 → generalization guarantees
linear models we have that 𝑑𝑉𝐶 = 𝑑 → can be interpreted as the number of
parameters of the model

The VC dimension of a model straight line in 2D plane is 3 30 /35


M and dvc
Two Questions
Generalization Error
Generalization Error
Generalization Error
Analysis of the generalization bound 𝐸𝑜𝑢𝑡 𝑔 ≤ 𝐸𝑖𝑛 𝑔 + Ω ( 𝑁, ℋ, 𝛿)
𝑬𝒐𝒖𝒕 - 𝑬in ≤ 𝛀 with Probability >= 1-δ

𝑬𝒐𝒖𝒕 ≤ 𝑬in +𝛀

1. 𝐸𝑜𝑢𝑡 𝑔 ≥ 𝐸𝑖 𝑛 𝑔 − Ω 𝑁, ℋ, 𝛿 → Not of much interest 


2. 𝑬𝒐𝒖𝒕 𝒈 ≤ 𝑬𝒊𝒏 𝒈 + 𝛀 𝑵, 𝓗, 𝜹 → Bound on the out of sample error! ☺

Observations
• 𝐸𝑖𝑛 𝑔 is known
• The penalty Ω can be computed if 𝑑𝑉𝐶 ℋ is known and 𝛿 is chosen
Generalization Bound
Generalization Bound
Generalization Bound Ω=
8
𝑁 ln
4𝑚ℋ 2𝑁
𝛿
Analysis of the generalization bound 𝐸𝑜𝑢𝑡 𝑔 ≤ 𝐸𝑖𝑛 𝑔 + Ω 𝑁, ℋ, 𝛿
The optimal model is a compromise between
𝐸𝑖𝑛 and Ω

Error
Out of sample
error(Test error)

Variance
Model complexity

overfitting

bias In sample
error(training error)
underfitting

VC dimension

Model complexity ≈ number of model parameters


VC Dimension for a rectangle
classifier
Break point for axis aligned rectangle is K= 5 ;
dvc=k-1=4
Connection to REAL learning
In a learning scenario, the function ℎ is not fixed a priori
• The learning algorithm is used to fathom the hypothesis space ℋ, to find the
best hypothesis ℎ ∈ ℋ that matches the sampled data → call this
hypothesis 𝑔
• With many hypotheses, there is more probability to find a good hypothesis 𝑔
only by chance → the function can be perfect on sampled data but bad on unseen
data

There is therefore an approximation - generalization tradeoff between:


• Perform well on the given (training) dataset
• Perform well on unseen data
Approximation vs.Generalization
The ultimate goal is to have a small 𝐸𝑜𝑢𝑡: good approximation of 𝑓 out of sample

• More complex ℋ ⇒ better chances of approximating 𝑓 in sample → if ℋ is


too simple, we fail to approximate 𝑓 and we end up with a large 𝐸
𝑖𝑛

• Less complex ℋ ⇒ better chance of generalizing out of sample → if ℋ is too


complex, we fail to generalize well

Ideal: ℋ = 𝑓 winning lottery ticket ☺


Approximation vs.Generalization
The example shows:

• Perfect fit on in sample (training) data



𝐸𝑖𝑛 = 0

• Low fit on out of sample (test) data



𝐸𝑜𝑢𝑡 huge
Approximation generalization
Tradeoff
VC Analysis → Choice of H should strike a balance between approximating
f on training data and generalizing on new data. Since we do not know the
target function, we go to larger model that contains good hypothesis.

H is too simple → fail to approximate f and large Ein


H is too large → Fail to generalise well because of model complexity

Selection of hypothesis H should have two requirements


1. To have some hypothesis H that can approximate f
2. Enable the data to zoom on right hypothesis

In Bias variance decomposition, we split Eout into bias and variance


Quantifying the tradeoff
VC analysis was one approach: 𝐸𝑜𝑢𝑡 ≤ 𝐸𝑖𝑛 + Ω

Bias-variance analysis is another: decomposing 𝐸𝑜𝑢𝑡 into:

1. How well ℋ can approximate 𝑓 → Bias


2. How well we can zoom in on a good ℎ ∈ ℋ, using the available data → Variance

It applies to real valued targets and uses squared error

The learning algorithm is not obliged to minimize squared error loss. However, we
measure its produced hypothesis’s bias and variance using squared error
Bias and Variance • We can create a graphical visualization of
bias and variance using a bulls-eye diagram.
• Imagine that the center of the target is a
model that perfectly predicts the correct
values.
• As we move away from the bulls-eye, our
predictions get worse and worse.
• Imagine we can repeat our entire model
building process to get a number of separate
hits on the target.
• Each hit represents an individual realization
of our model, given the chance variability in
the training data we gather.
• Sometimes we will get a good distribution of
training data so we predict very well and we
are close to the bulls-eye, while sometimes
our training data might be full of outliers or
non-standard values resulting in poorer
predictions.
• These different realizations result in a scatter
Graphical illustration of bias and variance. of hits on the target.
Bias and Variance
The out of sample error is (making explicit the dependence of 𝑔 on entire dataset 𝒟)
final hypothesis predicted target(mean squared error)

The expected out of sample error of the learning model is independent of the
particular realization of data set used to find 𝑔 𝒟 : in simple words the final hypothesis
will be different for different datasets(D). So, g is dependent on D. To get the expected
value of error with respect to D over N number samples,
Bias and Variance
Focus on Define the “average” hypothesis

This average hypothesis can be derived by imagining many datasets 𝒟1, 𝒟2, … , 𝒟𝑘 &
building it by

→ this is a conceptual tool, and 𝑔ҧ does not need


to belong to the hypothesis set
Bias and variance

Therefore
Bias and Variance
Interpretation

• The bias term measures how much our learning model is


biased away from the target function

• In fact, 𝑔 has the benefit of learning from an unlimited number of datasets,


so it is only limited in its ability to approximate 𝑓 by the limitations of the
learning model itself

• The variance term measures the variance in the


Final hypothesis, depending on the data set, and can be thought as how much
the final chosen hypothesis differs from the “mean” (best) hypothesis
Bias and Variance
Very s mall model. Since there is only one
hypothesis, both the average function g and t h e
final hypothesis 𝑔 𝒟 will be the same, for any
dataset. Thus, var = 0. The bias will depend solely
on how well this single hypothesis approximates
the target 𝑓, and unless we are extremely
lucky, we expect a large bias

Very large model. The target function is in ℋ.


Different data sets will led to different hypotheses
that agree with 𝑓 on the data set, and are spread
around 𝑓 in the red region. Thus, bias ≈ 0 because
𝑔 is likely to be close to 𝑓. The var is large
(heuristically represented by the size of the red
region)
Learning Curves
How it is possible to know if a model is suffering from bias or variance problems?
The learning curves provide a graphical representation for assessing this, by
plotting:

• the expected out of sample error ➪𝒟 𝐸𝑜𝑢𝑡 𝑔𝒟


• the expected in sample error ➪𝒟 𝐸𝑖𝑛 𝑔𝒟 𝒟1 𝒟2 𝒟3 𝒟4 𝒟5 𝒟6

with respect to the number of data 𝑁

𝒟
In practice, the curves are computed from one dataset, or by dividing it into
more parts and taking the mean curve resulting from the various sub-datasets
Learning Curves

Expected Error
Expected Error

Number of data points, Number of data points,

Simple model Complex model


Learning Curves
Interpretation
• Bias can be present when the expected error is quite high and 𝐸𝑖𝑛 is similar to
𝐸𝑜𝑢𝑡
• When bias is present, getting more data is not likely to help
• Variance can be present when there is a gap between 𝐸𝑖𝑛 and 𝐸𝑜𝑢𝑡
• When variance is present, getting more data is likely to help

Fixing bias Fixing variance


• Try add more features • Try a smaller set of features
• Try polynomial features • Get more training examples
• Try a more complex model • Regularization
• Boosting • Bagging
Learning Curves:
VC vs.Bias-Variance Analysis
References
1. Provost, Foster, and Tom Fawcett. “Data Science for Business: What you need to know about data mining
and data-analytic thinking”. O'Reilly Media, Inc., 2013.
2. Brynjolfsson, E., Hitt, L. M., and Kim, H. H. “Strength in numbers: How does data driven decision making affect
firm
performance?” Tech. rep., available at SSRN: http://ssrn.com/abstract=1819486, 2011.
3. Pyle, D. “Data Preparation for Data Mining”. Morgan Kaufmann, 1999.
4. Kohavi, R., and Longbotham, R. “Online experiments: Lessons learned”. Computer, 40 (9), 103–105, 2007.
5. Abu-Mostafa, Yaser S., Malik Magdon-Ismail, and Hsuan-Tien Lin. ”Learning from data”. AMLBook, 2012.
6. Andrew Ng. ”Machine learning”. Coursera MOOC. (https://www.coursera.org/learn/machine-learning)
7. Domingos, Pedro. “The Master Algorithm”. Penguin Books, 2016.
8. Christopher M. Bishop, “Pattern recognition and machine learning”, Springer-Verlag New York, 2006.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy