Lecture 2 Ai

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

UCK358E – INTR.

TO ARTIFICIAL INTELLIGENCE
SPRING ‘24

LECTURE 2
MACHINE LEARNING BASICS

Instructor: Barış Başpınar


Learning Algorithms: Tasks

 A learning algorithm is an algorithm that performs a task by using data.


 A learning problem usually has three components: Task, Performance and Experience
 Here are some common learning tasks
 Classification: Find a function that maps input data to one of k categories
 Example: Object recognition
 Classification with missing inputs: Find a function or a collection of functions that maps input data
to one of k categories, where the input vector might have missing values
 Example: Medical diagnosis
 Regression: Find a function that predicts a continuous output
 Example: Predict expected amount an insured person will make
 Transcription: Observe some unstructured data and transcribe it into discrete, textual form
 Example: Image translation, speech recognition
Learning Algorithms: Tasks

 Here are more learning tasks


 Machine Translation: Convert from one language to another
 Example: Translation for natural languages
 Structured Output: Any task that involves the output as a vector with important relationship among
its elements
 Example: Break down a sentence into nouns, verbs, etc, image captioning
 Anomaly Detection: Flag unusual or atypical objects
 Example: Spam filtering, credit card fraud detection
 Data Completion: Estimate missing values in data
 Example: Product recommendation
 Denoising: Recover a clean example from a corrupt one
 Example: Image and video denoising
 Density Estimation: Compute the 𝑝(𝒙) that generated the data
 Example: Estimate the structure of a PGM
Learning Algorithms: Performance

 We need a quantitative measure 𝑃 to measure how well our algorithm is performing on task 𝑇
 For classification and translation tasks, we measure the performance by accuracy,
 Just divide the number of correctly classified examples to number of all examples

 Performance is evaluated on test data, data that has never been seen by the algorithm before
 This set is usually different from the training data

 Sometimes selecting the correct performance criteria can be a complicated task itself
 For translation, should we measure the accuracy on word by word basis or a sentence by sentence
basis?
 For regression, should we penalize small medium frequency mistakes, or large rare mistakes
Learning Algorithms: Experience

 Many learning algorithms are categorized as supervised or unsupervised based on the data
that experience
 Roughly speaking, unsupervised algorithms try to determine 𝑝(𝒙) from 𝒙
 Whereas the supervised algorithm try to determine 𝑝( 𝒚 ∣ 𝒙 ) from (𝒙, 𝒚)

 However, sometimes the lines between the two categories get blurry
 In semi-supervised algorithms, where the data is only partially labelled
 Unsupervised algorithms can also be used for supervised learning
 Simply estimate 𝑝( 𝒙, 𝒚 ) and then marginalize

 And vice versa:


 Use chain rule 𝑝 𝑥 = ς𝑛𝑖=1 𝑝 𝑥𝑖 𝑥1 , … , 𝑥𝑖−1 to turn an unsupervised problem to a collection of 𝑛
supervised problems
Learning Algorithms: Linear Regression Example

 Let’s use a linear regression example to clarify these subjects


 In linear regression, our model is as follows:

 Where 𝒘 ∈ 𝑅𝑛 is a weight vector. Hence the task is to find 𝒘 that does well according to some
performance measure.

 A popular type of performance measure for these type of tasks is the mean squared error

 The experience is gained by observing the dataset


 It can be shown that the optimal solution is found by
Learning Algorithms: Linear Regression Example
Capacity, Overfitting and Underfitting: Generalization

 A central challenge in machine learning is to perform well on new, previously unseen inputs.
 This property is known as generalization. The error on the test data is referred to as generalization
error

 But how can we say something about the test data, if all we see is training data?
 If these datasets were generated arbitrarily, we can’t...
 However, if they come from the same distribution 𝑝(𝒙, 𝑦) , then we can say something! For
instance, their mean should be equal

 In general, we want to find a 𝒘 such that:


I. Make the training error small
II. Make the gap between training and test error small

 Unfortunately, we rarely do both good at the same time


 Failing on I, is known as underfitting, our model’s capacity is not sufficient to capture 𝑝(𝒙, 𝑦)
 Success on I but failing on II, is known as overfitting, our model’s capacity is so high that it becomes
overtuned to work only on training data
Capacity, Overfitting and Underfitting: Generalization
Capacity, Overfitting and Underfitting: Capacity
Capacity, Overfitting and Underfitting: Capacity
Capacity, Overfitting and Underfitting: Regularization

 Regularization is any modification we make to the learning algorithm to reduce generalization


error.
 The most common form of regularization is to modify the loss function so that larger parameters
are penalized

 There are other forms of regularization, for instance ‘dropout‘ (randomly dropping some of the
weights in your model) is very popular in deep learning
Hyperparameters and Validation Sets

 Hyperparameters are the parameters of our model that controls the capacity, as well as the
behaviour of the algorithm
 Examples: degree of a polynomial, number of layers in a neural net, structure of a PGM...
 These are usually set by domain experts

 Can we ’learn’ those parameters as well?


 We can try optimizing them, this is called hyperparameter optimization

 However, using the whole training data does not make sense
 The optimization procedure would simply select the parameters that result in highest capacity (i.e.
overfitting)

 Similar to how we separated the whole data into test and training, we can further divide the
training data into two disjoint subsets
 The smaller part that is used for learning the hyperparameters is called the validation set
Estimators, Bias and Variance: Estimation

 How can we quantify the concept of overfitting/underfitting?


 Classical statistics and estimation theory can help with that
 We will take the frequentist view for the rest of this section

 Point estimation is an attempt to provide a single best estimate for a quantity of interest. We
denote the point estimate of 𝜽 as 𝜽෡

 Let {𝑥 1 , . . . , 𝑥 (𝑚) } be a set of 𝑚 i.i.d. data points. A point estimator or statistic is any function
of the data:

෡ 𝑚 is a random variable, even though 𝜽 is not


 {𝑥 1 , . . . , 𝑥 (𝑚) } is generated randomly, hence 𝜽
 We are interested in analyzing the relationship between 𝜽 ෡ 𝑚 and 𝜽, both for finite 𝑚 and 𝑚 → ∞
Estimators, Bias and Variance: Bias

 The bias of an estimator is defined as:

 The expectation is taken over the data


෡ 𝑚 ) = 0 and asymptotically unbiased if lim Bias (𝜽
 An estimator is unbiased if Bias(𝜽 ෡𝑚 ) = 0
𝑚→∞

 Let 𝑥 (𝑖) be generated from a Bernoulli distribution with mean 𝜃. Then the estimator
1
𝜃መ𝑚 = σ𝑚 𝑖=1 𝑥
(𝑖) is unbiased
𝑚
 The same estimator is also unbiased when 𝑥 (𝑖) are generated from a Gaussian distribution with
mean 𝜇
1
 However, the estimator 𝜎ො𝑚 = 𝑚 σ𝑚𝑖=1 𝑥
(𝑖) − 𝜇Ƹ
𝑚 is only asymptotically unbiased. We can make it
1
unbiased by setting 𝜎ො𝑚 = σ𝑚𝑖=1 𝑥
(𝑖) − 𝜇Ƹ
𝑚
𝑚−1

 It might seem like unbiased estimators are preferable, but this is not necessarily true, we will
soon see that biased estimators possess some interesting properties
Estimators, Bias and Variance: Variance and Standard Error

 Another interesting property of an estimator is how much it varies as a function of the data
sample. This is called variance

 Low variance estimators give similar outputs irregardless of the sampled data. Variance is an
excellent measure for quantifying the uncertainty in our estimations.
 For instance, for Gaussian distributions we can say that our estimate of the mean 𝜇Ƹ falls into the
interval of

with 95% probability.

 Hence ideally, we want both low bias and low variance.


1
 For Bernoulli distribution, variance of the estimator is 𝜃(1 − 𝜃)
𝑚
 The variance decreases as 𝑚 increases, this is a common property of popular estimators
Estimators, Bias and Variance: Bias-Variance Decomposition

 Bias and variance are two difference sources of error in an estimator


 Bias measures the expected deviation from the true value.
 Variance measures deviation from the expected estimator value among different subsets of data.

 Should we choose a model with low bias or low variance?


 The answer is given by the key idea that mean squared error can be decomposed as follows

 Hence for a fixed MSE, there is a trade-off between variance and bias
 In general, increasing capacity increases the variance and decreases the bias.
Estimators, Bias and Variance: Bias-Variance Trade-off
Supervised Learning Algorithms: Probabilistic Supervised Learning

 The supervised learning is mainly about estimating the probability distribution 𝑝(𝑦 ∣ 𝒙)
 To use MLE for this job, we first define a family of probability distributions 𝑝(𝑦 ∣ 𝒙; 𝜽)

 Linear regression problem can be expressed by the family:

 It can be shown that in this case 𝜃𝑀𝐿 has a closed form expression, which is the same as least
squares solution!

 For binary classification, we can use the logistic sigmoid function:

 No closed form solution in this case, we need to use numerical optimization methods.
Supervised Learning Algorithms: Parametric Supervised Learning

 Not all supervised learning algorithms are probabilistic. We can simply define a parametric
family of deterministic models 𝑦 = 𝑓(𝒙; 𝒘) and use some loss function to find an optimal 𝒘∗
 One of the most famous approaches to this problem is the Support Vector Machines (SVMs)
 For binary classification, we let 𝑓(𝒙) = 𝒘𝑇 𝒙 + 𝑏
 If 𝑓(𝒙) > 0 we predict that sample belongs to class 1 and vice versa

 SVM is useful for two reasons


 Solving for optimal 𝒘 is a convex optimization problem
 It can be shown that:

 The second part is the key results. We can replace 𝒙𝑇 𝒙 (𝑖) by any inner product and we can
still use the same algorithm!
 This is how we generalize to nonlinear classifiers:
 The 𝑘 is called a 𝑘𝑒𝑟𝑛𝑒𝑙
Supervised Learning Algorithms: Nonparametric Supervised Learning

 For most supervised learning algorithms we need to fix the form and number of parameters in
the model beforehand.
 Can we also learn those from the model?
 Models that allow these are called nonparametric, in some sense, they let the data speak for itself

 One of the simplest nonparametric classifier is k-nearest neighborhood method


 Fix a positive integer k (yes even nonparametric models have parameters)
 For a new point, find the nearest k samples from the dataset, compute their class/output averages
and simply predict this value for the new point

 There are other popular nonparametric classifiers as well, decision trees, random forests etc.
Supervised Learning Algorithms: Nonparametric Supervised Learning
Unsupervised Learning

 There are unsupervised learning algorithms for dimensionality reduction, e.g. PCA
 PCA takes unlabeled and difficult to interpret data and transforms it into a much more manageable
form

 Another popular form of unsupervised learning is clustering


 How can we segment the data in a meaningful way?
 k-means clustering is a very popular algorithm for this job.

 Generating association rules is another task that can be solved via unsupervised learning
 An association rule is an unsupervised learning method which is used for finding the relationships
between variables in the large database.
References

 I. Goodfellow, Y. Bengio and A. Courville, “Deep Learning”, 2016.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy