Lecture 2 Ai

UCK358E – INTR.
TO ARTIFICIAL INTELLIGENCE
SPRING ‘24
LECTURE 2
MACHINE LEARNING BASICS
Instructor: Barış Başpınar

Learning Algorithms: Tasks
 A learning algorithm is an algorithm that performs a task by using data.

 A learning problem usually has three components: Task, Performance and Experience
 Here are some common learning tasks
 Classification: Find a function that maps input data to one of k categories
 Example: Object recognition
 Classification with missing inputs: Find a function or a collection of functions that maps input data
to one of k categories, where the input vector might have missing values
 Example: Medical diagnosis
 Regression: Find a function that predicts a continuous output
 Example: Predict expected amount an insured person will make
 Transcription: Observe some unstructured data and transcribe it into discrete, textual form
 Example: Image translation, speech recognition
Learning Algorithms: Tasks
 Here are more learning tasks

 Machine Translation: Convert from one language to another
 Example: Translation for natural languages
 Structured Output: Any task that involves the output as a vector with important relationship among
its elements
 Example: Break down a sentence into nouns, verbs, etc, image captioning
 Anomaly Detection: Flag unusual or atypical objects
 Example: Spam filtering, credit card fraud detection
 Data Completion: Estimate missing values in data
 Example: Product recommendation
 Denoising: Recover a clean example from a corrupt one
 Example: Image and video denoising
 Density Estimation: Compute the 𝑝(𝒙) that generated the data
 Example: Estimate the structure of a PGM
Learning Algorithms: Performance
 We need a quantitative measure 𝑃 to measure how well our algorithm is performing on task 𝑇
 For classification and translation tasks, we measure the performance by accuracy,
 Just divide the number of correctly classified examples to number of all examples
 Performance is evaluated on test data, data that has never been seen by the algorithm before
 This set is usually different from the training data
 Sometimes selecting the correct performance criteria can be a complicated task itself
 For translation, should we measure the accuracy on word by word basis or a sentence by sentence
basis?
 For regression, should we penalize small medium frequency mistakes, or large rare mistakes
Learning Algorithms: Experience
 Many learning algorithms are categorized as supervised or unsupervised based on the data
that experience
 Roughly speaking, unsupervised algorithms try to determine 𝑝(𝒙) from 𝒙
 Whereas the supervised algorithm try to determine 𝑝( 𝒚 ∣ 𝒙 ) from (𝒙, 𝒚)
 However, sometimes the lines between the two categories get blurry
 In semi-supervised algorithms, where the data is only partially labelled
 Unsupervised algorithms can also be used for supervised learning
 Simply estimate 𝑝( 𝒙, 𝒚 ) and then marginalize
 And vice versa:

 Use chain rule 𝑝 𝑥 = ς𝑛𝑖=1 𝑝 𝑥𝑖 𝑥1 , … , 𝑥𝑖−1 to turn an unsupervised problem to a collection of 𝑛
supervised problems
Learning Algorithms: Linear Regression Example
 Let’s use a linear regression example to clarify these subjects

 In linear regression, our model is as follows:
 Where 𝒘 ∈ 𝑅𝑛 is a weight vector. Hence the task is to find 𝒘 that does well according to some
performance measure.
 A popular type of performance measure for these type of tasks is the mean squared error
 The experience is gained by observing the dataset

 It can be shown that the optimal solution is found by
Learning Algorithms: Linear Regression Example
Capacity, Overfitting and Underfitting: Generalization
 A central challenge in machine learning is to perform well on new, previously unseen inputs.
 This property is known as generalization. The error on the test data is referred to as generalization
error
 But how can we say something about the test data, if all we see is training data?
 If these datasets were generated arbitrarily, we can’t...
 However, if they come from the same distribution 𝑝(𝒙, 𝑦) , then we can say something! For
instance, their mean should be equal
 In general, we want to find a 𝒘 such that:

I. Make the training error small
II. Make the gap between training and test error small
 Unfortunately, we rarely do both good at the same time

 Failing on I, is known as underfitting, our model’s capacity is not sufficient to capture 𝑝(𝒙, 𝑦)
 Success on I but failing on II, is known as overfitting, our model’s capacity is so high that it becomes
overtuned to work only on training data
Capacity, Overfitting and Underfitting: Generalization
Capacity, Overfitting and Underfitting: Capacity
Capacity, Overfitting and Underfitting: Capacity
Capacity, Overfitting and Underfitting: Regularization
 Regularization is any modification we make to the learning algorithm to reduce generalization

error.
 The most common form of regularization is to modify the loss function so that larger parameters
are penalized
 There are other forms of regularization, for instance ‘dropout‘ (randomly dropping some of the
weights in your model) is very popular in deep learning
Hyperparameters and Validation Sets
 Hyperparameters are the parameters of our model that controls the capacity, as well as the
behaviour of the algorithm
 Examples: degree of a polynomial, number of layers in a neural net, structure of a PGM...
 These are usually set by domain experts
 Can we ’learn’ those parameters as well?

 We can try optimizing them, this is called hyperparameter optimization
 However, using the whole training data does not make sense
 The optimization procedure would simply select the parameters that result in highest capacity (i.e.
overfitting)
 Similar to how we separated the whole data into test and training, we can further divide the
training data into two disjoint subsets
 The smaller part that is used for learning the hyperparameters is called the validation set
Estimators, Bias and Variance: Estimation
 How can we quantify the concept of overfitting/underfitting?

 Classical statistics and estimation theory can help with that
 We will take the frequentist view for the rest of this section
 Point estimation is an attempt to provide a single best estimate for a quantity of interest. We
denote the point estimate of 𝜽 as 𝜽෡
 Let {𝑥 1 , . . . , 𝑥 (𝑚) } be a set of 𝑚 i.i.d. data points. A point estimator or statistic is any function
of the data:
෡ 𝑚 is a random variable, even though 𝜽 is not

 {𝑥 1 , . . . , 𝑥 (𝑚) } is generated randomly, hence 𝜽
 We are interested in analyzing the relationship between 𝜽 ෡ 𝑚 and 𝜽, both for finite 𝑚 and 𝑚 → ∞
Estimators, Bias and Variance: Bias
 The bias of an estimator is defined as:
 The expectation is taken over the data

෡ 𝑚 ) = 0 and asymptotically unbiased if lim Bias (𝜽
 An estimator is unbiased if Bias(𝜽 ෡𝑚 ) = 0
𝑚→∞
 Let 𝑥 (𝑖) be generated from a Bernoulli distribution with mean 𝜃. Then the estimator
1
𝜃መ𝑚 = σ𝑚 𝑖=1 𝑥
(𝑖) is unbiased
𝑚
 The same estimator is also unbiased when 𝑥 (𝑖) are generated from a Gaussian distribution with
mean 𝜇
1
 However, the estimator 𝜎ො𝑚 = 𝑚 σ𝑚𝑖=1 𝑥
(𝑖) − 𝜇Ƹ
𝑚 is only asymptotically unbiased. We can make it
1
unbiased by setting 𝜎ො𝑚 = σ𝑚𝑖=1 𝑥
(𝑖) − 𝜇Ƹ
𝑚
𝑚−1
 It might seem like unbiased estimators are preferable, but this is not necessarily true, we will
soon see that biased estimators possess some interesting properties
Estimators, Bias and Variance: Variance and Standard Error
 Another interesting property of an estimator is how much it varies as a function of the data
sample. This is called variance
 Low variance estimators give similar outputs irregardless of the sampled data. Variance is an
excellent measure for quantifying the uncertainty in our estimations.
 For instance, for Gaussian distributions we can say that our estimate of the mean 𝜇Ƹ falls into the
interval of
with 95% probability.
 Hence ideally, we want both low bias and low variance.

1
 For Bernoulli distribution, variance of the estimator is 𝜃(1 − 𝜃)
𝑚
 The variance decreases as 𝑚 increases, this is a common property of popular estimators
Estimators, Bias and Variance: Bias-Variance Decomposition
 Bias and variance are two difference sources of error in an estimator

 Bias measures the expected deviation from the true value.
 Variance measures deviation from the expected estimator value among different subsets of data.
 Should we choose a model with low bias or low variance?

 The answer is given by the key idea that mean squared error can be decomposed as follows
 Hence for a fixed MSE, there is a trade-off between variance and bias
 In general, increasing capacity increases the variance and decreases the bias.
Estimators, Bias and Variance: Bias-Variance Trade-off
Supervised Learning Algorithms: Probabilistic Supervised Learning
 The supervised learning is mainly about estimating the probability distribution 𝑝(𝑦 ∣ 𝒙)
 To use MLE for this job, we first define a family of probability distributions 𝑝(𝑦 ∣ 𝒙; 𝜽)
 Linear regression problem can be expressed by the family:
 It can be shown that in this case 𝜃𝑀𝐿 has a closed form expression, which is the same as least
squares solution!
 For binary classification, we can use the logistic sigmoid function:
 No closed form solution in this case, we need to use numerical optimization methods.
Supervised Learning Algorithms: Parametric Supervised Learning
 Not all supervised learning algorithms are probabilistic. We can simply define a parametric
family of deterministic models 𝑦 = 𝑓(𝒙; 𝒘) and use some loss function to find an optimal 𝒘∗
 One of the most famous approaches to this problem is the Support Vector Machines (SVMs)
 For binary classification, we let 𝑓(𝒙) = 𝒘𝑇 𝒙 + 𝑏
 If 𝑓(𝒙) > 0 we predict that sample belongs to class 1 and vice versa
 SVM is useful for two reasons

 Solving for optimal 𝒘 is a convex optimization problem
 It can be shown that:
 The second part is the key results. We can replace 𝒙𝑇 𝒙 (𝑖) by any inner product and we can
still use the same algorithm!
 This is how we generalize to nonlinear classifiers:
 The 𝑘 is called a 𝑘𝑒𝑟𝑛𝑒𝑙
Supervised Learning Algorithms: Nonparametric Supervised Learning
 For most supervised learning algorithms we need to fix the form and number of parameters in
the model beforehand.
 Can we also learn those from the model?
 Models that allow these are called nonparametric, in some sense, they let the data speak for itself
 One of the simplest nonparametric classifier is k-nearest neighborhood method

 Fix a positive integer k (yes even nonparametric models have parameters)
 For a new point, find the nearest k samples from the dataset, compute their class/output averages
and simply predict this value for the new point
 There are other popular nonparametric classifiers as well, decision trees, random forests etc.
Supervised Learning Algorithms: Nonparametric Supervised Learning
Unsupervised Learning
 There are unsupervised learning algorithms for dimensionality reduction, e.g. PCA
 PCA takes unlabeled and difficult to interpret data and transforms it into a much more manageable
form
 Another popular form of unsupervised learning is clustering

 How can we segment the data in a meaningful way?
 k-means clustering is a very popular algorithm for this job.
 Generating association rules is another task that can be solved via unsupervised learning
 An association rule is an unsupervised learning method which is used for finding the relationships
between variables in the large database.
References
 I. Goodfellow, Y. Bengio and A. Courville, “Deep Learning”, 2016.

Lecture 2 Ai

Uploaded by

Copyright:

Available Formats

Lecture 2 Ai

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 2 Ai

Uploaded by

Copyright:

Available Formats

UCK358E – INTR.

Instructor: Barış Başpınar

 A learning algorithm is an algorithm that performs a task by using data.

 Here are more learning tasks

 And vice versa:

 Let’s use a linear regression example to clarify these subjects

 The experience is gained by observing the dataset

 In general, we want to ﬁnd a 𝒘 such that:

 Unfortunately, we rarely do both good at the same time

 Regularization is any modiﬁcation we make to the learning algorithm to reduce generalization

 Can we ’learn’ those parameters as well?

 How can we quantify the concept of overﬁtting/underﬁtting?

෡ 𝑚 is a random variable, even though 𝜽 is not

 The bias of an estimator is deﬁned as:

 The expectation is taken over the data

with 95% probability.

 Hence ideally, we want both low bias and low variance.

 Bias and variance are two diﬀerence sources of error in an estimator

 Should we choose a model with low bias or low variance?

 Linear regression problem can be expressed by the family:

 For binary classiﬁcation, we can use the logistic sigmoid function:

 SVM is useful for two reasons

 One of the simplest nonparametric classiﬁer is k-nearest neighborhood method

 Another popular form of unsupervised learning is clustering

 I. Goodfellow, Y. Bengio and A. Courville, “Deep Learning”, 2016.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.