Lecture 2 Ai
Lecture 2 Ai
Lecture 2 Ai
We need a quantitative measure 𝑃 to measure how well our algorithm is performing on task 𝑇
For classification and translation tasks, we measure the performance by accuracy,
Just divide the number of correctly classified examples to number of all examples
Performance is evaluated on test data, data that has never been seen by the algorithm before
This set is usually different from the training data
Sometimes selecting the correct performance criteria can be a complicated task itself
For translation, should we measure the accuracy on word by word basis or a sentence by sentence
For regression, should we penalize small medium frequency mistakes, or large rare mistakes
Learning Algorithms: Experience
Many learning algorithms are categorized as supervised or unsupervised based on the data
that experience
Roughly speaking, unsupervised algorithms try to determine 𝑝(𝒙) from 𝒙
Whereas the supervised algorithm try to determine 𝑝( 𝒚 ∣ 𝒙 ) from (𝒙, 𝒚)
However, sometimes the lines between the two categories get blurry
In semi-supervised algorithms, where the data is only partially labelled
Unsupervised algorithms can also be used for supervised learning
Simply estimate 𝑝( 𝒙, 𝒚 ) and then marginalize
Where 𝒘 ∈ 𝑅𝑛 is a weight vector. Hence the task is to find 𝒘 that does well according to some
performance measure.
A popular type of performance measure for these type of tasks is the mean squared error
A central challenge in machine learning is to perform well on new, previously unseen inputs.
This property is known as generalization. The error on the test data is referred to as generalization
But how can we say something about the test data, if all we see is training data?
If these datasets were generated arbitrarily, we can’t...
However, if they come from the same distribution 𝑝(𝒙, 𝑦) , then we can say something! For
instance, their mean should be equal
There are other forms of regularization, for instance ‘dropout‘ (randomly dropping some of the
weights in your model) is very popular in deep learning
Hyperparameters and Validation Sets
Hyperparameters are the parameters of our model that controls the capacity, as well as the
behaviour of the algorithm
Examples: degree of a polynomial, number of layers in a neural net, structure of a PGM...
These are usually set by domain experts
However, using the whole training data does not make sense
The optimization procedure would simply select the parameters that result in highest capacity (i.e.
Similar to how we separated the whole data into test and training, we can further divide the
training data into two disjoint subsets
The smaller part that is used for learning the hyperparameters is called the validation set
Estimators, Bias and Variance: Estimation
Point estimation is an attempt to provide a single best estimate for a quantity of interest. We
denote the point estimate of 𝜽 as 𝜽
Let {𝑥 1 , . . . , 𝑥 (𝑚) } be a set of 𝑚 i.i.d. data points. A point estimator or statistic is any function
of the data:
Let 𝑥 (𝑖) be generated from a Bernoulli distribution with mean 𝜃. Then the estimator
𝜃መ𝑚 = σ𝑚 𝑖=1 𝑥
(𝑖) is unbiased
The same estimator is also unbiased when 𝑥 (𝑖) are generated from a Gaussian distribution with
mean 𝜇
However, the estimator 𝜎ො𝑚 = 𝑚 σ𝑚𝑖=1 𝑥
(𝑖) − 𝜇Ƹ
𝑚 is only asymptotically unbiased. We can make it
unbiased by setting 𝜎ො𝑚 = σ𝑚𝑖=1 𝑥
(𝑖) − 𝜇Ƹ
It might seem like unbiased estimators are preferable, but this is not necessarily true, we will
soon see that biased estimators possess some interesting properties
Estimators, Bias and Variance: Variance and Standard Error
Another interesting property of an estimator is how much it varies as a function of the data
sample. This is called variance
Low variance estimators give similar outputs irregardless of the sampled data. Variance is an
excellent measure for quantifying the uncertainty in our estimations.
For instance, for Gaussian distributions we can say that our estimate of the mean 𝜇Ƹ falls into the
interval of
Hence for a fixed MSE, there is a trade-off between variance and bias
In general, increasing capacity increases the variance and decreases the bias.
Estimators, Bias and Variance: Bias-Variance Trade-off
Supervised Learning Algorithms: Probabilistic Supervised Learning
The supervised learning is mainly about estimating the probability distribution 𝑝(𝑦 ∣ 𝒙)
To use MLE for this job, we first define a family of probability distributions 𝑝(𝑦 ∣ 𝒙; 𝜽)
It can be shown that in this case 𝜃𝑀𝐿 has a closed form expression, which is the same as least
squares solution!
No closed form solution in this case, we need to use numerical optimization methods.
Supervised Learning Algorithms: Parametric Supervised Learning
Not all supervised learning algorithms are probabilistic. We can simply define a parametric
family of deterministic models 𝑦 = 𝑓(𝒙; 𝒘) and use some loss function to find an optimal 𝒘∗
One of the most famous approaches to this problem is the Support Vector Machines (SVMs)
For binary classification, we let 𝑓(𝒙) = 𝒘𝑇 𝒙 + 𝑏
If 𝑓(𝒙) > 0 we predict that sample belongs to class 1 and vice versa
The second part is the key results. We can replace 𝒙𝑇 𝒙 (𝑖) by any inner product and we can
still use the same algorithm!
This is how we generalize to nonlinear classifiers:
The 𝑘 is called a 𝑘𝑒𝑟𝑛𝑒𝑙
Supervised Learning Algorithms: Nonparametric Supervised Learning
For most supervised learning algorithms we need to fix the form and number of parameters in
the model beforehand.
Can we also learn those from the model?
Models that allow these are called nonparametric, in some sense, they let the data speak for itself
There are other popular nonparametric classifiers as well, decision trees, random forests etc.
Supervised Learning Algorithms: Nonparametric Supervised Learning
Unsupervised Learning
There are unsupervised learning algorithms for dimensionality reduction, e.g. PCA
PCA takes unlabeled and difficult to interpret data and transforms it into a much more manageable
Generating association rules is another task that can be solved via unsupervised learning
An association rule is an unsupervised learning method which is used for finding the relationships
between variables in the large database.