2021 Machine Learning Intro
2021 Machine Learning Intro
Introduction to Machine
Learning
Benjamin Rosman
Benjamin.Rosman1@wits.ac.za / benjros@gmail.com
Example: Hand-Written Digits
• Write a program to automatically classify these
Example: Faces
• Write a program to automatically find faces
What is machine learning?
(patterns)
• “the automatic discovery of regularities in data
through the use of computer algorithms and with the
use of these regularities to take actions such as
classifying the data into different categories”
[Bishop, 2007]
Prediction
Example problems
• Text recognition, understanding and translation
• Face/object detection
• Spam filtering
• Identifying topics in documents
• Spoken language understanding
• Medical diagnosis
• Customer segmentation / product recommendation
• Fraud detection
• Weather prediction
• Computer game AI
Hand-Written Digits
We need a model:
• Modelling assumptions (or knowledge about the problem) go here!
• y = f(x; θ)
• θ = parameters of model
Features of
the data (2D)
𝑥1
An example
Training
An example
Training
An example
Training
We want the model to generalise to any
An example point in this space that we haven’t seen yet
Querying
?
Categories of ML
• Supervised learning
• Predict output y when given input x
• Learn from labelled data: {(xi, yi)} (Make predictions!)
• Classification: y categorical
• Regression: y real-valued
• Unsupervised learning (Understand data!)
• Learn from unlabelled data: {xi}
• Clustering
• Learning some structure in the data
• Semi-supervised learning (Combine the above!)
• Only some labels provided
• Reinforcement learning
• Learn from rewards (typically delayed)
• Generate own data (experience) through interacting with an environment
Examples: +10
• Learn to fly a helicopter
• Learn to make coffee
• Learn to play chess
Generalising
During training, only a small fraction of possible data will
be provided
Trade-off between
• Expressive: accurately capture distinctions in data
• Sparse: not need prohibitive amounts of data
• Feature selection
• Autonomously identify important dimensions
• Feature learning
• Combine simpler features into more complex ones
• E.g. deep learning (when we talk about neural networks)
Data
For any ML algorithm to work, we need data, and
more is always better. In ML, we “let the data do the
talking”.
Much work goes into collecting data sets. For large
models (many parameters), we may need many
millions of examples to learn a good model.
But, how do we know how well
the model will generalise?
Protip: Never trust people
that mess this up!
Splitting the Data
Typically divide the full data set into three:
• Training data: learn the model parameters
• This is the core learning part, and so it needs the most data
• +/- 60% of the data
• Validation data: learn the model hyperparameters
• Hyperparameters are values set before training begins, e.g. the
degree of the polynomial, the complexity of the neural network
• +/- 20% of the data
• Testing data: report quality of model
• This is used to report an unbiased evaluation of the final model
• +/- 20% of the data
Why split the data?
This red model has a perfect fit to the blue
training points: so they will not give a reliable
estimate of how well the model will
x generalise.
x
Instead, we want to test it on new data points
that it has never seen during training. This
gives a better idea of its performance.
Similarly, we may be learning the hyperparameter of the degree of the model (M), by
training a straight line model (M=1), quadratic model (M=2), … up to M=9 and then
seeing which is best. We can train them all on the same training data, but we need to
use different validation data to choose the best one. Again, we can’t just report its
performance on that data, as it is already biased. So, we then need a different testing
set to report final scores.
The test data must not be touched until the very end! It is the “blind/surprise test”.
Example: Polynomial Curve Fitting
Simple regression (supervised learning) problem
Target label (= y)
True unknown
function: sin(2πx)
Feature (1D)
Goal: given a new x, predict t (target)
A Polynomial Function
Assume the function is polynomial:
Learning:
• Find the weight vector w to minimise error E(w)
• Unique solution w*
• E(w) is quadratic in w Predicted value at x
• E’(w) is linear in w Error between predicted and
true value t
Squared so it is symmetrical
Sum over every data point
½ to make the maths simpler
after differentiating
More on this example in the linear
regression lecture.
Model Selection
Choosing M (polynomial order)
For M = 9, E(w*) = 0! But goal is to generalise!
Training vs Testing Error
Overfitting: high
Define error to compare across N: error on test
data, low error
on training data
More data
• Less severe over-fitting
• More complex model we can fit