Unit 1

INTRODUCTION TO MACHINE LEARNING AND REGRESSION
 Basic Concepts  Probability, Linear algebra, Convex Optimization

Unit 1  Introduction to Machine Learning and applications
 Types of learning techniques
 Supervised
 Unsupervised and semi‐supervised
 reinforcement learning
INTRODUCTION TO MACHINE LEARNING
AND REGRESSION  Regression
 Linear Regression Models and Least Squares
 Subset Selection
 Shrinkage Methods
 Methods Using Derived Input Directions
 Multiple Outcome Shrinkage and Selection
 More on the Lasso and Related Path Algorithms
Unit 1 2
What is Machine Learning?

Related Areas of Study
• Machine learning is concerned with the design and • Stochastic signal processing
development of algorithms and techniques that – denoising, source separation, scene analysis, morphing
allow computers to "learn“
• Data mining
– (Relatively) simple machine learning on huge datasets
• The major focus of machine learning research is
• Data compression and coding
to extract information from data automatically, – state of the art methods for image compression and error correcting
by computational and statistical methods codes use learning methods
• Decision making, planning

• Machine learning is closely related to statistics, – use both utility and uncertainty optimally
data mining, but also to theoretical computer
science • Adaptive software agents, auctions or preferences
– action choice under limited resources and reward signals
Unit 1 3 Unit 1 4
Task: Handwritten Digit Recognition Optical character recognition (OCR)

Technology to convert scanned images/docs to text
Difficult to characterize what is specific to each digit
Digit recognition, AT&T labs

http://yann.lecun.com/exdb/lenet/ Automatic license plate reading
Unit 1 6
1
Visual Recognition of People in Images
Machine Learning Approach
General poses, high-
dimensional (30-100 dof)
• It is difficult to explicitly design programs that can recognize
Self-occlusions
people or digits in images
Difficult to segment – Modeling the object structure, the physics, the variability, and
the individual limbs Different
body the image formation process can be very difficult
Loss of Information in the sizes
perspective projection • Instead of designing the programs by hand, we collect many
Partial views
examples that specify the correct outputs for different inputs
• A machine learning algorithm takes these examples and produces a

program trained to give the answers we want
Several people,
occlusions – If designed properly, the program may work in new situations,
not just the ones it was trained on
Reduced observability
Accidental allignments of body parts due to
Motion blur loose fitting clothing
Unit 1 7 Unit 1 8
A spectrum of machine learning tasks

When to use Machine Learning?
Statistics---------------------Artificial Intelligence
• A machine learning approach is most effective when the • Low‐dimensional data (e.g. • High‐dimensional data (e.g.
structure of the task is not well understood ‐ or too less than 100 dimensions) more than 100 dimensions)
difficult to model explicitly ‐ but can be characterized by • Lots of noise in the • The noise is not sufficient to
a dataset with strong statistical regularity data obscure the structure in the
data if we process it right.
• There is not much structure in • There is a huge amount of
• Machine learning is also useful in dynamic situations, the data, and what structure structure in the data, but the
when the task is constantly changing there is, can be represented by a structure is too complicated to
– e.g. a robot operating in an unknown environment fairly simple model. be represented by a simple
model.
• The main problem is • The main problem is figuring

distinguishing true structure out a way to represent the
from noise. complicated structure that
allows it to be learned.
Unit 1 9 Unit 1 10
Machine Learning Applications

• natural language processing
• speech and handwriting recognition
• object recognition in computer vision
•
•
image retrieval
robotics
Sample Applications of Machine
• medical diagnosis Learning
• bioinformatics
• classifying DNA sequences
• detecting credit card fraud
• stock market analysis
• recommender systems
• analyzing social patterns and networks
• visualization, game playing and many
more…
Unit 1 11 Unit 1 12
2
Online search engines Displaying the structure of a set of documents using a deep NN
Query:
STREET
Organizing photo collections
Unit 1 13 Unit 1 14
IBM’s Watson (not) in Jeopardy (2011) Face detection

90 IBM servers, 2880 cores (@3.55Gb), 15TB RAM
Cities question: Its largest airport was named for a Many digital cameras now detect faces
World War II hero; its second largest, for a World War II
battle
Champions answer 2/3 questions with 85‐95%
accuracy Unit 1 15
Smile Detection Monocular 3D Human Pose Reconstruction
Sminchisescu, Kanujia, Metaxas, PAMI 2007

Unit 1 17 Unit 1 18
3
3D Human Pose – Microsoft’s Kinect (2011) Is Machine Learning Solved?
Lots of success already but…
• Many existing systems lag behind human performance
– Comparatively, see how fast children learn
• Handling large, unorganized data repositories, under

uncertain and indirect supervision is an open problem
• Designing complex systems (e.g. a robot with sophisticated
function and behavior) is an open problem
• Fundamental advances necessary at all levels
– computational, algorithmic, representational,
implementation and integration
Unit 1 19 Unit 1 20
Standard Learning Tasks Standard Learning Tasks

• Supervised Learning: given examples of inputs and desired outputs
(labeled data), predict outputs on future inputs • Reinforcement Learning: given sequences of inputs, actions
– Ex: classification, regression, time series prediction from a fixed set, and scalar rewards and punishments, learn to
– Who provides the correct outputs? Can these always be measured? select action sequences in a way that maximizes expected
Can the process scale?
reward
• Unsupervised Learning: given only inputs (unlabeled data), – Not much information in the reward, which is often delayed
automatically discover hidden representations, features, structure, – e.g. robot operating in a structured environment
etc.
– Ex: clustering, outlier detection, compression
– How can we know that representation is good? Open, challenging research problem, not covered in this course
• Semi‐supervised Learning: given both labeled and unlabeled data,

leverage information to improve both tasks
– Somewhat practical compromise for data acquisition and modeling
Unit 1 21 Unit 1 22
Hypothesis Spaces and Learning Machines

Modeling with Random Variables
We can view learning as the process of exploring a hypothesis space
• We use random variables to encode the learning task
– variables can take a set of possible values, each with an associated • We select functions from a specified set, the hypothesis space
probability • Hypothesis space is indexed by a set of parameters ‫ݓ‬, which are variables
– inputs, outputs, internal states of learning machine we can adjust (search) to create different machines
• In supervised learning, each hypothesis is a function that maps inputs to
• Random variables may be discrete or continuous outputs
• A challenge of machine learning is deciding how to represent inputs and
– discrete quantities take one of a fixed set of values outputs, and how to select the hypothesis space that would be
e.g. {0,1}, {smile, not‐smile}, {email, spam} appropriate for a given task
– continuous quantities take real values • The central trade‐off is in selecting a hypothesis space that is powerful
enough to represent the relationships between inputs and outputs, yet
e.g. elbow joint=30; temp=12.2; income=38,231
simple enough to be searched efficiently
Unit 1 23 Unit 1 24
4
Training and Testing Expectations Training and Testing Process
• Training data: examples we are provided with • In training, based only on the training data, construct a machine that generates
• Testing data: data we will see in the future outputs given inputs
• Training error: average value of loss function on training data – One option is to build machines with small training loss
Test error: the average value of loss function on test data – Ideally we wish the machine to model the main regularities in the data and ignore
the noise. However, if the machine has as many degrees of freedom as the data, it
can fit perfectly. We saw the spline case study
• Our goal is, primarily, not to do well on training data We
already have the answers (outputs) for that data – Avoiding this usually requires model complexity control (regularization)
• We want to perform well on future unseen data • In testing, a new sample is drawn i.i.d. from the same distribution as the
We wish to minimize the test error training data
– This assumption makes it unlikely that important regularities in the test data were
• How can we guarantee it, if we do not have test data? missed in the training data
We will rely on probabilistic assumptions on data variability – We run the machine on new sample and evaluate loss: this is the test error
Unit 1 25 Unit 1 26
Overfitting and Generalization Probability and Decision Theory

• The central problem with any learning setup is to do well on the training • Uncertainty is a key concept in pattern recognition and
data but poorly on test data. machine learning
This is called overfitting
• It arises both from measurement noise and from finite size
• This has to do with the set of assumptions about the target function made by datasets
the learning algorithm, known as its inductive bias
– complexity of hypothesis space, regularization, etc.
• Probability theory provides consistent framework for the
• The ability of a learning machine to achieve a small loss on test data is called quantification and manipulation of uncertainty
generalization
• When combined with decision theory, it allows us to make
• Generalization is possible, in principle, given our i.i.d. assumption: if a machine optimal predictions, given all the information available, even
performs well on most of a sufficiently large training set, and is not too
complex, it is likely to do well on similar test data
when that information is incomplete or ambiguous
Unit 1 27 Unit 1 28
Example of World with Probabilities Modeling with Probabilities

We have two boxes of Apples and Oranges (distributions • Quantities of interest in our problem are modeled as random
shown) We randomly pick box, then fruit, with replacement variables
• To start with, we will define the probability of an event to be

the fraction of times that event occurs, in the limit that the
Random total number of trials goes to infinity
variables:
‐ Box B= {r,b}
‐ Fruit F={a,o}
• Using elementary sum and product rules of probability, we can
ask fairly sophisticated questions in our problem domain
– Given that we chose an orange, what is the probability that the box we
chose was a blue one?
– What is the overall probability that the selection procedure will pick
an apple?
Some slides adapted from textbook

Unit 1 29 Unit 1 30
(Bishop)
5
Probability Theory Probability Theory
Marginal Sum Rule

Probability
Joint Conditional Product Rule

Probability Probability
Unit 1 31 Unit 1 32
The Rules of Probability

Bayes’ Theorem
Sum Rule
Product
Rule
posterior  likelihood × prior
Unit 1 33
Linear Models The Loss Function

• It is mathematically easy to fit linear models to data and we can get
• Once a model class is chosen, fitting the model to data is typically
insight by analyzing this relatively simple case. We already studied done by finding the parameter values that minimize a loss function
the polynomial curve fitting problem in the previous lectures • There are many possible loss functions. What criterion should we
• There are many ways to make linear models more powerful while use in selecting one?
retaining their attractive mathematical properties – Loss that makes analytic calculations easy (squared error)
– Loss that makes the fitting correspond to maximizing the
• By using non‐linear, non‐adaptive basis functions, we obtain
• likelihood of the training data, given some noise model for the
generalized linear models. These offer non‐linear mappings from observed outputs
inputs to outputs but are linear in their parameters. Typically, only – Loss that makes it easy to interpret the learned coefficients
the linear part of the model is learnt (easy if mostly zeros)
• By using kernel methods we can handle expansions of the raw data – Loss that corresponds to the real loss on a practical
that use a huge number of non‐linear, non‐adaptive basis functions. application (losses are often asymmetric)
By using large margin loss functions, we can avoid overfitting even
when we use huge numbers of basis functions
Unit 1 35 Unit 1 36
6
Linear Basis Function Models (1) Linear Basis Function Models (2)
Unit 1 37 Unit 1 38
Linear Basis Function Models (3) Linear Basis Function Models (4)
Unit 1 39 Unit 1 40
Linear Basis Function Models (5) Other Basis Function Models (6)
• In a Fourier representation, each basis
function represents a given frequency
and has infinite spatial extent
• Wavelets are localized in both space and
frequency, and by definition are linearly
orthogonal
Unit 1 41 Unit 1 42
7
Maximum Likelihood and Least Maximum Likelihood and Least
Squares (1) Squares (2)
Unit 1 43 Unit 1 44
Maximum Likelihood and Least Maximum Likelihood and Least

Squares (3) Squares (4)
Unit 1 45 Unit 1 46
Geometry of Least Squares Sequential Learning
Unit 1 47 Unit 1 48
8
Regularized Least Squares (1) Regularized Least Squares (2)
Unit 1 49 Unit 1 50
Effect of a Quadratic Regularizer
Unit 1 51 Unit 1 52
Penalty over square weights vs. Lasso Multiple Outputs

• If there are multiple outputs we can often treat
the learning problem as a set of independent
problems, one per output
• This approach is suboptimal if the output noise is
correlated and changes across datapoints
• Even though they are independent problems we
can save work by only multiplying the input
vectors by the inverse covariance of the input
components once
Unit 1 53 Unit 1 54
9
Multiple Outputs (1) Multiple Outputs (2)
Unit 1 55 Unit 1 56
Thank You
Unit 1 57
10

Unit 1

Uploaded by

Copyright:

Available Formats

Unit 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO MACHINE LEARNING AND REGRESSION

 Basic Concepts  Probability, Linear algebra, Convex Optimization

What is Machine Learning?

• Decision making, planning

Task: Handwritten Digit Recognition Optical character recognition (OCR)

Digit recognition, AT&T labs

• A machine learning algorithm takes these examples and produces a

A spectrum of machine learning tasks

• The main problem is • The main problem is figuring

Machine Learning Applications

Organizing photo collections

IBM’s Watson (not) in Jeopardy (2011) Face detection

Smile Detection Monocular 3D Human Pose Reconstruction

Sminchisescu, Kanujia, Metaxas, PAMI 2007

• Handling large, unorganized data repositories, under

Standard Learning Tasks Standard Learning Tasks

• Semi‐supervised Learning: given both labeled and unlabeled data,

Hypothesis Spaces and Learning Machines

Overfitting and Generalization Probability and Decision Theory

Example of World with Probabilities Modeling with Probabilities

• To start with, we will define the probability of an event to be

Some slides adapted from textbook

Marginal Sum Rule

Joint Conditional Product Rule

The Rules of Probability

posterior  likelihood × prior

Linear Models The Loss Function

Maximum Likelihood and Least Maximum Likelihood and Least

Geometry of Least Squares Sequential Learning

Effect of a Quadratic Regularizer

Penalty over square weights vs. Lasso Multiple Outputs

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.