Unit 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

INTRODUCTION TO MACHINE LEARNING AND REGRESSION

 Basic Concepts  Probability, Linear algebra, Convex Optimization


Unit 1  Introduction to Machine Learning and applications
 Types of learning techniques
 Supervised
 Unsupervised and semi‐supervised
 reinforcement learning
INTRODUCTION TO MACHINE LEARNING
AND REGRESSION  Regression
 Linear Regression Models and Least Squares
 Subset Selection
 Shrinkage Methods
 Methods Using Derived Input Directions
 Multiple Outcome Shrinkage and Selection
 More on the Lasso and Related Path Algorithms
Unit 1 2

What is Machine Learning?


Related Areas of Study
• Machine learning is concerned with the design and • Stochastic signal processing
development of algorithms and techniques that – denoising, source separation, scene analysis, morphing
allow computers to "learn“
• Data mining
– (Relatively) simple machine learning on huge datasets
• The major focus of machine learning research is
• Data compression and coding
to extract information from data automatically, – state of the art methods for image compression and error correcting
by computational and statistical methods codes use learning methods

• Decision making, planning


• Machine learning is closely related to statistics, – use both utility and uncertainty optimally
data mining, but also to theoretical computer
science • Adaptive software agents, auctions or preferences
– action choice under limited resources and reward signals

Unit 1 3 Unit 1 4

Task: Handwritten Digit Recognition Optical character recognition (OCR)


Technology to convert scanned images/docs to text
Difficult to characterize what is specific to each digit

Digit recognition, AT&T labs


http://yann.lecun.com/exdb/lenet/ Automatic license plate reading

Unit 1 6

1
Visual Recognition of People in Images
Machine Learning Approach
General poses, high-
dimensional (30-100 dof)
• It is difficult to explicitly design programs that can recognize
Self-occlusions
people or digits in images
Difficult to segment – Modeling the object structure, the physics, the variability, and
the individual limbs Different
body the image formation process can be very difficult
Loss of Information in the sizes
perspective projection • Instead of designing the programs by hand, we collect many
Partial views
examples that specify the correct outputs for different inputs

• A machine learning algorithm takes these examples and produces a


program trained to give the answers we want
Several people,
occlusions – If designed properly, the program may work in new situations,
not just the ones it was trained on
Reduced observability
Accidental allignments of body parts due to
Motion blur loose fitting clothing
Unit 1 7 Unit 1 8

A spectrum of machine learning tasks


When to use Machine Learning?
Statistics---------------------Artificial Intelligence
• A machine learning approach is most effective when the • Low‐dimensional data (e.g. • High‐dimensional data (e.g.
structure of the task is not well understood ‐ or too less than 100 dimensions) more than 100 dimensions)
difficult to model explicitly ‐ but can be characterized by • Lots of noise in the • The noise is not sufficient to
a dataset with strong statistical regularity data obscure the structure in the
data if we process it right.
• There is not much structure in • There is a huge amount of
• Machine learning is also useful in dynamic situations, the data, and what structure structure in the data, but the
when the task is constantly changing there is, can be represented by a structure is too complicated to
– e.g. a robot operating in an unknown environment fairly simple model. be represented by a simple
model.

• The main problem is • The main problem is figuring


distinguishing true structure out a way to represent the
from noise. complicated structure that
allows it to be learned.

Unit 1 9 Unit 1 10

Machine Learning Applications


• natural language processing
• speech and handwriting recognition
• object recognition in computer vision


image retrieval
robotics
Sample Applications of Machine
• medical diagnosis Learning
• bioinformatics
• classifying DNA sequences
• detecting credit card fraud
• stock market analysis
• recommender systems
• analyzing social patterns and networks
• visualization, game playing and many
more…
Unit 1 11 Unit 1 12

2
Online search engines Displaying the structure of a set of documents using a deep NN

Query:
STREET

Organizing photo collections

Unit 1 13 Unit 1 14

IBM’s Watson (not) in Jeopardy (2011) Face detection


90 IBM servers, 2880 cores (@3.55Gb), 15TB RAM

Cities question: Its largest airport was named for a Many digital cameras now detect faces
World War II hero; its second largest, for a World War II
battle
Champions answer 2/3 questions with 85‐95%
accuracy Unit 1 15

Smile Detection Monocular 3D Human Pose Reconstruction

Sminchisescu, Kanujia, Metaxas, PAMI 2007


Unit 1 17 Unit 1 18

3
3D Human Pose – Microsoft’s Kinect (2011) Is Machine Learning Solved?
Lots of success already but…
• Many existing systems lag behind human performance
– Comparatively, see how fast children learn

• Handling large, unorganized data repositories, under


uncertain and indirect supervision is an open problem
• Designing complex systems (e.g. a robot with sophisticated
function and behavior) is an open problem
• Fundamental advances necessary at all levels
– computational, algorithmic, representational,
implementation and integration

Unit 1 19 Unit 1 20

Standard Learning Tasks Standard Learning Tasks


• Supervised Learning: given examples of inputs and desired outputs
(labeled data), predict outputs on future inputs • Reinforcement Learning: given sequences of inputs, actions
– Ex: classification, regression, time series prediction from a fixed set, and scalar rewards and punishments, learn to
– Who provides the correct outputs? Can these always be measured? select action sequences in a way that maximizes expected
Can the process scale?
reward
• Unsupervised Learning: given only inputs (unlabeled data), – Not much information in the reward, which is often delayed
automatically discover hidden representations, features, structure, – e.g. robot operating in a structured environment
etc.
– Ex: clustering, outlier detection, compression
– How can we know that representation is good? Open, challenging research problem, not covered in this course

• Semi‐supervised Learning: given both labeled and unlabeled data,


leverage information to improve both tasks
– Somewhat practical compromise for data acquisition and modeling

Unit 1 21 Unit 1 22

Hypothesis Spaces and Learning Machines


Modeling with Random Variables
We can view learning as the process of exploring a hypothesis space
• We use random variables to encode the learning task
– variables can take a set of possible values, each with an associated • We select functions from a specified set, the hypothesis space
probability • Hypothesis space is indexed by a set of parameters ‫ݓ‬, which are variables
– inputs, outputs, internal states of learning machine we can adjust (search) to create different machines
• In supervised learning, each hypothesis is a function that maps inputs to
• Random variables may be discrete or continuous outputs
• A challenge of machine learning is deciding how to represent inputs and
– discrete quantities take one of a fixed set of values outputs, and how to select the hypothesis space that would be
e.g. {0,1}, {smile, not‐smile}, {email, spam} appropriate for a given task
– continuous quantities take real values • The central trade‐off is in selecting a hypothesis space that is powerful
enough to represent the relationships between inputs and outputs, yet
e.g. elbow joint=30; temp=12.2; income=38,231
simple enough to be searched efficiently

Unit 1 23 Unit 1 24

4
Training and Testing Expectations Training and Testing Process
• Training data: examples we are provided with • In training, based only on the training data, construct a machine that generates
• Testing data: data we will see in the future outputs given inputs
• Training error: average value of loss function on training data – One option is to build machines with small training loss
Test error: the average value of loss function on test data – Ideally we wish the machine to model the main regularities in the data and ignore
the noise. However, if the machine has as many degrees of freedom as the data, it
can fit perfectly. We saw the spline case study
• Our goal is, primarily, not to do well on training data We
already have the answers (outputs) for that data – Avoiding this usually requires model complexity control (regularization)

• We want to perform well on future unseen data • In testing, a new sample is drawn i.i.d. from the same distribution as the
We wish to minimize the test error training data
– This assumption makes it unlikely that important regularities in the test data were
• How can we guarantee it, if we do not have test data? missed in the training data
We will rely on probabilistic assumptions on data variability – We run the machine on new sample and evaluate loss: this is the test error

Unit 1 25 Unit 1 26

Overfitting and Generalization Probability and Decision Theory


• The central problem with any learning setup is to do well on the training • Uncertainty is a key concept in pattern recognition and
data but poorly on test data. machine learning
This is called overfitting
• It arises both from measurement noise and from finite size
• This has to do with the set of assumptions about the target function made by datasets
the learning algorithm, known as its inductive bias
– complexity of hypothesis space, regularization, etc.
• Probability theory provides consistent framework for the
• The ability of a learning machine to achieve a small loss on test data is called quantification and manipulation of uncertainty
generalization
• When combined with decision theory, it allows us to make
• Generalization is possible, in principle, given our i.i.d. assumption: if a machine optimal predictions, given all the information available, even
performs well on most of a sufficiently large training set, and is not too
complex, it is likely to do well on similar test data
when that information is incomplete or ambiguous

Unit 1 27 Unit 1 28

Example of World with Probabilities Modeling with Probabilities


We have two boxes of Apples and Oranges (distributions • Quantities of interest in our problem are modeled as random
shown) We randomly pick box, then fruit, with replacement variables

• To start with, we will define the probability of an event to be


the fraction of times that event occurs, in the limit that the
Random total number of trials goes to infinity
variables:
‐ Box B= {r,b}
‐ Fruit F={a,o}
• Using elementary sum and product rules of probability, we can
ask fairly sophisticated questions in our problem domain
– Given that we chose an orange, what is the probability that the box we
chose was a blue one?
– What is the overall probability that the selection procedure will pick
an apple?

Some slides adapted from textbook


Unit 1 29 Unit 1 30
(Bishop)

5
Probability Theory Probability Theory

Marginal Sum Rule


Probability

Joint Conditional Product Rule


Probability Probability

Unit 1 31 Unit 1 32

The Rules of Probability


Bayes’ Theorem

Sum Rule

Product
Rule

posterior  likelihood × prior

Unit 1 33

Linear Models The Loss Function


• It is mathematically easy to fit linear models to data and we can get
• Once a model class is chosen, fitting the model to data is typically
insight by analyzing this relatively simple case. We already studied done by finding the parameter values that minimize a loss function
the polynomial curve fitting problem in the previous lectures • There are many possible loss functions. What criterion should we
• There are many ways to make linear models more powerful while use in selecting one?
retaining their attractive mathematical properties – Loss that makes analytic calculations easy (squared error)
– Loss that makes the fitting correspond to maximizing the
• By using non‐linear, non‐adaptive basis functions, we obtain
• likelihood of the training data, given some noise model for the
generalized linear models. These offer non‐linear mappings from observed outputs
inputs to outputs but are linear in their parameters. Typically, only – Loss that makes it easy to interpret the learned coefficients
the linear part of the model is learnt (easy if mostly zeros)
• By using kernel methods we can handle expansions of the raw data – Loss that corresponds to the real loss on a practical
that use a huge number of non‐linear, non‐adaptive basis functions. application (losses are often asymmetric)
By using large margin loss functions, we can avoid overfitting even
when we use huge numbers of basis functions

Unit 1 35 Unit 1 36

6
Linear Basis Function Models (1) Linear Basis Function Models (2)

Unit 1 37 Unit 1 38

Linear Basis Function Models (3) Linear Basis Function Models (4)

Unit 1 39 Unit 1 40

Linear Basis Function Models (5) Other Basis Function Models (6)
• In a Fourier representation, each basis
function represents a given frequency
and has infinite spatial extent
• Wavelets are localized in both space and
frequency, and by definition are linearly
orthogonal

Unit 1 41 Unit 1 42

7
Maximum Likelihood and Least Maximum Likelihood and Least
Squares (1) Squares (2)

Unit 1 43 Unit 1 44

Maximum Likelihood and Least Maximum Likelihood and Least


Squares (3) Squares (4)

Unit 1 45 Unit 1 46

Geometry of Least Squares Sequential Learning

Unit 1 47 Unit 1 48

8
Regularized Least Squares (1) Regularized Least Squares (2)

Unit 1 49 Unit 1 50

Effect of a Quadratic Regularizer

Unit 1 51 Unit 1 52

Penalty over square weights vs. Lasso Multiple Outputs


• If there are multiple outputs we can often treat
the learning problem as a set of independent
problems, one per output
• This approach is suboptimal if the output noise is
correlated and changes across datapoints
• Even though they are independent problems we
can save work by only multiplying the input
vectors by the inverse covariance of the input
components once
Unit 1 53 Unit 1 54

9
Multiple Outputs (1) Multiple Outputs (2)

Unit 1 55 Unit 1 56

Thank You

Unit 1 57

10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy