Statistical Machine Learning-The Basic Approach and Current Research Challenges
Statistical Machine Learning-The Basic Approach and Current Research Challenges
Statistical Machine Learning-The Basic Approach and Current Research Challenges
CS497
February, 2007
A High Level Agenda
Herbert Simon
Representative learning tasks
Medical research.
Detection of fraudulent activity
(credit card transactions, intrusion
detection, stock market manipulation)
Analysis of genome functionality
Email spam detection.
Spatial prediction of landslide hazards.
Common to all such tasks
We wish to develop algorithms that detect meaningful
regularities in large complex data sets.
For example:
Training a spam filter
Medical Diagnosis (Patient info → High/Low risk).
Stock market prediction ( Predict tomorrow’s market
trend from companies performance data)
Other Learning Tasks
Clustering –
the grouping data into representative collections
- a fundamental tool for data analysis.
Examples :
Good models
should enable
Prediction
of new data…
X
A Fundamental Dilemma of Science:
Model Complexity vs Prediction Accuracy
Limited data
Accuracy
Possible
Models/representations
Complexity
Problem Outline
We are interested in
(automated) Hypothesis Generation,
rather than traditional Hypothesis Testing
First solution:
Consider only a limited set of candidate hypotheses.
Empirical Risk Minimization
Paradigm
1
VC dim(H ) ln( )
| {(x, y) S : h( x) y} |
Pr( x , y )D (h( x) y ) c
|S| |S|
Expanding H
will lower the approximation error
BUT
it will increase the estimation error
(lower statistical soundness)
Yet another problem –
Computational Complexity
Bartlett- BD
The Types of Errors to be
Considered
x ↦ (x, x2)
The SVM Idea: an Example
Controlling Computational Complexity
.......
.......
K(xixj)
max min wn xi
separating h xi