0% found this document useful (0 votes)

33 views30 pages

2806 Neural Computation Committee Machines: 2005 Ari Visa

The document discusses committee machines and their use in machine learning. It provides: 1) A brief history of committee machines including boosting, bagging, and other ensemble methods. 2) An overview of committee machine theory including how they use an ensemble of experts to divide the input space and combine outputs, and how this can improve performance over a single model. 3) Details on different types of committee machines like ensemble averaging, boosting, mixture of experts, and hierarchical mixtures of experts that integrate expert outputs in various static and dynamic ways.

Uploaded by

KhaireCr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views30 pages

2806 Neural Computation Committee Machines: 2005 Ari Visa

Uploaded by

KhaireCr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 30

2806 Neural Computation

Committee Machines
Lecture 7
2005 Ari Visa

Agenda
Some historical notes
Some theory
Committee Machines
Conclusions

Some Historical Notes

Boosting by filtering: Schapire 1990
AdaBoost:(Schapire, Freund: A decision-theoretic
generalization of on-line learning and an
application to boosting, 1995)
Bagging: (Leo Breiman: Bagging Predictors 1996)
General arching: Leo Breiman: Arcing Classifiers,
1998)
Kittler, Hatef, Duin, On Combining Classifiers, 1998

Some Theory

The principle of divide and conquer: a complex computational task is solved

by dividing it into a number of computationally simple tasks and then
combining the solutions to the tasks.
Committee machine = In supervised learning, computational simplicity is
achieved by distributing the learning task among a number of experts, which
in turn divides the input space into a set of subspaces. The combination of
experts is said to constitute a committee machine.
Modularity (Osherson & al 1990): A neural network is said to be modular if
the computation performed by the network can be decomposed into two or
more modules that operate on distinct inputs without communicating with
each other. The outputs of the modules are mediated by an integrating unit that
is not permitted to feed information back to the modules. In particular , the
integrating unit both decides how the outputs of the modules should be
combined to form the final output of the system, and decides which modules
should learn which training pattern.

Some Theory

Committee machines are universal approximators

a) Static structures: the responses of several experts are combined by
means of a mechanism that does not involve the input signal
Ensemble averaging: outputs are linearly combined
Boosting: a weak learning algorithm is converted into one that
achieves arbitarily high accuracy
b) Dynamic structures: the input signal is directly involved in
actuating the mechanism that integrates the outputs of the individual
experts into an overall output.
Mixture of experts: the responses of the experts are nonlinearly
combined by means of a single gating network.
Hierarchical mixture of experts: the responses of the experts are
nonlinearly combined by means of several gating network arranged
in a hierarchical fashion.

Some Theory
Ensamble Averaging: A
number of differently
trained neural networks,
which share a common
input and whose
individual outputs are
somehow combines to
produce an overall output.

Note, all the experts are

trained on the same data
set, they may differ from
each other in the choice of
initial conditions used in
network training.
The output is sum of the
individual outputs of the
experts and then
computing the probability
of correct classification.

Some Theory
The bias of the ensemble-averaged function
FI(x), pertaining to the commitee machine is
exactly the same as that of the function F(x)
pertaining to a single neural network.
The variance of the ensemble-averaged
function FI(x) is less than that of the
function F(x).

Some Theory

Boosting: the experts

are trained on data sets
with entirely different
distributions.
Boosting by filtering
Boosting by
subsampling
Boosting by
reweighting

Some Theory

1.
2.
-

3.
-

The original idea of boosting is rooted in a distribution-free or probably approximately correct

model of learning (PAC).
Boosting by filtering: the committee machine consists of three experts (requires a large data set).
1. The first expert is trained on a set consisting of N 1 examples.
2. The trained first expert is used to filter another set of examples by proceeding in the following
manner:
Flip a fair coin( ~ simulates a random guess)
If the result is heads, pass a new pattern through the first expert and discard correctly classified
patterns until a pattern is missclassified. That missclassified pattern is added to the training set for
the second expert.
If the result is tails, pass new patterns through the first expert and discard incorrectly classified
patterns until apattern is classified correctly. That correctly classified pattern is added to the
training set for the second expert.
Continue this process until a total of N 1 examples has been filtered by the first expert. This set of
filtered examples constitutes the training set for the second expert.
3. Once the second expert has been trained in the usual way, a third training set is formed for the
third expert by proceeding in the following manner:
Pass a new pattern through both the first and the second experts. If the two experts agree in their
decisions, discard that pattern. If they disagree, the pattern is added to the training set for the third
expert.
Continue with this process until total N 1 examples has been filtered jointly by the first and second
experts. This set of jointly filtered examples constitutes the training set for the third expert.

Some Theory

Classification: If the
first and the second
experts in the
committee agree in
their respective
decisions, that class
label is used.
Otherwise, the class
label discovered by the
third expert is used.

The three experts have

an error rate of < 1/2
g() = 32 -23

Some Theory

AdaBoost (boosting by
resampling -> batch
learning)
AdaBoost adjusts
adaptively to the error of
the weak hypothesis
returned by the weak
learning model.
When the number of
possible classes (labels) is
M>2, the boosting
problem becomes more
intricate.

Some Theory

Error performance due

to AdaBoost is
peculiar.
The shape of the error
rate debends on the
definition of
confidence.

Some Theory

Dynamic is used here in the sense that integration of knowledge

acquired by the experts is accomplished under the action of the input
signal.
Probabilistic generative model
1. An input vector x is picked at random from some prior distribution.
2. A perticular rule, say kth rule, is selected in accordance with the
conditional probability P(k|x,a(0)), given x and some parameter vector
a(0).
3. for rule k, k<01,2,...,K, the model response d is linear in x, with an
additive error k modeled as a Gaussian distributed random variable
with zero mean and unit variance:
E[k] = 0 for all k and var[k] = 1 for all k.
P(D=d|x, (0) ) = Kk=1 P(D=d|x,wk(0) ) P(k|x,a(0)),

Some Theory
Mixture of Experts Model:
yk = wkTx k=1,2,...,K
The gating network consists
of a single layer of K
neurons, with each neuron
assigned to a specific
expert.

Some Theory
The neurons of the gating
network are nonlinear.
gk = exp(uk)/Kj=1exp(uj)
k = 1,2,...,K and uk = akTx ->
Softmax
The gating network is a
classifier that maps the
input x into multinomial
probabilities -> The
different experts will be
able to match the desired
response.

Some Theory
fD(d|x,) = Kk=1 gk fD(d|x,k,)
= 1/2 Kk=1 gk exp(-1/2(d-yk)2)
associative Gaussian mixture model
Given the training sample {(xi ,di)}Ni=1, the problem
is to learn the conditional means k = yk and the
mixing parameters gk,k = 1,2,...,K in an optimum
manner, so that fD(d|x,) provides a good estimate
of the underlying probability density function of
the environment responsible for generating the
training data.

Some Theory
Hierarchical Mixture of
Experts Model
HME is a natural extension of the
ME model.
The HME model differs from the
ME model in that the input
space is divided into a nested
set of subspaces, with the
information combined and
redistributed among the experts
under the control of several
gating networks arranged in a
hierarchical manner.

Some Theory

The formulation of the HME model can be viewed

in two ways.
1) The HME model is a product of the divide and
conquer strategy
2) The HME model is a soft-decision tree
Standard decision trees suffer from a greediness
problem, once a decision is made in such a tree, it
is frozen and never changes thereafter.

Some Theory
Classification and decision tree
(CART Breiman 1984)
1) selection of splits: let a node t
denotea subset of the current tree
T. Let d-(t) denote the average of
di for all cases (x,di) falling into t,
that is, d-(t) = 1/N(t) xjt di where
the sum is over all di such that xit
and N(t) is the total number of
cases t.
2) Define E(t) = 1/N xjt (di d-(t))2
and E(T) = tT E(t)
The best split s* is then taken to be
the particular split for which we
have E(s*,t) = max sS E(t,s)

Some Theory
Determination of a terminal node: A node t is declared a
terminal node if this condition is satisfied:
max sS E(s,t) < th, th is a prescribed threshold.
Least-square estimation of a terminal nodes parameters:
Let t denote a terminal node in the final binary tree T, and let X(t)
denote the matrix composed of xi t. Let d(t) denote the
corresponding vector composed of all the d i in t. Define w(t) =
X+(t)d(t) where X+(t) is the pseudoinverse of matrix X(t).
Using the weights calculated above , the split selection problem is
solved by looking for the least sum of squared residuals with
respect to the regression surface.

Some Theory
g = 1 / {1+exp(-(aTx +b))}
Using CART to initialize the HME model:
1)
1) Apply CART to the training data
2)
2) Set the synaptic weight vectors of the experts in the HME model
equal to the least-squares estimates of the parameter vectors at the
corresponding terminal nodes of the binary tree resulting from the
application of CART.
3)
3) For the gating networks:
4)
a) set the synaptic weight vectors to point in direction that are
orthogonal to the corresponding splits in the binary tree obtained
from CART and
5)
b) set the lengths of the synaptic weight vectors equal to small
random vectors.

Some Theory
A posteriori probabilities at the nonterminal nodes of the
tree:
hk= gk2j=1 gj|k exp(-1/2(d-yjk)2) / {2k=1 gk 2j=1 gj|k exp(-1/2(dyjk)2)}
hj|k= gj|k exp(-1/2(d-yjk)2) / { 2j=1 gj|k exp(-1/2(d-yjk)2)}
the joint a posteriori probability that expert(j,k) produces an
output yjk :
hjk = hk hj|k
= gk gj|k exp(-1/2(d-yjk)2) / {2k=1 gk 2j=1 gj|k exp(-1/2(d-yjk)2)}

0 hjk 1 for all (j,k); 2j=1 2k=1 hjk = 1

Some Theory

The parameter estimation for the HME model:

Maximum likelihood estimation: y = 2k=1 gk 2j=1
gj|k yjk

fD(d|x,) = 1/2 2k=1 gk 2j=1 gj|k exp(-1/2(d-yk)2)

likelihood function l()= fD(d|x,)
log-likelihood function L()= log[fD(d|x,)]
L() / = 0 the maximum likelihood
estimate

Some Theory

Learning strategies for the

HME model
1. Stochastic gradient
approach
This approach yields an on-line
algorithm for the maximization
of L().
L / wjk = hj|k(n) hk (n)(d(n)
yj|k(n))x(n)
L / ak = (hk(n) gk(n))x(n)
L / ajk = hk(n)(hj|k(n) gj|
k(n))x(n)

Some Theory

2. Expectation-maximization approach
Expectation step (E-step), which uses the observed
data set of an incomplete data problem and the
current value of the parameter vector to
manufacture data so as to postulate an augmented
or so-called complete data set.
Maximization step (M-step), which consists of
deriving a new estimate of the parameter vector by
maximizing the log-likelihood function of the
complete data manufactured in the E-step.

Some Theory

The EM algorithm is directed at finding a value of that

maximizes the incomplete-data log-likelihood function
L()= log[fD(d|)]
This problem is solved indirectly by working iteratively
with the complete-data log-likelihood function Lc()=
logfc(r|), which is a random variable, because the
missing data vector z is unknown.
E-step: Q(,^(n)) = E[Lc()]
M-step maximize Q(,^(n)) with respect to n
^(n+1) = arg max Q(,^(n)); continue until the
difference between L(^(n+1)) and L(^(n)) drops to
some arbitrary small value

Some Theory
fD(di|xi,) = 1/2 2k=1 g(i)k 2j=1 g(i)j|k exp(-1/2(di-y(i)k)2)
L()= log[N i=1 fD(di|xi,)]
Lc()= log[N i=1 fc(di,z(i)jk|xi,)]
Q(,^(n)) = E[Lc()]
= Ni=1 2j=1 2k=1 h(i)jk (log g(i)k + log g(i)j|k -1/2(d(i) y(i)jk)2
wjk(n+1) = arg minwjk Ni=1 h(i)jk (di y(i)jk)2
aj(n+1) = arg maxaj Ni=1 2k=1 h(i)k log g(i)k
ajk(n+1) = arg maxajk Ni=1 2l=1 h(i)l 2m=1 h(i)m|l log g(i)m|l

Summary
Ensemble averaging improves error
performance by combining two effects:
a) overfitting the individual experts
b) using different initial conditions in the
training of the individual experts
Boosting improves error performance by
filtering and resampling .

Summary

Simple models provide insight into the problem but

lack accuracy
Complex models provide accurate results but lack
insight.
The architecture of HME is similar to that of
CART, but differs from it in soft partitioning the
input space.
The HME uses a nested form of nonlinearity
similar to MLP, not for the purpose of input-output
mapping, but rather for partitioning the input space.

Kiosks of McDonald-01032022-3
No ratings yet
Kiosks of McDonald-01032022-3
11 pages
2.marketing Analytics Notes Book
No ratings yet
2.marketing Analytics Notes Book
231 pages
Kumaraswamy's Distribution A Beta-Type Distribution With
No ratings yet
Kumaraswamy's Distribution A Beta-Type Distribution With
12 pages
Committee Machines
No ratings yet
Committee Machines
36 pages
Slide07 Haykin Chapter 7: Committee Machines
No ratings yet
Slide07 Haykin Chapter 7: Committee Machines
8 pages
Computer Network: 02 December 2024 22:38
No ratings yet
Computer Network: 02 December 2024 22:38
5 pages
Using Additive Expert Ensembles To Cope With Concept Drift
No ratings yet
Using Additive Expert Ensembles To Cope With Concept Drift
8 pages
Unit 3
No ratings yet
Unit 3
16 pages
Lec 12 NN
No ratings yet
Lec 12 NN
20 pages
L06 Slides - mlp3
No ratings yet
L06 Slides - mlp3
26 pages
ML - Interview Prep
No ratings yet
ML - Interview Prep
9 pages
Data Mining NOTES
No ratings yet
Data Mining NOTES
57 pages
NNML
No ratings yet
NNML
113 pages
Machine Learning and Neural Networks: Riccardo Rizzo
100% (1)
Machine Learning and Neural Networks: Riccardo Rizzo
113 pages
Machine Learning INTRO
No ratings yet
Machine Learning INTRO
12 pages
Fall 2022 Midterm Notes PDF
No ratings yet
Fall 2022 Midterm Notes PDF
15 pages
University of Computer Studies, Mandalay (UCSM)
No ratings yet
University of Computer Studies, Mandalay (UCSM)
23 pages
Classification and Clustering
No ratings yet
Classification and Clustering
80 pages
2023-24 ML Notes 2
No ratings yet
2023-24 ML Notes 2
16 pages
Machine
No ratings yet
Machine
61 pages
Machine 2023 Part 1
No ratings yet
Machine 2023 Part 1
4 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
22 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
Introduction To Classification - PPT Slides 1
No ratings yet
Introduction To Classification - PPT Slides 1
62 pages
LN ML Rug
No ratings yet
LN ML Rug
267 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
48 pages
Lec7 - Nonparametric Methods - II
No ratings yet
Lec7 - Nonparametric Methods - II
38 pages
AWS Machine Learning Specialty Master Cheat Sheet
No ratings yet
AWS Machine Learning Specialty Master Cheat Sheet
24 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
Machine - Learning (Unit 3)
No ratings yet
Machine - Learning (Unit 3)
9 pages
Lecture#12 DM MS (DEIM) Spring 2025
No ratings yet
Lecture#12 DM MS (DEIM) Spring 2025
21 pages
ML Notes
No ratings yet
ML Notes
15 pages
Describe About Reinforcement Learning, Passive Reinforcement Learning and Active Reinforcement
No ratings yet
Describe About Reinforcement Learning, Passive Reinforcement Learning and Active Reinforcement
1 page
ML4 - Decision Trees & Random Forest
No ratings yet
ML4 - Decision Trees & Random Forest
44 pages
Unit 4 Part 1
No ratings yet
Unit 4 Part 1
47 pages
Data Mining Models and Evaluation Techniques
No ratings yet
Data Mining Models and Evaluation Techniques
47 pages
Outline: - Learning Agents - Inductive Learning - Decision Tree Learning
No ratings yet
Outline: - Learning Agents - Inductive Learning - Decision Tree Learning
30 pages
Machine Learning Overview
No ratings yet
Machine Learning Overview
54 pages
Unit-1 ML
No ratings yet
Unit-1 ML
39 pages
Unit - 3 ML
No ratings yet
Unit - 3 ML
17 pages
Overfitting & Feature Engineering
No ratings yet
Overfitting & Feature Engineering
37 pages
Intro:: Part-1: Bayesian Learning
No ratings yet
Intro:: Part-1: Bayesian Learning
6 pages
ML Question Bank CA-II
No ratings yet
ML Question Bank CA-II
10 pages
MCS224 Dec 2024 Solved
No ratings yet
MCS224 Dec 2024 Solved
22 pages
K Winners Take All Ensemble Neural
No ratings yet
K Winners Take All Ensemble Neural
12 pages
U1 - ML
No ratings yet
U1 - ML
5 pages
Donalek Classif
No ratings yet
Donalek Classif
69 pages
Combining Classifiers: Outline
No ratings yet
Combining Classifiers: Outline
15 pages
Eodr Fcds
No ratings yet
Eodr Fcds
12 pages
Chapter 2,3,4
No ratings yet
Chapter 2,3,4
8 pages
BF 00200801
No ratings yet
BF 00200801
10 pages
Unit 1-1
No ratings yet
Unit 1-1
75 pages
ML U-3
No ratings yet
ML U-3
16 pages
AI Unit 4
No ratings yet
AI Unit 4
91 pages
M2 Transcript
No ratings yet
M2 Transcript
11 pages
Genetic Algorithms Versus Traditional Methods
No ratings yet
Genetic Algorithms Versus Traditional Methods
7 pages
Main Learning Algorithms: Find-S Algorithm
No ratings yet
Main Learning Algorithms: Find-S Algorithm
13 pages
Ai Notes
No ratings yet
Ai Notes
8 pages
ML Unit II - Final
No ratings yet
ML Unit II - Final
138 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Machine Guarding
No ratings yet
Machine Guarding
69 pages
India'S Future: 17 March 2007 Kelkar Alumni Lecture IIT Kanpur
No ratings yet
India'S Future: 17 March 2007 Kelkar Alumni Lecture IIT Kanpur
62 pages
Project Quality Management: Chnology 2001 1
No ratings yet
Project Quality Management: Chnology 2001 1
33 pages
Make in India Essay
No ratings yet
Make in India Essay
2 pages
RTT Terminal Practice Question
No ratings yet
RTT Terminal Practice Question
14 pages
Implementation of Geographic Information System GIS in Teaching Geography in Secondary Schools in Hong Local Government Area
No ratings yet
Implementation of Geographic Information System GIS in Teaching Geography in Secondary Schools in Hong Local Government Area
5 pages
Integrated Mathematics IA 2 Probability Nayasha Baksh
No ratings yet
Integrated Mathematics IA 2 Probability Nayasha Baksh
19 pages
Homework 6
No ratings yet
Homework 6
2 pages
A.G. Villaroya Technological Foundation Institute, Inc.: Loilo ST., Zone 5 Bulan, Sorsogon S.Y. 2020-2021
No ratings yet
A.G. Villaroya Technological Foundation Institute, Inc.: Loilo ST., Zone 5 Bulan, Sorsogon S.Y. 2020-2021
21 pages
Kenny-230720-8 Unique Machine Learning Interview Questions About K Nearest Neighbors
No ratings yet
Kenny-230720-8 Unique Machine Learning Interview Questions About K Nearest Neighbors
3 pages
Probability Sampling Techniques Assignment
No ratings yet
Probability Sampling Techniques Assignment
2 pages
Padeepz Ma3251 QB-1
No ratings yet
Padeepz Ma3251 QB-1
21 pages
8604 Important Questions
No ratings yet
8604 Important Questions
33 pages
Estimating The Probability of A Shot Resulting in
No ratings yet
Estimating The Probability of A Shot Resulting in
15 pages
Dean I. Radin - Psychophysiological Evidence of Possible Retrocausal Effects in Humans
100% (2)
Dean I. Radin - Psychophysiological Evidence of Possible Retrocausal Effects in Humans
21 pages
18.BMS 4201 Business Statistics Paper A
No ratings yet
18.BMS 4201 Business Statistics Paper A
3 pages
Anova Lesson PR2
No ratings yet
Anova Lesson PR2
4 pages
Final Exam, STATS 401 W18: Name
No ratings yet
Final Exam, STATS 401 W18: Name
10 pages
Proposed PHD in Data Science
No ratings yet
Proposed PHD in Data Science
166 pages
Capturing Astrology in Statistical Tests
100% (2)
Capturing Astrology in Statistical Tests
19 pages
M SC Forestry Syllabus (New Revised)
No ratings yet
M SC Forestry Syllabus (New Revised)
53 pages
Classical Test Theory
No ratings yet
Classical Test Theory
2 pages
Inference About A Population Mean
No ratings yet
Inference About A Population Mean
19 pages
Assumptions of Regression Including Independent of Errors
No ratings yet
Assumptions of Regression Including Independent of Errors
5 pages
Basic Research - Final Report
No ratings yet
Basic Research - Final Report
7 pages
Maths Project
No ratings yet
Maths Project
5 pages
Hierarchical Multiple Regression - D. Boduszek
100% (1)
Hierarchical Multiple Regression - D. Boduszek
27 pages
Homework 4a
No ratings yet
Homework 4a
2 pages
Solved On April 1 Paine Co Began Construction of A Small
No ratings yet
Solved On April 1 Paine Co Began Construction of A Small
1 page
Errors in Chemical Analysis
No ratings yet
Errors in Chemical Analysis
21 pages
Sample Size in Diagnostic Accuracy Studies
No ratings yet
Sample Size in Diagnostic Accuracy Studies
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

2806 Neural Computation Committee Machines: 2005 Ari Visa

Uploaded by

2806 Neural Computation Committee Machines: 2005 Ari Visa

Uploaded by

2806 Neural Computation

Some Historical Notes

The principle of divide and conquer: a complex computational task is solved

Committee machines are universal approximators

Note, all the experts are

Boosting: the experts

The original idea of boosting is rooted in a distribution-free or probably approximately correct

The three experts have

Error performance due

Dynamic is used here in the sense that integration of knowledge

The formulation of the HME model can be viewed

0 hjk 1 for all (j,k); 2j=1 2k=1 hjk = 1

The parameter estimation for the HME model:

fD(d|x,) = 1/2 2k=1 gk 2j=1 gj|k exp(-1/2(d-yk)2)

Learning strategies for the

The EM algorithm is directed at finding a value of that

Simple models provide insight into the problem but

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.