2806 Neural Computation Committee Machines: 2005 Ari Visa
2806 Neural Computation Committee Machines: 2005 Ari Visa
Committee Machines
Lecture 7
2005 Ari Visa
Agenda
Some historical notes
Some theory
Committee Machines
Conclusions
Some Theory
Some Theory
Some Theory
Ensamble Averaging: A
number of differently
trained neural networks,
which share a common
input and whose
individual outputs are
somehow combines to
produce an overall output.
Some Theory
The bias of the ensemble-averaged function
FI(x), pertaining to the commitee machine is
exactly the same as that of the function F(x)
pertaining to a single neural network.
The variance of the ensemble-averaged
function FI(x) is less than that of the
function F(x).
Some Theory
Some Theory
1.
2.
-
3.
-
Some Theory
Classification: If the
first and the second
experts in the
committee agree in
their respective
decisions, that class
label is used.
Otherwise, the class
label discovered by the
third expert is used.
Some Theory
Some Theory
AdaBoost (boosting by
resampling -> batch
learning)
AdaBoost adjusts
adaptively to the error of
the weak hypothesis
returned by the weak
learning model.
When the number of
possible classes (labels) is
M>2, the boosting
problem becomes more
intricate.
Some Theory
Some Theory
Some Theory
Mixture of Experts Model:
yk = wkTx k=1,2,...,K
The gating network consists
of a single layer of K
neurons, with each neuron
assigned to a specific
expert.
Some Theory
The neurons of the gating
network are nonlinear.
gk = exp(uk)/Kj=1exp(uj)
k = 1,2,...,K and uk = akTx ->
Softmax
The gating network is a
classifier that maps the
input x into multinomial
probabilities -> The
different experts will be
able to match the desired
response.
Some Theory
fD(d|x,) = Kk=1 gk fD(d|x,k,)
= 1/2 Kk=1 gk exp(-1/2(d-yk)2)
associative Gaussian mixture model
Given the training sample {(xi ,di)}Ni=1, the problem
is to learn the conditional means k = yk and the
mixing parameters gk,k = 1,2,...,K in an optimum
manner, so that fD(d|x,) provides a good estimate
of the underlying probability density function of
the environment responsible for generating the
training data.
Some Theory
Hierarchical Mixture of
Experts Model
HME is a natural extension of the
ME model.
The HME model differs from the
ME model in that the input
space is divided into a nested
set of subspaces, with the
information combined and
redistributed among the experts
under the control of several
gating networks arranged in a
hierarchical manner.
Some Theory
Some Theory
Classification and decision tree
(CART Breiman 1984)
1) selection of splits: let a node t
denotea subset of the current tree
T. Let d-(t) denote the average of
di for all cases (x,di) falling into t,
that is, d-(t) = 1/N(t) xjt di where
the sum is over all di such that xit
and N(t) is the total number of
cases t.
2) Define E(t) = 1/N xjt (di d-(t))2
and E(T) = tT E(t)
The best split s* is then taken to be
the particular split for which we
have E(s*,t) = max sS E(t,s)
Some Theory
Determination of a terminal node: A node t is declared a
terminal node if this condition is satisfied:
max sS E(s,t) < th, th is a prescribed threshold.
Least-square estimation of a terminal nodes parameters:
Let t denote a terminal node in the final binary tree T, and let X(t)
denote the matrix composed of xi t. Let d(t) denote the
corresponding vector composed of all the d i in t. Define w(t) =
X+(t)d(t) where X+(t) is the pseudoinverse of matrix X(t).
Using the weights calculated above , the split selection problem is
solved by looking for the least sum of squared residuals with
respect to the regression surface.
Some Theory
g = 1 / {1+exp(-(aTx +b))}
Using CART to initialize the HME model:
1)
1) Apply CART to the training data
2)
2) Set the synaptic weight vectors of the experts in the HME model
equal to the least-squares estimates of the parameter vectors at the
corresponding terminal nodes of the binary tree resulting from the
application of CART.
3)
3) For the gating networks:
4)
a) set the synaptic weight vectors to point in direction that are
orthogonal to the corresponding splits in the binary tree obtained
from CART and
5)
b) set the lengths of the synaptic weight vectors equal to small
random vectors.
Some Theory
A posteriori probabilities at the nonterminal nodes of the
tree:
hk= gk2j=1 gj|k exp(-1/2(d-yjk)2) / {2k=1 gk 2j=1 gj|k exp(-1/2(dyjk)2)}
hj|k= gj|k exp(-1/2(d-yjk)2) / { 2j=1 gj|k exp(-1/2(d-yjk)2)}
the joint a posteriori probability that expert(j,k) produces an
output yjk :
hjk = hk hj|k
= gk gj|k exp(-1/2(d-yjk)2) / {2k=1 gk 2j=1 gj|k exp(-1/2(d-yjk)2)}
Some Theory
Some Theory
Some Theory
2. Expectation-maximization approach
Expectation step (E-step), which uses the observed
data set of an incomplete data problem and the
current value of the parameter vector to
manufacture data so as to postulate an augmented
or so-called complete data set.
Maximization step (M-step), which consists of
deriving a new estimate of the parameter vector by
maximizing the log-likelihood function of the
complete data manufactured in the E-step.
Some Theory
Some Theory
fD(di|xi,) = 1/2 2k=1 g(i)k 2j=1 g(i)j|k exp(-1/2(di-y(i)k)2)
L()= log[N i=1 fD(di|xi,)]
Lc()= log[N i=1 fc(di,z(i)jk|xi,)]
Q(,^(n)) = E[Lc()]
= Ni=1 2j=1 2k=1 h(i)jk (log g(i)k + log g(i)j|k -1/2(d(i) y(i)jk)2
wjk(n+1) = arg minwjk Ni=1 h(i)jk (di y(i)jk)2
aj(n+1) = arg maxaj Ni=1 2k=1 h(i)k log g(i)k
ajk(n+1) = arg maxajk Ni=1 2l=1 h(i)l 2m=1 h(i)m|l log g(i)m|l
Summary
Ensemble averaging improves error
performance by combining two effects:
a) overfitting the individual experts
b) using different initial conditions in the
training of the individual experts
Boosting improves error performance by
filtering and resampling .
Summary