BioinforMatics Paper Presentation
BioinforMatics Paper Presentation
BioinforMatics Paper Presentation
From Birdsong to Human Speech Recognition: Bayesian Inference on a Hierarchy of Nonlinear Dynamical Systems
Main
Prerequisites
Structure of the ear
Importance of Cochlea 1. It is spiral shaped peripheral organ in the inner ear. 2. Important part of the auditory system which converts acoustic sound waves to neural signals. 3. Sound coming from the ear canal beats the cochlea and thus gets converted to sensitive neural signals of different frequencies. 4. Frequency specificity comes from differential stiffness of the basilar membrane which extends from the cochlea. 5. Its base is thick and responds to higher frequencies while apex is thin and responds to lower frequency.
Prerequisites
Cochleogram
Cochleogram representing the firing rate of auditory nerves at each time point(frequency time) [LyonPassiveModel with 86 channels]
Model
Conceptual overview
1. Bayesian approach builds a generative model which is then converted to recognition and learning model. 2. As compared to other models its hierarchically structured(2 levels for this model) and non linear and dynamic which can be tailored to once specific needs. 3. More flexible than other models such as Markov models, Deep Belief networks , Liquid state machines , TRACE and shortlist. 4. For this model, the firing patterns at the pre mortar area is considered.
Keyterms
Modules : It is a mechanism based on Bayesian inference which can learn and recognize a single word. It is like a sophisticated template matcher where the template is learned and stored in a hierarchically structured recurrent neural network and compared against a stimulus in an online fashion.Each module contains the two level model described shortly.
Prediction message
Agent : Is a group of individual models which together achieve a common classification task like word recognition task". Here we show how precision settings in agents are crucial to learn new stimuli or recognize sounds in noisy environment.
V is the set of casual states v(i) used to transmit output from Level 2 to level 1, and all of x,y and v have the Normally distributed Noise factors packed in to ensure reality.
The connectivity matrix represents strengths of inhibition from j to i for pij choosing high inhibition from previously active neuron to currently active neuron and low inhibition from currently active neuron to next active neuron.
Each second level wave called ensemble sends a signal Ik to the first level and hence total signal to the first level is
It can seen that maximizing the F(q,z) will minimize D(q||p) thus giving an approximation of Q(u) = P(u|z,m).Here Q(u) is assumed to be of laplace approximation which states that
Here we use a concept of precision such that low precision means greater influence and high precision means less deviation from expectations. Above maximization of F can be written in terms of herierrachal setting as follows
A message passing scheme can be used to find the optimal mode and variance for the states where optimization problem turns into a gradient descend on precision weighted prediction errors governed by following equations.
High precision for a variable means amplified prediction error and hence toleration of only smaller errors and low precision involves more approximation and greater larger tolerance.
Implementation
The above Bayesian inference can be implemented neuro biologically using two types of neuronal ensembles.
The modes of expected casual and hidden states can be represented by the neural activity of state ensembles Prediction errors encoded by activity of error ensembles,with one to one correspondence to state ensembles. These messages can be passed via forward and lateral connections. Error units can be identified by superficial pyramidal cells. This message passing scheme efficiently minimizes prediction errors and optimizes predictions at all levels and uses academic free ware as software backbone.
Results
Bayesian model for learning and online recognition of human speech
Main objective of this model is
Learning : where the feedback parameters from 2nd level to 1st level are allowed to change.(is slower) Recognition : parameters are fixed and model only reconstructs the hidden dynamics.(online)
Both the modules contain neuronal population which encodes the expectation (sensory input) about the cochlea. These expectation predict the neuronal activity at the next level.
Error minimization
z(t) from cochlear model is compared to the prediction of 1st model Then prediction error are computed and propagated to 2nd level. But levels adjust their internal predictions accordingly to upto a agreed precision. Similarly the 2ns level forms predictions which are sent to 1st level.(only possible if backward connections are appropriate)
Learning
Compared to recognition learning is not online. Doesnt happen over the course of the complete stimulus. Rather , for learning the prediction errors are summed up for the whole stimulus duration and used after stimulus presentation to update the parameters.
Testing
Learning speech
The relatively high precision forces each module to closely match the external stimulus, i.e., minimize the prediction error about the sensory input, and allow for more prediction error on the internal dynamics. To reduce these prediction errors, each module is forced to adapt the backward connections from the second level to the first level, which are free parameters in the model . This automatic optimization process iterates until the prediction error can be no further reduced and is typically completed after five to six repetitions of a word.
30dB
3.6
20dB
10dB
11.2
The target sentence : She argues with her sister. and presented it to a module
without background speaker, with one background speaker, and with three background speakers.
Accent adaptation
By adaptation we mean that the learning of the parameters in a module proceeds from a previously learned parameter set (base accent) as opposed to learning from scratch in the Learning speech simulation. Therefore, adaptation can be understood as slight changes of the backward connections instead of learning a completely new word.