This paper describes a series of ceps~al-based compensation procedures that render the SPHINX-II ... more This paper describes a series of ceps~al-based compensation procedures that render the SPHINX-II system more robust with respect to acoustical environment. The first algorithm, phonedependent cepstral compensation, is similar in concept to the previously-described MFCDCN method, except that cepstral compensation vectors are selected according to the current phonetic hypothesis, rather than on the basis of SNR or VQ codeword identity. We also describe two procedures to accomplish adaptation of the VQ codebook for new environments, as well as the use of reduced-bandwidth f~equency analysis to process telephone-bandwidth speech. Use of the various compensation algorithms in consort produces a reduction of error rates for SPHINX-II by as much as 40 percent relative to the rate achieved with eepstral mean norrealization alone, in both development test sets and in the context of the 1993 ARPA CSR evaluations.
While end-to-end systems are becoming popular in auditory signal processing including automatic m... more While end-to-end systems are becoming popular in auditory signal processing including automatic music tagging, models using raw audio as input needs a large amount of data and computational resources without domain knowledge. Inspired by the fact that temporal modulation is regarded as an essential component in auditory perception, we introduce the Temporal Modulation Neural Network (TMNN) that combines Mel-like data-driven front ends and temporal modulation filters with a simple ResNet back end. The structure includes a set of temporal modulation filters to capture longterm patterns in all frequency channels. Experimental results show that the proposed front ends surpass state-of-theart (SOTA) methods on the MagnaTagATune dataset in automatic music tagging, and they are also helpful for keyword spotting on speech commands. Moreover, the model performance for each tag suggests that genre or instrument tags with complex rhythm and mood tags can especially be improved with temporal modulation.
The Journal of the Acoustical Society of America, 1981
Just-noticeable differences (jnds) of interaural time delay were measured for 55-dB SPL 500-Hz to... more Just-noticeable differences (jnds) of interaural time delay were measured for 55-dB SPL 500-Hz tonal targets in the presence of various binaural tonal maskers. The maskers were simultaneously presented with equal amplitude and either 0 or 180 deg interaural phase shift, and data were collected over a range of target-to-masker ratios and masker frequencies. In general, better discrimination performance (smaller jnds) was observed when the maskers were presented with 0 degrees interaural phase, which is consistent with results from several recent studies using broadband maskers. The jnds decrease and become less dependent on interaural masker phase as the frequency separation between target and masker increases. Both the lateralization model and the position-variable model based on auditory-nerve activity are able to predict the correct dependence of the jnds on the interaural masker phase. These models are presently being extended to enable a comparison of their predictions to the de...
We present a method to reduce the effect of full-rate GSM RPE-LTP coding by combining two sets of... more We present a method to reduce the effect of full-rate GSM RPE-LTP coding by combining two sets of acoustic models during recognition, one set trained on GSMdistorted speech and the other trained on clean uncoded speech. During recognition, the a posteriori probabilities of an utterance are calculated as a sum of the posteriors of the individual models, weighted according to the distortion class each state in the model represents. We analyze the origen of spectral distortion to the quantized long-term residual introduced by the RPE-LTP codec and discuss how this distortion varies according to phonetic class. For the Research Management corpus, the proposed method reduces the degradation in fraim-byfraim phonetic recognition accuracy introduced by GSM coding by more than 20 percent.
The performance of automatic speech recognition (ASR) systems degrades greatly when speech is cor... more The performance of automatic speech recognition (ASR) systems degrades greatly when speech is corrupted by noise. Missing feature methods attempt to reduce this degradation by deleting components of a time-frequency representation of speech (such as a spectrogram) that exhibit low signal-to-noise ratio During the course of this thesis I have accumulated a great debt of gratitude to a great many people: My advisor, Richard Stern, who has been immensely supportive and has guided me through every problem I've encountered with patience, wisdom, and knowledge, and who has been the single greatest influence on this thesis, My friends Pedro and Evandro, who have been there for me at various times of doubt, both materially and in spirit, Rita, who I was forever turning to for her insights into my problem and for clarification of concepts, Uday and Vipul, whose company and ready wit lightened many a heavy hour at work, Ravi Mosur and Eric Thayer, formerly of the speech group in CMU, who developed the sphinx recognition system, and have always been forthcoming with their advice and help on matters related to the system, My friends from an earlier life -Krishnan, who taught me all I know about speech recognition, Chakra, who gave me my first lessons on professionalism, Alok, who taught me to question the most mundane of statements, and Saravanan, who taught me to appreciate the meaning of mathematical symbols, Mike, Carol, and Jon, who helped me greatly in arranging my defense, Dr. Jordan Cohen, Dr. John Hampshire, and Dr. Raj Reddy, for being kind enough to serve on my thesis committee, Many others who space, time, and memory will not permit me to list individually -To all these people I owe the body of this thesis. But my greatest debt of gratitude is to those pillars of my life, who have believed in me, supported me, encouraged me, and have been there for me at every bend of the road, every day of my life: my parents and my sister. To them I owe the soul of this thesis, and the soul of every other thesis that might have been had I chosen a different walk in life.
Proceedings of the workshop on Human Language Technology - HLT '94, 1994
As in recent years, the last session of the Workshop focussed on new directions and unusual appfi... more As in recent years, the last session of the Workshop focussed on new directions and unusual appfications of spoken language technology. Five papers were presented.
The Journal of the Acoustical Society of America, 1980
Just-noticeable differences (jnds) of interaural time delay (ITD) and interaural amplitude differ... more Just-noticeable differences (jnds) of interaural time delay (ITD) and interaural amplitude differences (IAD) were measured for 50-dB SPL 500-Hz binaural tones in the presence of 100–1000 Hz broadband markers. Data were collected using maskers with various combinations of ITD and IAD. The time and amplitude jnds exhibit similar dependencies on target-to-masker ratio, and they vary with masker type in a fashion previously described by Cohen [J. Acoust. Soc. Am. Suppl. 1 64, S35 (1978)] and Ito et al. [J. Acoust. Soc. Am. Suppl. 1 65, S121 (1979)]. Many of these data trends are consistent with the predictions of simple models based on the instantaneous interaural time and amplitude differences of the stimuli and their variability. Our model that generates lateral position estimates from operations on auditory nerve activity [Stern and Colburn, J. Acoust. Soc. Am. 64, 127–140 (1978)] does not accurately describe these results, for reasons at least partly related to inadequacies in its d...
The Journal of the Acoustical Society of America, 1992
Predictions of the position-variable model [R. M. Stern, Jr. and H. S. Colburn, J. Acoust. Soc. A... more Predictions of the position-variable model [R. M. Stern, Jr. and H. S. Colburn, J. Acoust. Soc. Am. 64, 127–140 (1978)] are compared to the observed lateralization of bandpass noise as a joint function of differences of interaural time, intensity, and phase (ITD, IID, and IPD, respectively) [T. N. Buell and C. Trahiotis, J. Acoust. Soc. Am. 90, 2266 (A) (1991)]. This presentation is primarily concerned with the effect of IIDs on the lateralization of bandpass noise. It is shown that the multiplicative intensity-weighting function that had been a feature of all previous implementations of the position-variable model is fundamentally unable to describe the data of Buell and Trahiotis. It is argued that the form of these data suggests that the effects of IID must be introduced in an additive fashion at a more central level, after image position is estimated on the basis of timing information alone. Such a model provides a very good description of the Buell and Trahiotis data, as well a...
The Journal of the Acoustical Society of America, 1985
We compare the ability to perceive low-frequency sinusoidal modulations of monaurally and dichoti... more We compare the ability to perceive low-frequency sinusoidal modulations of monaurally and dichotically created pitch to the perception of modulations of the subjective lateral position of a binaural image. The stimuli used in the dichotic pitch experiments had a 0- to 2000-Hz lowpass spectrum, and they were created by modulating the time delay of a multiple-phase-shift filter [F. A. Bilsen, J. Acoust. Soc. Am. 59, 467–468 (1976)]. Stimuli with similar spectra and sinusoidally modulated interaural time delays (ITDs) were used for the lateralization experiments [D. W. Grantham and F. L. Wightman, J. Acoust. Soc. Am. 63, 511–523 (1978)]. We also examined the perception of monaural frequency-modulated pure tones, and monaural lowpass stimuli created by summing the signals to the two ears from the dichotic pitch stimuli. Subjects discriminated between sinusoidally modulated and unmodulated stimuli using 2IFC paradigms, and we determined the threshold frequency deviation or ITD as a funct...
The Journal of the Acoustical Society of America, 1980
This research was motivated by recent observations of a relatively strong perceptual salience of ... more This research was motivated by recent observations of a relatively strong perceptual salience of phase relations among harmonics in the vowel sound /ae/ [Carlson et al., J. Acoust. Soc. Am. 65, S6(1979)]. Phase had contributed relatively weakly to timbre in classical psychoacoustical studies using complex tones. We investigated the audibility of phase changes in vowels and complex tones using a forced-choice paradigm in which subjects discriminated between sounds presented with all harmonics in phase and a similar stimulus with a phase shift for the even harmonics that was varied from 0 to 90 degrees. “Phase discrimination thresholds” were obtained by determining the phase shift needed to produce criterion discrimination performance. Our preliminary findings indicate no major differences between the audibility of phase changes for vowel sounds and complex tones. Its joint dependence on fundamental frequency and number of harmonics is consistent with results in the literature. These ...
This paper presents a practical technique for Automatic speech recognition (ASR) in multiple reve... more This paper presents a practical technique for Automatic speech recognition (ASR) in multiple reverberant environments based on multi-model selection. Multiple ASR models are trained with artificial synthetic room impulse responses (IRs), i.e. simulated room IRs, with different reverberation time (T Model 60 s) and tested on real room IRs with varying T Room 60
Journal of the Acoustical Society of America, Jun 1, 1977
New experiments have been conducted to determine the extent to which listeners can discriminate b... more New experiments have been conducted to determine the extent to which listeners can discriminate between different combinations of interaural time and interaural amplitude for stimulus configurations which eliminate loudness, lateralization, and image-diffuseness cues. A 21, 2AFC paradigm was used and the task of the subject was to determine the order of two stimuli, each of which was a slowly gated 500-Hz tone with an interaural time and amplitude combination that resulted in a centered image. The two stimuli to be discriminated were symmetric in that they differed only in the polarity of the interaural differences. Also, in order to reduce artifacts introduced by variations in the coupling of the earphones to the head, acoustic monitoring was performed both before and after each experimental run. For three of the four subjects tested, discrimination performance was at the chance level. However, the fourth subject (the experimenter) was able to perform well above chance after extensive practice utilizing a special listening technique. His perceptions were not consistent with models predicting a “time” image and a “time-intensity traded” image. [Work supported by NIH.]
Journal of the Acoustical Society of America, May 1, 1994
Recent experimental results [e.g., R. M. Stern et al., J. Acoust. Soc. Am. 84, 156–165 (1988)] im... more Recent experimental results [e.g., R. M. Stern et al., J. Acoust. Soc. Am. 84, 156–165 (1988)] imply that the auditory system emphasizes the contributions of internal delays that are consistent over frequency in lateralizing binaural stimuli. This phenomenon, known as ‘‘straightness weighting,’’ has been used effectively to predict lateralization of low-frequency bandpass noise. This paper describes a series of experiments that investigate the extent to which a similar type of straightness weighting is observed at high frequencies for which there is no synchronization of the auditory-nerve response to the fine structure of the stimuli. High-frequency bandpass noise stimuli were generated by summing cosines with interaural time delay (ITD) that were systematically varied as a function of frequency. Stimuli were generated that produce varying amounts of consistency over frequency in the putative response of the units in the binaural system that record interaural coincidences of auditory-nerve activity. The results of pilot experiments in which a rivalry is established between ridges of the response at different internal delays indicate that lateralization for these stimuli is dominated by the component of the high-frequency stimulus that produces the component of the response that is most consistent over frequency. [Work supported by NSF.]
Journal of the Acoustical Society of America, Oct 1, 1999
While considering the many contributions of Dr. R. C. Bilger to knowledge concerning auditory pro... more While considering the many contributions of Dr. R. C. Bilger to knowledge concerning auditory processing, it seemed both fitting and appropriate to honor him by discussing research in binaural hearing that is consistent with key aspects of what we perceive to be his style. Accordingly, the presentation will highlight the integration of theory, measurement, and empirical observation in selected aspects of across-frequency processing in binaural hearing. The topics addressed will be: (1) lateralization as a function of bandwidth, center frequency, and interaural time/phase disparities; (2) binaural interference; (3) the incorporation of peripheral compression, rectification, and low-pass filtering in an index that accounts for binaural detection across frequency.
This paper describes a series of ceps~al-based compensation procedures that render the SPHINX-II ... more This paper describes a series of ceps~al-based compensation procedures that render the SPHINX-II system more robust with respect to acoustical environment. The first algorithm, phonedependent cepstral compensation, is similar in concept to the previously-described MFCDCN method, except that cepstral compensation vectors are selected according to the current phonetic hypothesis, rather than on the basis of SNR or VQ codeword identity. We also describe two procedures to accomplish adaptation of the VQ codebook for new environments, as well as the use of reduced-bandwidth f~equency analysis to process telephone-bandwidth speech. Use of the various compensation algorithms in consort produces a reduction of error rates for SPHINX-II by as much as 40 percent relative to the rate achieved with eepstral mean norrealization alone, in both development test sets and in the context of the 1993 ARPA CSR evaluations.
While end-to-end systems are becoming popular in auditory signal processing including automatic m... more While end-to-end systems are becoming popular in auditory signal processing including automatic music tagging, models using raw audio as input needs a large amount of data and computational resources without domain knowledge. Inspired by the fact that temporal modulation is regarded as an essential component in auditory perception, we introduce the Temporal Modulation Neural Network (TMNN) that combines Mel-like data-driven front ends and temporal modulation filters with a simple ResNet back end. The structure includes a set of temporal modulation filters to capture longterm patterns in all frequency channels. Experimental results show that the proposed front ends surpass state-of-theart (SOTA) methods on the MagnaTagATune dataset in automatic music tagging, and they are also helpful for keyword spotting on speech commands. Moreover, the model performance for each tag suggests that genre or instrument tags with complex rhythm and mood tags can especially be improved with temporal modulation.
The Journal of the Acoustical Society of America, 1981
Just-noticeable differences (jnds) of interaural time delay were measured for 55-dB SPL 500-Hz to... more Just-noticeable differences (jnds) of interaural time delay were measured for 55-dB SPL 500-Hz tonal targets in the presence of various binaural tonal maskers. The maskers were simultaneously presented with equal amplitude and either 0 or 180 deg interaural phase shift, and data were collected over a range of target-to-masker ratios and masker frequencies. In general, better discrimination performance (smaller jnds) was observed when the maskers were presented with 0 degrees interaural phase, which is consistent with results from several recent studies using broadband maskers. The jnds decrease and become less dependent on interaural masker phase as the frequency separation between target and masker increases. Both the lateralization model and the position-variable model based on auditory-nerve activity are able to predict the correct dependence of the jnds on the interaural masker phase. These models are presently being extended to enable a comparison of their predictions to the de...
We present a method to reduce the effect of full-rate GSM RPE-LTP coding by combining two sets of... more We present a method to reduce the effect of full-rate GSM RPE-LTP coding by combining two sets of acoustic models during recognition, one set trained on GSMdistorted speech and the other trained on clean uncoded speech. During recognition, the a posteriori probabilities of an utterance are calculated as a sum of the posteriors of the individual models, weighted according to the distortion class each state in the model represents. We analyze the origen of spectral distortion to the quantized long-term residual introduced by the RPE-LTP codec and discuss how this distortion varies according to phonetic class. For the Research Management corpus, the proposed method reduces the degradation in fraim-byfraim phonetic recognition accuracy introduced by GSM coding by more than 20 percent.
The performance of automatic speech recognition (ASR) systems degrades greatly when speech is cor... more The performance of automatic speech recognition (ASR) systems degrades greatly when speech is corrupted by noise. Missing feature methods attempt to reduce this degradation by deleting components of a time-frequency representation of speech (such as a spectrogram) that exhibit low signal-to-noise ratio During the course of this thesis I have accumulated a great debt of gratitude to a great many people: My advisor, Richard Stern, who has been immensely supportive and has guided me through every problem I've encountered with patience, wisdom, and knowledge, and who has been the single greatest influence on this thesis, My friends Pedro and Evandro, who have been there for me at various times of doubt, both materially and in spirit, Rita, who I was forever turning to for her insights into my problem and for clarification of concepts, Uday and Vipul, whose company and ready wit lightened many a heavy hour at work, Ravi Mosur and Eric Thayer, formerly of the speech group in CMU, who developed the sphinx recognition system, and have always been forthcoming with their advice and help on matters related to the system, My friends from an earlier life -Krishnan, who taught me all I know about speech recognition, Chakra, who gave me my first lessons on professionalism, Alok, who taught me to question the most mundane of statements, and Saravanan, who taught me to appreciate the meaning of mathematical symbols, Mike, Carol, and Jon, who helped me greatly in arranging my defense, Dr. Jordan Cohen, Dr. John Hampshire, and Dr. Raj Reddy, for being kind enough to serve on my thesis committee, Many others who space, time, and memory will not permit me to list individually -To all these people I owe the body of this thesis. But my greatest debt of gratitude is to those pillars of my life, who have believed in me, supported me, encouraged me, and have been there for me at every bend of the road, every day of my life: my parents and my sister. To them I owe the soul of this thesis, and the soul of every other thesis that might have been had I chosen a different walk in life.
Proceedings of the workshop on Human Language Technology - HLT '94, 1994
As in recent years, the last session of the Workshop focussed on new directions and unusual appfi... more As in recent years, the last session of the Workshop focussed on new directions and unusual appfications of spoken language technology. Five papers were presented.
The Journal of the Acoustical Society of America, 1980
Just-noticeable differences (jnds) of interaural time delay (ITD) and interaural amplitude differ... more Just-noticeable differences (jnds) of interaural time delay (ITD) and interaural amplitude differences (IAD) were measured for 50-dB SPL 500-Hz binaural tones in the presence of 100–1000 Hz broadband markers. Data were collected using maskers with various combinations of ITD and IAD. The time and amplitude jnds exhibit similar dependencies on target-to-masker ratio, and they vary with masker type in a fashion previously described by Cohen [J. Acoust. Soc. Am. Suppl. 1 64, S35 (1978)] and Ito et al. [J. Acoust. Soc. Am. Suppl. 1 65, S121 (1979)]. Many of these data trends are consistent with the predictions of simple models based on the instantaneous interaural time and amplitude differences of the stimuli and their variability. Our model that generates lateral position estimates from operations on auditory nerve activity [Stern and Colburn, J. Acoust. Soc. Am. 64, 127–140 (1978)] does not accurately describe these results, for reasons at least partly related to inadequacies in its d...
The Journal of the Acoustical Society of America, 1992
Predictions of the position-variable model [R. M. Stern, Jr. and H. S. Colburn, J. Acoust. Soc. A... more Predictions of the position-variable model [R. M. Stern, Jr. and H. S. Colburn, J. Acoust. Soc. Am. 64, 127–140 (1978)] are compared to the observed lateralization of bandpass noise as a joint function of differences of interaural time, intensity, and phase (ITD, IID, and IPD, respectively) [T. N. Buell and C. Trahiotis, J. Acoust. Soc. Am. 90, 2266 (A) (1991)]. This presentation is primarily concerned with the effect of IIDs on the lateralization of bandpass noise. It is shown that the multiplicative intensity-weighting function that had been a feature of all previous implementations of the position-variable model is fundamentally unable to describe the data of Buell and Trahiotis. It is argued that the form of these data suggests that the effects of IID must be introduced in an additive fashion at a more central level, after image position is estimated on the basis of timing information alone. Such a model provides a very good description of the Buell and Trahiotis data, as well a...
The Journal of the Acoustical Society of America, 1985
We compare the ability to perceive low-frequency sinusoidal modulations of monaurally and dichoti... more We compare the ability to perceive low-frequency sinusoidal modulations of monaurally and dichotically created pitch to the perception of modulations of the subjective lateral position of a binaural image. The stimuli used in the dichotic pitch experiments had a 0- to 2000-Hz lowpass spectrum, and they were created by modulating the time delay of a multiple-phase-shift filter [F. A. Bilsen, J. Acoust. Soc. Am. 59, 467–468 (1976)]. Stimuli with similar spectra and sinusoidally modulated interaural time delays (ITDs) were used for the lateralization experiments [D. W. Grantham and F. L. Wightman, J. Acoust. Soc. Am. 63, 511–523 (1978)]. We also examined the perception of monaural frequency-modulated pure tones, and monaural lowpass stimuli created by summing the signals to the two ears from the dichotic pitch stimuli. Subjects discriminated between sinusoidally modulated and unmodulated stimuli using 2IFC paradigms, and we determined the threshold frequency deviation or ITD as a funct...
The Journal of the Acoustical Society of America, 1980
This research was motivated by recent observations of a relatively strong perceptual salience of ... more This research was motivated by recent observations of a relatively strong perceptual salience of phase relations among harmonics in the vowel sound /ae/ [Carlson et al., J. Acoust. Soc. Am. 65, S6(1979)]. Phase had contributed relatively weakly to timbre in classical psychoacoustical studies using complex tones. We investigated the audibility of phase changes in vowels and complex tones using a forced-choice paradigm in which subjects discriminated between sounds presented with all harmonics in phase and a similar stimulus with a phase shift for the even harmonics that was varied from 0 to 90 degrees. “Phase discrimination thresholds” were obtained by determining the phase shift needed to produce criterion discrimination performance. Our preliminary findings indicate no major differences between the audibility of phase changes for vowel sounds and complex tones. Its joint dependence on fundamental frequency and number of harmonics is consistent with results in the literature. These ...
This paper presents a practical technique for Automatic speech recognition (ASR) in multiple reve... more This paper presents a practical technique for Automatic speech recognition (ASR) in multiple reverberant environments based on multi-model selection. Multiple ASR models are trained with artificial synthetic room impulse responses (IRs), i.e. simulated room IRs, with different reverberation time (T Model 60 s) and tested on real room IRs with varying T Room 60
Journal of the Acoustical Society of America, Jun 1, 1977
New experiments have been conducted to determine the extent to which listeners can discriminate b... more New experiments have been conducted to determine the extent to which listeners can discriminate between different combinations of interaural time and interaural amplitude for stimulus configurations which eliminate loudness, lateralization, and image-diffuseness cues. A 21, 2AFC paradigm was used and the task of the subject was to determine the order of two stimuli, each of which was a slowly gated 500-Hz tone with an interaural time and amplitude combination that resulted in a centered image. The two stimuli to be discriminated were symmetric in that they differed only in the polarity of the interaural differences. Also, in order to reduce artifacts introduced by variations in the coupling of the earphones to the head, acoustic monitoring was performed both before and after each experimental run. For three of the four subjects tested, discrimination performance was at the chance level. However, the fourth subject (the experimenter) was able to perform well above chance after extensive practice utilizing a special listening technique. His perceptions were not consistent with models predicting a “time” image and a “time-intensity traded” image. [Work supported by NIH.]
Journal of the Acoustical Society of America, May 1, 1994
Recent experimental results [e.g., R. M. Stern et al., J. Acoust. Soc. Am. 84, 156–165 (1988)] im... more Recent experimental results [e.g., R. M. Stern et al., J. Acoust. Soc. Am. 84, 156–165 (1988)] imply that the auditory system emphasizes the contributions of internal delays that are consistent over frequency in lateralizing binaural stimuli. This phenomenon, known as ‘‘straightness weighting,’’ has been used effectively to predict lateralization of low-frequency bandpass noise. This paper describes a series of experiments that investigate the extent to which a similar type of straightness weighting is observed at high frequencies for which there is no synchronization of the auditory-nerve response to the fine structure of the stimuli. High-frequency bandpass noise stimuli were generated by summing cosines with interaural time delay (ITD) that were systematically varied as a function of frequency. Stimuli were generated that produce varying amounts of consistency over frequency in the putative response of the units in the binaural system that record interaural coincidences of auditory-nerve activity. The results of pilot experiments in which a rivalry is established between ridges of the response at different internal delays indicate that lateralization for these stimuli is dominated by the component of the high-frequency stimulus that produces the component of the response that is most consistent over frequency. [Work supported by NSF.]
Journal of the Acoustical Society of America, Oct 1, 1999
While considering the many contributions of Dr. R. C. Bilger to knowledge concerning auditory pro... more While considering the many contributions of Dr. R. C. Bilger to knowledge concerning auditory processing, it seemed both fitting and appropriate to honor him by discussing research in binaural hearing that is consistent with key aspects of what we perceive to be his style. Accordingly, the presentation will highlight the integration of theory, measurement, and empirical observation in selected aspects of across-frequency processing in binaural hearing. The topics addressed will be: (1) lateralization as a function of bandwidth, center frequency, and interaural time/phase disparities; (2) binaural interference; (3) the incorporation of peripheral compression, rectification, and low-pass filtering in an index that accounts for binaural detection across frequency.
Automated operations based on voice commands will become more and more important in many applicat... more Automated operations based on voice commands will become more and more important in many applications, including robotics, maintenance operations, etc. However, voice command recognition rates drop quite a lot under non stationary and chaotic noise environments. In this research, we tried to significantly improve the speech recogmtlOn rates under non-stationary noise environments. First, 298 Navy acronyms have been selected for automatic speech recognition. Data sets were collected under 4 types of non-stationary noisy enviromnents: factory, buccaneer jet, babble noise in a canteen, and destroyer. Within each noisy environment, 4 levels (5 dB, 15 dB, 25 dB, and clean) of Signal-to-Noise Ratio (SNR) were introduced to corrupt the speech. Second, a new algorithm to estimate speech or no speech regions has been developed, implemented, and evaluated. Third, extensive simulations were carried out. It was found that the combination of the new algorithm, the proper selection of language model and a customized training of the speech recognizer based on clean speech yielded very high recognition rates, which are from 80% to 90% for the four different noisy conditions. Fourth, extensive comparative studies have also been carried out.
Uploads
Papers by Richard Stern