This document proposes a novel i-vector based convolutive non-negative matrix factorization (CNMF) approach for noise robust automatic speech recognition. It aims to learn robust speech representations that are invariant to speaker environment and recording conditions. The approach uses CNMF to represent spectrograms as a product of a basis speech dictionary and time activations. It then learns the total variability matrix to model noise and channel variations in the speech dictionary, leading to features invariant to noisy environments. The method is evaluated on the Aurora 4 noisy speech database, and results show the i-vector CNMF approach outperforms traditional filterbank features for noise robust ASR. Future work includes hyperparameter optimization and integrating the features into a GMM-DNN ASR
Kunal Dhawan1 , Colin Vaz2 , Ruchir Travadi2 and Shrikanth Narayanan2
1 Department of Electronics and Electrical Engineering, Indian InsPtute of Technology GuwahaP, GuwahaP, Assam, India 2 Signal Analysis and InterpretaPon Lab, University of Southern California, Los Angeles, CA 90089
➢ Problem: Commonly used acoustic features for Automatic Step 1: Learning UBM Speech Dictionary Used the Aurora 4 database which has the following characteristics: 7 noise types: airport, babble, car, clean, restaurant, street, train j jj Speech Recognition are not robust to noise 2 microphone position conditions: near field, far field Degrades ASR performance in noisy conditions Training set of 7138 utterances, each approximately 8 seconds ➢ Aim: Learn robust representation of speech which is invariant clclong ( noise is not labeled) k to speaker environment and recording conditions Test set of 330 utterances for each noise+channel type (∴ 7*2=14 cc different conditions) Background Result and Discussion 1) Non-Negative Matrix Factorization (NMF) Step 2: Calculating the 0th and 1st order sufficient statistics , estimating the mean and covariance supervectors The i-vector approach presented is very effective in modeling the p An algorithm to approximate a non-negative matrix V ppnoise type: (∈ℝ≥0,M×N) as a product of two non-negative matrices W (∈ℝ≥0,M×R) and H (∈ℝ≥0,R×N), where R≤M. To minimize the error of reconstruction, decomposition is done according to an adapted Kullback-Leibler divergence cost metric: 123 #(%| '×) = ∑,-(%,- log 4×5 23 − %,- + '×) ,- ) Step 3: Learning the Total Variability Matrix The T matrix captures the variations in the training set about the UBM and hence defines the space where the speech dictionaries will adapt This idea can be used in ASR to represent a spectrogram as a product of a basis dictionary and corresponding time Step 4: Calculating the adapted Dictionaries for the training activations: set and hence find corresponding time activations Learning the noise and channel variations in the speech dictionary ppitself leads to features which are invariant of the noisy lllllenvironment:
Spectrogram (V) Speech Dict. (W) Time Ac9va9ons (H)
Spectro-temporal repn. of Learns a set of basis Chooses the appropriate a given speech sample elements having different basis elements for each Pme frequency characteristics frame in a given u]erance
2) Total Variability approach
Model the variations in a dataset as low-dimensional vector Car Noise Airport Noise Clean Speech Car Noise
Complete Data distribution Utterance Specific Data
(UBM) Distribution (means slightly shifted from UBM) Babble Noise
Feature Extraction Step:
Babble Noise Airport Noise
9 = : + ;×< i-Vector (a random
vector having a standard Feature corresponding to Conclusion and Future Work Speaker & Channel dependent given utterance normal distribution) dicPonary supervector ⭐ Current system outperforms the classical filterbank features. Step 5: Extract features for the test set using the same method ⭐ Currently performing a grid search over the hyperparameters UBM supervector (Speaker & Channel Total Variability Matrix (rectangular independent dicPonary supervector) matrix of low rank which models the as in step 4 llll(sparsity of the dicPonary , number of basis elements to choose) variation space) lllllwith the final aim of building GMM-DNN ASR system.