Usc Poster

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

A NOVEL I-VECTOR BASED CNMF APPROACH TOWARDS NOISE ROBUST ASR

Kunal Dhawan1 , Colin Vaz2 , Ruchir Travadi2 and Shrikanth Narayanan2


1
Department of Electronics and Electrical Engineering, Indian InsPtute of Technology GuwahaP, GuwahaP, Assam, India
2 Signal Analysis and InterpretaPon Lab, University of Southern California, Los Angeles, CA 90089

Motivation Proposed Feature GeneraPon Algorithm Dataset DescripPon


➢ Problem: Commonly used acoustic features for Automatic Step 1: Learning UBM Speech Dictionary Used the Aurora 4 database which has the following characteristics:
7 noise types: airport, babble, car, clean, restaurant, street, train
j jj Speech Recognition are not robust to noise 2 microphone position conditions: near field, far field
Degrades ASR performance in noisy conditions Training set of 7138 utterances, each approximately 8 seconds
➢ Aim: Learn robust representation of speech which is invariant clclong ( noise is not labeled)
k to speaker environment and recording conditions Test set of 330 utterances for each noise+channel type (∴
7*2=14 cc different conditions)
Background Result and Discussion
1) Non-Negative Matrix Factorization (NMF) Step 2: Calculating the 0th and 1st order sufficient statistics ,
estimating the mean and covariance supervectors
The i-vector approach presented is very effective in modeling the p
An algorithm to approximate a non-negative matrix V
ppnoise type:
(∈ℝ≥0,M×N) as a product of two non-negative matrices W
(∈ℝ≥0,M×R) and H (∈ℝ≥0,R×N), where R≤M.
To minimize the error of reconstruction, decomposition is
done according to an adapted Kullback-Leibler divergence
cost metric:
123
#(%| '×) = ∑,-(%,- log 4×5 23
− %,- + '×) ,- )
Step 3: Learning the Total Variability Matrix
The T matrix captures the variations in the training set about the UBM and
hence defines the space where the speech dictionaries will adapt
This idea can be used in ASR to represent a spectrogram as a
product of a basis dictionary and corresponding time Step 4: Calculating the adapted Dictionaries for the training
activations: set and hence find corresponding time activations Learning the noise and channel variations in the speech dictionary
ppitself leads to features which are invariant of the noisy
lllllenvironment:

Spectrogram (V) Speech Dict. (W) Time Ac9va9ons (H)


Spectro-temporal repn. of Learns a set of basis Chooses the appropriate
a given speech sample elements having different basis elements for each Pme
frequency characteristics frame in a given u]erance

2) Total Variability approach


Model the variations in a dataset as low-dimensional vector
Car Noise Airport Noise Clean Speech Car Noise

Complete Data distribution Utterance Specific Data


(UBM) Distribution (means slightly shifted
from UBM) Babble Noise

Feature Extraction Step:

Babble Noise Airport Noise

9 = : + ;×< i-Vector (a random


vector having a standard Feature corresponding to Conclusion and Future Work
Speaker & Channel dependent given utterance
normal distribution)
dicPonary supervector ⭐ Current system outperforms the classical filterbank features.
Step 5: Extract features for the test set using the same method ⭐ Currently performing a grid search over the hyperparameters
UBM supervector (Speaker & Channel Total Variability Matrix (rectangular
independent dicPonary supervector) matrix of low rank which models the as in step 4 llll(sparsity of the dicPonary , number of basis elements to choose)
variation space) lllllwith the final aim of building GMM-DNN ASR system.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy