1810.04826v6

VoiceFilter: Targeted Voice Separation by
Speaker-Conditioned Spectrogram Masking

Quan Wang* 1 Hannah Muckenhirn* 2,3 Kevin Wilson1 Prashant Sridhar1
1
Zelin Wu John Hershey1 Rif A. Saurous1 Ron J. Weiss1 Ye Jia1 Ignacio Lopez Moreno1
1 2 3
Google Inc., USA Idiap Research Institute, Switzerland EPFL, Switzerland
quanw@google.com hannah.muckenhirn@idiap.ch
Abstract literature call it speaker extraction). We argue that for voice fil-
tering, speaker-independent techniques such as those presented
In this paper, we present a novel system that separates the voice in [1, 2, 3] may not be a good fit. In addition to the challenges
of a target speaker from multi-speaker signals, by making use described previously, these techniques require an extra step to
of a reference signal from the target speaker. We achieve this by determine which output – out of the N possible outputs of the
arXiv:1810.04826v6 [eess.AS] 19 Jun 2019
training two separate neural networks: (1) A speaker recogni- speech separation system – corresponds to the target speaker(s),
tion network that produces speaker-discriminative embeddings; by e.g. choosing the loudest speaker, running a speaker verifi-
(2) A spectrogram masking network that takes both noisy spec- cation system on the outputs, or matching a specific keyword.
trogram and speaker embedding as input, and produces a mask. A more end-to-end approach for the voice filtering task is
Our system significantly reduces the speech recognition WER to treat the problem as a binary classification problem, where
on multi-speaker signals, with minimal WER degradation on the positive class is the speech of the speaker of interest, and
single-speaker signals. the negative class is formed by the combination of all fore-
Index Terms: Source separation, speaker recognition, spectro- ground and background interfering speakers and noises. By
gram masking, speech recognition speaker-conditioning the system, this approach suppresses the
three aforementioned challenges: unknown number of speak-
1. Introduction ers, permutation problem, and selection from multiple outputs.
Recent advances in speech recognition have led to performance In this work, we aim to condition the system on the speaker
improvement in challenging scenarios such as noisy and far- embedding vector of a reference recording. The proposed ap-
field conditions. However, speech recognition systems still per- proach is the following. We first train a LSTM-based speaker
form poorly when the speaker of interest is recorded in crowded encoder to compute robust speaker embedding vectors. We then
environments, i.e., with interfering speakers in the foreground train separately a time-frequency mask-based system that takes
or background. two inputs: (1) the embedding vector of the target speaker, pre-
One way to deal with this issue is to first apply a speech viously computed with the speaker encoder; and (2) the noisy
separation system on the noisy audio in order to separate the multi-speaker audio. This system is trained to remove the inter-
voices from different speakers. Therefore, if the noisy signal fering speakers and output only the voice of the target speaker.1
contains N speakers, this approach would yield N outputs with This approach can be easily extended to more than one speaker
a potential additional output for the ambient noise. A classical of interest by repeating the process in turns, for the reference
speech separation task like this needs to cope with two main recording of each target speaker.
challenges. First, identifying the number of speakers N in the Similar related literature exists for the task of voice filter-
recording, which in realistic scenarios is unknown. Secondly, ing. For example, in [4, 5], the authors achieved impressive
the optimization of a speech separation system may be required results by doing an indirect speaker conditioning of the system
to be invariant to the permutation of speaker labels, as the order on the visual information (lips movement). However, a solution
of the speakers should not have an impact during training [1]. like that would require simultaneously using speech and visual
Leveraging the advances in deep neural networks, several suc- information, which may not be available in certain type of appli-
cessful works have been introduced to address these problems, cations, where a reference speech signal may be more practical.
such as deep clustering [1], deep attractor network [2], and per- The systems proposed in [6, 7, 8, 9] are also very similar to
mutation invariant training [3]. ours, with a few major differences: (1) Unlike using one-hot
This work addresses the task of isolating the voices of a vectors, i-vectors or speaker posteriors derived from a cross-
subset of speakers of interest from the commonality of all the entropy classification network, our speaker encoder network is
other speakers and noises. For example, such subset can be specifically designed for large-scale end-to-end speaker recog-
formed by a single target speaker issuing a spoken query to a nition [10], which proves to perform much better in speaker
personal mobile device, or the members of a house talking to recognition tasks, especially for unseen speakers. (2) Instead
a shared home device. We will also assume that the speaker(s) of using a GEV beamformer [6, 8], our system directly opti-
of interest can be individually characterized by previous refer- mizes the power-law compressed reconstruction error between
ence recordings, e.g. through an enrollment stage. This task is the clean and enhanced signals [11]. (3) In addition to source-
closely related to classical speech separation, but in a way that to-distortion ratio [6, 7], we focus on Word Error Rate improve-
it is speaker-dependent. In this paper, we will refer to the task ments for ASR systems. (4) We use dilated convolutional lay-
of speaker-dependent speech separation as voice filtering (some ers to capture low-level acoustic features more effectively. (5)
* Equal contribution. Hannah performed this work as an intern at 1 Samples of output audios are available at: https://google.
Google. github.io/speaker-id/publications/VoiceFilter
We prefer separately trained speaker encoder network over joint Table 1: Parameters of the VoiceFilter network.
training like [8, 9], due to the very different requirements for Width Dilation
data in speaker recognition and source separation tasks. Layer Filters / Nodes
time freq time freq
The rest of this paper is organized as follows. In Section 2, CNN 1 1 7 1 1 64
we describe our approach to the problem, and provide the de- CNN 2 7 1 1 1 64
tails of how we train the neural networks. In Section 3, we CNN 3 5 5 1 1 64
describe our experimental setup, including the datasets we use CNN 4 5 5 2 1 64
and the evaluation metrics. The experimental results are pre- CNN 5 5 5 4 1 64
sented in Section 4. We draw our conclusions in Section 5, with CNN 6 5 5 8 1 64
discussions on future work directions. CNN 7 5 5 16 1 64
CNN 8 1 1 1 1 8
2. Approach LSTM - - - - 400
The system architecture is shown in Fig. 1. The system consists FC 1 - - - - 600
of two separately trained components: the speaker encoder (in FC 2 - - - - 600
red), and the VoiceFilter system (in blue), which uses the output of two completely different signals: a magnitude spectrogram
of the speaker encoder as an additional input. In this section, we and a speaker embedding.
will describe these two components. While training the VoiceFilter system, the input audios are
divided into segments of 3 seconds each and are converted, if
2.1. Speaker encoder necessary, to single channel audios with a sampling rate of 16
The purpose of the speaker encoder is to produce a speaker em- kHz.
bedding from an audio sample of the target speaker. This sys-
tem is based on a recent work from Wan et al. [10], which
3. Experimental setup
achieves great performance on both text-dependent and text- In this section, we describe our experimental setup: the data
independent speaker verification tasks, as well as on speaker di- used to train the two components of the system separately, as
arization [12, 13], multispeaker TTS [14], and speech-to-speech well as the metrics to assess the systems.
translation [15].
The speaker encoder is a 3-layer LSTM network trained 3.1. Data
with the generalized end-to-end loss [10]. It takes as inputs
3.1.1. Datasets
log-mel filterbank energies extracted from windows of 1600 ms,
and outputs speaker embeddings, called d-vectors, which have Speaker encoder: Although our speaker encoder network
a fixed dimension of 256. To compute a d-vector on one utter- has exactly the same network topology as the text-independent
ance, we extract sliding windows with 50% overlap, and aver- model described in [10], we use more training data in this sys-
age the L2-normalized d-vectors obtained on each window. tem. Our speaker encoder is trained with two datasets com-
bined by the MultiReader technique introduced in [10]. The
2.2. VoiceFilter system first dataset consists of anonymized voice query logs in English
from mobile and farfield devices. It has about 34 million utter-
The VoiceFilter system is based on the recent work of Wilson ances from about 138 thousand speakers. The second dataset
et al. [11], developed for speech enhancement. As shown in consists of LibriSpeech [16], VoxCeleb [17], and VoxCeleb2
Fig. 1, the neural network takes two inputs: a d-vector of the [18]. This model (referred to as “d-vector V2” in [13]) has a
target speaker, and a magnitude spectrogram computed from 3.06% equal error rate (EER) on our internal en-US phone au-
a noisy audio. The network predicts a soft mask, which is dio test dataset, compared to the 3.55% EER of the one reported
element-wise multiplied with the input (noisy) magnitude spec- in [10].
trogram to produce an enhanced magnitude spectrogram. To
obtain the enhanced waveform, we directly merge the phase of VoiceFilter: We cannot use a “standard” benchmark cor-
the noisy audio to the enhanced magnitude spectrogram, and pus for speech separation, such as one of the CHiME chal-
apply an inverse STFT on the result. The network is trained to lenges [19], because we need a clean reference utterance of
minimize the difference between the masked magnitude spec- each target speaker in order to compute speaker embeddings.
trogram and the target magnitude spectrogram computed from Instead, we train and evaluate the VoiceFilter system on our
the clean audio. own generated data, derived either from the VCTK dataset [20]
or from the LibriSpeech dataset [16]. For VCTK, we randomly
The VoiceFilter network is composed of 8 convolutional
take 99 speakers for training, and 10 speakers for testing. For
layers, 1 LSTM layer, and 2 fully connected layers, each with
LibriSpeech, we used the training and development sets defined
ReLU activations except the last layer, which has a sigmoid ac-
in the protocol of the dataset: the training set contains 2338
tivation. The values of the parameters are provided in Table 1.
speakers, and the development set contains 73 speakers. These
The d-vector is repeatedly concatenated to the output of the
two datasets contain read speech, and each utterance contains
last convolutional layer in every time frame. The resulting con-
the voice of one speaker. We explain in the next section how we
catenated vector is then fed as the input to the following LSTM
generate the data used to train the VoiceFilter system.
layers. We decide to inject the d-vector between the convo-
lutional layers and the LSTM layer and not before the convo-
3.1.2. Data generation
lutional layers for two reasons. First, the d-vector is already a
compact and robust representation of the target speaker, thus we From the system diagram in Fig. 1, we see that one training
do not need to modify it by applying convolutional layers on top step involves three inputs: (1) the clean audio from the target
of it. Secondly, convolutional layers assume time and frequency speaker, which is the ground truth; (2) the noisy audio contain-
homogeneity, and thus cannot be applied on an input composed ing multiple speakers; and (3) a reference audio from the target
Trained separately
Speaker Encoder
d-vector
LSTM
Trainable VoiceFilter
Reference Fully Soft mask
CNN LSTM
Audio connected prediction
Noisy Input Masked

Audio
STFT
Spectrogram * Spectrogram
Inverse STFT
Clean Clean Enhanced

STFT Loss function
Audio Spectrogram Audio
Computed during training only
Figure 1: System architecture.
Extracted from Training triplet phone models discussed in [21], which is trained on a YouTube
dataset
dataset.
Reference Reference For each VoiceFilter model, we care about four WER num-
Speaker A
Audio Audio
bers:
Speaker A
Clean Clean • Clean WER: Without VoiceFilter, the WER on the clean
Audio Audio
audio.
Speaker B Interference Noisy • Noisy WER: Without VoiceFilter, the WER on the noisy
Audio Audio (clean + interence) audio.
Directly summing
signals • Clean-enhanced WER: the WER on the clean audio pro-
cessed by the VoiceFilter system.
Figure 2: Input data processing workflow. • Noisy-enhanced WER: the WER on the noisy audio pro-
cessed by the VoiceFilter system.
speaker (different from the clean audio) over which the d-vector
will be computed. A good VoiceFilter model should have these two properties:
This training triplet can be obtained by using three audios 1. Noisy-enhanced WER is significantly lower than Noisy
from a clean dataset, as shown in Fig. 2. The reference au- WER, meaning that the VoiceFilter is improving speech
dio is picked randomly among all the utterances of the target recognition in multi-speaker scenarios.
speaker, and is different from the clean audio. The noisy audio
2. Clean-enhanced WER is very close to Clean WER,
is generated by mixing the clean audio and an interfering audio
meaning that the VoiceFilter has minimal negative im-
randomly selected from a different speaker. More specifically,
pact on single-speaker scenarios.
it is obtained by directly summing the clean audio and the inter-
fering audio, then trimming the result to the length of the clean
3.2.2. Source to distortion ratio
audio.
We have also tried to multiply the interfering audio by a ran- The SDR is a very common metric to evaluate source separation
dom weight following a uniform distribution either within [0, 1] systems [22], which requires to know both the clean signal and
or within [0, 2]. However, this did not affect the performance of the enhanced signal. It is an energy ratio, expressed in dB, be-
the VoiceFilter system in our experiments. tween the energy of the target signal contained in the enhanced
signal and the energy of the errors (coming from the interfering
3.2. Evaluation speakers and artifacts). Thus, the higher it is, the better.
To evaluate the performance of different VoiceFilter mod-
els, we use two metrics: the speech recognition Word Error 4. Results
Rate (WER) and the Source to Distortion Ratio (SDR). 4.1. Word error rate
In Table 2, we present the results of VoiceFilter models trained
3.2.1. Word error rate
and evaluated on the LibriSpeech dataset. The architecture of
As mentioned in Sec. 1, the main goal of our system is the VoiceFilter system is shown in Table 1, with a few different
to improve speech recognition. Specifically, we want to re- variations of the LSTM layer: (1) no LSTM layer, i.e., only con-
duce the WER in multi-speaker scenarios, while preserving the volutional layers directly followed by fully connected layers;
same WER in single-speaker scenarios. The speech recognizer (2) a uni-directional LSTM layer; (3) a bi-directional LSTM
we use for WER evaluation is a version of the conventional layer. In general, after applying VoiceFilter, the WER on the
Table 2: Speech recognition WER on LibriSpeech. VoiceFilter Table 4: Source to distortion ratio on LibriSpeech. Unit is dB.
is trained on LibriSpeech. PermInv stands for permutation invariant loss [3]. Mean SDR
for “No VoiceFilter” is high since some clean signals are mixed
Clean Noisy with silent parts of interference signals.
VoiceFilter Model
WER (%) WER (%)
No VoiceFilter 10.9 55.9 VoiceFilter Model Mean SDR Median SDR
VoiceFilter: no LSTM 12.2 35.3 No VoiceFilter 10.1 2.5
VoiceFilter: LSTM 12.2 28.2 VoiceFilter: no LSTM 11.9 9.7
VoiceFilter: bi-LSTM 11.1 23.4 VoiceFilter: LSTM 15.6 11.3
VoiceFilter: bi-LSTM 17.9 12.6
Table 3: Speech recognition WER on VCTK. LSTM layer is uni- PermInv: bi-LSTM 17.2 11.9
directional. Model architecture is shown in Table 1.
nal. As explained in Section 1, using such a system in practice
Clean Noisy would require to: 1) estimate how many speakers are in the
VoiceFilter Model
WER (%) WER (%) noisy input, and 2) choose which output to select, e.g. by run-
No VoiceFilter 6.1 60.6 ning a speaker verification system on each output (which might
Trained on VCTK 21.1 37.0 not be efficient if there are a lot of interfering speakers).
Trained on LibriSpeech 5.9 34.3 We observe that the VoiceFilter system outperforms the per-
mutation invariant loss based system. This shows that not only
noisy data is significantly lower than before, while the WER on our system solves the two aforementioned issues, but using a
the clean dataset remains close to before. There is a significant speaker embedding also improves the capability of the system
gap between the first and second model, meaning that process- to extract the source of interest (with a higher SDR).
ing the data sequentially with an LSTM is an important compo-
4.3. Discussions
nent of the system. Morever, using a bi-directional LSTM layer
we achieve the best WER on the noisy data. With this model, In Table 2, we tried a few variants of the VoiceFilter model on
applying the VoiceFilter system on the noisy data reduces the LibriSpeech, and the best WER performance was achieved with
speech recognition WER by a relative 58.1%. In the clean sce- a bi-directional LSTM. However, it is likely that a similar per-
nario, the performance degradation caused by the VoiceFilter formance could be achieved by adding more layers or nodes
system is very small: the WER is 11.1% instead of 10.9%. to uni-directional LSTM. Future work includes exploring more
In Table 3, we present the WER results of VoiceFilter mod- variants and fine-tuning the hyper-parameters to achieve better
els evaluated on the VCTK dataset. With a VoiceFilter model performance with lower computational cost, but that is beyond
trained also on VCTK, the WER on the noisy data after applying the focus of this paper.
VoiceFilter is significantly lower than before, reduced relatively
by 38.9%. However, the WER on the clean data after applying 5. Conclusions and future work
VoiceFilter is significantly higher. This is mostly because the In this paper, we have demonstrated the effectiveness of us-
VCTK training set is too small, containing only 99 speakers. ing a discriminatively-trained speaker encoder to condition the
If we use a VoiceFilter model trained on LibriSpeech instead, speech separation task. Such a system is more applicable to real
the WER on the noisy dataset further decreases, while the WER scenarios because it does not require prior knowledge about the
on the clean data reduces to 5.9%, which is even smaller than number of speakers and avoids the permutation problem. We
before applying VoiceFilter. This means: (1) The VoiceFilter have shown that a VoiceFilter model trained on the LibriSpeech
model is able to generalize from one dataset to another; (2) We dataset reduces the speech recognition WER from 55.9% to
are improving the acoustic quality of the original clean audios, 23.4% in two-speaker scenarios, while the WER stays approxi-
even if we did not explicitly train it this way. mately the same on single-speaker scenarios.
Note that the LibriSpeech training set contains about 20 This system could be improved by taking a few steps: (1)
times more speakers than VCTK (2338 speakers instead of 99 training on larger and more challenging datasets such as Vox-
speakers), which is the major difference between the two mod- Celeb 1 and 2 [18]; (2) adding more interfering speakers; and
els shown in Table 3. Thus, the results also imply that we could (3) computing the d-vectors over several utterances instead of
further improve our VoiceFilter model by training it with even only one to obtain more robust speaker embeddings. Another
more speakers. interesting direction would be to train the VoiceFilter system to
4.2. Source to distortion ratio perform joint voice separation and speech enhancement, i.e., to
remove both the interfering speakers and the ambient noise. To
We present the SDR numbers in Table 4. The results follow the do so, we could add different noises when mixing the clean au-
same trend as the WER in Table 2. The bi-directional LSTM dio with interfering utterances. This approach will be part of
approach in the VoiceFilter achieves the highest SDR. future investigations. Finally, the VoiceFilter system could also
We also compare the VoiceFilter results to a speech sepa- be trained jointly with the speech recognition system to further
ration model that uses the permutation invariant loss [3]. This increase the WER improvement.
model has the same architecture as the VoiceFilter system (with
a bi-directional LSTM), presented in Table 1, but is not fed with 6. Acknowledgements
speaker embeddings. Instead, it separates the noisy input into The authors would like to thank Seungwon Park for open sourc-
two components, corresponding to the clean and the interfering ing a third-party implementation of this system.2 We would
audio, and chooses the output that is the closest to the ground like to thank Yiteng (Arden) Huang, Jason Pelecanos, and Fadi
truth, i.e., with the lowest SDR. This system can be seen as an Biadsy for the helpful discussions.
“oracle” system as it knows both the number of sources con-
tained in the noisy signal as well as the ground truth clean sig- 2 https://github.com/mindslab-ai/voicefilter
7. References [17] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-
scale speaker identification dataset,” in Interspeech, 2017.
[1] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clus-
tering: Discriminative embeddings for segmentation and separa- [18] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep
tion,” in International Conference on Acoustics, Speech and Sig- speaker recognition,” in Interspeech, 2018.
nal Processing (ICASSP). IEEE, 2016, pp. 31–35. [19] J. P. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The
[2] Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network CHiME challenges: Robust speech recognition in everyday envi-
for single-microphone speaker separation,” in International Con- ronments,” in New Era for Robust Speech Recognition. Springer,
ference on Acoustics, Speech and Signal Processing (ICASSP). 2017, pp. 327–344.
IEEE, 2017. [20] C. Veaux, J. Yamagishi, K. MacDonald et al., “Superseded-cstr
[3] D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation in- vctk corpus: English multi-speaker corpus for cstr voice cloning
variant training of deep models for speaker-independent multi- toolkit,” 2016.
talker speech separation,” in International Conference on Acous- [21] H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer:
tics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. Acoustic-to-word lstm model for large vocabulary speech recog-
241–245. nition,” in Interspeech, 2017.
[4] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, [22] E. Vincent, R. Gribonval, and C. Févotte, “Performance measure-
W. T. Freeman, and M. Rubinstein, “Looking to listen at the cock- ment in blind audio source separation,” IEEE transactions on au-
tail party: A speaker-independent audio-visual model for speech dio, speech, and language processing, vol. 14, no. 4, pp. 1462–
separation,” in SIGGRAPH, 2018. 1469, 2006.
[5] T. Afouras, J. S. Chung, and A. Zisserman, “The conversation:
Deep audio-visual speech enhancement,” in Interspeech, 2018.
[6] K. Zmolikova, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa,
and T. Nakatani, “Speaker-aware neural network based beam-
former for speaker extraction in speech mixtures,” in Interspeech,
2017.
[7] J. Wang, J. Chen, D. Su, L. Chen, M. Yu, Y. Qian, and D. Yu,
“Deep extractor network for target speaker recovery from sin-
gle channel speech mixtures,” arXiv preprint arXiv:1807.08974,
2018.
[8] K. Žmolı́ková, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa,
and T. Nakatani, “Learning speaker representation for neu-
ral network based multichannel speaker extraction,” in Auto-
matic Speech Recognition and Understanding Workshop (ASRU).
IEEE, 2017, pp. 8–15.
[9] M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, and
T. Nakatani, “Single channel target speaker extraction and recog-
nition with speaker beam,” in International Conference on Acous-
tics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp.
5554–5558.
[10] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-
to-end loss for speaker verification,” in International Conference
on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
2018, pp. 4879–4883.
[11] K. Wilson, M. Chinen, J. Thorpe, B. Patton, J. Hershey, R. A.
Saurous, J. Skoglund, and R. F. Lyon, “Exploring tradeoffs in
models for low-latency speech enhancement,” in International
Workshop on Acoustic Signal Enhancement (IWAENC). IEEE,
2018, pp. 366–370.
[12] Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno,
“Speaker diarization with lstm,” in International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018,
pp. 5239–5243.
[13] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang, “Fully su-
pervised speaker diarization,” arXiv preprint arXiv:1810.04719,
2018.
[14] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen,
P. Nguyen, R. Pang, I. L. Moreno et al., “Transfer learning from
speaker verification to multispeaker text-to-speech synthesis,” in
Conference on Neural Information Processing Systems (NIPS),
2018.
[15] Y. Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson, Z. Chen,
and Y. Wu, “Direct speech-to-speech translation with a sequence-
to-sequence model,” arXiv preprint arXiv:1904.06037, 2019.
[16] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
rispeech: an asr corpus based on public domain audio books,”
in International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP). IEEE, 2015, pp. 5206–5210.

1810.04826v6

Uploaded by

Copyright:

Available Formats

1810.04826v6

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1810.04826v6

Uploaded by

Copyright:

Available Formats

VoiceFilter: Targeted Voice Separation by

Speaker-Conditioned Spectrogram Masking

Noisy Input Masked

Clean Clean Enhanced

Computed during training only

Figure 1: System architecture.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.