- Feature: 80-dim fbank, mean normalization, speed perturb
- Training: lr [0.00005, 0.2], batch_size 256, 4 gpu(Tesla V100), additive angular margin, speaker embeddding=192
- Metrics: EER(%), MinDCF
- Train set: CNCeleb-dev + CNCeleb2, 2973 speakers
- Test set: CNCeleb-eval
Model | Params | EER(%) | MinDCF |
---|---|---|---|
CAM++ | 7.18M | 6.78 | 0.393 |
ERes2net-base | 6.61M | 6.69 | 0.388 |
ERes2net-large | 22.46M | 6.17 | 0.372 |
Pretrained models are accessible on ModelScope.
- ERes2net-base: speech_eres2net_base_sv_zh-cn_cnceleb_16k
- ERes2Net-large: speech_eres2net_large_sv_zh-cn_cnceleb_16k
- 200k labeled speakers: speech_eres2net_sv_zh-cn_16k-common
Here is a simple example for directly extracting embeddings. It downloads the pretrained model from ModelScope and extracts embeddings.
# Install modelscope
pip install modelscope
# ERes2Net trained on CNCeleb
model_id=damo/speech_eres2net_base_sv_zh-cn_cnceleb_16k
# ERes2Net trained on 200k labeled speakers
model_id=damo/speech_eres2net_sv_zh-cn_16k-common
# Run inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path
If you are using ERes2Net model in your research, please cite:
@article{eres2net,
title={An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification},
author={Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Jiajun Qi},
booktitle={Interspeech 2023},
year={2023},
organization={IEEE}
}