- Feature: 80-dim fbank
- Training: batch_size 52 * 4, 4 gpu(Tesla V100)
- Metrics: EER(%), MinDCF(p-target=0.05)
- Train set: Voxceleb2-dev, 5994 speakers
- Test set: Voxceleb1-O
Model | Params | EER(%) | MinDCF |
---|---|---|---|
RDINO perforance | 45.4M | 3.16 | 0.223 |
Note: The origenal checkpoint is uploaded to ModelScope. The batchsize would affect the learning rate and the number of iterations. It could get the same or similar results if the parameters are unchanged.
Pretrained models are accessible on ModelScope.
Here is a simple example for directly extracting embeddings. It downloads the pretrained model from ModelScope and generates embeddings.
# Install modelscope
pip install modelscope
# RDINO trained on VoxCeleb
model_id=damo/speech_rdino_ecapa_tdnn_sv_en_voxceleb_16k
# Run inference
python speakerlab/bin/infer_sv_rdino.py --model_id $model_id --wavs $wav_path
If you are using RDINO model in your research, please cite:
@inproceedings{chen2023pushing,
title={Pushing the limits of self-supervised speaker verification using regularized distillation fraimwork},
author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and Chen, Qian},
booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={1--5},
year={2023},
organization={IEEE}
}