- Feature: 80-dim fbank, mean normalization, speed perturb
- Training: lr [0.00005, 0.2], batch_size 256, 4 gpus(Tesla V100), additive angular margin, speaker embeddding=192
- Metrics: EER(%), MinDCF(p-target=0.01)
- Train set: 3D-Speaker-train
- Test set: 3D-Speaker-test
Model | Params | Cross-Device | Cross-Distance | Cross-Dialect |
---|---|---|---|---|
ECAPA-TDNN | 20.8 M | 8.87% | 12.26% | 14.53% |
ResNet34 | 6.34 M | 7.29% | 8.98% | 12.81% |
Pretrained models are accessible on ModelScope.
- 3D-Speaker-Dataset: iic/speech_resnet34_sv_zh-cn_3dspeaker_16k
Here is a simple example for directly extracting embeddings. It downloads the pretrained model from ModelScope and extracts embeddings.
# Install modelscope
pip install modelscope
# ResNet34 trained on 3D-Speaker-Dataset
model_id=iic/speech_resnet34_sv_zh-cn_3dspeaker_16k
# Run inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path