Skip to content

This repo aims to build a web app that supports speech recognition system 😃 It's simple to use and understand 😄

License

Notifications You must be signed in to change notification settings

manhph2211/ViSR

Repository files navigation

Vietnamese-Speech-Recognition

Introduction

In this repo, I focused on building end-to-end speech recognition pipeline using Quartznet, wav2vec2.0 and CTC decoder supported by beam search algorithm as well as language model.

Setup

Datasets

Here I used 100h speech public dataset of Vinbigdata , which is a small clean set of VLSP2020 ASR competition. Some infomation of this dataset can be found at data/Data_Workspace.ipynb. The data format I would use to train and evaluate is just like LJSpeech, so I create data/custom.py to customize the given dataset.

mkdir data/LJSpeech-1.1 
python data/custom.py # create data format for training quartnet & w2v2.0

And below is the folder that I used, note that metadata.csv has 2 columns, file name and transcript:

├───data
│   ├───LJSpeech-1.1
│   │   └───wavs
│   │   └───metadata.csv
│   └───vlsp2020_train_set_02
├───datasets
├───demo
├───models
│   └───quartznet
│       └───base
├───tools
└───utils

Environment

You can create your environment and install the requirements file and note that torch should be installed based on your CUDA version. With conda:

cd Vietnamese-Speech-Recognition
conda create -n asr
conda activate asr
conda install --file requirements.txt

Also, you need to install ctcdecode:

git clone --recursive https://github.com/parlance/ctcdecode.git
cd ctcdecode && pip install . && cd ..

Tools

Training & Evaluation

For training the quartznet model, you can run:

python3 tools/train.py --config configs/config.yaml

And evaludate quartnet:

python3 tools/evaluate.py --config configs/config.yaml

Or you wanna finetune wav2vec2.0 model from Vietnamese pretrained w2v2.0:

python3 tools/fintune_w2v.py

Demo

This time, I provide small code with streamlit for asr demo, you can run:

streamlit run demo/app.py

demo

Results

I used wandb&tensorboard for logging results and antifacts during training, here are some visualizations after several epochs:

Quartznet W2v 2.0

References

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy