Skip to content

This repo contains the reicpe to assemble a corpus for Foreign Accented English using the crowdsourced corpus Common Voice which contains (optional) accent labels.

License

Notifications You must be signed in to change notification settings

schaltung/FAE-CV

Repository files navigation

FAE Common Voice 2022

Foreign-Accented English from the crowdsourced Common Voice corpus.

Design Criteria:

  • Use validated.tsv recordings only from cv-corpus-10.0-2022-07-04.
  • ~20sec per speaker.
  • ~100-500 speakers per accent-class.
  • Favor speaker diversity and gender balance.

Requirements.

  • Install docker-compose version > 1.29 or Compose V2 preferrably.
  • Create a .env file and define these variable:
    • PATH_CORPORA=/local/directory/where/cv/was/downloaded/
    • PORT_JUPY=8888 # local port of choice

For example .env:

PATH_CORPORA=/data/corpora/CommonVoice
PORT_JUPY=0  # 0 picks random port

Build docker image.

Note: If using Compose V2, simply replace docker-compose with docker compose.

docker-compose build

Run docker container.

docker-compose -p fae$USER up -d --force-recreate

Running Jupyter notebook

Find the assigned local port by inspecting

docker-compose -p fae$USER ps
       Name                    Command                 State                         Ports                   
-------------------------------------------------------------------------------------------------------------
faejsmith_FAECV_1   tini -g -- start-notebook.sh   Up (healthy)   0.0.0.0:49153->8888/tcp,:::49153->8888/tcp

The container is listening to that port: http://server:49153/

Find the corresponding access token in the container logs:

docker-compose -p fae$USER logs

Sample visualization of the Accents-Embedding space.

Follow the FAE/CV22 recipe to duplicate the visualization below.

References

[1] Ardila R, Branson M, Davis K, Henretty M, Kohler M, Meyer J, Morais R, Saunders L, Tyers FM, Weber G. (2019). Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670. 2019 Dec 13.

About

This repo contains the reicpe to assemble a corpus for Foreign Accented English using the crowdsourced corpus Common Voice which contains (optional) accent labels.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy