Content-Length: 354025 | pFad | http://github.com/WangHelin1997/SoloAudio

1A GitHub - WangHelin1997/SoloAudio: SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer.
Skip to content

SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer.

License

Notifications You must be signed in to change notification settings

WangHelin1997/SoloAudio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 Cannot retrieve latest commit at this time.

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SoloAudio

Paper HuggingFace Models Colab Demo page

Official Pytorch implementation of the ICASSP 2025 paper: SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer.

Try our Huggingface space!!!

TODO

  • Release model weights
  • Release data
  • HuggingFace Spaces demo
  • VAE training code
  • arxiv paper

Environment setup

conda env create -f env.yml
conda activate soloaudio

Pretrained Models

Download our pretrained models from huggingface.

After downloading the files, put them under this repo, like:

SoloAudio/
    -config/
    -demo/
    -pretrained_models/
    ....

Inference examples

For audio-oriented TSE, please run:

python tse_audioTSE.py --output_dir './output-audioTSE/' --mixture './demo/1_mix.wav' --enrollment './demo/1_enrollment.wav'

For language-oriented TSE, please run:

python tse_languageTSE.py --output_dir './output-languageTSE/' --mixture './demo/1_mix.wav' --enrollment 'Acoustic guitar'

Data Preparation

To train a SoloAudio model, you need to prepare the following parts:

  1. Prepare the FSD-Mix DataSet, please run:
cd data_preparating/
python create_filenames.py
python create_fsdmix.py

You can also use our simulated data for training, validataion and test.

  1. Prepare the TangoSyn DataSet, please run:
cd tango/
sh gen.sh
  1. Prepare the TangoSyn-Mix DataSet like step 1.

  2. Extract the VAE features, please run:

python extract_vae.py --data_dir "YOUR_DATA_DIR" --output_dir "YOUR_OUTPUT_DIR"
  1. Extract the CLAP features, please run:
python extract_clap_audio.py --input_base_dir "YOUR_DATA_DIR" --output_base_dir "YOUR_OUTPUT_DIR"
python extract_clap_text.py --input_base_dir "YOUR_DATA_DIR" --output_base_dir "YOUR_OUTPUT_DIR" --split 1
python extract_clap_text.py --input_base_dir "YOUR_DATA_DIR" --output_base_dir "YOUR_OUTPUT_DIR" --split 2
python extract_clap_text.py --input_base_dir "YOUR_DATA_DIR" --output_base_dir "YOUR_OUTPUT_DIR" --split 3

Training

Now, you are good to start training!

  1. Train with a single GPU, please run:
python train.py
  1. Train with multiple GPUs, please run:
accelerate launch train.py

Test

To test a folder of audio files, please run:

python test_audioTSE.py --output_dir './test-audioTSE/' --test_dir '/YOUR_PATH_TO_TEST/'

OR

python test_languageTSE.py --output_dir './test-languageTSE/' --test_dir '/YOUR_PATH_TO_TEST/'

To calculate the metrics used in the paper, please run:

cd metircs/
python main.py

VAE Training

We provide codes to train an audio waveform VAE model, reference to stable-audio-tools.

  1. Change data path in stable_audio_vae/configs/vae_data.txt (any folder contains audio files).

  2. Change model config in stable_audio_vae/configs/vae_16k_mono_v2.config.

We provide config for training audio files of 16k sampling rate, please change the settings when you want other sampling rates.

  1. Change batch size and training settings in stable_audio_vae/defaults.ini.

  2. Run:

cd stable_audio_vae/
bash train_bash.sh

License

The codebase is under MIT LICENSE.

Citations

@article{helin2024soloaudio,
  author    = {Wang, Helin and Hai, Jiarui and Lu, Yen-Ju and Thakkar, Karan and Elhilali, Mounya and Dehak, Najim},
  title     = {SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer},
  journal   = {arXiv},
  year      = {2024},
}

@INPROCEEDINGS{jiarui2024dpmtse,
  author={Hai, Jiarui and Wang, Helin and Yang, Dongchao and Thakkar, Karan and Dehak, Najim and Elhilali, Mounya},
  booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={DPM-TSE: A Diffusion Probabilistic Model for Target Sound Extraction}, 
  year={2024},
  pages={1196-1200},
  }

About

SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://github.com/WangHelin1997/SoloAudio

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy