You can create a virtual environment and install the required packages using the following commands:
conda create -n livevqa python=3.9.0 -y
conda activate livevqa
pip install -r requirements.txt
Please refer to the liveVQA_benchmarks/README.md
for detailed information.
This module can help you collect news from BBC, CNN, Forbes, AP and Variety.
Before collecting news, you need to do settings in collectors/config.py
.
After simple settings, you can run the following command to collect news articles:
cd LIVEVQA
python run.py
Every time you run the command, it will collect news articles and save them in hot_topics_{timestamp}.json
.
This module can rank and filter irrelevant images from the collected news articles.
You should set your api key and base path in ranking/config.py
. After that, you can run the following command to filter images:
cd ranking
python Model_ranking.py
Every time you run the command, it will read the latest hot_topics_{timestamp}.json
and filter images. The filtered file will be saved in modified_topics_{timestamp}.json
.
This module can generate and filter Level 1 QAs from the filtered news articles.
You should set your api key and base path in qa_makers/config.py
& qa_Filter/config.py
. After that, you can run the following commands to generate Level 1 QAs:
Generate Level 1 QAs:
cd qa_makers
python main.py
Every time you run the command, it will read the latest modified_topics_{timestamp}.json
and generate QAs. The output file will be saved in l1_topics_{timestamp}.json
.
Filter Level 1 QAs:
cd qa_Filter
python main.py
Every time you run the command, it will read the latest l1_topics_{timestamp}.json
and filter QAs. The filtered file will be saved in l1_filtered_topics_{timestamp}.json
.
This module can generate Level 2 QAs from the filtered Level 1 QAs.
You should set your api key and base path in qa_makers_mh/config.py
. After that, you can run the following command to generate Level 2 QAs:
cd qa_makers_mh
python main.py
Every time you run the command, it will read the latest l1_filtered_topics_{timestamp}.json
and generate Level 2 QAs. The output file will be saved in l23_topics_{timestamp}.json
.
If you want to run the whole pipeline automatically, you can set your base path in start.py
and run the following command:
python start.py
This will automatically:
- Collect news
- Filter images
- Generate Level 1 QAs
- Filter Level 1 QAs
- Generate Level 2 QAs
The final output will be saved in l23_topics_{timestamp}.json
.
This module can help you collect videos from YouTube.
Before collecting videos, you need to:
- Configure settings in
video_code/video_pipeline.sh
- Download and configure the following repositories according to their instructions:
- Modify the
demo.py
files in both folders based on the implementations inuvd.py
anddoclayout.py
Torch version may conflict with the CUDA version. We recommend checking your CUDA version:
nvcc --version
nvidia-smi
Then install the corresponding torch version:
For CUDA 12.4:
pip uninstall torch torchvision torchaudio -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
For CUDA 11.8:
pip uninstall torch torchvision torchaudio -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
For CPU only:
pip uninstall torch torchvision torchaudio -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
After configuration, run the following command to collect YouTube videos:
cd LIVEVQA/video_code
bash video_pipeline.sh
π‘ Tips: Make sure to install both
ffprobe
andffmpeg
, otherwise the pipeline will fail with errors.
This module includes:
- Downloading videos
- Splitting videos by text
- Extracting keyframes
- Deduplication
- Selecting final pictures
Finally, it processes a JSON file named modified_{timestamp}.json
, and the QA generation follows the same process as NEWS.
π Note: We made a small modification to
qa_makers/main.py
β before generating QAs, the module now evaluates whether the associated text is meaningful enough for QA generation. Therefore, to generate QAs from videos, you should use the QA generation code provided in thevideo_code
directory. Other components remain unchanged.
This section helps you collect ArXiv data.
cd arxiv
First, configure settings in arxiv/config.py
. Specifically, change BASE_DIR
to the directory where you want to save the downloaded papers. Then run:
python direct_download.py --yearmonth 2504 --start-id 1 --end-id 100 --concurrent 5 --processes 4
You can see crawled data in data/raw
.
Process the downloaded papers to extract images and associations:
python get_article.py --dir /path/to/html/files --workers 4
Then you can see the processed data in data/processed
.
Set environment variable OPENAI_API_KEY
to your OpenAI API key. Then run the following command to select the best images from the processed papers:
python select_best_images.py --input_dir /path/to/processed/jsons --workers 4 --start_index 0 --end_index 100
When synthesizing QAs about the authors, we put all authors from all papers in authors.json
.
Generate Level 1 QAs:
python construct_level1.py -i /path/to/processed/jsons -o /path/to/output/level1.jsonl --workers 4
Generate Level 2 QAs:
python construct_level2.py -i /path/to/output/level1.jsonl -o /path/to/output/level2.jsonl --processes 4