DQVis Dataset: Natural Language to Biomedical Visualization

This repository contains the code for generating the DQVis dataset available on Hugging Face.

The code generates a collection of natural language Queries on biomedical research Data and responds with a visualization specification in the form of the Universal Discovery Interface grammar.

📂 Dataset on Hugging Face: HIDIVE/DQVis
🔍 Dataset Review Software: hms-dbmi/DQVis-review

🚀 Overview

The overall pipeline can be run from main.py and consists of a few high-level steps.

Template Generation will create abstract questions and specifications with placeholders for entities and fields as well as constraints for those entities/fields.
Schema Generation will create dataset schemas based on provided datasets.
Template Expansion will reify the template questions/specifications given the provided schemas for all possibilities that satify the constraints.
Paraphraser will use an LLM framework to paraphrase input questions to cover different styles of expertise and formality in the input.
Export / Upload The will be exported in various figures and uploaded to Hugging Face.

🛠️ Usage

To generate data, run main.py with the appropriate flags:

python main.py [options]

Available Options

Flag	Description
`--schema`	Update the data package schema based on files in the `./datasets` folder
`--upload`	Upload the generated training data to Hugging Face
`--hf_local`	Save training data locally in Hugging Face-compatible format
`--paraphrase`	Perform paraphrasing of training data
`--only_cached`	Use only locally cached data for paraphrasing (no new paraphrase generation)
`--sqlite`	Export the generated data to an SQLite database
`--sample`	Export a sampled subset of the data to SQLite
`--json`	Export the data to JSON format
`--parquet`	Export the data to Parquet format

You can combine multiple flags. For example, to paraphrase and export to SQLite:

python main.py --paraphrase --sqlite

To also upload the full dataset after generation:

python main.py --paraphrase --sqlite --upload

🗂️ Folder Structure

.
├── datasets/        # Source structured data files
├── main.py          # Entry point for dataset generation
├── out/             # Generated datasets (optional exports)
└── README.md        # This file

Name		Name	Last commit message	Last commit date
Latest commit History 149 Commits
datasets		datasets
.gitignore		.gitignore
LICENSE		LICENSE
PaperFigureGeneration.ipynb		PaperFigureGeneration.ipynb
README.md		README.md
ReviewAnalysis.ipynb		ReviewAnalysis.ipynb
brainstorm.ipynb		brainstorm.ipynb
convert_for_finetuning.py		convert_for_finetuning.py
export_sqlite.py		export_sqlite.py
insert_reference_values.py		insert_reference_values.py
main.py		main.py
multi_step_generation.py		multi_step_generation.py
package_hubmap.py		package_hubmap.py
paraphraser.py		paraphraser.py
pipeline.png		pipeline.png
process_datapackage.py		process_datapackage.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
schema_generation.py		schema_generation.py
template_expansion.py		template_expansion.py
template_generation.py		template_generation.py
upload_to_huggingface.py		upload_to_huggingface.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DQVis Dataset: Natural Language to Biomedical Visualization

🚀 Overview

🛠️ Usage

Available Options

🗂️ Folder Structure

About

Releases

Packages

Languages

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

License

hms-dbmi/DQVis-Generation

Folders and files

Latest commit

History

Repository files navigation

DQVis Dataset: Natural Language to Biomedical Visualization

🚀 Overview

🛠️ Usage

Available Options

🗂️ Folder Structure

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Packages