Skip to content

code for generating training data used for fine-tuning the LLM

License

Notifications You must be signed in to change notification settings

hms-dbmi/DQVis-Generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DQVis Dataset: Natural Language to Biomedical Visualization

This repository contains the code for generating the DQVis dataset available on Hugging Face.

The code generates a collection of natural language Queries on biomedical research Data and responds with a visualization specification in the form of the Universal Discovery Interface grammar.

📂 Dataset on Hugging Face: HIDIVE/DQVis
🔍 Dataset Review Software: hms-dbmi/DQVis-review


🚀 Overview

Overview figure of data generation pipline

The overall pipeline can be run from main.py and consists of a few high-level steps.

  1. Template Generation will create abstract questions and specifications with placeholders for entities and fields as well as constraints for those entities/fields.
  2. Schema Generation will create dataset schemas based on provided datasets.
  3. Template Expansion will reify the template questions/specifications given the provided schemas for all possibilities that satify the constraints.
  4. Paraphraser will use an LLM framework to paraphrase input questions to cover different styles of expertise and formality in the input.
  5. Export / Upload The will be exported in various figures and uploaded to Hugging Face.

🛠️ Usage

To generate data, run main.py with the appropriate flags:

python main.py [options]

Available Options

Flag Description
--schema Update the data package schema based on files in the ./datasets folder
--upload Upload the generated training data to Hugging Face
--hf_local Save training data locally in Hugging Face-compatible format
--paraphrase Perform paraphrasing of training data
--only_cached Use only locally cached data for paraphrasing (no new paraphrase generation)
--sqlite Export the generated data to an SQLite database
--sample Export a sampled subset of the data to SQLite
--json Export the data to JSON format
--parquet Export the data to Parquet format

You can combine multiple flags. For example, to paraphrase and export to SQLite:

python main.py --paraphrase --sqlite

To also upload the full dataset after generation:

python main.py --paraphrase --sqlite --upload

🗂️ Folder Structure

.
├── datasets/        # Source structured data files
├── main.py          # Entry point for dataset generation
├── out/             # Generated datasets (optional exports)
└── README.md        # This file

About

code for generating training data used for fine-tuning the LLM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy