π€ About β’ π Getting Started β’ π§ Models β’ π€ Datasets β’ π Citation
Note
[03-17-2024] π₯ We updated our code to support Claude-3 models for grading. CODAL-Bench now includes claude-3-sonnet-20240229
responses.
[03-13-2024] π We are preparing a leaderboard for CODAL-Bench, stay tuned!
[03-13-2024] π₯ We release the first version of CodeUltraFeedback and CODAL-Bench.
Contact: If you have any inquiries or want to raise an issue, please feel free to contact:
Overview of CodeUltraFeedback dataset construction (see Section II of our paper for more details).
Given the increasing coding capabilities of large language models (LLMs), the following question emerges:
How well do these capabilities align with the expectations of developers, particularly concerning non-functional requirements such as code readability, efficiency, and adherence to best practices?
We believe existing benchmarks relying on automated metrics and static analysis tools are insufficient and too rigid for evaluating the broader capabilities of LLMs. Instead, we believe LLM-as-a-judge offers a more nuanced strategy (or proxy to human evaluation) to evaluate LLMs while effectively considering the intricacies of natural and programming languages.
Our work features two main contributions: CodeUltraFeedback
and CODAL-Bench
, a dataset and benchmark for aligning LLMs to coding preferences and evaluating their alignment using LLM-as-a-judge.
CodeUltraFeedback
is a preference dataset of complex coding instructions to align LLMs to coding preferences.
It has an analogous construction procedure to UltraFeedback, featuring:
- β¨ Complex instructions: CodeUltraFeedback is based on a 10k subset of MagiCoder Evol-Instruct comprising open domain complex coding instructions.
- β¨ Coding preferences: CodeUltraFeedback includes 5 coding preferences, which are crucial to evaluate the broader capabilities of LLMs: instruction-following, code explanation, code complexity and efficiency, code readability, and coding style.
- β¨ Large pool of LLMs: We use a large pool of 14 LLMs from 8 model families to generate responses to the 10k instructions to consider diverse writing and coding styles.
- β¨ LLM-as-a-judge and AI feedback: We use GPT-3.5 as a judge for evaluating LLM responses, which annotates each response with both numerical and textual feedback. The AI feedback data can be leveraged for various applications, including model alignment through RLAIF, tuning a critic LLM, and more.
CODAL-Bench
is a benchmark of 500 coding problems (100 per coding preference). We use LLM-as-a-judge with reference-guided single-answer grading using GPT-3.5 or GPT-4 to evaluate LLM alignment.
The approach enables the judge LLM to provide consistent ratings and evaluate each LLM individually (similar to MT-Bench).
We provide all the source code implemented to build CodeUltraFeedback and evaluate LLMs on CODAL-Bench.
Important
We are currently working on instructions to:
- Build CodeUltraFeedback or extend the dataset
- Tune your own SFT and DPO LLMs
- Evaluate LLMs on CODAL-Bench
Model | Checkpoint | Size | CODAL-Bench GPT-3.5 (G-3.5, G-4) |
CODAL-Bench GPT-4 (G-4) |
HumanEval+ (k=1, k=10) |
License |
---|---|---|---|---|---|---|
CodeLlama-7B-Instruct | π€ HF Link | 7B |
6.00 / 5.46 | 4.72 | 37.9 / 60.4 | Llama2 |
CodeLlama-7B-Instruct-SFT | π€ HF Link | 7B |
6.51 / 5.83 | 5.84 | 51.2 / 82.9 | Llama2 |
CodeLlama-7B-Instruct-DPO | π€ HF Link | 7B |
7.15 / 6.79 | 5.08 | 42.3 / 80.5 | Llama2 |
CodeLlama-7B-Instruct-SFT+DPO | π€ HF Link | 7B |
7.36 / 7.08 | 5.85 | 43.1 / 75.6 | Llama2 |
- π€ CodeUltraFeedback: https://huggingface.co/datasets/coseal/CodeUltraFeedback
- π€ CodeUltraFeedback binarized: https://huggingface.co/datasets/coseal/CodeUltraFeedback_binarized
- π€ CODAL-Bench: https://huggingface.co/datasets/coseal/codal-bench
- π€ Magicoder-Evol-Instruct-110K-sft: https://huggingface.co/datasets/coseal/Magicoder-Evol-Instruct-110K-sft
@misc{weyssow2024codeultrafeedback,
title={CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences},
author={Martin Weyssow and Aton Kamanda and Houari Sahraoui},
year={2024},
eprint={2403.09032},
archivePrefix={arXiv},
primaryClass={cs.SE}
}