[🌐 Website] •
[📜 Paper] •
[🤗 Dataset] •
[🐱 GitHub]
Repo for "CriticBench: Benchmarking LLMs for Critique-Correct Reasoning"
The ability of Large Language Models (LLMs) to critique and refine their reasoning is crucial for their application in evaluation, feedback provision, and self-improvement. This paper introduces CriticBench, a comprehensive benchmark designed to assess LLMs' abilities to critique and rectify their reasoning across a variety of tasks. CriticBench encompasses five reasoning domains: mathematical, commonsense, symbolic, coding, and algorithmic. It compiles 15 datasets and incorporates responses from three LLM families. Utilizing CriticBench, we evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning, i.e., GQC reasoning, and analyze the key factors affecting LLM critical reasoning.
Our findings reveal: (1) a linear relationship in GQC capabilities, with critique-focused training markedly enhancing performance; (2) a task-dependent variation in critique and correction effectiveness, with logic-oriented tasks being more amenable to correction; (3) GQC knowledge inconsistencies that decrease as model size increases; and (4) an intriguing inter-model critiquing pattern, where stronger models are better at critiquing weaker ones, while weaker models can surprisingly surpass stronger ones in their self-critique. We hope these insights into the nuanced critique-correct reasoning of LLMs will foster further research in LLM critique and self-improvement.
Figure 1: An overview for the CriticBench construction.
Cloning the repository
git clone git@github.com:CriticBench/CriticBench.git
cd CriticBench/src
Preparing conda env
conda create -n critcbench python=3.10
conda activate criticbench
Install torch that is compatible with your device, then install the required dependencies as follows:
pip install -r requirements.txt
You can evaluation model's generation(G), critique(Q), correction(C) by the following command.
Some models require access permissions, which can be set with the following commands:
export HUGGING_FACE_HUB_TOKEN=<Your Huggingface token>
export OPENAI_API_KEY=<Your OpenAI API key>
python evaluate.py \
--available_gpus <GPU_IDs> \
--tasks GQC \
--prompt_type fs\
--hf_model <model-name> \
--enable_code_execution
We provide support for Auto-J and UltraCM. You can evaluate these models with the following command.
python evaluate.py \
--available_gpus <GPU_IDs> \
--tasks Q \
--hf_critic_model <model-name> \
--enable_code_execution
OpenAI model
python evaluate.py \
--tasks GQC \
--prompt_type fs\
--openai_model <model-name> \
--enable_code_execution
--tasks
specifies which task to evaluate, with the available options being:GQC
for a combination of generation, critique, and correction;QC
for critique and correction;G
,Q
, orC
for generation, critique, or correction individually;- Note that correction tasks ("C") should be executed after critique tasks ("Q") or require a specified critique result file.
--prompt_type
allows you to further specify the prompts for critique and correction used during evaluation:fs
: few-shot prompt for both critique and correction;zs-crit-cot
: zero-shot chain-of-thought prompt for critique;zs-crit-ao-1
,zs-crit-ao-2
andzs-crit-ao-3
represent three distinct types of zero-shot answer-only prompts for critique;- In correction, zero-shot prompts are all set to chain of thought (cot).
--enable_code_execution
argument enables execution of code for generation and correction tasks--available_gpus
argument specifies which GPUs to use, identified by their IDs (e.g.,0,1
).
You can specify paths to the existed result file using the --existed_gen_file
, --existed_crit_file
and --existed_corr_file
. For accurate answer extraction,
ensure --prompt_type
aligns with your results. Here is an example:
python evaluate.py \
--tasks GQC \
--prompt_type fs \
--enable_code_execution \
--existed_gen_file <path to generation result file> \
--existed_crit_file <path to critique result file> \
--existed_corr_file <path to correction result file>
Here's an example of what a JSON line in a generation result file might look like:
{
"id": 0,
"final_prompt": "The final prompt for LLMs",
"generation_result": "LLM's result for the generation task"
}
If you find this repository helpful, please consider citing our paper:
@misc{lin2024criticbench,
title={CriticBench: Benchmarking LLMs for Critique-Correct Reasoning},
author={Zicheng Lin and Zhibin Gou and Tian Liang and Ruilin Luo and Haowei Liu and Yujiu Yang},
year={2024},
eprint={2402.14809},
archivePrefix={arXiv},
primaryClass={cs.CL}
}