Skip to content

FSoft-AI4Code/CodeMMLU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

26 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities

CodeMMLU

πŸ“° News β€’ πŸš€ Quick Start β€’ πŸ“‹ Evaluation β€’ πŸ“Œ Citation

πŸ“Œ About

CodeMMLU

CodeMMLU is a comprehensive benchmark designed to evaluate the capabilities of large language models (LLMs) in coding and software knowledge. It builds upon the structure of multiple-choice question answering (MCQA) to cover a wide range of programming tasks and domains, including code generation, defect detection, software engineering principles, and much more.

Why CodeMMLU?

  • CodeMMLU comprises over 10,000 questions curated from diverse, high-quality sources. It covers a wide spectrum of software knowledge, including general QA, code generation, defect detection, and code repair across various domains and more than 10 programming languages.

  • Precise and comprehensive: Checkout our LEADERBOARD for latest LLM rankings.

πŸš€ Quick Start

Install CodeMMLU and setup dependencies via pip:

pip install codemmlu

Generate response for CodeMMLU MCQs benchmark:

codemmlu --model_name <your_model_name_or_path> \
  --subset <subset> \
  --backend <backend> \
  --output_dir <your_output_dir>

πŸ“‹ Evaluation

Build codemmlu from source:

git clone https://github.com/Fsoft-AI4Code/CodeMMLU.git
cd CodeMMLU
pip install -e .

Note

If you prefer vllm backend, we highly recommend you install vllm from official project before install codemmlu.

Generating with CodeMMLU questions:

codemmlu --model_name <your_model_name_or_path> \
  --peft_model <your_peft_model_name_or_path> \
  --subset all \
  --batch_size 16 \
  --backend [vllm|hf] \
  --max_new_tokens 1024 \
  --temperature 0.0 \
  --output_dir <your_output_dir> \
  --instruction_prefix <special_prefix> \
  --assistant_prefix <special_prefix> \
  --cache_dir <your_cache_dir>
⏬ API Usage :: click to expand ::
codemmlu [-h] [-V] [--subset SUBSET] [--batch_size BATCH_SIZE] [--instruction_prefix INSTRUCTION_PREFIX]
                [--assistant_prefix ASSISTANT_PREFIX] [--output_dir OUTPUT_DIR] [--model_name MODEL_NAME]
                [--peft_model PEFT_MODEL] [--backend BACKEND] [--max_new_tokens MAX_NEW_TOKENS]
                [--temperature TEMPERATURE] [--prompt_mode PROMPT_MODE] [--cache_dir CACHE_DIR] [--trust_remote_code]

==================== CodeMMLU ====================

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         Get version
  --subset SUBSET       Select evaluate subset
  --batch_size BATCH_SIZE
  --instruction_prefix INSTRUCTION_PREFIX
  --assistant_prefix ASSISTANT_PREFIX
  --output_dir OUTPUT_DIR
                        Save generation and result path
  --model_name MODEL_NAME
                        Local path or Huggingface Hub link to load model
  --peft_model PEFT_MODEL
                        Lora config
  --backend BACKEND     LLM generation backend (default: hf)
  --max_new_tokens MAX_NEW_TOKENS
                        Number of max new tokens
  --temperature TEMPERATURE
  --prompt_mode PROMPT_MODE
                        Prompt available: zeroshot, fewshot, cot_zs, cot_fs
  --cache_dir CACHE_DIR
                        Cache for save model download checkpoint and dataset
  --trust_remote_code

List of supported backends:

Backend DecoderModel LoRA
Transformers (hf) βœ… βœ…
Vllm (vllm) βœ… βœ…

Leaderboard

To evaluate your model and submit your results to the leaderboard, please follow the instruction in data/README.md.

πŸ“Œ Citation

If you find this repository useful, please consider citing our paper:

@article{nguyen2024codemmlu,
  title={CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities},
  author={Nguyen, Dung Manh and Phan, Thang Chau and Le, Nam Hai and Doan, Thong T. and Nguyen, Nam V. and Pham, Quang and Bui, Nghi D. Q.},
  journal={arXiv preprint},
  year={2024}
}

About

[ICLR 2025] πŸš€ CodeMMLU Evaluator: A framework for evaluating LM models on CodeMMLU MCQs benchmark.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy