Skip to content

[ACL'2025 Findings] Official repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task"

Notifications You must be signed in to change notification settings

CodeEval-Pro/CodeEval-Pro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeEval-Pro
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

[🌐 Website][🏆 Leaderboard][📜 Paper][🤗 HF Datasets][🐦 Twitter]

Repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task"


Figure 1: Statistics of model performance.

🔥 News

  • [2024/12/31] Paper, Code, Benchmarks all released.

💡 Introduction

We present HumanEval Pro and MBPP Pro, two expanded versions of the traditional HumanEval and MBPP benchmarks to evaluate LLMs on self-invoking code generation task. Self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one.


Figure 2: Evaluation pipeline of HumanEval Pro and MBPP Pro.


Figure 3: An example of HumanEval Pro and MBPP Pro.

🚀 Quick Start

⚙️ Setup

We recommend using Conda to manage your environment. Run the following commands to setup your environment:

conda create -n evalpro python==3.10
conda activate evalpro
pip install -e .

⚖️ Evaluation

To evaluate your own models on HumanEval Pro and MBPP Pro, we recommend using vllm to generate solutions with the following command.

set -ex
OUTPUT_DIR=result
MODEL=QwQ-32B-preview
MODEL_PATH=Qwen/QwQ-32B-Preview
TASK_TYPE=humaneval_pro # or mbpp_pro
mkdir -p ${OUTPUT_DIR}/${MODEL}/${TASK_TYPE}/outputs/

python -m eval.inference \
  --model_name_or_path $MODEL_PATH \
  --save_path ${OUTPUT_DIR}/${MODEL}/${TASK_TYPE}/outputs/results.jsonl \
  --dataset $TASK_TYPE \
  --is_use_vllm true \
  --do_sample false \
  --temperature 0.0 \
  --top_p 1.0 \
  --max_new_tokens 4096 \
  --n_problems_per_batch 28 \
  --n_samples_per_problem 1 \
  --n_batches 1 

The choices of TASK_TYPE include:

["humaneval", "mbpp", "humaneval_pro", "mbpp_pro", "humaneval_pro_cot", "mbpp_pro_cot", "humaneval_pro_1shot", "mbpp_pro_1shot"]

To run API models, use

set -ex
WORK_DIR=evalpro/result
MODEL=GPT-4o 

TASK_TYPE=humaneval_pro      
mkdir -p ${WORK_DIR}/${MODEL}/${TASK_TYPE}/outputs/

python -m run_api \
  --model_name gpt-4o-2024-08-06 \
  --dataset $TASK_TYPE \
  --save_path ${WORK_DIR}/${MODEL}/${TASK_TYPE}/outputs/results.jsonl \
  --api_key  apikey \
  --base_url url 

Then you will get a results.json file under the --save_path.

To obtain your pass@k score, you can run eval/harness.py with the following command:

set -ex
OUTPUT_DIR=result
MODEL=Qwen2.5Coder-32B-base
DATASET=humaneval_pro
TASK_TYPE=humaneval_pro

python -m santize \
    --model_name $MODEL \
    --source_path ${OUTPUT_DIR}/${MODEL}/${TASK_TYPE}/outputs/ \
    
python -m harness \
    --model_name $MODEL \
    --dataset_path dataset/refined_${DATASET}.json \
    --source_path ${OUTPUT_DIR}/${MODEL}/${TASK_TYPE}/outputs/ \
    --save_path ${OUTPUT_DIR}/${MODEL}/${TASK_TYPE} \
    --run_code

You will get a result_of_pass_k.json file in your --save_path. Please check if the pass@k of ground truth is equal to 1.0 at first. The you will obtain two results: pass_k_of_output and pass_k_of_output_santized. pass_k_of_output_santized is the result that santizes the original model output. We use the higher socre as our final result.

If you use --run_code, you will get the execution error statistics in ${OUTPUT_DIR}/${MODEL}/${TASK_TYPE}/log.

The choices of DATASET include:

["humaneval_pro", "mbpp_pro"]

To evaluate your model on BigCodeBench-Lite Pro, run the following command:

export CUDA_VISIBLE_DEVICES=0
set -ex
WORK_DIR=result
MODEL=Qwen/QwQ-32B-Preview
MODEL_PATH=Qwen/QwQ-32B-Preview
TASK_TYPE=bigcodebench_lite_pro
mkdir -p ${WORK_DIR}/${MODEL}/${TASK_TYPE}/outputs/

python -m eval.inference \
  --model_name_or_path $MODEL_PATH \
  --save_path ${WORK_DIR}/${MODEL}/${TASK_TYPE}/outputs/results.jsonl \
  --dataset $TASK_TYPE \
  --is_use_vllm true \
  --do_sample false \
  --temperature 0.0 \
  --top_p 1.0 \
  --max_new_tokens 4096 \
  --n_problems_per_batch 28 \
  --n_samples_per_problem 1 \
  --n_batches 1 

rm -rf ${WORK_DIR}/${MODEL}/${TASK_TYPE}/log
python -m eval.santize \
    --model_name $MODEL \
    --source_path ${WORK_DIR}/${MODEL}/${TASK_TYPE}/outputs/ \
    
python -m eval.harness \
    --model_name $MODEL \
    --task $TASK_TYPE \
    --dataset_path evalpro/dataset/refined_${TASK_TYPE}.json \
    --source_path ${WORK_DIR}/${MODEL}/${TASK_TYPE}/outputs/ \
    --save_path ${WORK_DIR}/${MODEL}/${TASK_TYPE} 
    # --run_code

To obtain the result of original HumanEval and MBPP, we recommend using the evalplus library with the following command.

OUTPUT_DIR=result
MODEL=QwQ-32B-preview
TASK_TYPE=humaneval
evalplus.evaluate --dataset $TASK_TYPE --samples ${OUTPUT_DIR}/${MODEL}/${TASK_TYPE}/outputs/results.jsonl

📖 License

This code repository is licensed under the MIT License.

☕️ Citation

If you find this repository helpful, please consider citing our paper:

@article{yu2024humaneval,
  title={HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation},
  author={Yu, Zhaojian and Zhao, Yilun and Cohan, Arman and Zhang, Xiao-Ping},
  journal={arXiv preprint arXiv:2412.21199},
  year={2024}
}

Acknowledgement

Our evaluation code is inspired by Magicoder and WaveCoder. We thanks Evalplus for providing the evaluation of original HumanEval and MBPP.

About

[ACL'2025 Findings] Official repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task"

Topics

Resources

Stars

Watchers

Forks

Languages

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy