repobench

โœจ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024

CC-BY-4.0 License

Stars
124

๐Ÿ”ฅ News

  • Feb 5th, 2024: RepoBench v1.1 (with newest code data) is now available on the ๐Ÿค— HuggingFace Hub. You can access the datasets for Python and Java using the following links:

    For more details of RepoBench v1.1, please refer to the data directory.

  • Jan 16th, 2024: RepoBench is accepted to ICLR 2024! ๐ŸŽ‰

๐Ÿ› ๏ธ Installation

git clone https://github.com/Leolty/repobench.git
cd repobench

[!NOTE] There is a requirements.txt file, which contains dependencies for reproducing the results in the paper. If you are only interested in the data, you can skip the installation of dependencies.

โš™๏ธ Description of Settings

As discussed in the paper, we have three settings for each task:

  • cross_file_first: Masks the line where a module from a different file is used for the first time.
  • cross_file_random: Masks a random line where a module from a different file is used (not the first usage).
  • in_file: Masks a random line that has no cross-file dependency.

๐Ÿ“ฅ Load Data

from datasets import load_dataset

dataset = load_dataset("tianyang/repobench_python_v1.1", ignore_verifications=True)

For more details, visit the Hugging Face dataset pages:

๐Ÿš€ Running Experiments

To run experiments on the RepoBench v1.1 dataset, we provide a very basic run.py script using the ๐Ÿค— Transformers library.

Example usage:

CUDA_VISIBLE_DEVICES=0 python run.py --model_name "deepseek-ai/deepseek-coder-1.3b-base" \
               --dataset_name "tianyang/repobench_python_v1.1" \
               --start_date "2023-12-01" \
               --end_date "2023-12-31" \
               --language "python" \
               --max_token_nums 15800 \
               --levels "2k" "4k" "8k" "12k" "16k" \
               --temperature 0.2 \
               --top_p 0.95 \
               --max_new_tokens 128 \
               --batch_size 1

For a full list of available parameters, please refer to the run.py file. And it should be super easy to customize the script for your own needs.

๐Ÿ“Š Evaluation

After generating completions, you can evaluate the results using the eval.py script. This script calculates various metrics including Exact Match (EM), Edit Similarity (ES), and CodeBLEU (CB) scores for each setting.

To run the evaluation:

python eval.py --path "results/deepseek-coder-1.3b-base-python" --language "python"

The script will output scores for each level (cross_file_first, cross_file_random, in_file) as well as weighted averages across all levels.

๐Ÿ“ Note

This branch of the repository is specifically for RepoBench v1.1. For the results presented in our ICLR 2024 paper, which used the initial version of RepoBench, please refer to the archive/v0 branch of this repository.

๐Ÿ“ Citation

If you use RepoBench in your research, please consider citing us:

@misc{liu2023repobench,
      title={RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems}, 
      author={Tianyang Liu and Canwen Xu and Julian McAuley},
      year={2024},
      url={https://arxiv.org/abs/2306.03091},
      booktitle={International Conference on Learning Representations}
}
Related Projects