Open Source Ecosystems

🔥 News

Feb 5th, 2024: RepoBench v1.1 (with newest code data) is now available on the 🤗 HuggingFace Hub. You can access the datasets for Python and Java using the following links:
- For Python: 🤗 Repobench Python V1.1
- For Java: 🤗 Repobench Java V1.1
For more details of RepoBench v1.1, please refer to the data directory.
Jan 16th, 2024: RepoBench is accepted to ICLR 2024! 🎉

🛠️ Installation

git clone https://github.com/Leolty/repobench.git
cd repobench

[!NOTE] There is a requirements.txt file, which contains dependencies for reproducing the results in the paper. If you are only interested in the data, you can skip the installation of dependencies.

⚙️ Description of Settings

As discussed in the paper, we have three settings for each task:

cross_file_first: Masks the line where a module from a different file is used for the first time.
cross_file_random: Masks a random line where a module from a different file is used (not the first usage).
in_file: Masks a random line that has no cross-file dependency.

📥 Load Data

from datasets import load_dataset

dataset = load_dataset("tianyang/repobench_python_v1.1", ignore_verifications=True)

For more details, visit the Hugging Face dataset pages:

Python: 🤗 Repobench Python V1.1
Java: 🤗 Repobench Java V1.1

🚀 Running Experiments

To run experiments on the RepoBench v1.1 dataset, we provide a very basic run.py script using the 🤗 Transformers library.

Example usage:

CUDA_VISIBLE_DEVICES=0 python run.py --model_name "deepseek-ai/deepseek-coder-1.3b-base" \
               --dataset_name "tianyang/repobench_python_v1.1" \
               --start_date "2023-12-01" \
               --end_date "2023-12-31" \
               --language "python" \
               --max_token_nums 15800 \
               --levels "2k" "4k" "8k" "12k" "16k" \
               --temperature 0.2 \
               --top_p 0.95 \
               --max_new_tokens 128 \
               --batch_size 1

For a full list of available parameters, please refer to the run.py file. And it should be super easy to customize the script for your own needs.

📊 Evaluation

After generating completions, you can evaluate the results using the eval.py script. This script calculates various metrics including Exact Match (EM), Edit Similarity (ES), and CodeBLEU (CB) scores for each setting.

To run the evaluation:

python eval.py --path "results/deepseek-coder-1.3b-base-python" --language "python"

The script will output scores for each level (cross_file_first, cross_file_random, in_file) as well as weighted averages across all levels.

📝 Note

This branch of the repository is specifically for RepoBench v1.1. For the results presented in our ICLR 2024 paper, which used the initial version of RepoBench, please refer to the archive/v0 branch of this repository.

📝 Citation

If you use RepoBench in your research, please consider citing us:

@misc{liu2023repobench,
      title={RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems}, 
      author={Tianyang Liu and Canwen Xu and Julian McAuley},
      year={2024},
      url={https://arxiv.org/abs/2306.03091},
      booktitle={International Conference on Learning Representations}
}

Related Projects

YAIB

🧪Yet Another ICU Benchmark: a holistic framework for the standardization of clinical prediction m...

15 Aug 2022 49

test

Measuring Massive Multitask Language Understanding | ICLR 2021

07 Sep 2020 1,165

CodeGeeX2

CodeGeeX2: A More Powerful Multilingual Code Generation Model

23 Jul 2023 7,626

Bench2Drive

[NeurIPS 2024 Datasets and Benchmarks Track] Closed-Loop E2E-AD Benchmark Enhanced by World Model...

23 Apr 2024 1,240

Transfer-Learning-Library

Transfer Learning Library for Domain Adaptation, Task Adaptation, and Domain Generalization

14 Feb 2020 3,356

ceval

Official github repo for C-Eval, a Chinese evaluation suite for foundation models [NeurIPS 2023]

12 May 2023 1,507

ocl_survey

Code for "A Comprehensive Empirical Evaluation on Online Continual Learning" ICCVW 2023 VCL Workshop

25 Apr 2023 31

cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-f...

07 Jun 2019 5,379

pretraining-with-human-feedback

Code accompanying the paper Pretraining Language Models with Human Preferences

20 Feb 2023 167

UltraTool

Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Com...

25 Jan 2024 12

mteb

MTEB: Massive Text Embedding Benchmark

05 Apr 2022 1,441

automl_multimodal_benchmark

Repository for Multimodal AutoML Benchmark

12 Jul 2021 50

dreambench_plus

12 Jun 2024 64

SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?

04 Oct 2023 1,819

FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vi...

19 Mar 2023 36,628