Tip: For more recent evaluation approaches, for example for evaluating LLMs, we recommend our newer and more actively maintained library LightEval.

🤗 Evaluate is a library that makes evaluating and comparing models and reporting their performance easier and more standardized.

It currently contains:

implementations of dozens of popular metrics: the existing metrics cover a variety of tasks spanning from NLP to Computer Vision, and include dataset-specific metrics for datasets. With a simple command like accuracy = load("accuracy"), get any of these metrics ready to use for evaluating a ML model in any framework (Numpy/Pandas/PyTorch/TensorFlow/JAX).
comparisons and measurements: comparisons are used to measure the difference between models and measurements are tools to evaluate datasets.
an easy way of adding new evaluation modules to the 🤗 Hub: you can create new evaluation modules and push them to a dedicated Space in the 🤗 Hub with evaluate-cli create [metric name], which allows you to see easily compare different metrics and their outputs for the same sets of references and predictions.

🎓 Documentation

🔎 Find a metric, comparison, measurement on the Hub

🌟 Add a new evaluation module

🤗 Evaluate also has lots of useful features like:

Type checking: the input types are checked to make sure that you are using the right input formats for each metric
Metric cards: each metrics comes with a card that describes the values, limitations and their ranges, as well as providing examples of their usage and usefulness.
Community metrics: Metrics live on the Hugging Face Hub and you can easily add your own metrics for your project or to collaborate with others.

Installation

With pip

🤗 Evaluate can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)

pip install evaluate

Usage

🤗 Evaluate's main methods are:

evaluate.list_evaluation_modules() to list the available metrics, comparisons and measurements
evaluate.load(module_name, **kwargs) to instantiate an evaluation module
results = module.compute(*kwargs) to compute the result of an evaluation module

Adding a new evaluation module

First install the necessary dependencies to create a new metric with the following command:

pip install evaluate[template]

Then you can get started with the following command which will create a new folder for your metric and display the necessary steps:

evaluate-cli create "Awesome Metric"

See this step-by-step guide in the documentation for detailed instructions.

Credits

Thanks to @marella for letting us use the evaluate namespace on PyPi previously used by his library.

Package Rankings

Top 1.2% on Pypi.org

Top 6.0% on Proxy.golang.org

Top 36.99% on Anaconda.org

Top 24.4% on Conda-forge.org

Related Projects

neurips-llm-efficiency-challenge

Starter pack for NeurIPS LLM Efficiency Challenge 2023.

11 Jul 2023 28

CASS-PROPEL

Complete evaluation of traditional "SK-learn like" machine learning models for post-operative com...

11 Sep 2023 4

review_object_detection_metrics

Object Detection Metrics. 14 object detection metrics: mean Average Precision (mAP), Average Reca...

11 Nov 2020 1,064

ice-score

[EACL 2024] ICE-Score: Instructing Large Language Models to Evaluate Code

26 Apr 2023 67

torchmetrics

Torchmetrics - Machine learning metrics for distributed, scalable PyTorch applications.

22 Dec 2020 1,998

metric-learn

Metric learning algorithms in Python

02 Nov 2013 1,383

autometrics-py

Easily add metrics to your code that actually help you spot and debug issues in production. Built...

20 Feb 2023 206

evalify

Evaluate your biometric verification models literally in seconds.

15 Feb 2022 19

mmengine

OpenMMLab Foundational Library for Training Deep Learning Models

08 Feb 2022 1,093

lm-evaluation-harness

A framework for few-shot evaluation of language models.

28 Aug 2020 6,569

BIG-bench

Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilitie...

15 Jan 2021 2,833

nlp-uncertainty-zoo

Model zoo for different kinds of uncertainty quantification methods used in Natural Language Proc...

17 Mar 2021 45

RAGTune

Tuning and Evaluation of RAG pipeline. (Automated optimization to be added soon)

04 Mar 2024 260

EvalAI

Evaluating state of the art in AI

21 Oct 2016 1,699

deepeval

The LLM Evaluation Framework

10 Aug 2023 1,929