A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
APACHE-2.0 License
Benchmark #prompts | Modality | Capability | Logs | Pipeline Config |
---|---|---|---|---|
GeoMeter 1086 | Image -> Text | Geometric Reasoning | GeoMeter.zip | geometer.py |
MMMU 900 | Image -> Text | Multimodal QA | MMMU.zip | mmmu.py |
Image Understanding 10249 | Image -> Text | Object Recognition Object Detection Visual Prompting Spatial Reasoning | IMAGE_UNDERSTANDING.zip | object_recognition.py object_detection.py visual_prompting.py spatial_reasoning.py |
Vision Language 13500 | Image -> Text | Spatial Understanding Navigation Counting | VISION_LANGUAGE.zip | spatial_map.py maze.py spatial_grid.py |
IFEval 541 | Text -> Text | Instruction Following | IFEval.zip | ifeval.py |
FlenQA 12000 | Text -> Text | Long Context Multi-hop QA | FlenQA.zip | flenQA.py |
Kitab 34217 | Text -> Text | Information Retrieval | Kitab.zip | kitab.py |
Toxigen 10500 | Text -> Text | Toxicity Detection Safe Language Generation | ToxiGen.zip | toxigen.py |
Note: The benchmarks on Image Understanding and Vision Language Understanding will be available soon on HuggingFace. Please stay tuned.
For non-determinism evaluations using the above benchmarks, we provide pipelines in nondeterminism.py
To get started, clone this repository to your local machine and navigate to the project directory.
python3 -m venv .venv
pip install -e .
pip install wheel
python setup.py bdist_wheel
pip install eureka_ml_insights.whl
environment.yml
file:conda env create --name myenv --file environment.yml
conda activate myenv
conda env update --file environment_gpu.yml
To reproduce the results of a pre-defined experiment pipeline, you can run the following command:
python main.py --exp_config exp_config_name --model_config model_config_name --exp_logdir your_log_dir
For example, to run the FlenQA_Experiment_Pipeline
experiment pipeline defined in eureka_ml_insights/configs/flenqa.py
using the OpenAI GPT4 1106 Preview model, you can run the following command:
python main.py --exp_config FlenQA_Experiment_Pipeline --model_config OAI_GPT4_1106_PREVIEW_CONFIG --exp_logdir gpt4_1106_preveiw
The results of the experiment will be saved in a directory under logs/FlenQA_Experiment_Pipeline/gpt4_1106_preveiw
. For each experiment you run with these configurations, a new directory will be created using the date and time of the experiment run.
For other available experiment pipelines and model configurations, see the eureka_ml_insights/configs
directory. In model_configs.py you can configure the model classes to use your API keys, Keu Vault urls, endpoints, and other model-specific configurations.
Experiment pipelines define the sequence of components that are run to process data, run inference, and evaluate the model outputs. You can find examples of experiment pipeline configurations in the configs
directory. To create a new experiment configuration, you need to define a class that inherits from ExperimentConfig
and implements the configure_pipeline
method. In the configure_pipeline
method you define the Pipeline config (arrangement of Components) for your Experiment. Once your class is ready, add it to configs/__init__.py
import list.
Your Pipeline can use any of the available Components which can be found under the core
directory:
PromptProcessing
: you can use this component to prepare your data for inference, apply transformation, or apply a Jinja prompt template.DataProcessing
: you can use this component to to post-process the model outputs.Inference
: you can use this component to run your model on any processed data, for example running inference on the model subject to evaluation, or another model that is involved in the evaluation pipeline as an evaluator or judge.EvalReporting
: you can use this component to evaluate the model outputs using various metrics, aggregators and visualizers, and generate a report.DataJoin
: you can use this component to join two sources of data, for example to join the model outputs with the ground truth data for evaluation.Note that:
Utility classes include Models, Metrics, DataLoaders, DataReaders, etc. The components in your pipeline need to use the correct utility classes for your scenario. For example, to evaluate an OpenAI model on a dataset that is available on HuggingFace, you need to use the HFDataReader
data reader and the OpenAIModelsOAI
model class. In standard scenarios do not need to implement new components for your pipeline, but you do need to configure the existing components to work with the correct utility classes. If you need a functionality that is not provided by the existing utility classes, you can implement a new utility class and use it in your pipeline.
In general, to find out what utility classes and other attributes need to be configured for a component, you can look at the component's corresponding Config dataclass in configs/config.py
. For example, if you are configuring the DataProcessing
component, you can look at the DataProcessingConfig
dataclass in configs/config.py
.
Utility classes are also configurable by providing the name of the class and the initialization arguments. For example see ModelConfig in configs/config.py
that can be initialized with the model class name and the model initialization arguments.
Our current components use the following utility classes: DataReader
, DataLoader
, Model
, Metric
, Aggregator
. You can use the existing utility classes or implement new ones as needed to configure your components.
This component is used for general data processing tasks.
data_reader_config
: Configuration for the DataReader that is used to load the data into a pandas dataframe, apply any necessary processing on it (optional), and return the processed data. We currently support local and Azure Blob Storage data sources.
data_utils/transforms.py
. If you need to implement new transform classes, add them to this file.output_dir
: This is the folder name where the processed data will be saved. This folder will automatically be created under the experiment log directory and the processed data will be saved in a file called processed_data.jsonl
.output_data_columns
(OPTIONAL): This is the list of columns to save in transformed_data.jsonl. By default, all columns are saved.This component inherits from the DataProcessing component and is used specifically for prompt processing tasks, such as applying a Jinja prompt template. If a prompt template is provided, the processed data will have a 'prompt' column that is expected by the inference component. Otherwise the input data is expected to already have a 'prompt' column. This component also reserves the "model_output" column for the model outputs so if it already exists in the input data, it will be removed.
In addition to the attributes of the DataProcessing component, the PromptProcessing component has the following attributes:
prompt_template_path
(OPTIONAL): This template is used to format your data for model inference in case you need prompt templating or system prompts. Provide your jinja prompt template path to this component. See for example prompt_templates/basic.jinja
. The prompt template processing step adds a 'prompt' column to the processed data, which is expected by the inference component. If you do not need prompt templating, make sure your data already does have a 'prompt' column.ignore_failure
(OPTIONAL): Whether to ignore the failure of prompt processing on a row and move on to the next, or to raise an exception. Default is False.model_config
: Configuration of the model class to use for inference. You can find the available models in models/
.data_loader_config
: Configuration of the data_loader class to use for inference. You can find the available data classes in data_utils/data.py
.output_dir
: This is the folder name where the model outputs will be saved. This folder will automatically be created under the experiment log directory and the model outputs will be saved in a file called inference_result.jsonl
.data_reader_config
: Configuration object for the DataReader that is used to load the data into a pandas dataframe. This is the same type of utility class used in the DataProcessing component.metric_config
: a MetricConfig object to specify the metric class to use for evaluation. You can find the available metrics in metrics/
. If you need to implement new metric classes, add them to this directory.aggregator_configs
/visualizer_configs
: List of configs for aggregators/visualizers to apply to the metric results. These classes that take metric results and aggragate/analyze/vizualize them and save them. You can find the available aggregators and visualizers in metrics/reports.py
.output_dir
: This is the folder name where the evaluation results will be saved.This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repositories using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
To contribute to the framework:
configs
, as well as any utility classes that your pipeline requires.tests
directory.tests
directory.make format-inplace
to format the files you have changed. This will only work on files that git is tracking, so make sure to git add
any newly created files before running this command.make linters
to check any remaining style or format issues and fix them manually.make test
to run the tests and make sure they all pass.If you use this framework in your research, please cite the following paper:
@article{eureka2024,
title={Eureka: Evaluating and Understanding Large Foundation Models},
author={Balachandran, Vidhisha and Chen, Jingya and Joshi, Neel and Nushi, Besmira and Palangi, Hamid and Salinas, Eduardo and Vineet, Vibhav and Woffinden-Luey, James and Yousefi, Safoora},
journal={Microsoft Research. MSR-TR-2024-33},
year={2024}
}
A cross-cutting dimension for all capability evaluations is the evaluation of several aspects of model behavior important for the responsible fielding of AI systems. These consideration include the fairness, reliability, safety, privacy, and security of models. While evaluations through the Toxigen dataset (included in Eureka-Bench) capture notions of representational fairness for different demographic groups and, to some extent, the ability of the model to generate safe language despite non-safe input triggers in the prompt, other aspects or nuances of fairness and safety require further evaluation and additional clarity, which we hope to integrate in future versions and welcome contributions for. We are also interested in expanding Eureka-Bench with tasks where fairness and bias can be studied in more benign settings that simulate how risks may appear when humans use AI to assist them in everyday tasks (e.g. creative writing, information search etc.) and subtle language or visual biases encoded in training data might be reflected in the AI's assistance.
A general rising concern on responsible AI evaluations is that there is a quick turnaround between new benchmarks being released and then included in content safety filters or in post training datasets. Because of this, scores on benchmarks focused on responsible and safe deployment may appear to be unusually high for most capable models. While the quick reaction is a positive development, from an evaluation and understanding perspective, the high scores indicate that the benchmarks are not sensitive enough to capture differences in alignment and safety processes followed for different models. At the same time, it is also the case that fielding thresholds for responsible AI measurements can be inherently higher and as such these evaluations will require a different interpretation lens. For example, a 5 percent error rate in instruction following for content length should not be weighed in the same way as a 5 percent error rate in detecting toxic content, or even a 5 percent success rates in jailbreak attacks. Therefore, successful and timely evaluations to this end depend on collaborative efforts that integrate red teaming, quantified evaluations, and human studies in the context of real-world applications.
Finally, Eureka and the set of associated benchmarks are only the initial snapshot of an effort that aims at reliably measuring progress in AI. Our team is excited about further collaborations with the open-source community and research, with the goal of sharing and extending current measurements for new capabilities and models. Our current roadmap involves enriching Eureka with more measurements around planning, reasoning, fairness, reliability and safety, and advanced multimodal capabilities for video and audio.