🚀open-souce LLMs benchmark

Evaluate the capabilities of open-source LLMs in agent, tool calling, formatted output, long context retrieval, multilingual support, coding, mathematics, and custom tasks.

Example

🤖Agent Task

The ReAct Agent can access 5 functions. There are 10 questions to be solved, 4 of which are simple questions that can be solved using a single function, and 6 of which are complicated questions that require the agent to use multiple steps to solve.

The score ranges from 1 to 5, with 5 representing complete correctness. Here is an screen shot while running evaluation.

🧐Retrieval Task

Insert the needle(answer) into a haystack(long context) and ask the model retrieval the question based on the long context.

🗣️Format output Task

Evaluate the model's ability to repond in specified format, such as JSON, Number, Python, etc.

BenchMark Evaluation

Supported:

🤖Agent, evaluate whether the model can accurately select tools or functions for invocation and follow the ReAct pattern to solve problems.
🗣️Formated output, evaluate whether the model can output content in required formats such as JSON, Single Number, Code Bloch, etc.
🧐Long context retrieval, capability to retrieval correct fact from a long context.

Plan:

🇺🇸🇨🇳Multilingual, capability to understand and respond in different languages.
⌨️coding, capability to solve complicated promblem with code.
∞Mathematics, capability to solve mathematic problem w/ or w/o code interpreter
😀Custom Task, easily define and evaluate any specific task which you concern.

Install

Install from pypi:

pip install open_llm_benchmark

Install from github repo:

git clone [email protected]:EvilPsyCHo/Open-LLM-Benchmark.git
cd Open-LLM-Benchmark
python setup.py install

Supoorted Backend

Huggingface transformers
llama-cpp-pyton
vLLM
OpenAI

Contribute

Feel free to contribute this project!

more backend such as Anthropic, ollama, etc.
more tasks.
more evaluation data.
visualize the evaluation result.
etc.

Package Rankings

Top 35.93% on Pypi.org

Related Projects

askagent

Simple mac/unix terminal assistant with LLM agents capable of various tasks

21 Apr 2024 1

LLaMA-2-hf-Chatbot

Chatbot from pretrained LLaMA-2 LLM model, fine-tuned with medical research papers using RAG (Ret...

06 Jun 2024 2

mentals-ai

Agents in Markdown syntax (loops, memory and tools included) 🍓

27 Feb 2024 344

llm-axe

A simple, intuitive toolkit for quickly implementing LLM powered applications.

22 Apr 2024 125

empower-functions

GPT-4 level function calling models for real-world tool using use cases

10 May 2024 204

benchllama

Benchmark your local LLMs.

05 Feb 2024 27

SonAgent

Self-Repairing Autonomous Agent for Digital Consciousness Backup Using Large Language Models (LLM...

01 Dec 2023 28

Get-Things-Done-with-Prompt-Engineering-and-LangChain

LangChain & Prompt Engineering tutorials on Large Language Models (LLMs) such as ChatGPT with cus...

12 Apr 2023 1,094

llama-squad

Train Llama 2 & 3 on the SQuAD v2 task as an example of how to specialize a generalized (foundati...

29 Jul 2023 45

LLM-Finetuning

LLM Finetuning with peft

08 Jun 2023 2,100

BambooAI

A lightweight library that leverages Language Models (LLMs) to enable natural language interactio...

07 May 2023 439

AgentX

AgentX is an Open-source library that help people use LLMs on their own computers or help them to...

15 Jan 2024 12

LLMCompiler

LLMCompiler: An LLM Compiler for Parallel Function Calling

06 Dec 2023 1,061

AI-Agent-Document-Analyzer

This project is an AI-powered document analysis bot designed to process and extract information f...

23 Aug 2024 1

lightllm

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable fo...

22 Jul 2023 1,967