Evaluate open-source language models on Agent, formatted output, command following, long text, multilingual, coding, and custom task capabilities. 开源语言模型在Agent,格式化输出,指令追随,长文本,多语言,代码,自定义任务的能力基准测试。
MIT License
Evaluate the capabilities of open-source LLMs in agent, tool calling, formatted output, long context retrieval, multilingual support, coding, mathematics, and custom tasks.
The ReAct Agent can access 5 functions. There are 10 questions to be solved, 4 of which are simple questions that can be solved using a single function, and 6 of which are complicated questions that require the agent to use multiple steps to solve.
The score ranges from 1 to 5, with 5 representing complete correctness. Here is an screen shot while running evaluation.
Insert the needle(answer) into a haystack(long context) and ask the model retrieval the question based on the long context.
Evaluate the model's ability to repond in specified format, such as JSON, Number, Python, etc.
Supported:
Plan:
Install from pypi:
pip install open_llm_benchmark
Install from github repo:
git clone [email protected]:EvilPsyCHo/Open-LLM-Benchmark.git
cd Open-LLM-Benchmark
python setup.py install
Feel free to contribute this project!