Retrieval and Retrieval-augmented LLMs
MIT License
English |
FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently:
BAAI/bge-reranker-base
and BAAI/bge-reranker-large
, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.bge-*-v1.5
embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.bge-large-*
(short for BAAI General Embedding) Models, rank 1st on MTEB and C-MTEB benchmark! 🎉 🎉pip install -U FlagEmbedding
Clone the repository and install
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install .
For development in editable mode:
pip install -e .
First, load one of the BGE embedding model:
from FlagEmbedding import FlagModel
model = FlagModel('BAAI/bge-base-en-v1.5',
query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
use_fp16=True)
Then, feed some sentences to the model and get their embeddings:
sentences_1 = ["I love NLP", "I love machine learning"]
sentences_2 = ["I love BGE", "I love text retrieval"]
embeddings_1 = model.encode(sentences_1)
embeddings_2 = model.encode(sentences_2)
Once we get the embeddings, we can compute similarity by inner product:
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
We are actively maintaining the community of BGE and FlagEmbedding. Let us know if you have any suggessions or ideas!
Currently we are updating the tutorials, we aim to create a comprehensive and detailed tutorial for beginners on text retrieval and RAG. Stay tuned!
The following contents are releasing in the upcoming weeks:
In this project, we introduce BGE-M3, the first embedding model which supports:
The training code and fine-tuning data will be open-sourced in the near future.
In this project, we introduce Visualized-BGE, which integrating image token embedding into the BGE Text Embedding framework. Visualized-BGE can be used for various hybrid modal retrieval tasks, such as Multi-Modal Knowledge Retrieval, Composed Image Retrieval, and Knowledge Retrieval with Multi-Modal Queries.
Our model delivers outstanding zero-shot performance across multiple hybrid modal retrieval tasks. It can also serve as a base model for downstream fine-tuning for hybrid modal retrieval tasks.
We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning. The entire training cycle is super efficient, which takes 8 hours on one 8xA800 (80G) GPU machine (the context length can go far beyond 80k with more computing resources). The resulted model exhibits superior performances across a broad range of evaluation tasks, such as NIHS, topic retrieval, and long-context language understanding; meanwhile, it also well preserves the original capability over short contexts.
The utilization of long contexts poses a big challenge for large language models due to their limited context window length. Activation Beacon condenses LLM's raw activations into more compact forms such that it can perceive a much longer context with a limited context window. It is an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. More details please refer to our paper and code.
LM-Cocktail automatically merges fine-tuned models and base model using a simple function to compute merging weights. LM-Cocktail can be used to improve the performance on target domain without decrease the general capabilities beyond target domain, as well as generate a model for new tasks without fine-tuning. You can use it to merge the LLMs (e.g., Llama) or embedding models. More details please refer to our report: LM-Cocktail and code.
LLM Embedder is fine-tuned based on the feedback from LLMs. It supports the retrieval augmentation needs of large language models, including knowledge retrieval, memory retrieval, example retrieval, and tool retrieval. It is fine-tuned over 6 tasks: Question Answering, Conversational Search, Long Conversation, Long-Range Language Modeling, In-Context Learning, and Tool Learning. For more details please refer to report and ./FlagEmbedding/llm_embedder/README.md
Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. For more details please refer to ./FlagEmbedding/reranker/README.md
We provide a new version of the cross-encoder that supports more languages and longer lengths. The data format is similar to our embedding models, but now includes prompt data for fine-tuning and inference. You can perform inference using specific layers or using the entire layers. You can fine-tune it easily following our example. For more details please refer to ./FlagEmbedding/llm_reranker/README.md.
BGE embedding is a general Embedding Model. We pre-train the models using retromae and train them on large-scale pair data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. Refer to our report: c-pack and code for more details.
BGE uses the last hidden state of [cls]
as the sentence embedding: sentence_embeddings = model_output[0][:, 0]
. If you use mean pooling, there will be a significant decrease in performance.
A benchmark for chinese text embedding. This benchmark has been merged into MTEB. Refer to our report: c-pack and code for more details.
bge
is short for BAAI general embedding
.
Model | Language | Description | query instruction for retrieval | |
---|---|---|---|---|
BAAI/bge-en-icl | English | A LLM-based embedding model with in-context learning capabilities, which can fully leverage the model's potential based on a few shot examples | Provide instructions and few-shot examples freely based on the given task. | |
BAAI/bge-multilingual-gemma2 | Multilingual | - | A LLM-based multilingual embedding model, trained on a diverse range of languages and tasks. | Provide instructions based on the given task. |
BAAI/bge-m3 | Multilingual | Inference Fine-tune | Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) | |
LM-Cocktail | English | fine-tuned models (Llama and BGE) which can be used to reproduce the results of LM-Cocktail | ||
BAAI/llm-embedder | English | Inference Fine-tune | a unified embedding model to support diverse retrieval augmentation needs for LLMs | See README |
BAAI/bge-reranker-v2-m3 | Multilingual | Inference Fine-tune | a lightweight cross-encoder model, possesses strong multilingual capabilities, easy to deploy, with fast inference. | |
BAAI/bge-reranker-v2-gemma | Multilingual | Inference Fine-tune | a cross-encoder model which is suitable for multilingual contexts, performs well in both English proficiency and multilingual capabilities. | |
BAAI/bge-reranker-v2-minicpm-layerwise | Multilingual | Inference Fine-tune | a cross-encoder model which is suitable for multilingual contexts, performs well in both English and Chinese proficiency, allows freedom to select layers for output, facilitating accelerated inference. | |
BAAI/bge-reranker-v2.5-gemma2-lightweight | Multilingual | Inference | a cross-encoder model which is suitable for multilingual contexts, performs well in both English and Chinese proficiency, allows freedom to select layers, compress ratio and compress layers for output, facilitating accelerated inference. | |
BAAI/bge-reranker-large | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient | |
BAAI/bge-reranker-base | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient | |
BAAI/bge-large-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | Represent this sentence for searching relevant passages: |
BAAI/bge-base-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | Represent this sentence for searching relevant passages: |
BAAI/bge-small-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | Represent this sentence for searching relevant passages: |
BAAI/bge-large-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `` |
BAAI/bge-base-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `` |
BAAI/bge-small-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `` |
BAAI/bge-large-en | English | Inference Fine-tune | Embedding Model which map text into vector | Represent this sentence for searching relevant passages: |
BAAI/bge-base-en | English | Inference Fine-tune | a base-scale model but with similar ability to bge-large-en
|
Represent this sentence for searching relevant passages: |
BAAI/bge-small-en | English | Inference Fine-tune | a small-scale model but with competitive performance | Represent this sentence for searching relevant passages: |
BAAI/bge-large-zh | Chinese | Inference Fine-tune | Embedding Model which map text into vector | `` |
BAAI/bge-base-zh | Chinese | Inference Fine-tune | a base-scale model but with similar ability to bge-large-zh
|
`` |
BAAI/bge-small-zh | Chinese | Inference Fine-tune | a small-scale model but with competitive performance | `` |
Thank all our contributors for their efforts and warmly welcome new members to join in!
If you find this repository useful, please consider giving a star ⭐ and citation
@misc{bge_m3,
title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
author={Chen, Jianlv and Xiao, Shitao and Zhang, Peitian and Luo, Kun and Lian, Defu and Liu, Zheng},
year={2023},
eprint={2309.07597},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{cocktail,
title={LM-Cocktail: Resilient Tuning of Language Models via Model Merging},
author={Shitao Xiao and Zheng Liu and Peitian Zhang and Xingrun Xing},
year={2023},
eprint={2311.13534},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{llm_embedder,
title={Retrieve Anything To Augment Large Language Models},
author={Peitian Zhang and Shitao Xiao and Zheng Liu and Zhicheng Dou and Jian-Yun Nie},
year={2023},
eprint={2310.07554},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
@misc{bge_embedding,
title={C-Pack: Packaged Resources To Advance General Chinese Embedding},
author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
year={2023},
eprint={2309.07597},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
FlagEmbedding is licensed under the MIT License.