BAAI/bge-reranker-large

feature extractiontransformersenzhtransformerspytorchonnxsafetensorsxlm-robertatext-classificationmit
1.0M

for s2p(short query to long passage) retrieval task, suggest to use encode_queries() which will automatically add the instruction to each query

corpus in retrieval task can still use encode() or encode_corpus(), since they don't need instruction

queries = ['query_1', 'query_2'] passages = ["样例文档-1", "样例文档-2"] q_embeddings = model.encode_queries(queries) p_embeddings = model.encode(passages) scores = q_embeddings @ p_embeddings.T

For the value of the argument `query_instruction_for_retrieval`, see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list). 

By default, FlagModel will use all available GPUs when encoding. Please set `os.environ["CUDA_VISIBLE_DEVICES"]` to select specific GPUs.
You also can set `os.environ["CUDA_VISIBLE_DEVICES"]=""` to make all GPUs unavailable.


#### Using Sentence-Transformers

You can also use the `bge` models with [sentence-transformers](https://www.SBERT.net):

pip install -U sentence-transformers

```python
from sentence_transformers import SentenceTransformer
sentences_1 = ["样例数据-1", "样例数据-2"]
sentences_2 = ["样例数据-3", "样例数据-4"]
model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)

For s2p(short query to long passage) retrieval task, each short query should start with an instruction (instructions see Model List). But the instruction is not needed for passages.

from sentence_transformers import SentenceTransformer
queries = ['query_1', 'query_2']
passages = ["样例文档-1", "样例文档-2"]
instruction = "为这个句子生成表示以用于检索相关文章:"

model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)
scores = q_embeddings @ p_embeddings.T

Using Langchain

You can use bge in langchain like this:

from langchain.embeddings import HuggingFaceBgeEmbeddings
model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
model = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
    query_instruction="为这个句子生成表示以用于检索相关文章:"
)
model.query_instruction = "为这个句子生成表示以用于检索相关文章:"

Using HuggingFace Transformers

With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., [CLS]) as the sentence embedding.

from transformers import AutoTokenizer, AutoModel
import torch
# Sentences we want sentence embeddings for
sentences = ["样例数据-1", "样例数据-2"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
model.eval()

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
    # Perform pooling. In this case, cls pooling.
    sentence_embeddings = model_output[0][:, 0]
# normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:", sentence_embeddings)

Usage for Reranker

Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range.

Using FlagEmbedding

pip install -U FlagEmbedding

Get relevance scores (higher scores indicate more relevance):

from FlagEmbedding import FlagReranker
reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

score = reranker.compute_score(['query', 'passage'])
print(score)

scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
print(scores)

Using Huggingface transformers

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large')
model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large')
model.eval()

pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
    print(scores)

Usage reranker with the ONNX files

from optimum.onnxruntime import ORTModelForSequenceClassification  # type: ignore

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large')
model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base')
model_ort = ORTModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base', file_name="onnx/model.onnx")

# Sentences we want sentence embeddings for
pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]

# Tokenize sentences
encoded_input = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt')

scores_ort = model_ort(**encoded_input, return_dict=True).logits.view(-1, ).float()
# Compute token embeddings
with torch.inference_mode():
    scores = model_ort(**encoded_input, return_dict=True).logits.view(-1, ).float()

# scores and scores_ort are identical

Usage reranker with infinity

Its also possible to deploy the onnx/torch files with the infinity_emb pip package.

import asyncio
from infinity_emb import AsyncEmbeddingEngine, EngineArgs

query='what is a panda?'
docs = ['The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear', "Paris is in France."]

engine = AsyncEmbeddingEngine.from_args(
    EngineArgs(model_name_or_path = "BAAI/bge-reranker-base", device="cpu", engine="torch" # or engine="optimum" for onnx
))

async def main(): 
    async with engine:
        ranking, usage = await engine.rerank(query=query, docs=docs)
        print(list(zip(ranking, docs)))
asyncio.run(main())

Evaluation

baai-general-embedding models achieve state-of-the-art performance on both MTEB and C-MTEB leaderboard! For more details and evaluation tools see our scripts.

  • MTEB:
Model NameDimensionSequence LengthAverage (56)Retrieval (15)Clustering (11)Pair Classification (3)Reranking (4)STS (10)Summarization (1)Classification (12)
BAAI/bge-large-en-v1.5102451264.2354.2946.0887.1260.0383.1131.6175.97
BAAI/bge-base-en-v1.576851263.5553.2545.7786.5558.8682.431.0775.53
BAAI/bge-small-en-v1.538451262.1751.6843.8284.9258.3681.5930.1274.14
bge-large-en102451263.9853.946.9885.859.4881.5632.0676.21
bge-base-en76851263.3653.046.3285.8658.781.8429.2775.27
gte-large102451263.1352.2246.8485.0059.1383.3531.6673.33
gte-base76851262.3951.1446.284.5758.6182.331.1773.01
e5-large-v2102451262.2550.5644.4986.0356.6182.0530.1975.24
bge-small-en38451262.1151.8244.3183.7857.9780.7230.5374.37
instructor-xl76851261.7949.2644.7486.6257.2983.0632.3261.79
e5-base-v276851261.550.2943.8085.7355.9181.0530.2873.84
gte-small38451261.3649.4644.8983.5457.782.0730.4272.31
text-embedding-ada-0021536819260.9949.2545.984.8956.3280.9730.870.93
e5-small-v238451259.9349.0439.9284.6754.3280.3931.1672.94
sentence-t5-xxl76851259.5142.2443.7285.0656.4282.6330.0873.42
all-mpnet-base-v276851457.7843.8143.6983.0459.3680.2827.4965.07
sgpt-bloom-7b1-msmarco4096204857.5948.2238.9381.955.6577.7433.666.19
  • C-MTEB:
    We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks. Please refer to C_MTEB for a detailed introduction.
ModelEmbedding dimensionAvgRetrievalSTSPairClassificationClassificationRerankingClustering
BAAI/bge-large-zh-v1.5102464.5370.4656.2581.669.1365.8448.99
BAAI/bge-base-zh-v1.576863.1369.4953.7279.7568.0765.3947.53
BAAI/bge-small-zh-v1.551257.8261.7749.1170.4163.9660.9244.18
BAAI/bge-large-zh102464.2071.5354.9878.9468.3265.1148.39
bge-large-zh-noinstruct102463.5370.555376.7768.5864.9150.01
BAAI/bge-base-zh76862.9669.5354.1277.567.0764.9147.63
multilingual-e5-large102458.7963.6648.4469.8967.3456.0048.23
BAAI/bge-small-zh51258.2763.0749.4570.3563.6461.4845.09
m3e-base76857.1056.9150.4763.9967.5259.3447.68
m3e-large102457.0554.7550.4264.368.259.6648.88
multilingual-e5-base76855.4861.6346.4967.0765.3554.3540.68
multilingual-e5-small38455.3859.9545.2766.4565.8553.8645.26
text-embedding-ada-002(OpenAI)153653.0252.043.3569.5664.3154.2845.68
luotuo102449.3744.442.7866.626149.2544.39
text2vec-base76847.6338.7943.4167.4162.1949.4537.66
text2vec-large102447.3641.9444.9770.8660.6649.1630.02
  • Reranking: See C_MTEB for evaluation script.
ModelT2RerankingT2RerankingZh2En*T2RerankingEn2Zh*MMarcoRerankingCMedQAv1CMedQAv2Avg
text2vec-base-multilingual64.6662.9462.5114.3748.4648.650.26
multilingual-e5-small65.6260.9456.4129.9167.2666.5457.78
multilingual-e5-large64.5561.6154.2828.667.4267.9257.4
multilingual-e5-base64.2162.1354.6829.566.2366.9857.29
m3e-base66.0362.7456.0717.5177.0576.7659.36
m3e-large66.1362.7256.116.4677.7678.2759.57
bge-base-zh-v1.566.4963.2557.0229.7480.4784.8863.64
bge-large-zh-v1.565.7463.3957.0328.7483.4585.4463.97
BAAI/bge-reranker-base67.2863.9560.4535.4681.2684.165.42
BAAI/bge-reranker-large67.664.0361.4437.1682.1584.1866.09

* : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks

Train

BAAI Embedding

We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. More training details for bge see baai_general_embedding.

BGE Reranker

Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. More details please refer to ./FlagEmbedding/reranker/README.md

Citation

If you find this repository useful, please consider giving a star :star: and citation

@misc{bge_embedding,
      title={C-Pack: Packaged Resources To Advance General Chinese Embedding}, 
      author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
      year={2023},
      eprint={2309.07597},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.

DEPLOY IN 60 SECONDS

Run bge-reranker-large on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.