shibing624/text2vec-base-multilingual

sentence similaritysentence-transformerszhensentence-transformerspytorchonnxsafetensorsbertfeature-extractionapache-2.0
157.5K

shibing624/text2vec-base-multilingual

This is a CoSENT(Cosine Sentence) model: shibing624/text2vec-base-multilingual.

It maps sentences to a 384 dimensional dense vector space and can be used for tasks like sentence embeddings, text matching or semantic search.

Evaluation

For an automated evaluation of this model, see the Evaluation Benchmark: text2vec

Languages

Available languages are: de, en, es, fr, it, nl, pl, pt, ru, zh

Release Models

  • 本项目release模型的中文匹配评测结果:
ArchBaseModelModelATECBQLCQMCPAWSXSTS-BSOHU-ddSOHU-dcAvgQPS
Word2Vecword2vecw2v-light-tencent-chinese20.0031.4959.462.5755.7855.0420.7035.0323769
SBERTxlm-roberta-basesentence-transformers/paraphrase-multilingual-MiniLM-L12-v218.4238.5263.9610.1478.9063.0152.2846.463138
Instructorhfl/chinese-roberta-wwm-extmoka-ai/m3e-base41.2763.8174.8712.2076.9675.8360.5557.932980
CoSENThfl/chinese-macbert-baseshibing624/text2vec-base-chinese31.9342.6770.1617.2179.3070.2750.4251.613008
CoSENThfl/chinese-lert-largeGanymedeNil/text2vec-large-chinese32.6144.5969.3014.5179.4473.0159.0453.122092
CoSENTnghuyong/ernie-3.0-base-zhshibing624/text2vec-base-chinese-sentence43.3761.4373.4838.9078.2570.6053.0859.873089
CoSENTnghuyong/ernie-3.0-base-zhshibing624/text2vec-base-chinese-paraphrase44.8963.5874.2440.9078.9376.7063.3063.083066
CoSENTsentence-transformers/paraphrase-multilingual-MiniLM-L12-v2shibing624/text2vec-base-multilingual32.3950.3365.6432.5674.4568.8851.1753.674004

说明:

模型训练实验报告:实验报告

Usage (text2vec)

Using this model becomes easy when you have text2vec installed:

pip install -U text2vec

Then you can use the model like this:

from text2vec import SentenceModel
sentences = ['如何更换花呗绑定银行卡', 'How to replace the Huabei bundled bank card']

model = SentenceModel('shibing624/text2vec-base-multilingual')
embeddings = model.encode(sentences)
print(embeddings)

Usage (HuggingFace Transformers)

Without text2vec, you can use the model like this:

First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

Install transformers:

pip install transformers

Then load model and predict:

from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('shibing624/text2vec-base-multilingual')
model = AutoModel.from_pretrained('shibing624/text2vec-base-multilingual')
sentences = ['如何更换花呗绑定银行卡', 'How to replace the Huabei bundled bank card']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Usage (sentence-transformers)

sentence-transformers is a popular library to compute dense vector representations for sentences.

Install sentence-transformers:

pip install -U sentence-transformers

Then load model and predict:

from sentence_transformers import SentenceTransformer

m = SentenceTransformer("shibing624/text2vec-base-multilingual")
sentences = ['如何更换花呗绑定银行卡', 'How to replace the Huabei bundled bank card']

sentence_embeddings = m.encode(sentences)
print("Sentence embeddings:")
print(sentence_embeddings)

Full Model Architecture

CoSENT(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_mean_tokens': True})
)

Intended uses

Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.

By default, input text longer than 256 word pieces is truncated.

Training procedure

Pre-training

We use the pretrained sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model. Please refer to the model card for more detailed information about the pre-training procedure.

Fine-tuning

We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch. We then apply the rank loss by comparing with true pairs and false pairs.

Citing & Authors

This model was trained by text2vec.

If you find this model helpful, feel free to cite:

@software{text2vec,
  author = {Ming Xu},
  title = {text2vec: A Tool for Text to Vector},
  year = {2023},
  url = {https://github.com/shibing624/text2vec},
}
DEPLOY IN 60 SECONDS

Run text2vec-base-multilingual on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.