dangvantuan/vietnamese-document-embedding

sentence similaritysentence-transformersvisentence-transformerssafetensorsVietnamesefeature-extractionsentence-similaritytransformersapache-2.0
307.6K

Loading the dataset for evaluation

vi_sts = load_dataset("doanhieung/vi-stsbenchmark")["train"] df_dev = vi_sts.filter(lambda example: example['split'] == 'dev') df_test = vi_sts.filter(lambda example: example['split'] == 'test')

Convert the dataset for evaluation

For Dev set:

dev_samples = convert_dataset(df_dev) val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev') val_evaluator(model, output_path="./")

For Test set:

test_samples = convert_dataset(df_test) test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test') test_evaluator(model, output_path="./")





### Metric for all dataset of [Semantic Textual Similarity on STS Benchmark](https://huggingface.co/datasets/anti-ai/ViSTS)

**Spearman score**
| Model                                                                                                               | [STSB]   | [STS12]| [STS13] | [STS14] | [STS15] |    [STS16] | [SICK] | Mean |
|-----------------------------------------------------------|---------|----------|----------|----------|----------|----------|---------|--------|
| [dangvantuan/vietnamese-embedding](https://huggingface.co/dangvantuan/vietnamese-embedding)                                                 |84.84|	79.04|	85.30|	81.38|	87.06|	79.95|	79.58|	82.45|
| [dangvantuan/vietnamese-embedding-LongContext](https://huggingface.co/dangvantuan/vietnamese-embedding-LongContext)  |85.25|	75.77|	83.82|	81.69|	88.48|	81.5|	78.2|	82.10|

## Citation


	
```bibtex
@article{reimers2019sentence,
	   title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks}

, author={Nils Reimers, Iryna Gurevych}, journal={https://arxiv.org/abs/1908.10084}, year={2019} }

@article{zhang2024mgte,
      title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval}

, author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others}, journal={arXiv preprint arXiv:2407.19669}, year={2024} }

@article{li2023towards,
      title={Towards general text embeddings with multi-stage contrastive learning}

, author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan}, journal={arXiv preprint arXiv:2308.03281}, year={2023} }

@article{li20242d,
      title={2d matryoshka sentence embeddings}

, author={Li, Xianming and Li, Zongxi and Li, Jing and Xie, Haoran and Li, Qing}, journal={arXiv preprint arXiv:2402.14776}, year={2024} }

DEPLOY IN 60 SECONDS

Run vietnamese-document-embedding on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.