cl-nagoya/ruri-v3-310m

sentence similarityjasafetensorsmodernbertsentence-similarityfeature-extractionjadataset:cl-nagoya/ruri-v3-dataset-ftapache-2.0
373.9K

Ruri: Japanese General Text Embeddings

Ruri v3 is a general-purpose Japanese text embedding model built on top of ModernBERT-Ja. Ruri v3 offers several key technical advantages:

  • State-of-the-art performance for Japanese text embedding tasks.
  • Supports sequence lengths up to 8192 tokens
    • Previous versions of Ruri (v1, v2) were limited to 512.
  • Expanded vocabulary of 100K tokens, compared to 32K in v1 and v2
    • The larger vocabulary make input sequences shorter, improving efficiency.
  • Integrated FlashAttention, following ModernBERT's architecture
    • Enables faster inference and fine-tuning.
  • Tokenizer based solely on SentencePiece
    • Unlike previous versions, which relied on Japanese-specific BERT tokenizers and required pre-tokenized input, Ruri v3 performs tokenization with SentencePiece only—no external word segmentation tool is required.

Model Series

We provide Ruri-v3 in several model sizes. Below is a summary of each model.

ID#Param.#Param.
w/o Emb.
Dim.#LayersAvg. JMTEB
cl-nagoya/ruri-v3-30m37M10M2561074.51
cl-nagoya/ruri-v3-70m70M31M3841375.48
cl-nagoya/ruri-v3-130m132M80M5121976.55
cl-nagoya/ruri-v3-310m315M236M7682577.24

Usage

You can use our models directly with the transformers library v4.48.0 or higher:

pip install -U "transformers>=4.48.0" sentence-transformers

Additionally, if your GPUs support Flash Attention 2, we recommend using our models with Flash Attention 2.

pip install flash-attn --no-build-isolation

Then you can load this model and run inference.

import torch
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer("cl-nagoya/ruri-v3-310m", device=device)

# Ruri v3 employs a 1+3 prefix scheme to distinguish between different types of text inputs:
# "" (empty string) is used for encoding semantic meaning.
# "トピック: " is used for classification, clustering, and encoding topical information.
# "検索クエリ: " is used for queries in retrieval tasks.
# "検索文書: " is used for documents to be retrieved.
sentences = [
    "川べりでサーフボードを持った人たちがいます",
    "サーファーたちが川べりに立っています",
    "トピック: 瑠璃色のサーファー",
    "検索クエリ: 瑠璃色はどんな色?",
    "検索文書: 瑠璃色(るりいろ)は、紫みを帯びた濃い青。名は、半貴石の瑠璃(ラピスラズリ、英: lapis lazuli)による。JIS慣用色名では「こい紫みの青」(略号 dp-pB)と定義している[1][2]。",
]

embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.size())
# [5, 768]

similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.9603, 0.8157, 0.7074, 0.6916],
#  [0.9603, 1.0000, 0.8192, 0.7014, 0.6819],
#  [0.8157, 0.8192, 1.0000, 0.8701, 0.8470],
#  [0.7074, 0.7014, 0.8701, 1.0000, 0.9746],
#  [0.6916, 0.6819, 0.8470, 0.9746, 1.0000]]

Benchmarks

JMTEB

Evaluated with JMTEB.

Model#Param.Avg.RetrievalSTSClassfificationRerankingClusteringPairClassification
Ruri-v3-30m37M74.5178.0882.4874.8093.0052.1262.40
Ruri-v3-70m70M75.4879.9679.8276.9793.2752.7061.75
Ruri-v3-130m132M76.5581.8979.2577.1693.3155.3662.26
Ruri-v3-310m
(this model)
315M77.2481.8981.2278.6693.4355.6962.60
sbintuitions/sarashina-embedding-v1-1b1.22B75.5077.6182.7178.3793.7453.8662.00
PLaMo-Embedding-1B1.05B76.1079.9483.1477.2093.5753.4762.37
OpenAI/text-embedding-ada-002-69.4864.3879.0269.7593.0448.3062.40
OpenAI/text-embedding-3-small-70.8666.3979.4673.0692.9251.0662.27
OpenAI/text-embedding-3-large-73.9774.4882.5277.5893.5853.3262.35
pkshatech/GLuCoSE-base-ja133M70.4459.0278.7176.8291.9049.7866.39
pkshatech/GLuCoSE-base-ja-v2133M72.2373.3682.9674.2193.0148.6562.37
retrieva-jp/amber-base130M72.1273.4077.8176.1493.2748.0564.03
retrieva-jp/amber-large315M73.2275.4079.3277.1493.5448.7360.97
sentence-transformers/LaBSE472M64.7040.1276.5672.6691.6344.8862.33
intfloat/multilingual-e5-small118M69.5267.2780.0767.6293.0346.9162.19
intfloat/multilingual-e5-base278M70.1268.2179.8469.3092.8548.2662.26
intfloat/multilingual-e5-large560M71.6570.9879.7072.8992.9651.2462.15
Ruri-Small68M71.5369.4182.7976.2293.0051.1962.11
Ruri-Small v268M73.3073.9482.9176.1793.2051.5862.32
Ruri-Base111M71.9169.8282.8775.5892.9154.1662.38
Ruri-Base v2111M72.4872.3383.0375.3493.1751.3862.35
Ruri-Large337M73.3173.0283.1377.4392.9951.8262.29
Ruri-Large v2337M74.5576.3483.1777.1893.2152.1462.27

Model Details

Model Description

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Citation

@misc{
  Ruri,
  title={{Ruri: Japanese General Text Embeddings}}, 
  author={Hayato Tsukagoshi and Ryohei Sasano},
  year={2024},
  eprint={2409.07737},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2409.07737}, 
}

License

This model is published under the Apache License, Version 2.0.

DEPLOY IN 60 SECONDS

Run ruri-v3-310m on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.