M-CLIP/XLM-Roberta-Large-Vit-B-32

transformersmultilingualaftransformerspytorchtfM-CLIPmultilingualaf

16

237.9K

Load Model & Tokenizer

model = pt_multilingual_clip.MultilingualCLIP.from_pretrained(model_name) tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

embeddings = model.forward(texts, tokenizer) print("Text features shape:", embeddings.shape)


Extracting embeddings from the corresponding image encoder:

```python
import torch
import clip
import requests
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)

print("Image features shape:", image_features.shape)

Evaluation results

None of the M-CLIP models have been extensivly evaluated, but testing them on Txt2Img retrieval on the humanly translated MS-COCO dataset, we see the following R@10 results:

Name	En	De	Es	Fr	Zh	It	Pl	Ko	Ru	Tr	Jp
OpenAI CLIP Vit-B/32	90.3	-	-	-	-	-	-	-	-	-	-
OpenAI CLIP Vit-L/14	91.8	-	-	-	-	-	-	-	-	-	-
OpenCLIP ViT-B-16+-	94.3	-	-	-	-	-	-	-	-	-	-
LABSE Vit-L/14	91.6	89.6	89.5	89.9	88.9	90.1	89.8	80.8	85.5	89.8	73.9
XLM-R Large Vit-B/32	91.8	88.7	89.1	89.4	89.3	89.8	91.4	82.1	86.1	88.8	81.0
XLM-R Vit-L/14	92.4	90.6	91.0	90.0	89.7	91.1	91.3	85.2	85.8	90.3	81.9
XLM-R Large Vit-B/16+	95.0	93.0	93.6	93.1	94.0	93.1	94.4	89.0	90.0	93.0	84.2

Training/Model details

Further details about the model training and data can be found in the model card.

Deploy Model on Runcrate

Run this model on powerful GPU infrastructure. Deploy in 60 seconds.

Pay per second

H100, A100, RTX GPUs

Instant deployment

DEPLOY IN 60 SECONDS

Run XLM-Roberta-Large-Vit-B-32 on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.