apple/DFN2B-CLIP-ViT-B-16

open_clipopen_cliparxiv:2309.17425license:apple-amlrregion:usapple-amlr
365.4K

A CLIP (Contrastive Language-Image Pre-training) model trained on DFN-2B. Data Filtering Networks (DFNs) are small networks used to automatically filter large pools of uncurated data. This model was trained on 2B images that were filtered from a pool of 12.8B uncurated image-text pairs (12.8B image-text pairs from CommonPool-12.8B).

These weights are directly usable in OpenCLIP (image + text).

Model Details

  • Model Type: Contrastive Image-Text, Zero-Shot Image Classification.
  • Dataset: DFN-2b
  • Papers:
  • Examples Seen: 12.8B

Model Metrics

DatasetMetric
ImageNet 1k0.76236
Caltech-1010.942894
CIFAR-100.9672
CIFAR-1000.8347
CLEVR Counts0.232333
CLEVR Distance0.245267
Country2110.19545
Describable Textures0.575532
EuroSAT0.54
FGVC Aircraft0.248503
Food-1010.91303
GTSRB0.469913
ImageNet Sketch0.620684
ImageNet v20.682
ImageNet-A0.482133
ImageNet-O0.493
ImageNet-R0.830967
KITTI Vehicle Distance0.192686
MNIST0.782
ObjectNet0.631851
Oxford Flowers-1020.819895
Oxford-IIIT Pet0.936907
Pascal VOC 20070.788528
PatchCamelyon0.521545
Rendered SST20.486546
RESISC450.61381
Stanford Cars0.90735
STL-100.97525
SUN3970.714162
SVHN0.598955
Flickr0.7728
MSCOCO0.518773
WinoGAViL0.541748
iWildCam0.155574
Camelyon170.499283
FMoW0.141149
Dollar Street0.625
GeoDE0.891023
Average0.609232

Model Usage

With OpenCLIP

import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer 

model, preprocess = create_model_from_pretrained('hf-hub:apple/DFN2B-CLIP-ViT-B-16')
tokenizer = get_tokenizer('ViT-B-16')

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)

labels_list = ["a dog", "a cat", "a donut", "a beignet"]
text = tokenizer(labels_list, context_length=model.context_length)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features = F.normalize(image_features, dim=-1)
    text_features = F.normalize(text_features, dim=-1)

    text_probs = torch.sigmoid(image_features @ text_features.T * model.logit_scale.exp() + model.logit_bias)

zipped_list = list(zip(labels_list, [round(p.item(), 3) for p in text_probs[0]]))
print("Label probabilities: ", zipped_list)

Citation

@article{fang2023data,
  title={Data Filtering Networks},
  author={Fang, Alex and Jose, Albin Madappally and Jain, Amit and Schmidt, Ludwig and Toshev, Alexander and Shankar, Vaishaal},
  journal={arXiv preprint arXiv:2309.17425},
  year={2023}
}

DEPLOY IN 60 SECONDS

Run DFN2B-CLIP-ViT-B-16 on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.