wkcn/TinyCLIP-ViT-39M-16-Text-19M-YFCC15M

zero shot image classificationtransformerstransformerspytorchsafetensorsclipzero-shot-image-classificationtinyclipmit
466.9K

TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance

[ICCV 2023] - TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance

TinyCLIP is a novel cross-modal distillation method for large-scale language-image pre-trained models. The method introduces two core techniques: affinity mimicking and weight inheritance. This work unleashes the capacity of small CLIP models, fully leveraging large-scale models as well as pre-training data and striking the best trade-off between speed and accuracy.

Use with Transformers

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("wkcn/TinyCLIP-ViT-39M-16-Text-19M-YFCC15M")
processor = CLIPProcessor.from_pretrained("wkcn/TinyCLIP-ViT-39M-16-Text-19M-YFCC15M")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

Highlights

  • TinyCLIP ViT-45M/32 uses only half parameters of ViT-B/32 to achieves comparable zero-shot performance.
  • TinyCLIP ResNet-19M reduces the parameters by 50% while getting 2x inference speedup, and obtains 56.4% accuracy on ImageNet.

Model Zoo

ModelWeight inheritancePretrainIN-1K Acc@1(%)MACs(G)Throughput(pairs/s)Link
TinyCLIP ViT-39M/16 Text-19MmanualYFCC-15M63.59.51,469Model
TinyCLIP ViT-8M/16 Text-3MmanualYFCC-15M41.12.04,150Model
TinyCLIP ResNet-30M Text-29MmanualLAION-400M59.16.91,811Model
TinyCLIP ResNet-19M Text-19MmanualLAION-400M56.44.43,024Model
TinyCLIP ViT-61M/32 Text-29MmanualLAION-400M62.45.33,191Model
TinyCLIP ViT-40M/32 Text-19MmanualLAION-400M59.83.54,641Model
TinyCLIP ViT-63M/32 Text-31MautoLAION-400M63.95.62,905Model
TinyCLIP ViT-45M/32 Text-18MautoLAION-400M61.43.73,682Model
TinyCLIP ViT-22M/32 Text-10MautoLAION-400M53.71.95,504Model
TinyCLIP ViT-63M/32 Text-31MautoLAION+YFCC-400M64.55.62,909Model
TinyCLIP ViT-45M/32 Text-18MautoLAION+YFCC-400M62.71.93,685Model

Note: The configs of models with auto inheritance are generated automatically.

Official PyTorch Implementation

https://github.com/microsoft/Cream/tree/main/TinyCLIP

Citation

If this repo is helpful for you, please consider to cite it. :mega: Thank you! :)

@InProceedings{tinyclip,
    title     = {TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance},
    author    = {Wu, Kan and Peng, Houwen and Zhou, Zhenghong and Xiao, Bin and Liu, Mengchen and Yuan, Lu and Xuan, Hong and Valenzuela, Michael and Chen, Xi (Stephen) and Wang, Xinggang and Chao, Hongyang and Hu, Han},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {21970-21980}
}

Acknowledge

Our code is based on CLIP, OpenCLIP, CoFi and PyTorch. Thank contributors for their awesome contribution!

License

DEPLOY IN 60 SECONDS

Run TinyCLIP-ViT-39M-16-Text-19M-YFCC15M on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.