Phikon-v2 is a Vision Transformer Large pre-trained with Dinov2 self-supervised method on PANCAN-XL, a dataset of 450M 20x magnification histology images sampled from 60K whole slide images. PANCAN-XL only incorporates publicly available datasets: CPTAC (6,193 WSI) and TCGA (29,502 WSI) for malignant tissue, and GTEx for normal tissue (13,302 WSI).
Phikon-v2 improves upon Phikon, our previous fondation model pre-trained with iBOT on 40M histology images from TCGA (6k WSI), on a large variety of weakly-supervised tasks tailored for biomarker discovery. Phikon-v2 is evaluated on external cohorts to avoid any data contamination with PANCAN-XL pre-training dataset, and benchmarked against an exhaustive panel of representation learning and foundation models.
The following code snippet allows you to extract features from histology images using Phikon-v2 (CLS token). These features can then be used for downstream applications such as ROI classification (via linear or knn probing), slide classification (via multiple instance learning), segmentation (via ViT-Adapter for instance), etc.
from PIL import Image
import torch
from transformers import AutoImageProcessor, AutoModel
# Load an image
image = Image.open(
requests.get(
"https://github.com/owkin/HistoSSLscaling/blob/main/assets/example.tif?raw=true",
stream=True
).raw
)
# Load phikon-v2
processor = AutoImageProcessor.from_pretrained("owkin/phikon-v2")
model = AutoModel.from_pretrained("owkin/phikon-v2")
model.eval()
# Process the image
inputs = processor(image, return_tensors="pt")
# Get the features
with torch.inference_mode():
outputs = model(**inputs)
features = outputs.last_hidden_state[:, 0, :] # (1, 1024) shape
assert features.shape == (1, 1024)
Phikon-v2 can be used with or without fine-tuning on different downstream applications, on top of which slide-classification using multiple instance learning algorithms (such as ABMIL).
You can fine-tune the model on tile-level downstream tasks. This Colab notebook allows you to fine-tune Phikon and Phikon-v2 using LoRa through the huggingface API.
Python Packages
Repositories
For any additional questions or comments, contact Alexandre Filiot (alexandre.filiot@owkin.com).
@misc{filiot2024phikonv2largepublicfeature,
title={Phikon-v2, A large and public feature extractor for biomarker prediction},
author={Alexandre Filiot and Paul Jacob and Alice Mac Kain and Charlie Saillard},
year={2024},
eprint={2409.09173},
archivePrefix={arXiv},
primaryClass={eess.IV},
url={https://arxiv.org/abs/2409.09173},
}
We thank DINOv2 authors for the amazing contribution [1].
This work was granted access to the HPC resources of IDRIS under the allocation 2023-A0141012519 made by GENCI.
The results published here are partly based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga. The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. The data used for the analyses described in this manuscript were obtained from the GTEx Portal on 07/01/2023.
Vision Transformers architectures were derived from facebookresearch/dino (Apache License 2.0), huggingface/pytorch-image-models (Apache License 2.0). This code is built upon DINOv2 repository (Apache License 2.0).
The following table provides the license associated with each datasets used for pre-training Phikon-v2.
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., & Bojanowski, P. (2024). Dinov2: Learning robust visual features without supervision. arXiv.
Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and operating a public information repository. Journal of Digital Imaging, 26(6), 1045–1057. Springer Science and Business Media LLC. https://doi.org/10.1007/s10278-013-9622-7
National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2019). The Clinical Proteomic Tumor Analysis Consortium Acute Myeloid Leukemia Collection (CPTAC-AML) (Version 4) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.2019.B6FOE619
National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2018). The Clinical Proteomic Tumor Analysis Consortium Glioblastoma Multiforme Collection (CPTAC-GBM) (Version 15) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2018.3RJE41Q1
National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2020). The Clinical Proteomic Tumor Analysis Consortium Breast Invasive Carcinoma Collection (CPTAC-BRCA) (Version 1) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.CAEM-YS80
National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2020). The Clinical Proteomic Tumor Analysis Consortium Colon Adenocarcinoma Collection (CPTAC-COAD) (Version 1) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.YZWQ-ZZ63
National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2018). The Clinical Proteomic Tumor Analysis Consortium Head and Neck Squamous Cell Carcinoma Collection (CPTAC-HNSCC) (Version 16) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2018.UW45NH81
National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2018). The Clinical Proteomic Tumor Analysis Consortium Clear Cell Renal Cell Carcinoma Collection (CPTAC-CCRCC) (Version 13) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2018.OBLAMN27
National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2018). The Clinical Proteomic Tumor Analysis Consortium Lung Squamous Cell Carcinoma Collection (CPTAC-LSCC) (Version 15) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2018.6EMUB5L2
National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2019). The Clinical Proteomic Tumor Analysis Consortium Sarcomas Collection (CPTAC-SAR) (Version 10) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.2019.9BT23R95
National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2020). The Clinical Proteomic Tumor Analysis Consortium Ovarian Serous Cystadenocarcinoma Collection (CPTAC-OV) (Version 3) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.ZS4A-JD58
National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2018). The Clinical Proteomic Tumor Analysis Consortium Pancreatic Ductal Adenocarcinoma Collection (CPTAC-PDA) (Version 14) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2018.SC20FO18
National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2018). The Clinical Proteomic Tumor Analysis Consortium Cutaneous Melanoma Collection (CPTAC-CM) (Version 11) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2018.ODU24GZE
National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2019). The Clinical Proteomic Tumor Analysis Consortium Uterine Corpus Endometrial Carcinoma Collection (CPTAC-UCEC) (Version 12) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2018.3R3JUISW
Cancer Moonshot Biobank. (2022). Cancer Moonshot Biobank – Colorectal Cancer Collection (CMB-CRC) (Version 5) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/DJG7-GZ87
Cancer Moonshot Biobank. (2022). Cancer Moonshot Biobank – Melanoma Collection (CMB-MEL) (Version 5) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/GWSP-WH72
Cancer Moonshot Biobank. (2022). Cancer Moonshot Biobank – Gastroesophageal Cancer Collection (CMB-GEC) (Version 2) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/E7KH-R486
Cancer Moonshot Biobank. (2022). Cancer Moonshot Biobank – Lung Cancer Collection (CMB-LCA) (Version 5) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/3CX3-S132
Cancer Moonshot Biobank. (2022). Cancer Moonshot Biobank – Multiple Myeloma Collection (CMB-MML) (Version 4) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/SZKB-SW39
Bakas, S., Sako, C., Akbari, H., Bilello, M., Sotiras, A., Shukla, G., Rudie, J. D., Flores Santamaria, N., Fathi Kazerooni, A., Pati, S., Rathore, S., Mamourian, E., Ha, S. M., Parker, W., Doshi, J., Baid, U., Bergman, M., Binder, Z. A., Verma, R., … Davatzikos, C. (2021). Multi-parametric magnetic resonance imaging (mpMRI) scans for de novo Glioblastoma (GBM) patients from the University of Pennsylvania Health System (UPENN-GBM) (Version 2) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.709X-DN49
Martel, A. L., Nofech-Mozes, S., Salama, S., Akbar, S., & Peikari, M. (2019). Assessment of residual breast cancer cellularity after neoadjuvant chemotherapy using digital pathology [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.2019.4YIBTJNO
Campanella, G., Hanna, M. G., Brogi, E., & Fuchs, T. J. (2019). Breast metastases to axillary lymph nodes [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.2019.3XBN2JCC
Farahmand, S., Fernandez, A. I., Ahmed, F. S., Rimm, D. L., Chuang, J. H., Reisenbichler, E., & Zarringhalam, K. (2022). HER2 and trastuzumab treatment response H&E slides with tumor ROI annotations (Version 3) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/E65C-AM96
Pataki, B. A., Olar, A., Ribli, D., Pesti, A., Kontsek, E., Gyongyosi, B., Bilecz, A., Kovács, T., Kovács, K. A., Kiss, Z., Szócska, M., Pollner, P., & Csabai, I. (2021). Digital pathological slides from Hungarian (Europe) colorectal cancer screening (Version 2) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.9CJF-0127
Pennycuick, A., Teixeira, V. H., AbdulJabbar, K., Raza, S. E. A., Lund, T., Akarca, A. U., Rosenthal, R., Kalinke, L., Chandrasekharan, D. P., Pipinikas, C. P., Lee-Six, H., Hynds, R. E., Gowers, K. H. C., Henry, J. Y., Millar, F. R., Hagos, Y. B., Denais, C., Falzon, M., Moore, D. A., Antoniou, S., Durrenberger, P. F., Furness, A. J., Carroll, B., Marceaux, C., Asselin-Labat, M. L., Larson, W., Betts, C., Coussens, L. M., Thakrar, R. M., George, J., Swanton, C., Thirlwell, C., Campbell, P. J., Marafioti, T., Yuan, Y., Quezada, S. A., McGranahan, N., & Janes, S. M. (2020). Immune surveillance in clinical regression of preinvasive squamous cell lung cancer. Cancer Discovery, 10(10), 1489-1499. https://doi.org/10.1158/2159-8290.CD-19-1366
National Lung Screening Trial Research Team. (2013). Data from the National Lung Screening Trial (NLST) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.HMQ8-J677
Wang, C.-W., Chang, C.-C., Lo, S.-C., Lin, Y.-J., Liou, Y.-A., Hsu, P.-C., Lee, Y.-C., & Chao, T.-K. (2021). A dataset of histopathological whole slide images for classification of treatment effectiveness to ovarian cancer (Ovarian Bevacizumab Response) (Version 2) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.985G-EY35
Chowdhury, S., Kennedy, J. J., Ivey, R. G., Murillo, O., Hosseini, N., Song, X., Petralia, F., Calinawan, A., Voytovich, U. J., Savage, S. R., Berry, A., Reva, B., Ozbek, U., Krek, A., Ma, W., da Veiga Leprevost, F., Ji, J., Yoo, S., Lin, C., … Paulovich, A. G. (2023). Proteogenomic analysis of chemo-refractory high grade serous ovarian cancer (PTRC-HGSOC) (Version 1) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/6RDA-P940
Hodis, E., Torlai Triglia, E., Kwon, J. Y. H., Biancalani, T., Zakka, L. R., Parkar, S., Hütter, J. C., Buffoni, L., Delorey, T. M., Phillips, D., Dionne, D., Nguyen, L. T., Schapiro, D., Maliga, Z., Jacobson, C. A., Hendel, A., Rozenblatt-Rosen, O., Mihm, M. C. Jr., Garraway, L. A., & Regev, A. (2022). Stepwise-edited, human melanoma models reveal mutations' effect on tumor and microenvironment. Science, 376(6592), eabi8175. https://doi.org/10.1126/science.abi8175