Self-supervised models
This repository collects a set of self-supervised learning (SSL) model checkpoints for ResNet and Vision Transformer (ViT) architectures. These models have been pretrained on large image datasets using various SSL methods and can be fine-tuned or evaluated on downstream tasks.
You can use the vitookit for evaluation and fine-tuning.
Example vitookit k-NN evaluation (iBOT ViT-B on Oxford Pets):
vitrun eval_knn.py --data_location ~/data/ --data_set Pets --model vit_base_patch16_224.ibot -w ~/models/SSL/vit/ibot_vitb_in1k.pth
Model cards — ResNet and ViT
Below are concise model cards you can drop into this README to document self‑supervised ResNet and ViT checkpoints. Include the concrete checkpoint metadata (pretraining recipe, epoch, augmentations) next to each entry when available.
ViT (Vision Transformer)
- Description: Transformer-based image encoder using non-overlapping patches (e.g., ViT-B/16, ViT-Tiny). Commonly used with self‑supervised pretraining (iBOT, DINO, etc.).
- Typical architecture names: vit_base_patch16_224, vit_tiny_patch16_224, etc.
- Typical pretraining: ImageNet-1k (IN1k) or larger; method-dependent (iBOT/DINO/MAE). Check checkpoint metadata for exact recipe.
- Input size: 224×224 (unless model variant specifies otherwise).
- Preprocessing: ImageNet mean/std — mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]; resize + center-crop to model input.
ResNet (Residual Networks)
- Description: Convolutional residual encoder family (ResNet-18/34/50/101). Widely used as backbones for self‑supervised methods (MoCo, BYOL, SimCLR).
- Typical architecture names: resnet18, resnet50, resnet101, etc.
- Typical pretraining: ImageNet-1k (IN1k) with self‑supervised recipes — verify checkpoint metadata for exact details.
- Input size: usually 224×224.
- Preprocessing: ImageNet mean/std — mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]; resize + center-crop to 224.
- Recommended evaluation protocols:
- Evaluate with k-NN, linear probe, or full fine-tuning depending on downstream task complexity.
- Check whether the checkpoint contains a backbone-only state dict or a wrapped model (use --prefix or --checkpoint_key to extract).
- Example vitookit k-NN evaluation: vitrun eval_knn.py --data_location ~/data/ --data_set Pets --model resnet50.moco -w ~/models/SSL/resnet/resnet50_moco_in1k.pth
Common notes & metadata to record with each checkpoint
- Checkpoint name and full path (local or URL)
- Self-supervised method and hyperparameters (crop sizes, augmentations, epochs, batch size)
- Epoch or step number and top-line metrics (if available)
- Checkpoint layout hints: whether weights are under keys like
model,state_dict, or prefixed (use--checkpoint_keyand--prefixwith vitookit) - Intended use, known limitations, and license/citation (add the original paper citation and checkpoint license)
Limitations & safety
- Self‑supervised checkpoints inherit biases from pretraining data (e.g., ImageNet). Validate on your downstream data and consider fairness/privacy impacts before deployment.
- Always check and respect the original checkpoint license and citation requirements.
Put per-checkpoint details (method, epoch, training recipe, license, citation) immediately under each model entry for clarity.