Japanese CLIP Model with Knowledge Distillation (Distillation Only)
This model is a Japanese CLIP model trained using knowledge distillation from line-corporation/clip-japanese-base.
Model Architecture
- Image Encoder: ResNet50 (fine-tuned from ImageNet pretrained weights)
- Text Encoder: Frozen line-corporation/clip-japanese-base text encoder
- Training Method: Knowledge Distillation Only (no contrastive learning)
- Output Dimension: 512 (compatible with line-corporation/clip-japanese-base)
Training Details
- Dataset: STAIR Captions
- Training Method: Knowledge Distillation Only
- Distillation Temperature: 4.0
- Total Epochs: 15
Usage
from transformers import AutoModel, AutoTokenizer
from PIL import Image
import torch
# Load model and tokenizer
model = AutoModel.from_pretrained("AoiNoGeso/japanese-resclip-v1", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("AoiNoGeso/japanese-resclip-v1", trust_remote_code=True)
# Load and process image
image = Image.open("path/to/image.jpg")
text = ["犬", "猫", "鳥"]
# Get features
image_features = model.get_image_features(**image)
text_features = model.get_text_features(**text)
# Compute similarity
similarity = torch.cosine_similarity(image_features, text_features)
Model Performance
This model was trained using only knowledge distillation to mimic the image feature output of line-corporation/clip-japanese-base while using a ResNet50-based image encoder.
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support