Turkish CLIP (ViT + BERT) – turkish-clip-vit-bert
Description
This model is a Turkish CLIP (Contrastive Language–Image Pretraining) style multimodal model.
It maps images and text into a shared embedding space, allowing for image-text similarity and text-to-image retrieval tasks.
- Image encoder: ViT-base-patch16-224-in21k
- Text encoder: BERT-base-Turkish-uncased
- Training dataset: ITU Perceptron Turkish Image Captioning Dataset
- Maximum sequence length: 512 tokens
It was trained using mean pooling over tokens, so it is recommended to use mean pooling during inference as well.
Image Preprocessing
Images should be preprocessed in the same way as during training:
from torchvision import transforms
image_transform = transforms.Compose([
transforms.Resize(224),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
),
])
Text Preprocessing
Texts are tokenized with a maximum length of 512 tokens, which matches the training setup:
inputs = tokenizer(
text,
return_tensors="pt",
padding=True,
truncation=True,
max_length=512
)
Usage Example
import torch
from transformers import AutoTokenizer, AutoModel
from PIL import Image
from torchvision import transforms
import matplotlib.pyplot as plt
import numpy as np
model_name = "erythropygia/turkish-clip-vit-bert"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()
image_transform = transforms.Compose([
transforms.Resize(224),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
def compute_similarity(text, image: Image.Image):
pixel_values = image_transform(image).unsqueeze(0).to(device)
with torch.no_grad():
image_embeds = model.encode_image(pixel_values)
inputs = tokenizer(text, return_tensors="pt", padding=True,
truncation=True, max_length=512).to(device)
text_embeds = model.encode_text(inputs["input_ids"], inputs["attention_mask"])
similarity = torch.matmul(image_embeds, text_embeds.T) / model.temperature
return similarity.item()
test_image = Image.open("test_image/cat.png").convert("RGB")
candidate_texts = [
"köpek parkta koşuyor",
"hayvan resmi içeren bir fotoğraf",
"kedi",
"kedi bir ayağını önüne uzatmış sol tarafa bakıyor",
"vesikalık fotoğraf",
"hafif bulanık insan suratı fotoğrafı",
"iki insanın karşılıklı bir bilgisayarın yanında oturduğu fotoğraf ve altında bir youtube yorumu",
"Admin yazılı bir çerçeve içerisinde insan suratı imajda tarih ve zaman bilgisi var",
"araba"
]
raw_scores = []
for text in candidate_texts:
score = compute_similarity(text, test_image)
raw_scores.append(score)
scores_tensor = torch.tensor(raw_scores)
softmax_scores = torch.softmax(scores_tensor, dim=0).numpy()
print("\nImage - Text similarity scores (softmax normalized)")
results = list(zip(candidate_texts, softmax_scores))
for text, score in results:
print(f"'{text}': {score:.4f}")
best_match = max(results, key=lambda x: x[1])
print(f"\nBest Match: '{best_match[0]}' (score: {best_match[1]:.4f})")
plt.figure(figsize=(6, 6))
plt.imshow(test_image)
plt.title(f"Test Image\nBest Match: '{best_match[0]}'")
plt.axis('off')
plt.show()
Example Results
Notes
- The model was trained using mean pooling. It is recommended to use mean pooling during inference for consistent results.
model.temperatureis used for scaling the logits. Do not divide raw cosine similarity values by the temperature.- The model can be used for batch-based retrieval or ranking tasks by computing embeddings for multiple images or texts.
- Make sure to preprocess your images with the same normalization and resizing as above to get accurate similarity scores.
- Texts are tokenized with a maximum of 512 tokens, consistent with training.
- Trained on the ITU Perceptron Turkish Image Captioning Dataset.
- Training run: 10 epochs (~70 hours) on a single NVIDIA RTX 3050 GPU.
- Downloads last month
- 17

