Turkish CLIP (ViT + BERT) – `turkish-clip-vit-bert`

Description

This model is a Turkish CLIP (Contrastive Language–Image Pretraining) style multimodal model.
It maps images and text into a shared embedding space, allowing for image-text similarity and text-to-image retrieval tasks.

Image encoder: ViT-base-patch16-224-in21k
Text encoder: BERT-base-Turkish-uncased
Training dataset: ITU Perceptron Turkish Image Captioning Dataset
Maximum sequence length: 512 tokens

It was trained using mean pooling over tokens, so it is recommended to use mean pooling during inference as well.

Image Preprocessing

Images should be preprocessed in the same way as during training:

from torchvision import transforms

image_transform = transforms.Compose([
    transforms.Resize(224),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406], 
        std=[0.229, 0.224, 0.225]
    ),
])

Text Preprocessing

Texts are tokenized with a maximum length of 512 tokens, which matches the training setup:

inputs = tokenizer(
    text,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=512
)

Usage Example

  import torch
  from transformers import AutoTokenizer, AutoModel
  from PIL import Image
  from torchvision import transforms
  import matplotlib.pyplot as plt
  import numpy as np

  model_name = "erythropygia/turkish-clip-vit-bert"
  model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  
  device = "cuda" if torch.cuda.is_available() else "cpu"
  model.to(device)
  model.eval()

  image_transform = transforms.Compose([
      transforms.Resize(224),
      transforms.CenterCrop(224),
      transforms.ToTensor(),
      transforms.Normalize(mean=[0.485, 0.456, 0.406],
                           std=[0.229, 0.224, 0.225])
  ])
  
  def compute_similarity(text, image: Image.Image):
      pixel_values = image_transform(image).unsqueeze(0).to(device)
      with torch.no_grad():
          image_embeds = model.encode_image(pixel_values)
          inputs = tokenizer(text, return_tensors="pt", padding=True,
                             truncation=True, max_length=512).to(device)
          text_embeds = model.encode_text(inputs["input_ids"], inputs["attention_mask"])
          similarity = torch.matmul(image_embeds, text_embeds.T) / model.temperature
          return similarity.item()
  
  test_image = Image.open("test_image/cat.png").convert("RGB")
  
  candidate_texts = [
      "köpek parkta koşuyor", 
      "hayvan resmi içeren bir fotoğraf",
      "kedi",
      "kedi bir ayağını önüne uzatmış sol tarafa bakıyor",
      "vesikalık fotoğraf",
      "hafif bulanık insan suratı fotoğrafı",
      "iki insanın karşılıklı bir bilgisayarın yanında oturduğu fotoğraf ve altında bir youtube yorumu",
      "Admin yazılı bir çerçeve içerisinde insan suratı imajda tarih ve zaman bilgisi var",
      "araba"
  ]
  
  raw_scores = []
  for text in candidate_texts:
      score = compute_similarity(text, test_image)
      raw_scores.append(score)
  
  scores_tensor = torch.tensor(raw_scores)
  softmax_scores = torch.softmax(scores_tensor, dim=0).numpy()
  
  print("\nImage - Text similarity scores (softmax normalized)")
  results = list(zip(candidate_texts, softmax_scores))
  for text, score in results:
      print(f"'{text}': {score:.4f}")
  
  best_match = max(results, key=lambda x: x[1])
  print(f"\nBest Match: '{best_match[0]}' (score: {best_match[1]:.4f})")
  
  plt.figure(figsize=(6, 6))
  plt.imshow(test_image)
  plt.title(f"Test Image\nBest Match: '{best_match[0]}'")
  plt.axis('off')
  plt.show()

Example Results

Notes

The model was trained using mean pooling. It is recommended to use mean pooling during inference for consistent results.
model.temperature is used for scaling the logits. Do not divide raw cosine similarity values by the temperature.
The model can be used for batch-based retrieval or ranking tasks by computing embeddings for multiple images or texts.
Make sure to preprocess your images with the same normalization and resizing as above to get accurate similarity scores.
Texts are tokenized with a maximum of 512 tokens, consistent with training.
Trained on the ITU Perceptron Turkish Image Captioning Dataset.
Training run: 10 epochs (~70 hours) on a single NVIDIA RTX 3050 GPU.

Downloads last month: 17

Safetensors

Model size

0.2B params

Tensor type

F32

Dataset used to train erythropygia/turkish-clip-vit-bert

Evaluation results

Metadata error: specify a dataset to view leaderboard

Turkish CLIP (ViT + BERT) – turkish-clip-vit-bert