Turkish Multimodal Embedding Model

This repository contains a contrastively trained Turkish multimodal embedding model, combining a text encoder and a vision encoder with projection heads.
The model is trained entirely on Turkish datasets (image–caption and paraphrase), making it specifically tailored for Turkish multimodal applications.

Model Summary

  • Text encoder: newmindai/modernbert-base-tr-uncased-allnli-stsb
  • Vision encoder: facebook/dinov2-base
  • Dimensions: text_dim=768, image_dim=768, embed_dim=768
  • Projection dropout: fixed at 0.4 (inside ProjectionHead)
  • Pooling: mean pooling over tokens (use_mean_pooling_for_text=True)
  • Normalize outputs: {normalize}
  • Encoders frozen during training?: {frozen} (this release was trained with encoders NOT frozen)
  • Language focus: Turkish (both text and image–caption pairs are fully in Turkish)

Training Strategy (inspired by JINA-CLIP-v2 style)

  • The model was trained jointly with image–text and text–text pairs using a bidirectional contrastive loss (InfoNCE/CLIP-style).
  • For image–text, standard CLIP-style training with in-batch negatives was applied.
  • For text–text, only positive paraphrase pairs (label=1) were used, with in-batch negatives coming from other samples.
  • This follows the general training philosophy often seen in Jina’s multimodal work, but in a simplified single-stage setup (without the 3-stage curriculum).

Datasets

Both datasets are in Turkish, aligning the model’s embedding space around Turkish multimodal signals.
Please check each dataset’s license and terms before downstream use.

Files

  • pytorch_model.bin — PyTorch state_dict
  • config.json — metadata (encoder IDs, dimensions, flags)
  • model.py — custom model classes (required to load)
  • (This README is the model card.)

Evaluation Results

Dataset: Test split created from ituperceptron/image-captioning-turkish

Image-Text

Average cosine similarity: 0.7934

Recall@K

DirectionR@1R@5R@10
Text → Image0.93650.99130.9971
Image → Text0.93560.99270.9958
Raw metrics (JSON)
{
    "avg_cosine_sim": 0.7934404611587524,
    "recall_text_to_image": {
        "R@1": 0.936458564763386,
        "R@5": 0.9913352588313709,
        "R@10": 0.9971117529437903
    },
    "recall_image_to_text": {
        "R@1": 0.9355698733614752,
        "R@5": 0.9926682959342369,
        "R@10": 0.9957787158409243
    }
}

Text-Text

Average cosine similarity: 0.7599

Recall@K

DirectionR@1R@5R@10
Text → Text0.71980.94530.9824
Raw metrics (JSON)
{
    "avg_cosine_sim": 0.7599335312843323,
    "recall_text_to_text": {
        "R@1": 0.719875500222321,
        "R@5": 0.9453090262338817,
        "R@10": 0.9824366385060027
    }
}

Loading & Usage

import os, json, torch, importlib.util
from huggingface_hub import snapshot_download
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import torch.nn.functional as F

# --- Settings
repo_id = "utkubascakir/turkish-multimodal-embedding"
local_dir = snapshot_download(repo_id)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# --- 1) Load config
with open(os.path.join(local_dir, "config.json"), "r", encoding="utf-8") as f:
    cfg = json.load(f)

# --- 2) Load base encoders & processor
tok = AutoTokenizer.from_pretrained(cfg["text_encoder_id"])
txt_enc = AutoModel.from_pretrained(cfg["text_encoder_id"])
img_proc = AutoImageProcessor.from_pretrained(cfg["vision_encoder_id"])
vis_enc = AutoModel.from_pretrained(cfg["vision_encoder_id"])

# --- 3) Import the custom model class
spec = importlib.util.spec_from_file_location("model", os.path.join(local_dir, "model.py"))
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)  # exposes mod.MultiModalEmbedder

# --- 4) Build the model and load weights
model = mod.MultiModalEmbedder(
    text_encoder=txt_enc,
    vision_encoder=vis_enc,
    text_dim=cfg.get("text_dim", 768),
    image_dim=cfg.get("image_dim", 768),
    embed_dim=cfg.get("embed_dim", 768),      # must match training
    temperature_init=cfg.get("temperature_init", 1/0.07),
    use_mean_pooling_for_text=cfg.get("use_mean_pooling_for_text", True),
    freeze_encoders=cfg.get("freeze_encoders", False),
).to(device)

state = torch.load(os.path.join(local_dir, "pytorch_model.bin"), map_location=device)
# If you accidentally uploaded a checkpoint dict with a "model" key:
# if isinstance(state, dict) and "model" in state:
#     state = state["model"]
missing, unexpected = model.load_state_dict(state, strict=False)
print("load_state_dict -> missing:", missing, " unexpected:", unexpected)

model.eval()

# --- 5) INFERENCE (recommended): encode_* methods (@no_grad inside)
texts = ["cat"]
text_inputs = tok(texts, padding=True, truncation=True, return_tensors="pt").to(device)
t_emb = model.encode_text(text_inputs)  # (B, embed_dim)

img = Image.open("cat.jpeg").convert("RGB")
img_inputs = img_proc(img, return_tensors="pt").to(device)
v_emb = model.encode_image(img_inputs)  # (1, embed_dim)

print("Text embeddings:", t_emb.shape)
print("Image embeddings:", v_emb.shape)

# Cosine similarity
sim = F.cosine_similarity(t_emb, v_emb).item()
print(f"Cosine similarity: {sim:.4f}")

# --- 6) (Optional) TRAINING example: forward_* (grad-enabled usage)
# DO NOT use torch.no_grad() here during training
# t_train = model.forward_text(text_inputs["input_ids"], text_inputs["attention_mask"])
# v_train = model.forward_image(img_inputs["pixel_values"])
# loss calculations go here...

Limitations & Intended Use

This release provides a Turkish multimodal embedding model, trained to produce aligned vector representations for text and images.
It has not been tested for specific downstream tasks (e.g., retrieval, classification).
No guarantees for bias/toxicity; please evaluate on your own target domain.

Citation

If you use this model, please cite this repository.

Downloads last month
23
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for utkubascakir/turkish-multimodal-embedding

Finetuned
(56)
this model

Datasets used to train utkubascakir/turkish-multimodal-embedding