Turkish Multimodal Embedding Model
This repository contains a contrastively trained Turkish multimodal embedding model, combining a text encoder and a vision encoder with projection heads.
The model is trained entirely on Turkish datasets (image–caption and paraphrase), making it specifically tailored for Turkish multimodal applications.
Model Summary
- Text encoder:
newmindai/modernbert-base-tr-uncased-allnli-stsb - Vision encoder:
facebook/dinov2-base - Dimensions:
text_dim=768,image_dim=768,embed_dim=768 - Projection dropout: fixed at
0.4(insideProjectionHead) - Pooling: mean pooling over tokens (
use_mean_pooling_for_text=True) - Normalize outputs:
{normalize} - Encoders frozen during training?:
{frozen}(this release was trained with encoders NOT frozen) - Language focus: Turkish (both text and image–caption pairs are fully in Turkish)
Training Strategy (inspired by JINA-CLIP-v2 style)
- The model was trained jointly with image–text and text–text pairs using a bidirectional contrastive loss (InfoNCE/CLIP-style).
- For image–text, standard CLIP-style training with in-batch negatives was applied.
- For text–text, only positive paraphrase pairs (label=1) were used, with in-batch negatives coming from other samples.
- This follows the general training philosophy often seen in Jina’s multimodal work, but in a simplified single-stage setup (without the 3-stage curriculum).
Datasets
- Image–Text:
ituperceptron/image-captioning-turkish - Text–Text (Paraphrase):
dogukanvzr/ml-paraphrase-tr
Both datasets are in Turkish, aligning the model’s embedding space around Turkish multimodal signals.
Please check each dataset’s license and terms before downstream use.
Files
pytorch_model.bin— PyTorchstate_dictconfig.json— metadata (encoder IDs, dimensions, flags)model.py— custom model classes (required to load)- (This README is the model card.)
Evaluation Results
Dataset: Test split created from ituperceptron/image-captioning-turkish
Image-Text
Average cosine similarity: 0.7934
Recall@K
| Direction | R@1 | R@5 | R@10 |
|---|---|---|---|
| Text → Image | 0.9365 | 0.9913 | 0.9971 |
| Image → Text | 0.9356 | 0.9927 | 0.9958 |
Raw metrics (JSON)
{
"avg_cosine_sim": 0.7934404611587524,
"recall_text_to_image": {
"R@1": 0.936458564763386,
"R@5": 0.9913352588313709,
"R@10": 0.9971117529437903
},
"recall_image_to_text": {
"R@1": 0.9355698733614752,
"R@5": 0.9926682959342369,
"R@10": 0.9957787158409243
}
}
Text-Text
Average cosine similarity: 0.7599
Recall@K
| Direction | R@1 | R@5 | R@10 |
|---|---|---|---|
| Text → Text | 0.7198 | 0.9453 | 0.9824 |
Raw metrics (JSON)
{
"avg_cosine_sim": 0.7599335312843323,
"recall_text_to_text": {
"R@1": 0.719875500222321,
"R@5": 0.9453090262338817,
"R@10": 0.9824366385060027
}
}
Loading & Usage
import os, json, torch, importlib.util
from huggingface_hub import snapshot_download
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import torch.nn.functional as F
# --- Settings
repo_id = "utkubascakir/turkish-multimodal-embedding"
local_dir = snapshot_download(repo_id)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# --- 1) Load config
with open(os.path.join(local_dir, "config.json"), "r", encoding="utf-8") as f:
cfg = json.load(f)
# --- 2) Load base encoders & processor
tok = AutoTokenizer.from_pretrained(cfg["text_encoder_id"])
txt_enc = AutoModel.from_pretrained(cfg["text_encoder_id"])
img_proc = AutoImageProcessor.from_pretrained(cfg["vision_encoder_id"])
vis_enc = AutoModel.from_pretrained(cfg["vision_encoder_id"])
# --- 3) Import the custom model class
spec = importlib.util.spec_from_file_location("model", os.path.join(local_dir, "model.py"))
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod) # exposes mod.MultiModalEmbedder
# --- 4) Build the model and load weights
model = mod.MultiModalEmbedder(
text_encoder=txt_enc,
vision_encoder=vis_enc,
text_dim=cfg.get("text_dim", 768),
image_dim=cfg.get("image_dim", 768),
embed_dim=cfg.get("embed_dim", 768), # must match training
temperature_init=cfg.get("temperature_init", 1/0.07),
use_mean_pooling_for_text=cfg.get("use_mean_pooling_for_text", True),
freeze_encoders=cfg.get("freeze_encoders", False),
).to(device)
state = torch.load(os.path.join(local_dir, "pytorch_model.bin"), map_location=device)
# If you accidentally uploaded a checkpoint dict with a "model" key:
# if isinstance(state, dict) and "model" in state:
# state = state["model"]
missing, unexpected = model.load_state_dict(state, strict=False)
print("load_state_dict -> missing:", missing, " unexpected:", unexpected)
model.eval()
# --- 5) INFERENCE (recommended): encode_* methods (@no_grad inside)
texts = ["cat"]
text_inputs = tok(texts, padding=True, truncation=True, return_tensors="pt").to(device)
t_emb = model.encode_text(text_inputs) # (B, embed_dim)
img = Image.open("cat.jpeg").convert("RGB")
img_inputs = img_proc(img, return_tensors="pt").to(device)
v_emb = model.encode_image(img_inputs) # (1, embed_dim)
print("Text embeddings:", t_emb.shape)
print("Image embeddings:", v_emb.shape)
# Cosine similarity
sim = F.cosine_similarity(t_emb, v_emb).item()
print(f"Cosine similarity: {sim:.4f}")
# --- 6) (Optional) TRAINING example: forward_* (grad-enabled usage)
# DO NOT use torch.no_grad() here during training
# t_train = model.forward_text(text_inputs["input_ids"], text_inputs["attention_mask"])
# v_train = model.forward_image(img_inputs["pixel_values"])
# loss calculations go here...
Limitations & Intended Use
This release provides a Turkish multimodal embedding model, trained to produce aligned vector representations for text and images.
It has not been tested for specific downstream tasks (e.g., retrieval, classification).
No guarantees for bias/toxicity; please evaluate on your own target domain.
Citation
If you use this model, please cite this repository.
- Downloads last month
- 23
Model tree for utkubascakir/turkish-multimodal-embedding
Base model
facebook/dinov2-base