Model Card for fashion-clip-vit-large-patch14

Quick Summary

This is the model card for fashion-clip-vit-large-patch14, an openai/clip-vit-large-patch14 model fine-tuned for high-performance, domain-specific image-text retrieval in the fashion industry. This model adapts the broad, general-purpose knowledge of CLIP to the specific vocabulary and visual nuances of e-commerce fashion products.

Model Details

Model Description

This model is a fine-tuned version of openai/clip-vit-large-patch14. The base model, developed by OpenAI, is a powerful zero-shot vision–language model trained on 400M general (image, text) pairs from the web. While effective for general-purpose tasks, it lacks the specialized vocabulary and fine-grained visual understanding required for specific domains like fashion.

This project addresses that gap through domain adaptation. The model was trained on the paramaggarwal/fashion-product-images-dataset, a high-quality dataset of 44,439 professionally shot e-commerce images with structured attributes.

The fine-tuning process optimized the model's embedding space for fashion-specific concepts. It learned to differentiate subtle terms (e.g., Topwear vs. T-shirt) and visual styles (Sandal vs. Flip Flop) that the base model may treat similarly. The result is a powerful retrieval model for fashion-centric semantic and visual search.

  • Developed by: Md Mohsin
  • Model type: Vision & Language, Contrastive Learning Model
  • Architecture: ViT-L/14 image encoder + Transformer text encoder
  • Language(s): English
  • Finetuned from: openai/clip-vit-large-patch14

This model is a derivative work of two MIT-licensed components:

  • The base openai/clip model
  • The paramaggarwal/fashion-product-images-dataset

Model Sources

  • Repository:
  • Original CLIP Paper: ()

Uses

Direct Use

This model is optimized for fashion-domain retrieval tasks.

  • Text-to-Image Retrieval: Given a prompt like "men's blue shoes" or "women’s red saree", retrieve the most relevant product images.
  • Image-to-Image Retrieval: Find visually similar or stylistically similar products.
  • Zero-Shot Classification: Classify images against fashion-specific prompts (e.g., "a photo of a sandal").

Downstream Use

The model's encoders can serve as domain-specialized feature extractors for:

  • Fashion recommendation systems
  • Lightweight attribute classifiers (pattern, material, neckline, etc.)
  • Visual Question Answering (VQA) for e-commerce

Out-of-Scope Use

  • General-purpose retrieval (performs worse than base CLIP)
  • In-the-wild images (non-studio photos)
  • Modern fashion trends post-2017

Bias, Risks, and Limitations

Inherited Bias

The model inherits all biases from the base CLIP model, trained on large-scale unfiltered web data.

Dataset-Specific Biases

  • Geographic & Cultural Bias: Dataset mostly from Myntra (India). Strong Indian fashion bias.
  • Temporal Bias: Dataset from ~2017; lacks modern styles.
  • Presentation Bias: Studio photography only.
  • Demographic Bias: Gender attributes reinforce binary categories.

Recommendations

  • Do not use the model as a sole decision-maker.
  • Test for biases across demographics, cultures, and categories before production use.

How to Get Started with the Model

The model can be loaded with 🤗 Transformers to compute similarity scores for fashion-specific prompts.

Training Details

Training Data

  • Dataset: paramaggarwal/fashion-product-images-dataset (small version)
  • Size: 44,439 studio product images
  • Attributes: gender, masterCategory, subCategory, articleType, baseColour, season, year, usage

Training Procedure

Preprocessing

Since the dataset lacks natural captions, descriptive text was synthesized from structured attributes using templates such as:
"A photo of {gender} {masterCategory} {subCategory}"
These generated text prompts serve as positive pairs during contrastive learning.

Training Hyperparameters

  • Training regime: fp16 mixed precision
  • Model size: 1.71 GB (ViT-L)
  • Learning Rate: 1e-4
  • Batch Size: 32

Speeds, Sizes, Times

  • Compute: 1× NVIDIA P100
  • Training Time: ~7 hours

Evaluation

The model was evaluated on a held-out validation split using text-image retrieval.

Testing Data, Factors & Metrics

  • Testing Data: Validation split of the same dataset
  • Metrics:
    • Recall@k: Measures whether the model retrieves the correct item
    • NDCG@k: Measures ranking quality (higher ranking → higher score)

Results Summary

  • High Retrieval Effectiveness:
    Recall@50 = 0.9410
  • Strong Top-1 Accuracy:
    Recall@1 / NDCG@1 = 0.2602
  • Ranking Quality:
    Gap between Recall and NDCG indicates the model retrieves the right neighborhood and ranks reasonably well.

Table 1: Validation Recall Metrics

Metric Score
Recall@1 0.2602
Recall@5 0.5644
Recall@10 0.7008
Recall@15 0.7770
Recall@20 0.8270
Recall@25 0.8626
Recall@50 0.9410
Recall@100 0.9780
Recall@200 0.9922

Table 2: Validation NDCG Metrics

Metric Score
NDCG@1 0.2602
NDCG@5 0.4191
NDCG@10 0.4632
NDCG@15 0.4834
NDCG@20 0.4952
NDCG@25 0.5030
NDCG@50 0.5183
NDCG@100 0.5243
NDCG@200 0.5264

Environmental Impact

  • Hardware: 1× NVIDIA P100
  • Hours Used: ~7
  • Cloud Provider: Kaggle

Citation

Model

If you use this model, please cite:

@misc{mohsin2025fashionclip,
  title        = {fashion-clip-vit-large-patch14: A Fine-Tuned CLIP Model for Fashion Image-Text Retrieval},
  author       = {Md Mohsin},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/mohsin416/fashion-clip-vit-large}}
}

Model Card Contact

Md Mohsin

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support