Model Card for fashion-clip-vit-large-patch14

Quick Summary

This is the model card for fashion-clip-vit-large-patch14, an openai/clip-vit-large-patch14 model fine-tuned for high-performance, domain-specific image-text retrieval in the fashion industry. This model adapts the broad, general-purpose knowledge of CLIP to the specific vocabulary and visual nuances of e-commerce fashion products.

Model Details

Model Description

This model is a fine-tuned version of openai/clip-vit-large-patch14. The base model, developed by OpenAI, is a powerful zero-shot vision–language model trained on 400M general (image, text) pairs from the web. While effective for general-purpose tasks, it lacks the specialized vocabulary and fine-grained visual understanding required for specific domains like fashion.

This project addresses that gap through domain adaptation. The model was trained on the paramaggarwal/fashion-product-images-dataset, a high-quality dataset of 44,439 professionally shot e-commerce images with structured attributes.

The fine-tuning process optimized the model's embedding space for fashion-specific concepts. It learned to differentiate subtle terms (e.g., Topwear vs. T-shirt) and visual styles (Sandal vs. Flip Flop) that the base model may treat similarly. The result is a powerful retrieval model for fashion-centric semantic and visual search.

Developed by: Md Mohsin
Model type: Vision & Language, Contrastive Learning Model
Architecture: ViT-L/14 image encoder + Transformer text encoder
Language(s): English
Finetuned from: openai/clip-vit-large-patch14

This model is a derivative work of two MIT-licensed components:

The base openai/clip model
The paramaggarwal/fashion-product-images-dataset

Model Sources

Repository:
Original CLIP Paper: ()

Uses

Direct Use

This model is optimized for fashion-domain retrieval tasks.

Text-to-Image Retrieval: Given a prompt like "men's blue shoes" or "women’s red saree", retrieve the most relevant product images.
Image-to-Image Retrieval: Find visually similar or stylistically similar products.
Zero-Shot Classification: Classify images against fashion-specific prompts (e.g., "a photo of a sandal").

Downstream Use

The model's encoders can serve as domain-specialized feature extractors for:

Fashion recommendation systems
Lightweight attribute classifiers (pattern, material, neckline, etc.)
Visual Question Answering (VQA) for e-commerce

Out-of-Scope Use

General-purpose retrieval (performs worse than base CLIP)
In-the-wild images (non-studio photos)
Modern fashion trends post-2017

Bias, Risks, and Limitations

Inherited Bias

The model inherits all biases from the base CLIP model, trained on large-scale unfiltered web data.

Dataset-Specific Biases

Geographic & Cultural Bias: Dataset mostly from Myntra (India). Strong Indian fashion bias.
Temporal Bias: Dataset from ~2017; lacks modern styles.
Presentation Bias: Studio photography only.
Demographic Bias: Gender attributes reinforce binary categories.

Recommendations

Do not use the model as a sole decision-maker.
Test for biases across demographics, cultures, and categories before production use.

How to Get Started with the Model

The model can be loaded with 🤗 Transformers to compute similarity scores for fashion-specific prompts.

Training Details

Training Data

Dataset: paramaggarwal/fashion-product-images-dataset (small version)
Size: 44,439 studio product images
Attributes: gender, masterCategory, subCategory, articleType, baseColour, season, year, usage

Training Procedure

Preprocessing

Since the dataset lacks natural captions, descriptive text was synthesized from structured attributes using templates such as:
"A photo of {gender} {masterCategory} {subCategory}"
These generated text prompts serve as positive pairs during contrastive learning.

Training Hyperparameters

Training regime: fp16 mixed precision
Model size: 1.71 GB (ViT-L)
Learning Rate: 1e-4
Batch Size: 32

Speeds, Sizes, Times

Compute: 1× NVIDIA P100
Training Time: ~7 hours

Evaluation

The model was evaluated on a held-out validation split using text-image retrieval.

Testing Data, Factors & Metrics

Testing Data: Validation split of the same dataset
Metrics:
- Recall@k: Measures whether the model retrieves the correct item
- NDCG@k: Measures ranking quality (higher ranking → higher score)

Results Summary

High Retrieval Effectiveness:
Recall@50 = 0.9410
Strong Top-1 Accuracy:
Recall@1 / NDCG@1 = 0.2602
Ranking Quality:
Gap between Recall and NDCG indicates the model retrieves the right neighborhood and ranks reasonably well.

Table 1: Validation Recall Metrics

Metric	Score
Recall@1	0.2602
Recall@5	0.5644
Recall@10	0.7008
Recall@15	0.7770
Recall@20	0.8270
Recall@25	0.8626
Recall@50	0.9410
Recall@100	0.9780
Recall@200	0.9922

Table 2: Validation NDCG Metrics

Metric	Score
NDCG@1	0.2602
NDCG@5	0.4191
NDCG@10	0.4632
NDCG@15	0.4834
NDCG@20	0.4952
NDCG@25	0.5030
NDCG@50	0.5183
NDCG@100	0.5243
NDCG@200	0.5264

Environmental Impact

Hardware: 1× NVIDIA P100
Hours Used: ~7
Cloud Provider: Kaggle

Citation

Model

If you use this model, please cite:

@misc{mohsin2025fashionclip,
  title        = {fashion-clip-vit-large-patch14: A Fine-Tuned CLIP Model for Fashion Image-Text Retrieval},
  author       = {Md Mohsin},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/mohsin416/fashion-clip-vit-large}}
}

Model Card Contact

Md Mohsin

Hugging Face: https://huggingface.co/mohsin416
GitHub: https://github.com/mdmohsin212/
LinkedIn: https://www.linkedin.com/in/mohsin416/

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support