Model Card for fashion-clip-vit-large-patch14
Quick Summary
This is the model card for fashion-clip-vit-large-patch14, an openai/clip-vit-large-patch14 model fine-tuned for high-performance, domain-specific image-text retrieval in the fashion industry. This model adapts the broad, general-purpose knowledge of CLIP to the specific vocabulary and visual nuances of e-commerce fashion products.
Model Details
Model Description
This model is a fine-tuned version of openai/clip-vit-large-patch14. The base model, developed by OpenAI, is a powerful zero-shot vision–language model trained on 400M general (image, text) pairs from the web. While effective for general-purpose tasks, it lacks the specialized vocabulary and fine-grained visual understanding required for specific domains like fashion.
This project addresses that gap through domain adaptation. The model was trained on the paramaggarwal/fashion-product-images-dataset, a high-quality dataset of 44,439 professionally shot e-commerce images with structured attributes.
The fine-tuning process optimized the model's embedding space for fashion-specific concepts. It learned to differentiate subtle terms (e.g., Topwear vs. T-shirt) and visual styles (Sandal vs. Flip Flop) that the base model may treat similarly. The result is a powerful retrieval model for fashion-centric semantic and visual search.
- Developed by: Md Mohsin
- Model type: Vision & Language, Contrastive Learning Model
- Architecture: ViT-L/14 image encoder + Transformer text encoder
- Language(s): English
- Finetuned from: openai/clip-vit-large-patch14
This model is a derivative work of two MIT-licensed components:
- The base openai/clip model
- The paramaggarwal/fashion-product-images-dataset
Model Sources
- Repository:
- Original CLIP Paper: ()
Uses
Direct Use
This model is optimized for fashion-domain retrieval tasks.
- Text-to-Image Retrieval: Given a prompt like "men's blue shoes" or "women’s red saree", retrieve the most relevant product images.
- Image-to-Image Retrieval: Find visually similar or stylistically similar products.
- Zero-Shot Classification: Classify images against fashion-specific prompts (e.g., "a photo of a sandal").
Downstream Use
The model's encoders can serve as domain-specialized feature extractors for:
- Fashion recommendation systems
- Lightweight attribute classifiers (pattern, material, neckline, etc.)
- Visual Question Answering (VQA) for e-commerce
Out-of-Scope Use
- General-purpose retrieval (performs worse than base CLIP)
- In-the-wild images (non-studio photos)
- Modern fashion trends post-2017
Bias, Risks, and Limitations
Inherited Bias
The model inherits all biases from the base CLIP model, trained on large-scale unfiltered web data.
Dataset-Specific Biases
- Geographic & Cultural Bias: Dataset mostly from Myntra (India). Strong Indian fashion bias.
- Temporal Bias: Dataset from ~2017; lacks modern styles.
- Presentation Bias: Studio photography only.
- Demographic Bias: Gender attributes reinforce binary categories.
Recommendations
- Do not use the model as a sole decision-maker.
- Test for biases across demographics, cultures, and categories before production use.
How to Get Started with the Model
The model can be loaded with 🤗 Transformers to compute similarity scores for fashion-specific prompts.
Training Details
Training Data
- Dataset: paramaggarwal/fashion-product-images-dataset (small version)
- Size: 44,439 studio product images
- Attributes: gender, masterCategory, subCategory, articleType, baseColour, season, year, usage
Training Procedure
Preprocessing
Since the dataset lacks natural captions, descriptive text was synthesized from structured attributes using templates such as:
"A photo of {gender} {masterCategory} {subCategory}"
These generated text prompts serve as positive pairs during contrastive learning.
Training Hyperparameters
- Training regime: fp16 mixed precision
- Model size: 1.71 GB (ViT-L)
- Learning Rate: 1e-4
- Batch Size: 32
Speeds, Sizes, Times
- Compute: 1× NVIDIA P100
- Training Time: ~7 hours
Evaluation
The model was evaluated on a held-out validation split using text-image retrieval.
Testing Data, Factors & Metrics
- Testing Data: Validation split of the same dataset
- Metrics:
- Recall@k: Measures whether the model retrieves the correct item
- NDCG@k: Measures ranking quality (higher ranking → higher score)
Results Summary
- High Retrieval Effectiveness:
Recall@50 = 0.9410 - Strong Top-1 Accuracy:
Recall@1 / NDCG@1 = 0.2602 - Ranking Quality:
Gap between Recall and NDCG indicates the model retrieves the right neighborhood and ranks reasonably well.
Table 1: Validation Recall Metrics
| Metric | Score |
|---|---|
| Recall@1 | 0.2602 |
| Recall@5 | 0.5644 |
| Recall@10 | 0.7008 |
| Recall@15 | 0.7770 |
| Recall@20 | 0.8270 |
| Recall@25 | 0.8626 |
| Recall@50 | 0.9410 |
| Recall@100 | 0.9780 |
| Recall@200 | 0.9922 |
Table 2: Validation NDCG Metrics
| Metric | Score |
|---|---|
| NDCG@1 | 0.2602 |
| NDCG@5 | 0.4191 |
| NDCG@10 | 0.4632 |
| NDCG@15 | 0.4834 |
| NDCG@20 | 0.4952 |
| NDCG@25 | 0.5030 |
| NDCG@50 | 0.5183 |
| NDCG@100 | 0.5243 |
| NDCG@200 | 0.5264 |
Environmental Impact
- Hardware: 1× NVIDIA P100
- Hours Used: ~7
- Cloud Provider: Kaggle
Citation
Model
If you use this model, please cite:
@misc{mohsin2025fashionclip,
title = {fashion-clip-vit-large-patch14: A Fine-Tuned CLIP Model for Fashion Image-Text Retrieval},
author = {Md Mohsin},
year = {2025},
howpublished = {\url{https://huggingface.co/mohsin416/fashion-clip-vit-large}}
}
Model Card Contact
Md Mohsin
- Hugging Face: https://huggingface.co/mohsin416
- GitHub: https://github.com/mdmohsin212/
- LinkedIn: https://www.linkedin.com/in/mohsin416/