--- language: en tags: - clip - fashion - image-retrieval - text-image - contrastive-learning model_name: fashion-clip-vit-large finetuned_from: openai/clip-vit-large-patch14 datasets: - paramaggarwal/fashion-product-images-dataset --- # Model Card for fashion-clip-vit-large-patch14 ## Quick Summary This is the model card for **fashion-clip-vit-large-patch14**, an **openai/clip-vit-large-patch14** model fine-tuned for high-performance, domain-specific image-text retrieval in the fashion industry. This model adapts the broad, general-purpose knowledge of CLIP to the specific vocabulary and visual nuances of e-commerce fashion products. ## Model Details ### Model Description This model is a fine-tuned version of **openai/clip-vit-large-patch14**. The base model, developed by OpenAI, is a powerful zero-shot vision–language model trained on 400M general (image, text) pairs from the web. While effective for general-purpose tasks, it lacks the specialized vocabulary and fine-grained visual understanding required for specific domains like fashion. This project addresses that gap through domain adaptation. The model was trained on the **paramaggarwal/fashion-product-images-dataset**, a high-quality dataset of 44,439 professionally shot e-commerce images with structured attributes. The fine-tuning process optimized the model's embedding space for fashion-specific concepts. It learned to differentiate subtle terms (e.g., *Topwear* vs. *T-shirt*) and visual styles (*Sandal* vs. *Flip Flop*) that the base model may treat similarly. The result is a powerful retrieval model for fashion-centric semantic and visual search. - **Developed by:** Md Mohsin - **Model type:** Vision & Language, Contrastive Learning Model - **Architecture:** ViT-L/14 image encoder + Transformer text encoder - **Language(s):** English - **Finetuned from:** openai/clip-vit-large-patch14 This model is a derivative work of two MIT-licensed components: - The base **openai/clip** model - The **paramaggarwal/fashion-product-images-dataset** ## Model Sources - **Repository:** - **Original CLIP Paper:** () ## Uses ### Direct Use This model is optimized for fashion-domain retrieval tasks. - **Text-to-Image Retrieval:** Given a prompt like *"men's blue shoes"* or *"women’s red saree"*, retrieve the most relevant product images. - **Image-to-Image Retrieval:** Find visually similar or stylistically similar products. - **Zero-Shot Classification:** Classify images against fashion-specific prompts (e.g., *"a photo of a sandal"*). ### Downstream Use The model's encoders can serve as domain-specialized feature extractors for: - Fashion recommendation systems - Lightweight attribute classifiers (pattern, material, neckline, etc.) - Visual Question Answering (VQA) for e-commerce ### Out-of-Scope Use - **General-purpose retrieval** (performs worse than base CLIP) - **In-the-wild images** (non-studio photos) - **Modern fashion trends post-2017** ## Bias, Risks, and Limitations ### Inherited Bias The model inherits all biases from the base CLIP model, trained on large-scale unfiltered web data. ### Dataset-Specific Biases - **Geographic & Cultural Bias:** Dataset mostly from Myntra (India). Strong Indian fashion bias. - **Temporal Bias:** Dataset from ~2017; lacks modern styles. - **Presentation Bias:** Studio photography only. - **Demographic Bias:** Gender attributes reinforce binary categories. ### Recommendations - Do not use the model as a sole decision-maker. - Test for biases across demographics, cultures, and categories before production use. ## How to Get Started with the Model The model can be loaded with 🤗 Transformers to compute similarity scores for fashion-specific prompts. ## Training Details ### Training Data - **Dataset:** paramaggarwal/fashion-product-images-dataset (small version) - **Size:** 44,439 studio product images - **Attributes:** gender, masterCategory, subCategory, articleType, baseColour, season, year, usage ### Training Procedure #### Preprocessing Since the dataset lacks natural captions, descriptive text was synthesized from structured attributes using templates such as: **"A photo of {gender} {masterCategory} {subCategory}"** These generated text prompts serve as positive pairs during contrastive learning. #### Training Hyperparameters - Training regime: **fp16 mixed precision** - Model size: **1.71 GB (ViT-L)** - Learning Rate: **1e-4** - Batch Size: **32** #### Speeds, Sizes, Times - Compute: **1× NVIDIA P100** - Training Time: **~7 hours** ## Evaluation The model was evaluated on a held-out validation split using text-image retrieval. ### Testing Data, Factors & Metrics - **Testing Data:** Validation split of the same dataset - **Metrics:** - **Recall@k:** Measures whether the model retrieves the correct item - **NDCG@k:** Measures ranking quality (higher ranking → higher score) ### Results Summary - **High Retrieval Effectiveness:** Recall@50 = **0.9410** - **Strong Top-1 Accuracy:** Recall@1 / NDCG@1 = **0.2602** - **Ranking Quality:** Gap between Recall and NDCG indicates the model retrieves the right neighborhood and ranks reasonably well. ### Table 1: Validation Recall Metrics | Metric | Score | |------------|--------| | Recall@1 | 0.2602 | | Recall@5 | 0.5644 | | Recall@10 | 0.7008 | | Recall@15 | 0.7770 | | Recall@20 | 0.8270 | | Recall@25 | 0.8626 | | Recall@50 | 0.9410 | | Recall@100 | 0.9780 | | Recall@200 | 0.9922 | ### Table 2: Validation NDCG Metrics | Metric | Score | |-----------|--------| | NDCG@1 | 0.2602 | | NDCG@5 | 0.4191 | | NDCG@10 | 0.4632 | | NDCG@15 | 0.4834 | | NDCG@20 | 0.4952 | | NDCG@25 | 0.5030 | | NDCG@50 | 0.5183 | | NDCG@100 | 0.5243 | | NDCG@200 | 0.5264 | ## Environmental Impact - **Hardware:** 1× NVIDIA P100 - **Hours Used:** ~7 - **Cloud Provider:** Kaggle ## Citation ### Model If you use this model, please cite: ```bibtex @misc{mohsin2025fashionclip, title = {fashion-clip-vit-large-patch14: A Fine-Tuned CLIP Model for Fashion Image-Text Retrieval}, author = {Md Mohsin}, year = {2025}, howpublished = {\url{https://huggingface.co/mohsin416/fashion-clip-vit-large}} } ``` ## Model Card Contact **Md Mohsin** - **Hugging Face:** https://huggingface.co/mohsin416 - **GitHub:** https://github.com/mdmohsin212/ - **LinkedIn:** https://www.linkedin.com/in/mohsin416/