---
language: en
tags:
- clip
- fashion
- image-retrieval
- text-image
- contrastive-learning
model_name: fashion-clip-vit-large
finetuned_from: openai/clip-vit-large-patch14
datasets:
- paramaggarwal/fashion-product-images-dataset
---

# Model Card for fashion-clip-vit-large-patch14

## Quick Summary
This is the model card for **fashion-clip-vit-large-patch14**, an **openai/clip-vit-large-patch14** model fine-tuned for high-performance, domain-specific image-text retrieval in the fashion industry. This model adapts the broad, general-purpose knowledge of CLIP to the specific vocabulary and visual nuances of e-commerce fashion products.


## Model Details

### Model Description
This model is a fine-tuned version of **openai/clip-vit-large-patch14**. The base model, developed by OpenAI, is a powerful zero-shot vision–language model trained on 400M general (image, text) pairs from the web. While effective for general-purpose tasks, it lacks the specialized vocabulary and fine-grained visual understanding required for specific domains like fashion.

This project addresses that gap through domain adaptation. The model was trained on the **paramaggarwal/fashion-product-images-dataset**, a high-quality dataset of 44,439 professionally shot e-commerce images with structured attributes.

The fine-tuning process optimized the model's embedding space for fashion-specific concepts. It learned to differentiate subtle terms (e.g., *Topwear* vs. *T-shirt*) and visual styles (*Sandal* vs. *Flip Flop*) that the base model may treat similarly. The result is a powerful retrieval model for fashion-centric semantic and visual search.

- **Developed by:** Md Mohsin  
- **Model type:** Vision & Language, Contrastive Learning Model  
- **Architecture:** ViT-L/14 image encoder + Transformer text encoder  
- **Language(s):** English  
- **Finetuned from:** openai/clip-vit-large-patch14  

This model is a derivative work of two MIT-licensed components:  
- The base **openai/clip** model  
- The **paramaggarwal/fashion-product-images-dataset**  


## Model Sources
- **Repository:**  
- **Original CLIP Paper:** ()

## Uses

### Direct Use
This model is optimized for fashion-domain retrieval tasks.

- **Text-to-Image Retrieval:** Given a prompt like *"men's blue shoes"* or *"women’s red saree"*, retrieve the most relevant product images.  
- **Image-to-Image Retrieval:** Find visually similar or stylistically similar products.  
- **Zero-Shot Classification:** Classify images against fashion-specific prompts (e.g., *"a photo of a sandal"*).

### Downstream Use
The model's encoders can serve as domain-specialized feature extractors for:

- Fashion recommendation systems  
- Lightweight attribute classifiers (pattern, material, neckline, etc.)  
- Visual Question Answering (VQA) for e-commerce

### Out-of-Scope Use
- **General-purpose retrieval** (performs worse than base CLIP)  
- **In-the-wild images** (non-studio photos)  
- **Modern fashion trends post-2017**

## Bias, Risks, and Limitations

### Inherited Bias
The model inherits all biases from the base CLIP model, trained on large-scale unfiltered web data.

### Dataset-Specific Biases
- **Geographic & Cultural Bias:** Dataset mostly from Myntra (India). Strong Indian fashion bias.  
- **Temporal Bias:** Dataset from ~2017; lacks modern styles.  
- **Presentation Bias:** Studio photography only.  
- **Demographic Bias:** Gender attributes reinforce binary categories.

### Recommendations
- Do not use the model as a sole decision-maker.  
- Test for biases across demographics, cultures, and categories before production use.


## How to Get Started with the Model
The model can be loaded with 🤗 Transformers to compute similarity scores for fashion-specific prompts.


## Training Details

### Training Data
- **Dataset:** paramaggarwal/fashion-product-images-dataset (small version)  
- **Size:** 44,439 studio product images  
- **Attributes:** gender, masterCategory, subCategory, articleType, baseColour, season, year, usage  

### Training Procedure

#### Preprocessing
Since the dataset lacks natural captions, descriptive text was synthesized from structured attributes using templates such as:  
**"A photo of {gender} {masterCategory} {subCategory}"**  
These generated text prompts serve as positive pairs during contrastive learning.

#### Training Hyperparameters
- Training regime: **fp16 mixed precision**  
- Model size: **1.71 GB (ViT-L)**
- Learning Rate: **1e-4**  
- Batch Size: **32**  

#### Speeds, Sizes, Times
- Compute: **1× NVIDIA P100**  
- Training Time: **~7 hours**  


## Evaluation

The model was evaluated on a held-out validation split using text-image retrieval.

### Testing Data, Factors & Metrics
- **Testing Data:** Validation split of the same dataset  
- **Metrics:**
  - **Recall@k:** Measures whether the model retrieves the correct item  
  - **NDCG@k:** Measures ranking quality (higher ranking → higher score)

### Results Summary
- **High Retrieval Effectiveness:**  
  Recall@50 = **0.9410**
- **Strong Top-1 Accuracy:**  
  Recall@1 / NDCG@1 = **0.2602**
- **Ranking Quality:**  
  Gap between Recall and NDCG indicates the model retrieves the right neighborhood and ranks reasonably well.

### Table 1: Validation Recall Metrics
| Metric     | Score  |
|------------|--------|
| Recall@1   | 0.2602 |
| Recall@5   | 0.5644 |
| Recall@10  | 0.7008 |
| Recall@15  | 0.7770 |
| Recall@20  | 0.8270 |
| Recall@25  | 0.8626 |
| Recall@50  | 0.9410 |
| Recall@100 | 0.9780 |
| Recall@200 | 0.9922 |
### Table 2: Validation NDCG Metrics
| Metric    | Score  |
|-----------|--------|
| NDCG@1    | 0.2602 |
| NDCG@5    | 0.4191 |
| NDCG@10   | 0.4632 |
| NDCG@15   | 0.4834 |
| NDCG@20   | 0.4952 |
| NDCG@25   | 0.5030 |
| NDCG@50   | 0.5183 |
| NDCG@100  | 0.5243 |
| NDCG@200  | 0.5264 |


## Environmental Impact
- **Hardware:** 1× NVIDIA P100  
- **Hours Used:** ~7  
- **Cloud Provider:** Kaggle  


## Citation

### Model
If you use this model, please cite:

```bibtex
@misc{mohsin2025fashionclip,
  title        = {fashion-clip-vit-large-patch14: A Fine-Tuned CLIP Model for Fashion Image-Text Retrieval},
  author       = {Md Mohsin},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/mohsin416/fashion-clip-vit-large}}
}
```

## Model Card Contact
**Md Mohsin**  
- **Hugging Face:** https://huggingface.co/mohsin416  
- **GitHub:** https://github.com/mdmohsin212/  
- **LinkedIn:** https://www.linkedin.com/in/mohsin416/