| # SigLIP Base Patch16-256 Multilingual ONNX | |
| This directory contains ONNX exports of the `google/siglip-base-patch16-256-multilingual` model. | |
| ## Model Description | |
| SigLIP (Sigmoid Loss for Language Image Pre-training) is a multimodal model similar to CLIP but with key improvements: | |
| - **Sigmoid Loss**: Uses sigmoid loss instead of contrastive loss, allowing for better scaling and performance | |
| - **No Global Normalization**: Operates on image-text pairs independently without requiring global batch statistics | |
| - **Better Multilingual Support**: Enhanced multilingual capabilities across 28+ languages | |
| - **Resolution**: 256x256 pixels with 16x16 patches | |
| ## Directory Structure | |
| ``` | |
| siglip-base-patch16-256-multilingual-onnx/ | |
| βββ vision/ | |
| β βββ model.onnx # Vision encoder | |
| β βββ config.json # Model configuration | |
| β βββ preprocessor_config.json | |
| βββ text/ | |
| β βββ model.onnx # Text encoder | |
| β βββ config.json # Model configuration | |
| β βββ tokenizer.json # Fast tokenizer | |
| β βββ special_tokens_map.json | |
| β βββ spiece.model # SentencePiece model | |
| βββ README.md | |
| ``` | |
| ## Installation | |
| ```bash | |
| pip install onnxruntime pillow transformers | |
| ``` | |
| For GPU support: | |
| ```bash | |
| pip install onnxruntime-gpu | |
| ``` | |
| ## Usage | |
| ### Python Example | |
| ```python | |
| import numpy as np | |
| import onnxruntime as ort | |
| from PIL import Image | |
| from transformers import AutoProcessor | |
| # Load processors | |
| processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-256-multilingual") | |
| # Load ONNX sessions | |
| vision_session = ort.InferenceSession("vision/model.onnx") | |
| text_session = ort.InferenceSession("text/model.onnx") | |
| # Process image | |
| image = Image.open("your_image.jpg") | |
| image_inputs = processor(images=image, return_tensors="np") | |
| image_embeddings = vision_session.run(None, {"pixel_values": image_inputs["pixel_values"]})[0] | |
| # Process text | |
| texts = ["a photo of a cat", "une photo d'un chat", "una foto de un gato"] | |
| text_inputs = processor(text=texts, padding=True, return_tensors="np") | |
| text_embeddings = text_session.run(None, { | |
| "input_ids": text_inputs["input_ids"], | |
| "attention_mask": text_inputs["attention_mask"] | |
| })[0] | |
| # Compute similarity using sigmoid (not softmax like CLIP!) | |
| # SigLIP uses sigmoid activation, so we compute sigmoid of the dot product | |
| logits = np.dot(image_embeddings, text_embeddings.T) | |
| probs = 1 / (1 + np.exp(-logits)) # sigmoid activation | |
| print("Probabilities:") | |
| for i, text in enumerate(texts): | |
| print(f" {text}: {probs[0][i]:.2%}") | |
| ``` | |
| ## Key Differences from CLIP | |
| 1. **Activation Function**: SigLIP uses sigmoid instead of softmax | |
| 2. **Loss Function**: Sigmoid loss instead of contrastive loss | |
| 3. **No L2 Normalization**: Embeddings are not L2-normalized | |
| 4. **Independent Pairs**: Each image-text pair is processed independently | |
| ## Performance | |
| - **Inference Speed**: ~2-3x faster than PyTorch model | |
| - **Memory Usage**: ~50% reduction compared to PyTorch | |
| - **Accuracy**: Identical outputs to original model (within floating-point precision) | |
| ## Supported Languages | |
| Arabic (ar), Bengali (bn), German (de), Greek (el), English (en), Spanish (es), | |
| Finnish (fi), French (fr), Hebrew (he), Hindi (hi), Indonesian (id), Italian (it), | |
| Japanese (ja), Korean (ko), Dutch (nl), Norwegian (no), Polish (pl), Portuguese (pt), | |
| Romanian (ro), Russian (ru), Swedish (sv), Swahili (sw), Tamil (ta), Thai (th), | |
| Turkish (tr), Ukrainian (uk), Vietnamese (vi), Chinese (zh) | |
| ## Model Details | |
| - **Vision Encoder**: ViT-Base with 256x256 input, 16x16 patches | |
| - **Text Encoder**: Transformer-based with SentencePiece tokenizer | |
| - **Embedding Dimension**: 768 | |
| - **ONNX Opset**: 14 | |
| - **Precision**: FP32 | |
| ## Citation | |
| ```bibtex | |
| @article{zhai2023sigmoid, | |
| title={Sigmoid Loss for Language Image Pre-Training}, | |
| author={Zhai, Xiaohua and others}, | |
| journal={arXiv preprint arXiv:2303.12170}, | |
| year={2023} | |
| } | |
| ``` | |
| ## License | |
| Please refer to the original model's license at: https://huggingface.co/google/siglip-base-patch16-256-multilingual | |