Initial commit.

Browse files

Files changed (12) hide show

README.md +124 -0
text/config.json +39 -0
text/model.onnx +3 -0
text/special_tokens_map.json +23 -0
text/spiece.model +3 -0
text/tokenizer_config.json +34 -0
vision/config.json +9 -0
vision/model.onnx +3 -0
vision/preprocessor_config.json +24 -0
vision/special_tokens_map.json +23 -0
vision/spiece.model +3 -0
vision/tokenizer_config.json +34 -0

README.md ADDED Viewed

	@@ -0,0 +1,124 @@

+# SigLIP Base Patch16-256 Multilingual ONNX
+This directory contains ONNX exports of the `google/siglip-base-patch16-256-multilingual` model.
+## Model Description
+SigLIP (Sigmoid Loss for Language Image Pre-training) is a multimodal model similar to CLIP but with key improvements:
+- **Sigmoid Loss**: Uses sigmoid loss instead of contrastive loss, allowing for better scaling and performance
+- **No Global Normalization**: Operates on image-text pairs independently without requiring global batch statistics
+- **Better Multilingual Support**: Enhanced multilingual capabilities across 28+ languages
+- **Resolution**: 256x256 pixels with 16x16 patches
+## Directory Structure
+```
+siglip-base-patch16-256-multilingual-onnx/
+├── vision/
+│   ├── model.onnx          # Vision encoder
+│   ├── config.json         # Model configuration
+│   └── preprocessor_config.json
+├── text/
+│   ├── model.onnx          # Text encoder
+│   ├── config.json         # Model configuration
+│   ├── tokenizer.json      # Fast tokenizer
+│   ├── special_tokens_map.json
+│   └── spiece.model        # SentencePiece model
+└── README.md
+```
+## Installation
+```bash
+pip install onnxruntime pillow transformers
+```
+For GPU support:
+```bash
+pip install onnxruntime-gpu
+```
+## Usage
+### Python Example
+```python
+import numpy as np
+import onnxruntime as ort
+from PIL import Image
+from transformers import AutoProcessor
+# Load processors
+processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-256-multilingual")
+# Load ONNX sessions
+vision_session = ort.InferenceSession("vision/model.onnx")
+text_session = ort.InferenceSession("text/model.onnx")
+# Process image
+image = Image.open("your_image.jpg")
+image_inputs = processor(images=image, return_tensors="np")
+image_embeddings = vision_session.run(None, {"pixel_values": image_inputs["pixel_values"]})[0]
+# Process text
+texts = ["a photo of a cat", "une photo d'un chat", "una foto de un gato"]
+text_inputs = processor(text=texts, padding=True, return_tensors="np")
+text_embeddings = text_session.run(None, {
+    "input_ids": text_inputs["input_ids"],
+    "attention_mask": text_inputs["attention_mask"]
+})[0]
+# Compute similarity using sigmoid (not softmax like CLIP!)
+# SigLIP uses sigmoid activation, so we compute sigmoid of the dot product
+logits = np.dot(image_embeddings, text_embeddings.T)
+probs = 1 / (1 + np.exp(-logits))  # sigmoid activation
+print("Probabilities:")
+for i, text in enumerate(texts):
+    print(f"  {text}: {probs[0][i]:.2%}")
+```
+## Key Differences from CLIP
+1. **Activation Function**: SigLIP uses sigmoid instead of softmax
+2. **Loss Function**: Sigmoid loss instead of contrastive loss
+3. **No L2 Normalization**: Embeddings are not L2-normalized
+4. **Independent Pairs**: Each image-text pair is processed independently
+## Performance
+- **Inference Speed**: ~2-3x faster than PyTorch model
+- **Memory Usage**: ~50% reduction compared to PyTorch
+- **Accuracy**: Identical outputs to original model (within floating-point precision)
+## Supported Languages
+Arabic (ar), Bengali (bn), German (de), Greek (el), English (en), Spanish (es),
+Finnish (fi), French (fr), Hebrew (he), Hindi (hi), Indonesian (id), Italian (it),
+Japanese (ja), Korean (ko), Dutch (nl), Norwegian (no), Polish (pl), Portuguese (pt),
+Romanian (ro), Russian (ru), Swedish (sv), Swahili (sw), Tamil (ta), Thai (th),
+Turkish (tr), Ukrainian (uk), Vietnamese (vi), Chinese (zh)
+## Model Details
+- **Vision Encoder**: ViT-Base with 256x256 input, 16x16 patches
+- **Text Encoder**: Transformer-based with SentencePiece tokenizer
+- **Embedding Dimension**: 768
+- **ONNX Opset**: 14
+- **Precision**: FP32
+## Citation
+```bibtex
+@article{zhai2023sigmoid,
+  title={Sigmoid Loss for Language Image Pre-Training},
+  author={Zhai, Xiaohua and others},
+  journal={arXiv preprint arXiv:2303.12170},
+  year={2023}
+}
+```
+## License
+Please refer to the original model's license at: https://huggingface.co/google/siglip-base-patch16-256-multilingual

text/config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "model_type": "siglip_text",
+  "hidden_size": 768,
+  "max_seq_length": 64,
+  "vocab_size": 250000,
+  "model_name": "google/siglip-base-patch16-256-multilingual",
+  "onnx_export_version": "1.0",
+  "tokenizer_type": "sentencepiece",
+  "languages": [
+    "ar",
+    "bn",
+    "de",
+    "el",
+    "en",
+    "es",
+    "fi",
+    "fr",
+    "he",
+    "hi",
+    "id",
+    "it",
+    "ja",
+    "ko",
+    "nl",
+    "no",
+    "pl",
+    "pt",
+    "ro",
+    "ru",
+    "sv",
+    "sw",
+    "ta",
+    "th",
+    "tr",
+    "uk",
+    "vi",
+    "zh"
+  ]
+}

text/model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:56c210978f9dd3fa158d0400c148d0893c53a8ebd9a3ba25f9b01a87f5ab177b
+size 1111037737

text/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": true,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "</s>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": true,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": true,
+    "single_word": false
+  }
+}

text/spiece.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ef78f86560d809067d12bac6c09f19a462cb3af3f54d2b8acbba26e1433125d6
+size 4309802

text/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "added_tokens_decoder": {
+    "1": {
+      "content": "</s>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<unk>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [],
+  "clean_up_tokenization_spaces": true,
+  "do_lower_case": true,
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "model_input_names": [
+    "input_ids"
+  ],
+  "model_max_length": 64,
+  "pad_token": "</s>",
+  "processor_class": "SiglipProcessor",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "SiglipTokenizer",
+  "unk_token": "<unk>"
+}

vision/config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "model_type": "siglip_vision",
+  "hidden_size": 768,
+  "image_size": 256,
+  "patch_size": 16,
+  "num_channels": 3,
+  "model_name": "google/siglip-base-patch16-256-multilingual",
+  "onnx_export_version": "1.0"
+}

vision/model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dad19503fe2ad3a8a4fb752d8431003b15961ceffe7e13c193fa4a5f1915e88d
+size 372014748

vision/preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "do_convert_rgb": null,
+  "do_normalize": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "image_processor_type": "SiglipImageProcessor",
+  "image_std": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "processor_class": "SiglipProcessor",
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "height": 256,
+    "width": 256
+  }
+}

vision/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": true,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "</s>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": true,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": true,
+    "single_word": false
+  }
+}

vision/spiece.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ef78f86560d809067d12bac6c09f19a462cb3af3f54d2b8acbba26e1433125d6
+size 4309802

vision/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "added_tokens_decoder": {
+    "1": {
+      "content": "</s>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<unk>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [],
+  "clean_up_tokenization_spaces": true,
+  "do_lower_case": true,
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "model_input_names": [
+    "input_ids"
+  ],
+  "model_max_length": 64,
+  "pad_token": "</s>",
+  "processor_class": "SiglipProcessor",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "SiglipTokenizer",
+  "unk_token": "<unk>"
+}