Initial commit.
Browse files- README.md +124 -0
- text/config.json +39 -0
- text/model.onnx +3 -0
- text/special_tokens_map.json +23 -0
- text/spiece.model +3 -0
- text/tokenizer_config.json +34 -0
- vision/config.json +9 -0
- vision/model.onnx +3 -0
- vision/preprocessor_config.json +24 -0
- vision/special_tokens_map.json +23 -0
- vision/spiece.model +3 -0
- vision/tokenizer_config.json +34 -0
README.md
ADDED
|
@@ -0,0 +1,124 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# SigLIP Base Patch16-256 Multilingual ONNX
|
| 2 |
+
|
| 3 |
+
This directory contains ONNX exports of the `google/siglip-base-patch16-256-multilingual` model.
|
| 4 |
+
|
| 5 |
+
## Model Description
|
| 6 |
+
|
| 7 |
+
SigLIP (Sigmoid Loss for Language Image Pre-training) is a multimodal model similar to CLIP but with key improvements:
|
| 8 |
+
|
| 9 |
+
- **Sigmoid Loss**: Uses sigmoid loss instead of contrastive loss, allowing for better scaling and performance
|
| 10 |
+
- **No Global Normalization**: Operates on image-text pairs independently without requiring global batch statistics
|
| 11 |
+
- **Better Multilingual Support**: Enhanced multilingual capabilities across 28+ languages
|
| 12 |
+
- **Resolution**: 256x256 pixels with 16x16 patches
|
| 13 |
+
|
| 14 |
+
## Directory Structure
|
| 15 |
+
|
| 16 |
+
```
|
| 17 |
+
siglip-base-patch16-256-multilingual-onnx/
|
| 18 |
+
├── vision/
|
| 19 |
+
│ ├── model.onnx # Vision encoder
|
| 20 |
+
│ ├── config.json # Model configuration
|
| 21 |
+
│ └── preprocessor_config.json
|
| 22 |
+
├── text/
|
| 23 |
+
│ ├── model.onnx # Text encoder
|
| 24 |
+
│ ├── config.json # Model configuration
|
| 25 |
+
│ ├── tokenizer.json # Fast tokenizer
|
| 26 |
+
│ ├── special_tokens_map.json
|
| 27 |
+
│ └── spiece.model # SentencePiece model
|
| 28 |
+
└── README.md
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
## Installation
|
| 32 |
+
|
| 33 |
+
```bash
|
| 34 |
+
pip install onnxruntime pillow transformers
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
For GPU support:
|
| 38 |
+
```bash
|
| 39 |
+
pip install onnxruntime-gpu
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
## Usage
|
| 43 |
+
|
| 44 |
+
### Python Example
|
| 45 |
+
|
| 46 |
+
```python
|
| 47 |
+
import numpy as np
|
| 48 |
+
import onnxruntime as ort
|
| 49 |
+
from PIL import Image
|
| 50 |
+
from transformers import AutoProcessor
|
| 51 |
+
|
| 52 |
+
# Load processors
|
| 53 |
+
processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-256-multilingual")
|
| 54 |
+
|
| 55 |
+
# Load ONNX sessions
|
| 56 |
+
vision_session = ort.InferenceSession("vision/model.onnx")
|
| 57 |
+
text_session = ort.InferenceSession("text/model.onnx")
|
| 58 |
+
|
| 59 |
+
# Process image
|
| 60 |
+
image = Image.open("your_image.jpg")
|
| 61 |
+
image_inputs = processor(images=image, return_tensors="np")
|
| 62 |
+
image_embeddings = vision_session.run(None, {"pixel_values": image_inputs["pixel_values"]})[0]
|
| 63 |
+
|
| 64 |
+
# Process text
|
| 65 |
+
texts = ["a photo of a cat", "une photo d'un chat", "una foto de un gato"]
|
| 66 |
+
text_inputs = processor(text=texts, padding=True, return_tensors="np")
|
| 67 |
+
text_embeddings = text_session.run(None, {
|
| 68 |
+
"input_ids": text_inputs["input_ids"],
|
| 69 |
+
"attention_mask": text_inputs["attention_mask"]
|
| 70 |
+
})[0]
|
| 71 |
+
|
| 72 |
+
# Compute similarity using sigmoid (not softmax like CLIP!)
|
| 73 |
+
# SigLIP uses sigmoid activation, so we compute sigmoid of the dot product
|
| 74 |
+
logits = np.dot(image_embeddings, text_embeddings.T)
|
| 75 |
+
probs = 1 / (1 + np.exp(-logits)) # sigmoid activation
|
| 76 |
+
|
| 77 |
+
print("Probabilities:")
|
| 78 |
+
for i, text in enumerate(texts):
|
| 79 |
+
print(f" {text}: {probs[0][i]:.2%}")
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
## Key Differences from CLIP
|
| 83 |
+
|
| 84 |
+
1. **Activation Function**: SigLIP uses sigmoid instead of softmax
|
| 85 |
+
2. **Loss Function**: Sigmoid loss instead of contrastive loss
|
| 86 |
+
3. **No L2 Normalization**: Embeddings are not L2-normalized
|
| 87 |
+
4. **Independent Pairs**: Each image-text pair is processed independently
|
| 88 |
+
|
| 89 |
+
## Performance
|
| 90 |
+
|
| 91 |
+
- **Inference Speed**: ~2-3x faster than PyTorch model
|
| 92 |
+
- **Memory Usage**: ~50% reduction compared to PyTorch
|
| 93 |
+
- **Accuracy**: Identical outputs to original model (within floating-point precision)
|
| 94 |
+
|
| 95 |
+
## Supported Languages
|
| 96 |
+
|
| 97 |
+
Arabic (ar), Bengali (bn), German (de), Greek (el), English (en), Spanish (es),
|
| 98 |
+
Finnish (fi), French (fr), Hebrew (he), Hindi (hi), Indonesian (id), Italian (it),
|
| 99 |
+
Japanese (ja), Korean (ko), Dutch (nl), Norwegian (no), Polish (pl), Portuguese (pt),
|
| 100 |
+
Romanian (ro), Russian (ru), Swedish (sv), Swahili (sw), Tamil (ta), Thai (th),
|
| 101 |
+
Turkish (tr), Ukrainian (uk), Vietnamese (vi), Chinese (zh)
|
| 102 |
+
|
| 103 |
+
## Model Details
|
| 104 |
+
|
| 105 |
+
- **Vision Encoder**: ViT-Base with 256x256 input, 16x16 patches
|
| 106 |
+
- **Text Encoder**: Transformer-based with SentencePiece tokenizer
|
| 107 |
+
- **Embedding Dimension**: 768
|
| 108 |
+
- **ONNX Opset**: 14
|
| 109 |
+
- **Precision**: FP32
|
| 110 |
+
|
| 111 |
+
## Citation
|
| 112 |
+
|
| 113 |
+
```bibtex
|
| 114 |
+
@article{zhai2023sigmoid,
|
| 115 |
+
title={Sigmoid Loss for Language Image Pre-Training},
|
| 116 |
+
author={Zhai, Xiaohua and others},
|
| 117 |
+
journal={arXiv preprint arXiv:2303.12170},
|
| 118 |
+
year={2023}
|
| 119 |
+
}
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
## License
|
| 123 |
+
|
| 124 |
+
Please refer to the original model's license at: https://huggingface.co/google/siglip-base-patch16-256-multilingual
|
text/config.json
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "siglip_text",
|
| 3 |
+
"hidden_size": 768,
|
| 4 |
+
"max_seq_length": 64,
|
| 5 |
+
"vocab_size": 250000,
|
| 6 |
+
"model_name": "google/siglip-base-patch16-256-multilingual",
|
| 7 |
+
"onnx_export_version": "1.0",
|
| 8 |
+
"tokenizer_type": "sentencepiece",
|
| 9 |
+
"languages": [
|
| 10 |
+
"ar",
|
| 11 |
+
"bn",
|
| 12 |
+
"de",
|
| 13 |
+
"el",
|
| 14 |
+
"en",
|
| 15 |
+
"es",
|
| 16 |
+
"fi",
|
| 17 |
+
"fr",
|
| 18 |
+
"he",
|
| 19 |
+
"hi",
|
| 20 |
+
"id",
|
| 21 |
+
"it",
|
| 22 |
+
"ja",
|
| 23 |
+
"ko",
|
| 24 |
+
"nl",
|
| 25 |
+
"no",
|
| 26 |
+
"pl",
|
| 27 |
+
"pt",
|
| 28 |
+
"ro",
|
| 29 |
+
"ru",
|
| 30 |
+
"sv",
|
| 31 |
+
"sw",
|
| 32 |
+
"ta",
|
| 33 |
+
"th",
|
| 34 |
+
"tr",
|
| 35 |
+
"uk",
|
| 36 |
+
"vi",
|
| 37 |
+
"zh"
|
| 38 |
+
]
|
| 39 |
+
}
|
text/model.onnx
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:56c210978f9dd3fa158d0400c148d0893c53a8ebd9a3ba25f9b01a87f5ab177b
|
| 3 |
+
size 1111037737
|
text/special_tokens_map.json
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"eos_token": {
|
| 3 |
+
"content": "</s>",
|
| 4 |
+
"lstrip": true,
|
| 5 |
+
"normalized": false,
|
| 6 |
+
"rstrip": true,
|
| 7 |
+
"single_word": false
|
| 8 |
+
},
|
| 9 |
+
"pad_token": {
|
| 10 |
+
"content": "</s>",
|
| 11 |
+
"lstrip": true,
|
| 12 |
+
"normalized": false,
|
| 13 |
+
"rstrip": true,
|
| 14 |
+
"single_word": false
|
| 15 |
+
},
|
| 16 |
+
"unk_token": {
|
| 17 |
+
"content": "<unk>",
|
| 18 |
+
"lstrip": true,
|
| 19 |
+
"normalized": false,
|
| 20 |
+
"rstrip": true,
|
| 21 |
+
"single_word": false
|
| 22 |
+
}
|
| 23 |
+
}
|
text/spiece.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ef78f86560d809067d12bac6c09f19a462cb3af3f54d2b8acbba26e1433125d6
|
| 3 |
+
size 4309802
|
text/tokenizer_config.json
ADDED
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"1": {
|
| 4 |
+
"content": "</s>",
|
| 5 |
+
"lstrip": true,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": true,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"2": {
|
| 12 |
+
"content": "<unk>",
|
| 13 |
+
"lstrip": true,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": true,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
}
|
| 19 |
+
},
|
| 20 |
+
"additional_special_tokens": [],
|
| 21 |
+
"clean_up_tokenization_spaces": true,
|
| 22 |
+
"do_lower_case": true,
|
| 23 |
+
"eos_token": "</s>",
|
| 24 |
+
"extra_special_tokens": {},
|
| 25 |
+
"model_input_names": [
|
| 26 |
+
"input_ids"
|
| 27 |
+
],
|
| 28 |
+
"model_max_length": 64,
|
| 29 |
+
"pad_token": "</s>",
|
| 30 |
+
"processor_class": "SiglipProcessor",
|
| 31 |
+
"sp_model_kwargs": {},
|
| 32 |
+
"tokenizer_class": "SiglipTokenizer",
|
| 33 |
+
"unk_token": "<unk>"
|
| 34 |
+
}
|
vision/config.json
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "siglip_vision",
|
| 3 |
+
"hidden_size": 768,
|
| 4 |
+
"image_size": 256,
|
| 5 |
+
"patch_size": 16,
|
| 6 |
+
"num_channels": 3,
|
| 7 |
+
"model_name": "google/siglip-base-patch16-256-multilingual",
|
| 8 |
+
"onnx_export_version": "1.0"
|
| 9 |
+
}
|
vision/model.onnx
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:dad19503fe2ad3a8a4fb752d8431003b15961ceffe7e13c193fa4a5f1915e88d
|
| 3 |
+
size 372014748
|
vision/preprocessor_config.json
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"do_convert_rgb": null,
|
| 3 |
+
"do_normalize": true,
|
| 4 |
+
"do_rescale": true,
|
| 5 |
+
"do_resize": true,
|
| 6 |
+
"image_mean": [
|
| 7 |
+
0.5,
|
| 8 |
+
0.5,
|
| 9 |
+
0.5
|
| 10 |
+
],
|
| 11 |
+
"image_processor_type": "SiglipImageProcessor",
|
| 12 |
+
"image_std": [
|
| 13 |
+
0.5,
|
| 14 |
+
0.5,
|
| 15 |
+
0.5
|
| 16 |
+
],
|
| 17 |
+
"processor_class": "SiglipProcessor",
|
| 18 |
+
"resample": 3,
|
| 19 |
+
"rescale_factor": 0.00392156862745098,
|
| 20 |
+
"size": {
|
| 21 |
+
"height": 256,
|
| 22 |
+
"width": 256
|
| 23 |
+
}
|
| 24 |
+
}
|
vision/special_tokens_map.json
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"eos_token": {
|
| 3 |
+
"content": "</s>",
|
| 4 |
+
"lstrip": true,
|
| 5 |
+
"normalized": false,
|
| 6 |
+
"rstrip": true,
|
| 7 |
+
"single_word": false
|
| 8 |
+
},
|
| 9 |
+
"pad_token": {
|
| 10 |
+
"content": "</s>",
|
| 11 |
+
"lstrip": true,
|
| 12 |
+
"normalized": false,
|
| 13 |
+
"rstrip": true,
|
| 14 |
+
"single_word": false
|
| 15 |
+
},
|
| 16 |
+
"unk_token": {
|
| 17 |
+
"content": "<unk>",
|
| 18 |
+
"lstrip": true,
|
| 19 |
+
"normalized": false,
|
| 20 |
+
"rstrip": true,
|
| 21 |
+
"single_word": false
|
| 22 |
+
}
|
| 23 |
+
}
|
vision/spiece.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ef78f86560d809067d12bac6c09f19a462cb3af3f54d2b8acbba26e1433125d6
|
| 3 |
+
size 4309802
|
vision/tokenizer_config.json
ADDED
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"1": {
|
| 4 |
+
"content": "</s>",
|
| 5 |
+
"lstrip": true,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": true,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"2": {
|
| 12 |
+
"content": "<unk>",
|
| 13 |
+
"lstrip": true,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": true,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
}
|
| 19 |
+
},
|
| 20 |
+
"additional_special_tokens": [],
|
| 21 |
+
"clean_up_tokenization_spaces": true,
|
| 22 |
+
"do_lower_case": true,
|
| 23 |
+
"eos_token": "</s>",
|
| 24 |
+
"extra_special_tokens": {},
|
| 25 |
+
"model_input_names": [
|
| 26 |
+
"input_ids"
|
| 27 |
+
],
|
| 28 |
+
"model_max_length": 64,
|
| 29 |
+
"pad_token": "</s>",
|
| 30 |
+
"processor_class": "SiglipProcessor",
|
| 31 |
+
"sp_model_kwargs": {},
|
| 32 |
+
"tokenizer_class": "SiglipTokenizer",
|
| 33 |
+
"unk_token": "<unk>"
|
| 34 |
+
}
|