ajaleksa commited on
Commit
14c3f06
·
verified ·
1 Parent(s): 05e0bb1

Initial commit.

Browse files
README.md ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SigLIP Base Patch16-256 Multilingual ONNX
2
+
3
+ This directory contains ONNX exports of the `google/siglip-base-patch16-256-multilingual` model.
4
+
5
+ ## Model Description
6
+
7
+ SigLIP (Sigmoid Loss for Language Image Pre-training) is a multimodal model similar to CLIP but with key improvements:
8
+
9
+ - **Sigmoid Loss**: Uses sigmoid loss instead of contrastive loss, allowing for better scaling and performance
10
+ - **No Global Normalization**: Operates on image-text pairs independently without requiring global batch statistics
11
+ - **Better Multilingual Support**: Enhanced multilingual capabilities across 28+ languages
12
+ - **Resolution**: 256x256 pixels with 16x16 patches
13
+
14
+ ## Directory Structure
15
+
16
+ ```
17
+ siglip-base-patch16-256-multilingual-onnx/
18
+ ├── vision/
19
+ │ ├── model.onnx # Vision encoder
20
+ │ ├── config.json # Model configuration
21
+ │ └── preprocessor_config.json
22
+ ├── text/
23
+ │ ├── model.onnx # Text encoder
24
+ │ ├── config.json # Model configuration
25
+ │ ├── tokenizer.json # Fast tokenizer
26
+ │ ├── special_tokens_map.json
27
+ │ └── spiece.model # SentencePiece model
28
+ └── README.md
29
+ ```
30
+
31
+ ## Installation
32
+
33
+ ```bash
34
+ pip install onnxruntime pillow transformers
35
+ ```
36
+
37
+ For GPU support:
38
+ ```bash
39
+ pip install onnxruntime-gpu
40
+ ```
41
+
42
+ ## Usage
43
+
44
+ ### Python Example
45
+
46
+ ```python
47
+ import numpy as np
48
+ import onnxruntime as ort
49
+ from PIL import Image
50
+ from transformers import AutoProcessor
51
+
52
+ # Load processors
53
+ processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-256-multilingual")
54
+
55
+ # Load ONNX sessions
56
+ vision_session = ort.InferenceSession("vision/model.onnx")
57
+ text_session = ort.InferenceSession("text/model.onnx")
58
+
59
+ # Process image
60
+ image = Image.open("your_image.jpg")
61
+ image_inputs = processor(images=image, return_tensors="np")
62
+ image_embeddings = vision_session.run(None, {"pixel_values": image_inputs["pixel_values"]})[0]
63
+
64
+ # Process text
65
+ texts = ["a photo of a cat", "une photo d'un chat", "una foto de un gato"]
66
+ text_inputs = processor(text=texts, padding=True, return_tensors="np")
67
+ text_embeddings = text_session.run(None, {
68
+ "input_ids": text_inputs["input_ids"],
69
+ "attention_mask": text_inputs["attention_mask"]
70
+ })[0]
71
+
72
+ # Compute similarity using sigmoid (not softmax like CLIP!)
73
+ # SigLIP uses sigmoid activation, so we compute sigmoid of the dot product
74
+ logits = np.dot(image_embeddings, text_embeddings.T)
75
+ probs = 1 / (1 + np.exp(-logits)) # sigmoid activation
76
+
77
+ print("Probabilities:")
78
+ for i, text in enumerate(texts):
79
+ print(f" {text}: {probs[0][i]:.2%}")
80
+ ```
81
+
82
+ ## Key Differences from CLIP
83
+
84
+ 1. **Activation Function**: SigLIP uses sigmoid instead of softmax
85
+ 2. **Loss Function**: Sigmoid loss instead of contrastive loss
86
+ 3. **No L2 Normalization**: Embeddings are not L2-normalized
87
+ 4. **Independent Pairs**: Each image-text pair is processed independently
88
+
89
+ ## Performance
90
+
91
+ - **Inference Speed**: ~2-3x faster than PyTorch model
92
+ - **Memory Usage**: ~50% reduction compared to PyTorch
93
+ - **Accuracy**: Identical outputs to original model (within floating-point precision)
94
+
95
+ ## Supported Languages
96
+
97
+ Arabic (ar), Bengali (bn), German (de), Greek (el), English (en), Spanish (es),
98
+ Finnish (fi), French (fr), Hebrew (he), Hindi (hi), Indonesian (id), Italian (it),
99
+ Japanese (ja), Korean (ko), Dutch (nl), Norwegian (no), Polish (pl), Portuguese (pt),
100
+ Romanian (ro), Russian (ru), Swedish (sv), Swahili (sw), Tamil (ta), Thai (th),
101
+ Turkish (tr), Ukrainian (uk), Vietnamese (vi), Chinese (zh)
102
+
103
+ ## Model Details
104
+
105
+ - **Vision Encoder**: ViT-Base with 256x256 input, 16x16 patches
106
+ - **Text Encoder**: Transformer-based with SentencePiece tokenizer
107
+ - **Embedding Dimension**: 768
108
+ - **ONNX Opset**: 14
109
+ - **Precision**: FP32
110
+
111
+ ## Citation
112
+
113
+ ```bibtex
114
+ @article{zhai2023sigmoid,
115
+ title={Sigmoid Loss for Language Image Pre-Training},
116
+ author={Zhai, Xiaohua and others},
117
+ journal={arXiv preprint arXiv:2303.12170},
118
+ year={2023}
119
+ }
120
+ ```
121
+
122
+ ## License
123
+
124
+ Please refer to the original model's license at: https://huggingface.co/google/siglip-base-patch16-256-multilingual
text/config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "siglip_text",
3
+ "hidden_size": 768,
4
+ "max_seq_length": 64,
5
+ "vocab_size": 250000,
6
+ "model_name": "google/siglip-base-patch16-256-multilingual",
7
+ "onnx_export_version": "1.0",
8
+ "tokenizer_type": "sentencepiece",
9
+ "languages": [
10
+ "ar",
11
+ "bn",
12
+ "de",
13
+ "el",
14
+ "en",
15
+ "es",
16
+ "fi",
17
+ "fr",
18
+ "he",
19
+ "hi",
20
+ "id",
21
+ "it",
22
+ "ja",
23
+ "ko",
24
+ "nl",
25
+ "no",
26
+ "pl",
27
+ "pt",
28
+ "ro",
29
+ "ru",
30
+ "sv",
31
+ "sw",
32
+ "ta",
33
+ "th",
34
+ "tr",
35
+ "uk",
36
+ "vi",
37
+ "zh"
38
+ ]
39
+ }
text/model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:56c210978f9dd3fa158d0400c148d0893c53a8ebd9a3ba25f9b01a87f5ab177b
3
+ size 1111037737
text/special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "eos_token": {
3
+ "content": "</s>",
4
+ "lstrip": true,
5
+ "normalized": false,
6
+ "rstrip": true,
7
+ "single_word": false
8
+ },
9
+ "pad_token": {
10
+ "content": "</s>",
11
+ "lstrip": true,
12
+ "normalized": false,
13
+ "rstrip": true,
14
+ "single_word": false
15
+ },
16
+ "unk_token": {
17
+ "content": "<unk>",
18
+ "lstrip": true,
19
+ "normalized": false,
20
+ "rstrip": true,
21
+ "single_word": false
22
+ }
23
+ }
text/spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ef78f86560d809067d12bac6c09f19a462cb3af3f54d2b8acbba26e1433125d6
3
+ size 4309802
text/tokenizer_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "1": {
4
+ "content": "</s>",
5
+ "lstrip": true,
6
+ "normalized": false,
7
+ "rstrip": true,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "2": {
12
+ "content": "<unk>",
13
+ "lstrip": true,
14
+ "normalized": false,
15
+ "rstrip": true,
16
+ "single_word": false,
17
+ "special": true
18
+ }
19
+ },
20
+ "additional_special_tokens": [],
21
+ "clean_up_tokenization_spaces": true,
22
+ "do_lower_case": true,
23
+ "eos_token": "</s>",
24
+ "extra_special_tokens": {},
25
+ "model_input_names": [
26
+ "input_ids"
27
+ ],
28
+ "model_max_length": 64,
29
+ "pad_token": "</s>",
30
+ "processor_class": "SiglipProcessor",
31
+ "sp_model_kwargs": {},
32
+ "tokenizer_class": "SiglipTokenizer",
33
+ "unk_token": "<unk>"
34
+ }
vision/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "siglip_vision",
3
+ "hidden_size": 768,
4
+ "image_size": 256,
5
+ "patch_size": 16,
6
+ "num_channels": 3,
7
+ "model_name": "google/siglip-base-patch16-256-multilingual",
8
+ "onnx_export_version": "1.0"
9
+ }
vision/model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dad19503fe2ad3a8a4fb752d8431003b15961ceffe7e13c193fa4a5f1915e88d
3
+ size 372014748
vision/preprocessor_config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_rgb": null,
3
+ "do_normalize": true,
4
+ "do_rescale": true,
5
+ "do_resize": true,
6
+ "image_mean": [
7
+ 0.5,
8
+ 0.5,
9
+ 0.5
10
+ ],
11
+ "image_processor_type": "SiglipImageProcessor",
12
+ "image_std": [
13
+ 0.5,
14
+ 0.5,
15
+ 0.5
16
+ ],
17
+ "processor_class": "SiglipProcessor",
18
+ "resample": 3,
19
+ "rescale_factor": 0.00392156862745098,
20
+ "size": {
21
+ "height": 256,
22
+ "width": 256
23
+ }
24
+ }
vision/special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "eos_token": {
3
+ "content": "</s>",
4
+ "lstrip": true,
5
+ "normalized": false,
6
+ "rstrip": true,
7
+ "single_word": false
8
+ },
9
+ "pad_token": {
10
+ "content": "</s>",
11
+ "lstrip": true,
12
+ "normalized": false,
13
+ "rstrip": true,
14
+ "single_word": false
15
+ },
16
+ "unk_token": {
17
+ "content": "<unk>",
18
+ "lstrip": true,
19
+ "normalized": false,
20
+ "rstrip": true,
21
+ "single_word": false
22
+ }
23
+ }
vision/spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ef78f86560d809067d12bac6c09f19a462cb3af3f54d2b8acbba26e1433125d6
3
+ size 4309802
vision/tokenizer_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "1": {
4
+ "content": "</s>",
5
+ "lstrip": true,
6
+ "normalized": false,
7
+ "rstrip": true,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "2": {
12
+ "content": "<unk>",
13
+ "lstrip": true,
14
+ "normalized": false,
15
+ "rstrip": true,
16
+ "single_word": false,
17
+ "special": true
18
+ }
19
+ },
20
+ "additional_special_tokens": [],
21
+ "clean_up_tokenization_spaces": true,
22
+ "do_lower_case": true,
23
+ "eos_token": "</s>",
24
+ "extra_special_tokens": {},
25
+ "model_input_names": [
26
+ "input_ids"
27
+ ],
28
+ "model_max_length": 64,
29
+ "pad_token": "</s>",
30
+ "processor_class": "SiglipProcessor",
31
+ "sp_model_kwargs": {},
32
+ "tokenizer_class": "SiglipTokenizer",
33
+ "unk_token": "<unk>"
34
+ }