File size: 4,073 Bytes
14c3f06
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
# SigLIP Base Patch16-256 Multilingual ONNX

This directory contains ONNX exports of the `google/siglip-base-patch16-256-multilingual` model.

## Model Description

SigLIP (Sigmoid Loss for Language Image Pre-training) is a multimodal model similar to CLIP but with key improvements:

- **Sigmoid Loss**: Uses sigmoid loss instead of contrastive loss, allowing for better scaling and performance
- **No Global Normalization**: Operates on image-text pairs independently without requiring global batch statistics
- **Better Multilingual Support**: Enhanced multilingual capabilities across 28+ languages
- **Resolution**: 256x256 pixels with 16x16 patches

## Directory Structure

```
siglip-base-patch16-256-multilingual-onnx/
β”œβ”€β”€ vision/
β”‚   β”œβ”€β”€ model.onnx          # Vision encoder
β”‚   β”œβ”€β”€ config.json         # Model configuration
β”‚   └── preprocessor_config.json
β”œβ”€β”€ text/
β”‚   β”œβ”€β”€ model.onnx          # Text encoder
β”‚   β”œβ”€β”€ config.json         # Model configuration
β”‚   β”œβ”€β”€ tokenizer.json      # Fast tokenizer
β”‚   β”œβ”€β”€ special_tokens_map.json
β”‚   └── spiece.model        # SentencePiece model
└── README.md
```

## Installation

```bash
pip install onnxruntime pillow transformers
```

For GPU support:
```bash
pip install onnxruntime-gpu
```

## Usage

### Python Example

```python
import numpy as np
import onnxruntime as ort
from PIL import Image
from transformers import AutoProcessor

# Load processors
processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-256-multilingual")

# Load ONNX sessions
vision_session = ort.InferenceSession("vision/model.onnx")
text_session = ort.InferenceSession("text/model.onnx")

# Process image
image = Image.open("your_image.jpg")
image_inputs = processor(images=image, return_tensors="np")
image_embeddings = vision_session.run(None, {"pixel_values": image_inputs["pixel_values"]})[0]

# Process text
texts = ["a photo of a cat", "une photo d'un chat", "una foto de un gato"]
text_inputs = processor(text=texts, padding=True, return_tensors="np")
text_embeddings = text_session.run(None, {
    "input_ids": text_inputs["input_ids"],
    "attention_mask": text_inputs["attention_mask"]
})[0]

# Compute similarity using sigmoid (not softmax like CLIP!)
# SigLIP uses sigmoid activation, so we compute sigmoid of the dot product
logits = np.dot(image_embeddings, text_embeddings.T)
probs = 1 / (1 + np.exp(-logits))  # sigmoid activation

print("Probabilities:")
for i, text in enumerate(texts):
    print(f"  {text}: {probs[0][i]:.2%}")
```

## Key Differences from CLIP

1. **Activation Function**: SigLIP uses sigmoid instead of softmax
2. **Loss Function**: Sigmoid loss instead of contrastive loss
3. **No L2 Normalization**: Embeddings are not L2-normalized
4. **Independent Pairs**: Each image-text pair is processed independently

## Performance

- **Inference Speed**: ~2-3x faster than PyTorch model
- **Memory Usage**: ~50% reduction compared to PyTorch
- **Accuracy**: Identical outputs to original model (within floating-point precision)

## Supported Languages

Arabic (ar), Bengali (bn), German (de), Greek (el), English (en), Spanish (es), 
Finnish (fi), French (fr), Hebrew (he), Hindi (hi), Indonesian (id), Italian (it), 
Japanese (ja), Korean (ko), Dutch (nl), Norwegian (no), Polish (pl), Portuguese (pt), 
Romanian (ro), Russian (ru), Swedish (sv), Swahili (sw), Tamil (ta), Thai (th), 
Turkish (tr), Ukrainian (uk), Vietnamese (vi), Chinese (zh)

## Model Details

- **Vision Encoder**: ViT-Base with 256x256 input, 16x16 patches
- **Text Encoder**: Transformer-based with SentencePiece tokenizer
- **Embedding Dimension**: 768
- **ONNX Opset**: 14
- **Precision**: FP32

## Citation

```bibtex
@article{zhai2023sigmoid,
  title={Sigmoid Loss for Language Image Pre-Training},
  author={Zhai, Xiaohua and others},
  journal={arXiv preprint arXiv:2303.12170},
  year={2023}
}
```

## License

Please refer to the original model's license at: https://huggingface.co/google/siglip-base-patch16-256-multilingual