π§© DeepVK-USER-BGE-M3 β Quantized ONNX (INT8)
β¨ This repository contains a quantized INT8 ONNX version of deepvk/USER-bge-m3.
It is designed for fast CPU inference with ONNX Runtime, making it a great choice for semantic search, embeddings generation, and text similarity tasks in Russian π·πΊ and English π¬π§.
π Model Card
| Property | Value |
|---|---|
| Base model | deepvk/USER-bge-m3, BAAI/bge-m3 |
| Quantization | INT8 (Dynamic) |
| Format | ONNX |
| Libraries | transformers, onnxruntime, optimum, sentence-transformers |
| Embedding dim | 1024 |
| Supported HW | CPU (optimized for Intel AVX512-VNNI, fallback to AVX2) |
| License | Apache-2.0 |
π Features
- β‘ Fast CPU inference β ONNX + INT8 gives a strong speed-up.
- π¦ Lightweight β reduced model size, lower memory footprint.
- π Drop-in replacement β embeddings compatible with the FP32 version.
- π Multilingual β supports Russian π·πΊ and English π¬π§.
π§ Intended Use
β Recommended for:
- Semantic search & retrieval systems
- Recommendation pipelines
- Text similarity & clustering
- Low-latency CPU deployments
β Not ideal for:
- Absolute maximum accuracy scenarios (INT8 introduces minor loss)
- GPU-optimized pipelines (prefer FP16/FP32 models instead)
βοΈ Pros & Cons of Quantized ONNX
Pros β
- Easy to use (no calibration dataset required).
- Smaller & faster than FP32.
- Works out of the box with ONNX Runtime.
Cons β
- Slight accuracy drop compared to static quantization.
- AVX512 optimizations only on modern Intel CPUs.
- No GPU acceleration in this export.
π Benchmark
| Metric | Value |
|---|---|
| Avg cosine similarity (vs FP32) | ~0.988 |
| Median cosine similarity | ~0.988 |
| Orig model time (s) | 0.7504 |
| Quant model time (s) | 0.3539 |
| Inference speed | ~2Γ faster |
| Model size (MB) | 347.5 |
π Files
model_quantized.onnx β quantized model
tokenizer.json, vocab.txt, special_tokens_map.json β tokenizer
config.json β model config
π§© Examples
You can try the model directly in Google Colab:
This notebook demonstrates:
- Loading the original FP32 model
deepvk/USER-bge-m3 - Loading the quantized INT8 ONNX model
skatzR/USER-BGE-M3-ONNX-INT8 - Comparing quality (cosine similarity) and inference speed side by side
You can try this model with ready-to-use scripts in the examples folder:
quantmodel.pyβ universal Python module for loading and encoding texts with the quantized ONNX model.app-console.pyβ console script to compare FP32 vs INT8 embeddings (cosine similarity + inference time).app-streamlit.pyβ interactive demo with Streamlit.
- Downloads last month
- 6
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for skatzR/USER-BGE-M3-ONNX-INT8
Base model
deepvk/USER-bge-m3