🧩 DeepVK-USER-BGE-M3 — Quantized ONNX (INT8)

✨ This repository contains a quantized INT8 ONNX version of deepvk/USER-bge-m3.
It is designed for fast CPU inference with ONNX Runtime, making it a great choice for semantic search, embeddings generation, and text similarity tasks in Russian 🇷🇺 and English 🇬🇧.

🔍 Model Card

Property	Value
Base model	`deepvk/USER-bge-m3`, `BAAI/bge-m3`
Quantization	INT8 (Dynamic)
Format	ONNX
Libraries	`transformers`, `onnxruntime`, `optimum`, `sentence-transformers`
Embedding dim	1024
Supported HW	CPU (optimized for Intel AVX512-VNNI, fallback to AVX2)
License	Apache-2.0

🚀 Features

⚡ Fast CPU inference — ONNX + INT8 gives a strong speed-up.
📦 Lightweight — reduced model size, lower memory footprint.
🔄 Drop-in replacement — embeddings compatible with the FP32 version.
🌍 Multilingual — supports Russian 🇷🇺 and English 🇬🇧.

🧠 Intended Use

✅ Recommended for:

Semantic search & retrieval systems
Recommendation pipelines
Text similarity & clustering
Low-latency CPU deployments

❌ Not ideal for:

Absolute maximum accuracy scenarios (INT8 introduces minor loss)
GPU-optimized pipelines (prefer FP16/FP32 models instead)

⚖️ Pros & Cons of Quantized ONNX

Pros ✅

Easy to use (no calibration dataset required).
Smaller & faster than FP32.
Works out of the box with ONNX Runtime.

Cons ❌

Slight accuracy drop compared to static quantization.
AVX512 optimizations only on modern Intel CPUs.
No GPU acceleration in this export.

📊 Benchmark

Metric	Value
Avg cosine similarity (vs FP32)	~0.988
Median cosine similarity	~0.988
Orig model time (s)	0.7504
Quant model time (s)	0.3539
Inference speed	~2× faster
Model size (MB)	347.5

📂 Files

model_quantized.onnx — quantized model

tokenizer.json, vocab.txt, special_tokens_map.json — tokenizer

config.json — model config

🧩 Examples

You can try the model directly in Google Colab:

This notebook demonstrates:

Loading the original FP32 model deepvk/USER-bge-m3
Loading the quantized INT8 ONNX model skatzR/USER-BGE-M3-ONNX-INT8
Comparing quality (cosine similarity) and inference speed side by side

You can try this model with ready-to-use scripts in the examples folder:

quantmodel.py — universal Python module for loading and encoding texts with the quantized ONNX model.
app-console.py — console script to compare FP32 vs INT8 embeddings (cosine similarity + inference time).
app-streamlit.py — interactive demo with Streamlit.

Downloads last month: 6

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for skatzR/USER-BGE-M3-ONNX-INT8

Base model

deepvk/USER-bge-m3

Quantized

(7)

this model