🧩 DeepVK-USER-BGE-M3 β€” Quantized ONNX (INT8)

✨ This repository contains a quantized INT8 ONNX version of deepvk/USER-bge-m3.
It is designed for fast CPU inference with ONNX Runtime, making it a great choice for semantic search, embeddings generation, and text similarity tasks in Russian πŸ‡·πŸ‡Ί and English πŸ‡¬πŸ‡§.


πŸ” Model Card

Property Value
Base model deepvk/USER-bge-m3, BAAI/bge-m3
Quantization INT8 (Dynamic)
Format ONNX
Libraries transformers, onnxruntime, optimum, sentence-transformers
Embedding dim 1024
Supported HW CPU (optimized for Intel AVX512-VNNI, fallback to AVX2)
License Apache-2.0

πŸš€ Features

  • ⚑ Fast CPU inference β€” ONNX + INT8 gives a strong speed-up.
  • πŸ“¦ Lightweight β€” reduced model size, lower memory footprint.
  • πŸ”„ Drop-in replacement β€” embeddings compatible with the FP32 version.
  • 🌍 Multilingual β€” supports Russian πŸ‡·πŸ‡Ί and English πŸ‡¬πŸ‡§.

🧠 Intended Use

βœ… Recommended for:

  • Semantic search & retrieval systems
  • Recommendation pipelines
  • Text similarity & clustering
  • Low-latency CPU deployments

❌ Not ideal for:

  • Absolute maximum accuracy scenarios (INT8 introduces minor loss)
  • GPU-optimized pipelines (prefer FP16/FP32 models instead)

βš–οΈ Pros & Cons of Quantized ONNX

Pros βœ…

  • Easy to use (no calibration dataset required).
  • Smaller & faster than FP32.
  • Works out of the box with ONNX Runtime.

Cons ❌

  • Slight accuracy drop compared to static quantization.
  • AVX512 optimizations only on modern Intel CPUs.
  • No GPU acceleration in this export.

πŸ“Š Benchmark

Metric Value
Avg cosine similarity (vs FP32) ~0.988
Median cosine similarity ~0.988
Orig model time (s) 0.7504
Quant model time (s) 0.3539
Inference speed ~2Γ— faster
Model size (MB) 347.5

πŸ“‚ Files

model_quantized.onnx β€” quantized model

tokenizer.json, vocab.txt, special_tokens_map.json β€” tokenizer

config.json β€” model config


🧩 Examples

You can try the model directly in Google Colab:

Open In Colab

This notebook demonstrates:

You can try this model with ready-to-use scripts in the examples folder:

  • quantmodel.py β€” universal Python module for loading and encoding texts with the quantized ONNX model.
  • app-console.py β€” console script to compare FP32 vs INT8 embeddings (cosine similarity + inference time).
  • app-streamlit.py β€” interactive demo with Streamlit.
Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for skatzR/USER-BGE-M3-ONNX-INT8

Base model

deepvk/USER-bge-m3
Quantized
(7)
this model