SmolVLM-Instruct-AWQ-4bit
This is a 4-bit AWQ quantized version of HuggingFaceTB/SmolVLM-Instruct, a 2.2B parameter vision-language model.
Model Details
- Base Model: HuggingFaceTB/SmolVLM-Instruct
- Quantization Method: AWQ W4A16 (4-bit weights, 16-bit activations)
- Quantization Tool: llm-compressor
- Model Size: 1.97 GB (55.1% reduction from 4.4 GB)
- Architecture: Idefics3 (vision encoder + Llama-3.2 text decoder)
- Layers Quantized: 168 linear layers compressed
What's Quantized
✅ Quantized to 4-bit:
- Text decoder (24 LlamaDecoderLayer blocks)
- All attention projections (q_proj, k_proj, v_proj, o_proj)
- All MLP layers (gate_proj, up_proj, down_proj)
- Total: 168 linear layers
❌ Preserved at full precision:
- Vision encoder/tower (SigLIP)
- Vision-text connector
- Language model head
- All layer norms and biases
Usage
Requirements
pip install transformers torch pillow
Basic Usage
from transformers import Idefics3ForConditionalGeneration, AutoProcessor
from PIL import Image
import requests
# Load model and processor
model = Idefics3ForConditionalGeneration.from_pretrained(
"ronantakizawa/SmolVLM-Instruct-AWQ-4bit",
device_map="auto",
torch_dtype="auto"
)
processor = AutoProcessor.from_pretrained("ronantakizawa/SmolVLM-Instruct-AWQ-4bit")
# Load an image
url = "https://huggingface.co/spaces/merve/chatml-llava/resolve/main/bee.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# Create prompt
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Describe this image in detail."}
]
}
]
# Generate
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
)
print(generated_texts[0])
Using with vLLM (Production Deployment)
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model ronantakizawa/SmolVLM-Instruct-AWQ-4bit \
--quantization awq \
--dtype auto
Then use the OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy"
)
response = client.chat.completions.create(
model="ronantakizawa/SmolVLM-Instruct-AWQ-4bit",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "What's in this image?"}
]
}
]
)
print(response.choices[0].message.content)
Quantization Details
Training Data
- Calibration Dataset: lmms-lab/flickr30k
- Calibration Samples: 256 images
- Sequence Length: 2048 tokens
Quantization Parameters
AWQModifier(
targets="Linear",
scheme="W4A16",
ignore=[
"re:.*lm_head",
"re:.*vision_model.*",
"re:.*connector.*",
"re:.*vision_tower.*"
]
)
Sequential Targets
- Target layers:
LlamaDecoderLayer - Pipeline: Sequential (layer-by-layer calibration)
Performance
| Metric | Value |
|---|---|
| Original Size | 4.4 GB |
| Quantized Size | 1.97 GB |
| Compression Ratio | 2.23x (55.1% reduction) |
| Layers Compressed | 168 linear layers |
| GPU Memory (inference) | ~2-3 GB |
| Vision Quality | Preserved (no degradation) |
| Text Quality | Under 1% quality degradation in DocVQA |
Inference Speed
- Faster calibration than GPTQ due to activation-aware scaling
- Similar or slightly faster than fp16 due to reduced memory bandwidth
- Ideal for deployment on consumer GPUs (RTX 3090, 4090, etc.)
Limitations
- Slight quality degradation: 4-bit quantization introduces minor quality loss in text generation
- AWQ-specific: Requires AWQ-compatible inference engines (vLLM, transformers)
- Vision tower not quantized: Vision encoder remains at full precision to preserve image understanding
Technical Notes
This model was quantized using custom patches to llm-compressor to support the idefics3 architecture:
- Fixed meta tensor materialization issues in sequential pipeline
- Enabled AWQ quantization for vision-language models
- Patches available at: ronantakizawa/llm-compressor
Citation
If you use this model, please cite the original SmolVLM work:
@misc{smolvlm2024,
title={SmolVLM: Small Vision-Language Model},
author={HuggingFace Team},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct}
}
And the quantization tool:
@software{llmcompressor2024,
title={LLM Compressor},
author={Neural Magic},
year={2024},
url={https://github.com/vllm-project/llm-compressor}
}
License
This model inherits the Apache 2.0 license from the base model.
Acknowledgments
- Base model: HuggingFaceTB/SmolVLM-Instruct
- Quantization: llm-compressor
- Calibration data: lmms-lab/flickr30k
- Downloads last month
- 68
Model tree for ronantakizawa/SmolVLM-Instruct-awq
Base model
HuggingFaceTB/SmolLM2-1.7B
Quantized
HuggingFaceTB/SmolLM2-1.7B-Instruct
Quantized
HuggingFaceTB/SmolVLM-Instruct