SmolVLM-Instruct-AWQ-4bit

This is a 4-bit AWQ quantized version of HuggingFaceTB/SmolVLM-Instruct, a 2.2B parameter vision-language model.

Model Details

Base Model: HuggingFaceTB/SmolVLM-Instruct
Quantization Method: AWQ W4A16 (4-bit weights, 16-bit activations)
Quantization Tool: llm-compressor
Model Size: 1.97 GB (55.1% reduction from 4.4 GB)
Architecture: Idefics3 (vision encoder + Llama-3.2 text decoder)
Layers Quantized: 168 linear layers compressed

What's Quantized

✅ Quantized to 4-bit:

Text decoder (24 LlamaDecoderLayer blocks)
All attention projections (q_proj, k_proj, v_proj, o_proj)
All MLP layers (gate_proj, up_proj, down_proj)
Total: 168 linear layers

❌ Preserved at full precision:

Vision encoder/tower (SigLIP)
Vision-text connector
Language model head
All layer norms and biases

Usage

Requirements

pip install transformers torch pillow

Basic Usage

from transformers import Idefics3ForConditionalGeneration, AutoProcessor
from PIL import Image
import requests

# Load model and processor
model = Idefics3ForConditionalGeneration.from_pretrained(
    "ronantakizawa/SmolVLM-Instruct-AWQ-4bit",
    device_map="auto",
    torch_dtype="auto"
)
processor = AutoProcessor.from_pretrained("ronantakizawa/SmolVLM-Instruct-AWQ-4bit")

# Load an image
url = "https://huggingface.co/spaces/merve/chatml-llava/resolve/main/bee.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Create prompt
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

# Generate
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

Using with vLLM (Production Deployment)

pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model ronantakizawa/SmolVLM-Instruct-AWQ-4bit \
    --quantization awq \
    --dtype auto

Then use the OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

response = client.chat.completions.create(
    model="ronantakizawa/SmolVLM-Instruct-AWQ-4bit",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
                {"type": "text", "text": "What's in this image?"}
            ]
        }
    ]
)

print(response.choices[0].message.content)

Quantization Details

Training Data

Calibration Dataset: lmms-lab/flickr30k
Calibration Samples: 256 images
Sequence Length: 2048 tokens

Quantization Parameters

AWQModifier(
    targets="Linear",
    scheme="W4A16",
    ignore=[
        "re:.*lm_head",
        "re:.*vision_model.*",
        "re:.*connector.*",
        "re:.*vision_tower.*"
    ]
)

Sequential Targets

Target layers: LlamaDecoderLayer
Pipeline: Sequential (layer-by-layer calibration)

Performance

Metric	Value
Original Size	4.4 GB
Quantized Size	1.97 GB
Compression Ratio	2.23x (55.1% reduction)
Layers Compressed	168 linear layers
GPU Memory (inference)	~2-3 GB
Vision Quality	Preserved (no degradation)
Text Quality	Under 1% quality degradation in DocVQA

Inference Speed

Faster calibration than GPTQ due to activation-aware scaling
Similar or slightly faster than fp16 due to reduced memory bandwidth
Ideal for deployment on consumer GPUs (RTX 3090, 4090, etc.)

Limitations

Slight quality degradation: 4-bit quantization introduces minor quality loss in text generation
AWQ-specific: Requires AWQ-compatible inference engines (vLLM, transformers)
Vision tower not quantized: Vision encoder remains at full precision to preserve image understanding

Technical Notes

This model was quantized using custom patches to llm-compressor to support the idefics3 architecture:

Fixed meta tensor materialization issues in sequential pipeline
Enabled AWQ quantization for vision-language models
Patches available at: ronantakizawa/llm-compressor

Citation

If you use this model, please cite the original SmolVLM work:

@misc{smolvlm2024,
  title={SmolVLM: Small Vision-Language Model},
  author={HuggingFace Team},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct}
}

And the quantization tool:

@software{llmcompressor2024,
  title={LLM Compressor},
  author={Neural Magic},
  year={2024},
  url={https://github.com/vllm-project/llm-compressor}
}