SmolVLM-Instruct-AWQ-4bit

This is a 4-bit AWQ quantized version of HuggingFaceTB/SmolVLM-Instruct, a 2.2B parameter vision-language model.

Model Details

  • Base Model: HuggingFaceTB/SmolVLM-Instruct
  • Quantization Method: AWQ W4A16 (4-bit weights, 16-bit activations)
  • Quantization Tool: llm-compressor
  • Model Size: 1.97 GB (55.1% reduction from 4.4 GB)
  • Architecture: Idefics3 (vision encoder + Llama-3.2 text decoder)
  • Layers Quantized: 168 linear layers compressed

What's Quantized

Quantized to 4-bit:

  • Text decoder (24 LlamaDecoderLayer blocks)
  • All attention projections (q_proj, k_proj, v_proj, o_proj)
  • All MLP layers (gate_proj, up_proj, down_proj)
  • Total: 168 linear layers

Preserved at full precision:

  • Vision encoder/tower (SigLIP)
  • Vision-text connector
  • Language model head
  • All layer norms and biases

Usage

Requirements

pip install transformers torch pillow

Basic Usage

from transformers import Idefics3ForConditionalGeneration, AutoProcessor
from PIL import Image
import requests

# Load model and processor
model = Idefics3ForConditionalGeneration.from_pretrained(
    "ronantakizawa/SmolVLM-Instruct-AWQ-4bit",
    device_map="auto",
    torch_dtype="auto"
)
processor = AutoProcessor.from_pretrained("ronantakizawa/SmolVLM-Instruct-AWQ-4bit")

# Load an image
url = "https://huggingface.co/spaces/merve/chatml-llava/resolve/main/bee.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Create prompt
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

# Generate
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

Using with vLLM (Production Deployment)

pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model ronantakizawa/SmolVLM-Instruct-AWQ-4bit \
    --quantization awq \
    --dtype auto

Then use the OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

response = client.chat.completions.create(
    model="ronantakizawa/SmolVLM-Instruct-AWQ-4bit",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
                {"type": "text", "text": "What's in this image?"}
            ]
        }
    ]
)

print(response.choices[0].message.content)

Quantization Details

Training Data

  • Calibration Dataset: lmms-lab/flickr30k
  • Calibration Samples: 256 images
  • Sequence Length: 2048 tokens

Quantization Parameters

AWQModifier(
    targets="Linear",
    scheme="W4A16",
    ignore=[
        "re:.*lm_head",
        "re:.*vision_model.*",
        "re:.*connector.*",
        "re:.*vision_tower.*"
    ]
)

Sequential Targets

  • Target layers: LlamaDecoderLayer
  • Pipeline: Sequential (layer-by-layer calibration)

Performance

Metric Value
Original Size 4.4 GB
Quantized Size 1.97 GB
Compression Ratio 2.23x (55.1% reduction)
Layers Compressed 168 linear layers
GPU Memory (inference) ~2-3 GB
Vision Quality Preserved (no degradation)
Text Quality Under 1% quality degradation in DocVQA

Inference Speed

  • Faster calibration than GPTQ due to activation-aware scaling
  • Similar or slightly faster than fp16 due to reduced memory bandwidth
  • Ideal for deployment on consumer GPUs (RTX 3090, 4090, etc.)

Limitations

  1. Slight quality degradation: 4-bit quantization introduces minor quality loss in text generation
  2. AWQ-specific: Requires AWQ-compatible inference engines (vLLM, transformers)
  3. Vision tower not quantized: Vision encoder remains at full precision to preserve image understanding

Technical Notes

This model was quantized using custom patches to llm-compressor to support the idefics3 architecture:

  • Fixed meta tensor materialization issues in sequential pipeline
  • Enabled AWQ quantization for vision-language models
  • Patches available at: ronantakizawa/llm-compressor

Citation

If you use this model, please cite the original SmolVLM work:

@misc{smolvlm2024,
  title={SmolVLM: Small Vision-Language Model},
  author={HuggingFace Team},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct}
}

And the quantization tool:

@software{llmcompressor2024,
  title={LLM Compressor},
  author={Neural Magic},
  year={2024},
  url={https://github.com/vllm-project/llm-compressor}
}

License

This model inherits the Apache 2.0 license from the base model.

Acknowledgments

Downloads last month
68
Safetensors
Model size
0.8B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ronantakizawa/SmolVLM-Instruct-awq