--- license: apache-2.0 base_model: HuggingFaceTB/SmolVLM-Instruct tags: - vision - image-text-to-text - multimodal - quantized - awq - 4-bit - llm-compressor language: - en pipeline_tag: image-text-to-text library_name: transformers --- # SmolVLM-Instruct-AWQ-4bit This is a 4-bit AWQ quantized version of [HuggingFaceTB/SmolVLM-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct), a 2.2B parameter vision-language model. ## Model Details - **Base Model**: HuggingFaceTB/SmolVLM-Instruct - **Quantization Method**: AWQ W4A16 (4-bit weights, 16-bit activations) - **Quantization Tool**: [llm-compressor](https://github.com/vllm-project/llm-compressor) - **Model Size**: 1.97 GB (55.1% reduction from 4.4 GB) - **Architecture**: Idefics3 (vision encoder + Llama-3.2 text decoder) - **Layers Quantized**: 168 linear layers compressed ### What's Quantized ✅ **Quantized to 4-bit**: - Text decoder (24 LlamaDecoderLayer blocks) - All attention projections (q_proj, k_proj, v_proj, o_proj) - All MLP layers (gate_proj, up_proj, down_proj) - Total: 168 linear layers ❌ **Preserved at full precision**: - Vision encoder/tower (SigLIP) - Vision-text connector - Language model head - All layer norms and biases ## Usage ### Requirements ```bash pip install transformers torch pillow ``` ### Basic Usage ```python from transformers import Idefics3ForConditionalGeneration, AutoProcessor from PIL import Image import requests # Load model and processor model = Idefics3ForConditionalGeneration.from_pretrained( "ronantakizawa/SmolVLM-Instruct-AWQ-4bit", device_map="auto", torch_dtype="auto" ) processor = AutoProcessor.from_pretrained("ronantakizawa/SmolVLM-Instruct-AWQ-4bit") # Load an image url = "https://huggingface.co/spaces/merve/chatml-llava/resolve/main/bee.jpg" image = Image.open(requests.get(url, stream=True).raw) # Create prompt messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "Describe this image in detail."} ] } ] # Generate prompt = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor(text=prompt, images=[image], return_tensors="pt").to(model.device) generated_ids = model.generate(**inputs, max_new_tokens=500) generated_texts = processor.batch_decode( generated_ids, skip_special_tokens=True, ) print(generated_texts[0]) ``` ### Using with vLLM (Production Deployment) ```bash pip install vllm python -m vllm.entrypoints.openai.api_server \ --model ronantakizawa/SmolVLM-Instruct-AWQ-4bit \ --quantization awq \ --dtype auto ``` Then use the OpenAI-compatible API: ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="dummy" ) response = client.chat.completions.create( model="ronantakizawa/SmolVLM-Instruct-AWQ-4bit", messages=[ { "role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}, {"type": "text", "text": "What's in this image?"} ] } ] ) print(response.choices[0].message.content) ``` ## Quantization Details ### Training Data - **Calibration Dataset**: lmms-lab/flickr30k - **Calibration Samples**: 256 images - **Sequence Length**: 2048 tokens ### Quantization Parameters ```python AWQModifier( targets="Linear", scheme="W4A16", ignore=[ "re:.*lm_head", "re:.*vision_model.*", "re:.*connector.*", "re:.*vision_tower.*" ] ) ``` ### Sequential Targets - Target layers: `LlamaDecoderLayer` - Pipeline: Sequential (layer-by-layer calibration) ## Performance | Metric | Value | |--------|-------| | **Original Size** | 4.4 GB | | **Quantized Size** | 1.97 GB | | **Compression Ratio** | 2.23x (55.1% reduction) | | **Layers Compressed** | 168 linear layers | | **GPU Memory (inference)** | ~2-3 GB | | **Vision Quality** | Preserved (no degradation) | | **Text Quality** | Under 1% quality degradation in DocVQA | ### Inference Speed - Faster calibration than GPTQ due to activation-aware scaling - Similar or slightly faster than fp16 due to reduced memory bandwidth - Ideal for deployment on consumer GPUs (RTX 3090, 4090, etc.) ## Limitations 1. **Slight quality degradation**: 4-bit quantization introduces minor quality loss in text generation 2. **AWQ-specific**: Requires AWQ-compatible inference engines (vLLM, transformers) 3. **Vision tower not quantized**: Vision encoder remains at full precision to preserve image understanding ## Technical Notes This model was quantized using custom patches to llm-compressor to support the idefics3 architecture: - Fixed meta tensor materialization issues in sequential pipeline - Enabled AWQ quantization for vision-language models - Patches available at: [ronantakizawa/llm-compressor](https://github.com/ronantakizawa/llm-compressor) ## Citation If you use this model, please cite the original SmolVLM work: ```bibtex @misc{smolvlm2024, title={SmolVLM: Small Vision-Language Model}, author={HuggingFace Team}, year={2024}, publisher={HuggingFace}, url={https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct} } ``` And the quantization tool: ```bibtex @software{llmcompressor2024, title={LLM Compressor}, author={Neural Magic}, year={2024}, url={https://github.com/vllm-project/llm-compressor} } ``` ## License This model inherits the Apache 2.0 license from the base model. ## Acknowledgments - Base model: [HuggingFaceTB/SmolVLM-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) - Quantization: [llm-compressor](https://github.com/vllm-project/llm-compressor) - Calibration data: [lmms-lab/flickr30k](https://huggingface.co/datasets/lmms-lab/flickr30k)