NV-Reason-CXR-3B GGUF (Quantized for Edge)
Quantized GGUF versions of NVIDIA's NV-Reason-CXR-3B vision-language model optimized for edge deployment for Cactus Compute and llama.cpp.
Model Description
This repository contains quantized versions of NV-Reason-CXR-3B, a 3B parameter vision-language model specialized in chest X-ray analysis. The model has been converted to GGUF format and quantized for efficient deployment on edge devices (mobile, desktop, embedded systems).
Original Model: nvidia/NV-Reason-CXR-3B Base Architecture: Qwen2.5-VL 3B Instruct Conversion: llama.cpp Quantization: llama-cpp-python
Available Models
| Filename | Format | Size | Use Case | Quality | Speed |
|---|---|---|---|---|---|
nv-reason-cxr-3b-fp16.gguf |
FP16 | 6.3 GB | Desktop with GPU (quality reference) | 100% | Baseline |
nv-reason-cxr-3b-Q4_K_M.gguf |
Q4_K_M | 1.96 GB | Recommended for edge devices | 90-95% | Fast |
mmproj-nv-reason-cxr-3b-f16.gguf |
FP16 mmproj | 1.25 GB | Vision encoder (required for image analysis) | 100% | - |
Model Details
Q4_K_M (Recommended):
- Size: 1.96 GB (69% reduction from FP16)
- Compression: 3.23x from original
- Quality: 90-95% retention
- Speed: 8-20 tokens/sec on mobile (device-dependent)
- RAM Required: 3-4 GB
- Best for: Mid-range to high-end mobile devices
FP16 (Reference):
- Size: 6.3 GB
- Quality: Original precision
- Speed: Slower than quantized
- RAM Required: 8+ GB
- Best for: Desktop inference, quality comparison
Performance Benchmarks
Desktop (Apple M3 Mac)
Q4_K_M Performance:
| Configuration | Load Time | Inference Speed | Memory Usage |
|---|---|---|---|
| CPU-only | 1.87s | 29.61 tok/s | ~2 GB RAM |
| M3 GPU (Metal) | 0.34s | 33.24 tok/s | ~2 GB RAM |
| Speedup | 5.46x faster โก | 1.12x faster | Same |
Key Insights:
- ๐ GPU provides 5.46x faster model loading - Huge benefit for app cold starts!
- โก Modest 1.12x inference speedup - Q4_K_M is already highly CPU-optimized
- โ Excellent CPU performance - GPU acceleration is optional, not required
- ๐ช Mobile devices will run well even without dedicated GPU
Test Hardware: Apple M3 MacBook Pro (Metal GPU support)
Mobile Projections
| Device | RAM | Expected Speed | Load Time | Rating |
|---|---|---|---|---|
| Budget Android | 3GB | 3-5 tok/s | 30-45s | Poor |
| Mid-range Android | 4GB | 8-12 tok/s | 20-30s | Good |
| High-end Android | 6GB | 15-20 tok/s | 15-25s | Excellent |
| iPhone 12+ | 4-6GB | 12-18 tok/s | 15-20s | Excellent |
| iPhone 14+ | 6GB+ | 18-25 tok/s | 10-15s | Optimal |
Minimum Requirements:
- 4GB RAM
- 3GB free storage
- iOS 14+ or Android 8+
Usage
With llama.cpp
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# Download model
huggingface-cli download samwell/NV-Reason-CXR-3B-GGUF \
nv-reason-cxr-3b-Q4_K_M.gguf --local-dir ./models
# Run inference
./llama-cli \
-m models/nv-reason-cxr-3b-Q4_K_M.gguf \
-p "Analyze this chest X-ray image." \
--image xray.jpg \
-n 512 \
--temp 0.3
With llama-cpp-python
from llama_cpp import Llama
# Option 1: CPU-only (works great, 29.61 tok/s on M3)
llm = Llama(
model_path="nv-reason-cxr-3b-Q4_K_M.gguf",
n_ctx=4096,
n_threads=4,
n_gpu_layers=0, # CPU-only
)
# Option 2: GPU acceleration (5.46x faster loading!)
llm = Llama(
model_path="nv-reason-cxr-3b-Q4_K_M.gguf",
n_ctx=4096,
n_threads=4,
n_gpu_layers=-1, # Use GPU (Metal on Mac, CUDA on Linux/Windows)
)
# Analyze X-ray
response = llm(
"Analyze this chest X-ray image and identify key findings.",
max_tokens=512,
temperature=0.3, # Lower for medical = more deterministic
top_p=0.9,
)
print(response['choices'][0]['text'])
With Cactus Compute (Flutter/Mobile)
Note: You need BOTH the model file AND the mmproj file for image analysis.
import 'package:cactus/cactus.dart';
// Initialize VLM with both model and mmproj files
final vlm = CactusVLM();
await vlm.init(
modelFilename: 'nv-reason-cxr-3b-Q4_K_M.gguf', // Model file
mmprojFilename: 'mmproj-nv-reason-cxr-3b-f16.gguf', // Vision encoder
contextSize: 2048, // Context window (2K-4K for mobile)
threads: 4, // CPU threads
gpuLayers: 0, // CPU-only (GPU may cause issues on some devices)
);
// Create prompt
final messages = [
ChatMessage(
role: 'system',
content: 'You are a helpful radiologist assistant.',
),
ChatMessage(
role: 'user',
content: 'Describe what you see in this chest X-ray image.',
),
];
// Analyze X-ray
final response = await vlm.completion(
messages,
imagePaths: ['path/to/xray.jpg'],
maxTokens: 150,
temperature: 0.1, // Lower for medical analysis (0.1-0.5)
);
print(response.text);
Mobile GPU Benefits:
- ๐ 5.46x faster model loading (critical for app startup)
- ๐ฑ Better user experience on iOS (Metal) and Android (Vulkan/OpenCL)
- ๐ Minimal battery impact during loading phase
- โ Falls back gracefully to CPU if GPU unavailable
Inference Parameters
Recommended settings for medical analysis:
{
"temperature": 0.3, # Lower = more deterministic (range: 0.1-0.5)
"top_p": 0.9, # Nucleus sampling
"top_k": 40, # Top-k sampling
"repeat_penalty": 1.1, # Avoid repetition
"max_tokens": 512, # Response length
"n_ctx": 4096, # Context window (2048-4096 for mobile)
}
Files Included
.
โโโ README.md # Model card and usage guide
โโโ LICENSE # NSCLV1 license
โโโ CONVERSION_PROCESS.md # Technical conversion details
โโโ nv-reason-cxr-3b-Q4_K_M.gguf # Q4_K_M quantized (1.96 GB) - Recommended
โโโ nv-reason-cxr-3b-fp16.gguf # FP16 reference (6.3 GB)
โโโ mmproj-nv-reason-cxr-3b-f16.gguf # Vision encoder (1.25 GB) - Required
Model Card
Model Details
- Developed by: NVIDIA (original), quantized by samwell
- Model type: Vision-Language Model (VLM)
- Architecture: Qwen2.5-VL
- Parameters: 3 billion
- Language: English
- License: NSCLV1 (see LICENSE)
- Fine-tuned from: Qwen2.5-VL-3B-Instruct
- Specialty: Chest X-ray analysis
Intended Use
Primary Use Cases:
- Research in medical image analysis
- Educational purposes for radiology students
- Prototyping mobile medical AI applications
- Edge deployment of medical VLMs
Out-of-Scope:
- โ Clinical diagnosis or treatment decisions
- โ Production medical applications without proper validation
- โ Replacing trained radiologists
- โ Any FDA-regulated medical use
Limitations
- Not for Clinical Use: This model is for research and educational purposes only
- Quality Trade-off: Quantization reduces model size but may affect accuracy
- Domain Specific: Trained primarily on chest X-rays, may not generalize to other imaging
- Requires Validation: All outputs should be verified by medical professionals
- Mobile Performance: Speed varies significantly by device capabilities
Ethical Considerations
- Model outputs should not be used for medical diagnosis
- Always consult qualified healthcare professionals
- Be aware of potential biases in training data
- Ensure patient privacy when using with real medical images
- Comply with local healthcare regulations (HIPAA, GDPR, etc.)
Bias and Fairness
The original model may have inherited biases from training data. Users should:
- Test across diverse patient populations
- Validate performance on their specific use cases
- Monitor for unexpected outputs or biases
- Not rely solely on model outputs
Technical Details
Quantization Method
Q4_K_M uses 4-bit quantization with K-means clustering:
- Weights stored in 4 bits instead of 16 (FP16)
- K-means clustering for optimal quantization scales
- Medium variant balances size and quality
- Per-block scales for better accuracy preservation
Vision Encoder
The vision encoder has been extracted into a separate mmproj file for compatibility:
- File:
mmproj-nv-reason-cxr-3b-f16.gguf(1.25 GB) - Required: Both the model file AND mmproj file are needed for image analysis
- Format: FP16 (full precision vision encoder)
- Extracted from: NVIDIA's NV-Reason-CXR-3B original model
- Contains: Vision transformer blocks and multimodal projection layers (519 tensors)
Why separate mmproj?
- Mobile frameworks (Cactus Compute) require separate mmproj architecture
- Allows independent caching and loading strategies
- Enables mixing different model quantizations with same vision encoder
Context Window
- Training: 128,000 tokens
- Recommended for mobile: 2,048-4,096 tokens
- Desktop: Up to 128,000 tokens (RAM-dependent)
Citation
If you use this model, please cite the original work:
@misc{nvidia2024nvreasoncrx,
title={NV-Reason-CXR-3B: A Specialized Vision-Language Model for Chest X-ray Analysis},
author={NVIDIA},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/nvidia/NV-Reason-CXR-3B}
}
And optionally cite the quantization:
@misc{nvreasoncrx3b-gguf,
title={NV-Reason-CXR-3B GGUF: Quantized for Edge Deployment},
author={samwell},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/samwell/NV-Reason-CXR-3B-GGUF}
}
Acknowledgments
- NVIDIA for the original NV-Reason-CXR-3B model
- Qwen Team for the Qwen2.5-VL architecture
- llama.cpp contributors for the GGUF format and conversion tools
- Cactus Compute for mobile VLM deployment framework
License
This model inherits the NSCLV1 license from the original NV-Reason-CXR-3B model. See LICENSE for details.
Key points:
- Research and educational use permitted
- Commercial use may require additional permissions
- Not for clinical/diagnostic use
- See original model card for complete license terms
Disclaimer
โ ๏ธ IMPORTANT MEDICAL DISCLAIMER
This model is provided for RESEARCH AND EDUCATIONAL PURPOSES ONLY. It is:
- NOT intended for clinical diagnosis or treatment
- NOT FDA approved or clinically validated
- NOT a substitute for professional medical advice
- NOT validated for production medical use
Always consult qualified healthcare professionals for medical decisions. The creators and distributors of this model assume no liability for any use of this software.
Contact & Support
- Issues: Report issues on GitHub (link to your repo)
- Questions: See documentation in this repository
- Original Model: nvidia/NV-Reason-CXR-3B
- Cactus Compute: GitHub
Version History
- v1.0 (2025-11-05): Initial release
- FP16 GGUF conversion
- Q4_K_M quantization
- Tested on macOS and mobile projections
- Complete documentation and scripts
For research and educational purposes only. Not for clinical use.
- Downloads last month
- 103
4-bit