NV-Reason-CXR-3B GGUF (Quantized for Edge)

Quantized GGUF versions of NVIDIA's NV-Reason-CXR-3B vision-language model optimized for edge deployment for Cactus Compute and llama.cpp.

Model Description

This repository contains quantized versions of NV-Reason-CXR-3B, a 3B parameter vision-language model specialized in chest X-ray analysis. The model has been converted to GGUF format and quantized for efficient deployment on edge devices (mobile, desktop, embedded systems).

Original Model: nvidia/NV-Reason-CXR-3B Base Architecture: Qwen2.5-VL 3B Instruct Conversion: llama.cpp Quantization: llama-cpp-python

Available Models

Filename Format Size Use Case Quality Speed
nv-reason-cxr-3b-fp16.gguf FP16 6.3 GB Desktop with GPU (quality reference) 100% Baseline
nv-reason-cxr-3b-Q4_K_M.gguf Q4_K_M 1.96 GB Recommended for edge devices 90-95% Fast
mmproj-nv-reason-cxr-3b-f16.gguf FP16 mmproj 1.25 GB Vision encoder (required for image analysis) 100% -

Model Details

Q4_K_M (Recommended):

  • Size: 1.96 GB (69% reduction from FP16)
  • Compression: 3.23x from original
  • Quality: 90-95% retention
  • Speed: 8-20 tokens/sec on mobile (device-dependent)
  • RAM Required: 3-4 GB
  • Best for: Mid-range to high-end mobile devices

FP16 (Reference):

  • Size: 6.3 GB
  • Quality: Original precision
  • Speed: Slower than quantized
  • RAM Required: 8+ GB
  • Best for: Desktop inference, quality comparison

Performance Benchmarks

Desktop (Apple M3 Mac)

Q4_K_M Performance:

Configuration Load Time Inference Speed Memory Usage
CPU-only 1.87s 29.61 tok/s ~2 GB RAM
M3 GPU (Metal) 0.34s 33.24 tok/s ~2 GB RAM
Speedup 5.46x faster โšก 1.12x faster Same

Key Insights:

  • ๐Ÿš€ GPU provides 5.46x faster model loading - Huge benefit for app cold starts!
  • โšก Modest 1.12x inference speedup - Q4_K_M is already highly CPU-optimized
  • โœ… Excellent CPU performance - GPU acceleration is optional, not required
  • ๐Ÿ’ช Mobile devices will run well even without dedicated GPU

Test Hardware: Apple M3 MacBook Pro (Metal GPU support)

Mobile Projections

Device RAM Expected Speed Load Time Rating
Budget Android 3GB 3-5 tok/s 30-45s Poor
Mid-range Android 4GB 8-12 tok/s 20-30s Good
High-end Android 6GB 15-20 tok/s 15-25s Excellent
iPhone 12+ 4-6GB 12-18 tok/s 15-20s Excellent
iPhone 14+ 6GB+ 18-25 tok/s 10-15s Optimal

Minimum Requirements:

  • 4GB RAM
  • 3GB free storage
  • iOS 14+ or Android 8+

Usage

With llama.cpp

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Download model
huggingface-cli download samwell/NV-Reason-CXR-3B-GGUF \
  nv-reason-cxr-3b-Q4_K_M.gguf --local-dir ./models

# Run inference
./llama-cli \
  -m models/nv-reason-cxr-3b-Q4_K_M.gguf \
  -p "Analyze this chest X-ray image." \
  --image xray.jpg \
  -n 512 \
  --temp 0.3

With llama-cpp-python

from llama_cpp import Llama

# Option 1: CPU-only (works great, 29.61 tok/s on M3)
llm = Llama(
    model_path="nv-reason-cxr-3b-Q4_K_M.gguf",
    n_ctx=4096,
    n_threads=4,
    n_gpu_layers=0,  # CPU-only
)

# Option 2: GPU acceleration (5.46x faster loading!)
llm = Llama(
    model_path="nv-reason-cxr-3b-Q4_K_M.gguf",
    n_ctx=4096,
    n_threads=4,
    n_gpu_layers=-1,  # Use GPU (Metal on Mac, CUDA on Linux/Windows)
)

# Analyze X-ray
response = llm(
    "Analyze this chest X-ray image and identify key findings.",
    max_tokens=512,
    temperature=0.3,  # Lower for medical = more deterministic
    top_p=0.9,
)

print(response['choices'][0]['text'])

With Cactus Compute (Flutter/Mobile)

Note: You need BOTH the model file AND the mmproj file for image analysis.

import 'package:cactus/cactus.dart';

// Initialize VLM with both model and mmproj files
final vlm = CactusVLM();
await vlm.init(
  modelFilename: 'nv-reason-cxr-3b-Q4_K_M.gguf',      // Model file
  mmprojFilename: 'mmproj-nv-reason-cxr-3b-f16.gguf', // Vision encoder
  contextSize: 2048,    // Context window (2K-4K for mobile)
  threads: 4,           // CPU threads
  gpuLayers: 0,         // CPU-only (GPU may cause issues on some devices)
);

// Create prompt
final messages = [
  ChatMessage(
    role: 'system',
    content: 'You are a helpful radiologist assistant.',
  ),
  ChatMessage(
    role: 'user',
    content: 'Describe what you see in this chest X-ray image.',
  ),
];

// Analyze X-ray
final response = await vlm.completion(
  messages,
  imagePaths: ['path/to/xray.jpg'],
  maxTokens: 150,
  temperature: 0.1,     // Lower for medical analysis (0.1-0.5)
);

print(response.text);

Mobile GPU Benefits:

  • ๐Ÿš€ 5.46x faster model loading (critical for app startup)
  • ๐Ÿ“ฑ Better user experience on iOS (Metal) and Android (Vulkan/OpenCL)
  • ๐Ÿ”‹ Minimal battery impact during loading phase
  • โœ… Falls back gracefully to CPU if GPU unavailable

Inference Parameters

Recommended settings for medical analysis:

{
    "temperature": 0.3,      # Lower = more deterministic (range: 0.1-0.5)
    "top_p": 0.9,            # Nucleus sampling
    "top_k": 40,             # Top-k sampling
    "repeat_penalty": 1.1,   # Avoid repetition
    "max_tokens": 512,       # Response length
    "n_ctx": 4096,          # Context window (2048-4096 for mobile)
}

Files Included

.
โ”œโ”€โ”€ README.md                               # Model card and usage guide
โ”œโ”€โ”€ LICENSE                                 # NSCLV1 license
โ”œโ”€โ”€ CONVERSION_PROCESS.md                   # Technical conversion details
โ”œโ”€โ”€ nv-reason-cxr-3b-Q4_K_M.gguf           # Q4_K_M quantized (1.96 GB) - Recommended
โ”œโ”€โ”€ nv-reason-cxr-3b-fp16.gguf             # FP16 reference (6.3 GB)
โ””โ”€โ”€ mmproj-nv-reason-cxr-3b-f16.gguf       # Vision encoder (1.25 GB) - Required

Model Card

Model Details

  • Developed by: NVIDIA (original), quantized by samwell
  • Model type: Vision-Language Model (VLM)
  • Architecture: Qwen2.5-VL
  • Parameters: 3 billion
  • Language: English
  • License: NSCLV1 (see LICENSE)
  • Fine-tuned from: Qwen2.5-VL-3B-Instruct
  • Specialty: Chest X-ray analysis

Intended Use

Primary Use Cases:

  • Research in medical image analysis
  • Educational purposes for radiology students
  • Prototyping mobile medical AI applications
  • Edge deployment of medical VLMs

Out-of-Scope:

  • โŒ Clinical diagnosis or treatment decisions
  • โŒ Production medical applications without proper validation
  • โŒ Replacing trained radiologists
  • โŒ Any FDA-regulated medical use

Limitations

  1. Not for Clinical Use: This model is for research and educational purposes only
  2. Quality Trade-off: Quantization reduces model size but may affect accuracy
  3. Domain Specific: Trained primarily on chest X-rays, may not generalize to other imaging
  4. Requires Validation: All outputs should be verified by medical professionals
  5. Mobile Performance: Speed varies significantly by device capabilities

Ethical Considerations

  • Model outputs should not be used for medical diagnosis
  • Always consult qualified healthcare professionals
  • Be aware of potential biases in training data
  • Ensure patient privacy when using with real medical images
  • Comply with local healthcare regulations (HIPAA, GDPR, etc.)

Bias and Fairness

The original model may have inherited biases from training data. Users should:

  • Test across diverse patient populations
  • Validate performance on their specific use cases
  • Monitor for unexpected outputs or biases
  • Not rely solely on model outputs

Technical Details

Quantization Method

Q4_K_M uses 4-bit quantization with K-means clustering:

  • Weights stored in 4 bits instead of 16 (FP16)
  • K-means clustering for optimal quantization scales
  • Medium variant balances size and quality
  • Per-block scales for better accuracy preservation

Vision Encoder

The vision encoder has been extracted into a separate mmproj file for compatibility:

  • File: mmproj-nv-reason-cxr-3b-f16.gguf (1.25 GB)
  • Required: Both the model file AND mmproj file are needed for image analysis
  • Format: FP16 (full precision vision encoder)
  • Extracted from: NVIDIA's NV-Reason-CXR-3B original model
  • Contains: Vision transformer blocks and multimodal projection layers (519 tensors)

Why separate mmproj?

  • Mobile frameworks (Cactus Compute) require separate mmproj architecture
  • Allows independent caching and loading strategies
  • Enables mixing different model quantizations with same vision encoder

Context Window

  • Training: 128,000 tokens
  • Recommended for mobile: 2,048-4,096 tokens
  • Desktop: Up to 128,000 tokens (RAM-dependent)

Citation

If you use this model, please cite the original work:

@misc{nvidia2024nvreasoncrx,
  title={NV-Reason-CXR-3B: A Specialized Vision-Language Model for Chest X-ray Analysis},
  author={NVIDIA},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/nvidia/NV-Reason-CXR-3B}
}

And optionally cite the quantization:

@misc{nvreasoncrx3b-gguf,
  title={NV-Reason-CXR-3B GGUF: Quantized for Edge Deployment},
  author={samwell},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/samwell/NV-Reason-CXR-3B-GGUF}
}

Acknowledgments

  • NVIDIA for the original NV-Reason-CXR-3B model
  • Qwen Team for the Qwen2.5-VL architecture
  • llama.cpp contributors for the GGUF format and conversion tools
  • Cactus Compute for mobile VLM deployment framework

License

This model inherits the NSCLV1 license from the original NV-Reason-CXR-3B model. See LICENSE for details.

Key points:

  • Research and educational use permitted
  • Commercial use may require additional permissions
  • Not for clinical/diagnostic use
  • See original model card for complete license terms

Disclaimer

โš ๏ธ IMPORTANT MEDICAL DISCLAIMER

This model is provided for RESEARCH AND EDUCATIONAL PURPOSES ONLY. It is:

  • NOT intended for clinical diagnosis or treatment
  • NOT FDA approved or clinically validated
  • NOT a substitute for professional medical advice
  • NOT validated for production medical use

Always consult qualified healthcare professionals for medical decisions. The creators and distributors of this model assume no liability for any use of this software.

Contact & Support

  • Issues: Report issues on GitHub (link to your repo)
  • Questions: See documentation in this repository
  • Original Model: nvidia/NV-Reason-CXR-3B
  • Cactus Compute: GitHub

Version History

  • v1.0 (2025-11-05): Initial release
    • FP16 GGUF conversion
    • Q4_K_M quantization
    • Tested on macOS and mobile projections
    • Complete documentation and scripts

For research and educational purposes only. Not for clinical use.

Downloads last month
103
GGUF
Model size
3B params
Architecture
qwen2vl
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for samwell/NV-Reason-CXR-3B-GGUF

Quantized
(1)
this model