NVIDIA Nemotron Nano 12B v2 - GGUF Q4_K_M (7GB)

This repository provides a 4-bit quantized GGUF build of NVIDIA Nemotron Nano 12B v2 using Q4_K_M, reducing the on-disk size to approximately 7GB from roughly 23GB for the original full precision weights, while preserving core capabilities.

Upstream base model: nvidia/NVIDIA-Nemotron-Nano-12B-v2

SHA256: 82ea4805d2f9f37e3c67b06768141ff58e43fb0dcd3983a82e9c2f481eb7fea8

What's included

model-q4.gguf (7.0GB)
tokenizer.json
tokenizer_config.json
special_tokens_map.json
config.json
generation_config.json
configuration_nemotron_h.py
modeling_nemotron_h.py
nemotron_toolcall_parser_no_streaming.py
bias.md, explainability.md, privacy.md, safety.md
acc-vs-budget.png
README.md

Capabilities

✓ Tool calling support via preserved special tokens and helper parser script
✓ Thinking mode tokens for structured reasoning
✓ Long-context up to 128k window
✓ Multilingual general-purpose LLM behavior

Note: GGUF inference backends may vary in their native support for tool-calling integrations; use the included parser or your own orchestration as needed.

Hardware notes

Disk space: 8GB free recommended for the quantized file and metadata
CPU inference: 16GB RAM recommended for 4k contexts; 32GB suggested for comfortable operation. For 128k contexts, memory usage grows significantly and 64 to 128GB system RAM may be required
GPU offload: 8 to 16GB VRAM can accelerate decoding with llama.cpp -ngl offloading; very long contexts may require 24 to 48GB VRAM or hybrid CPU plus GPU offload
Throughput: Depends on backend, threads, and offload settings

Usage

llama.cpp

Build llama.cpp, then run:

Generate:

./llama-cli -m model-q4.gguf -p "Hello, Nemotron." -n 128 -t 8 -c 4096 -ngl 35

Server:

./llama-server -m model-q4.gguf -c 4096 -ngl 35

For very long contexts, increase -c accordingly and ensure sufficient RAM or VRAM for KV cache.

Python via llama-cpp-python

pip install llama-cpp-python

from llama_cpp import Llama
llm = Llama(model_path="model-q4.gguf", n_ctx=4096, n_threads=8)
out = llm("Write a short greeting.", max_tokens=128)
print(out)

Ollama

Create a Modelfile referencing this repo, then create and run:

Modelfile:

FROM hf.co/Avarok/nvidia-nemotron-nano-12b-v2-q4_k_m
PARAMETER num_ctx 4096

Commands:

ollama create nemotron-nano-12b-q4km -f Modelfile
ollama run nemotron-nano-12b-q4km

Note: Ollama versions and syntax may evolve; consult Ollama docs if the Modelfile interface changes.

License and attribution

Base model: NVIDIA Nemotron Nano 12B v2
License: This GGUF quantized derivative is subject to the original model's license and terms. See the upstream model card and license. By using this repository you agree to comply with NVIDIA's licensing for Nemotron models
Attribution: If you use this model, please attribute both NVIDIA for the base model and this repository for the quantized packaging

Reproducibility

This artifact was produced by converting the upstream weights to GGUF and quantizing with Q4_K_M. An equivalent quantization command with llama.cpp tools is:

llama-quantize input.gguf model-q4.gguf Q4_K_M

Exact commands may differ based on the conversion workflow for the upstream model version.

Safety

Review the included bias, privacy, and safety documents. As with all LLMs, outputs may be inaccurate or unsafe without proper safeguards and human oversight.