--- license: llama3.3 base_model: - Sao10K/70B-L3.3-Cirrus-x1 pipeline_tag: text-generation tags: - text adventure - roleplay - rpg - creative writing - nvfp4 - vllm - conversational - nvfp4a16 --- # 70B-L3.3-Cirrus-x1 (NVFP4A16 quant) This repo contains 70B-L3.3-Cirrus-x1 quantized with NVFP4A16, a 4-bit compression suitable for max performance on all hardware with 8-bit-like accuracy. > ℹ️ Unlike NVFP4 format (4-bit weights + 4-bit activation), NVFP4A16 is not limited to Blackwell GPUs and will be supported efficiently in vLLM with RTX 3000s and RTX 4000s GPUs. Original Model: - [Sao10K/70B-L3.3-Cirrus-x1](https://huggingface.co/Sao10K/70B-L3.3-Cirrus-x1) Hopper and Blackwell optimized model: - [mratsim/70B-L3.3-Cirrus-x1-NVFP4](https://huggingface.co/mratsim/70B-L3.3-Cirrus-x1-NVFP4) This model requires ~39.8GiB of VRAM. Make sure to set an appropriate context size `--max-model-len` in VLLM and/or quantize the KV cache and/or use multiple GPUs with for example tensor-parallelism. NVFP4 writeups: - https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/ - https://arxiv.org/pdf/2509.25149 ## 📥 Usage & Running Instructions The model was tested with vLLM + 1x or 2x RTX Pro 6000, here is a script suitable for such configuration with 131072 context length. ### Recommendations It is however recommended to use only 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87) This model is recommended with "min-p" sampling, this sampling is available through both the oldest Text completions API and the Chat completions API (and there is a new Response API), however most LLM frontends only support modifying min-p when using Text completions. You can however use `--override-generation-config "${SAMPLER_JSONCONFIG}"` to override the sampler (which is a merge of generation_config.json and vLLM defaults) ### Running script ```bash # Model configuration (Mandatory) MODEL="mratsim/70B-L3.3-Cirrus-x1-NVFP4A16" MODELNAME="70B-L3.3-Cirrus-x1" GPU_UTIL=0.45 NUM_GPUS=2 # Sampling configuration (Optional, if departing from `generation_config.json`) SAMPLER_OVERRIDE='{"temperature": 1.1, "min_p": 0.02}' # Prevent vLLM from using 100% CPU when idle (Very Recommended) export VLLM_SLEEP_WHEN_IDLE=1 # Use FlashInfer backend (fastest, recommended, "instant" context reprocessing) export VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve "${MODEL}" \ --served-model-name "${MODELNAME}" \ --tensor-parallel-size "${NUM_GPUS}" \ --gpu-memory-utilization ${GPU_UTIL} \ --override-generation-config "${SAMPLER_OVERRIDE}" ``` > ℹ️ The FlashInfer backend may fail with an error similar to > `Failed to allocate memory for batch_prefill_tmp_v with size XYZ and alignment 16 in AlignedAllocator`. > > A workaround is running a sed replacement command within vllm install to increase buffer space > ```bash > sed -i 's/FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 \* 1024 \* 1024/FLASHINFER_WORKSPACE_BUFFER_SIZE = 512 \* 1024 \* 1024/g' vllm/v1/attention/backends/flashinfer.py > ``` > This will be fixed by PR https://github.com/vllm-project/vllm/pull/25344 ## 🔬 Quantization method The llmcompressor library was used with the following recipe: ```yaml default_stage: default_modifiers: QuantizationModifier: targets: [Linear] ignore: [lm_head] scheme: NVFP4A16 ``` NVFP4A16 doesn't require any calibration dataset.