Good Job man!

#1
by Narutoouz - opened

I don't have nvidia graphics card. But this is good thing to make in nvfp4 formats. I wish mlx also supports this format, so I can use mlx nvfp4 quants to run on my m4 max.

Thanks! I actually don't have any hardware that supports NVFP4 either... I just think it's cool and have been making quants and testing it out with cloud instances. As far as I know nothing is stopping Apple or someone working on MLX from supporting NVFP4. Maybe if enough people get interested in it and keep making quants available others will start to pick it up. Kinda like how companies started supporting FP8.

Can I ask a favor? Can you give a brief description on how you are making them? Your work perfectly for me with vlllm latest or nightly, but every time I try quant using either LLM-Compressor 0.8.0 or 0.8.1 it prints gibberish or fails validation.
Thanks!

This is generally what I'm doing for each quant. Some of them require tweaks (passing a specific processor=tok) but this has worked well for most of the models I've tried.

import os, json
from datasets import load_dataset, get_dataset_split_names
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor import oneshot

MODEL_ID      = os.environ.get("MODEL_ID")
OUTPUT_DIR    = os.environ.get("OUTPUT_DIR")
CAL_DATASET   = os.environ.get("CAL_DATASET")
CAL_SPLIT_ENV = os.environ.get("CAL_SPLIT")
NUM_SAMPLES   = int(os.environ.get("NUM_CAL_SAMPLES", "256"))
MAX_SEQ_LEN   = int(os.environ.get("MAX_SEQ_LEN", "4096"))

print(f"[INFO] Loading base model: {MODEL_ID}")
tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True, )
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")

if CAL_SPLIT_ENV:
    split = CAL_SPLIT_ENV
else:
    try:
        splits = get_dataset_split_names(CAL_DATASET)
    except Exception as e:
        raise RuntimeError(f"Could not list splits for {CAL_DATASET}: {e}")
    # preference order: 'train', anything starting with 'train', otherwise first
    if "train" in splits:
        split = "train"
    else:
        train_like = [s for s in splits if s.startswith("train")]
        split = train_like[0] if train_like else splits[0]
print(f"[INFO] Using dataset {CAL_DATASET} split '{split}'")

print(f"[INFO] Preparing {NUM_SAMPLES} calibration samples @ max_len={MAX_SEQ_LEN}")
raw = load_dataset(CAL_DATASET, split=f"{split}[:{NUM_SAMPLES}]").shuffle(seed=42)

def to_text(ex):
    if "messages" in ex:
        # chat-style sample
        return {"text": tok.apply_chat_template(ex["messages"], tokenize=False)}
    for key in ("text", "content", "raw"):
        if key in ex:
            return {"text": ex[key]}
    return {"text": str(ex)}

ds_text = raw.map(to_text)

def tok_fn(sample):
    return tok(sample["text"], padding=False, truncation=True,
               max_length=MAX_SEQ_LEN, add_special_tokens=False)

ds_tok = ds_text.map(tok_fn, remove_columns=ds_text.column_names)

print("[INFO] Quantizing to NVFP4 W4A4 (ignoring lm_head)")
recipe = QuantizationModifier(
    targets="Linear",
    scheme="NVFP4",          # W4A4 recipe per ZeroShot guide
    ignore=["lm_head"],
)

oneshot(
    model=model,
    dataset=ds_tok,
    recipe=recipe,
    max_seq_length=MAX_SEQ_LEN,
    num_calibration_samples=NUM_SAMPLES,
)
print(f"[INFO] Saving NVFP4 checkpoint to: {OUTPUT_DIR}")

model.generation_config.do_sample = True

model.save_pretrained(OUTPUT_DIR, safe_serialization=True)
tok.save_pretrained(OUTPUT_DIR)
print("[OK] Quantization complete.")

I usually calibrate with one of these three datasets based on if the model is instruct, reasoning or coding.

MODEL_ID="TheDrummer/Precog-24B-v1"   # Source model on HF
OUTPUT_DIR="./Precog-24B-v1-nvfp4"          # Where to save the NVFP4 checkpoint
#CAL_DATASET="Rombo-Org/Optimized_Reasoning"    # Calibration dataset
CAL_DATASET="HuggingFaceH4/ultrachat_200k"    # Calibration dataset
#CAL_DATASET="nvidia/OpenCodeInstruct"    # Calibration dataset
NUM_CAL_SAMPLES=256                           # Fewer long sequences > many short sequences
MAX_SEQ_LEN=4096                              # Long sequences help calibrate attention correctly

I normally don't pin a particular version of my dependencies. I only needed very specific versions of transformers and llm compressor for Olmo 3.

Thanks! I will give it a try! Do you have to do anything differently for MoE models?

I don't believe I've had to do anything special for MoE models. There have been a few models that caused problems but I think they were more just bizzare quirks of their custom python code rather than being MoE.

Sign up or log in to comment