Qwen3-30B-A3B AT-GRPO Production (K8)

Model Description

This is a LoRA adapter trained using Agent- and Turn-wise Group Relative Policy Optimization (AT-GRPO) on the Qwen3-30B-A3B sparse MoE model.

Training Details

  • Base Model: unsloth/Qwen3-30B-A3B
  • Method: AT-GRPO (arXiv:2510.11062)
  • Training Steps: 400
  • Agent Role: Vision agent specialization
  • K value: 8 (production configuration)
  • Hardware: AMD Ryzen AI Max+ 395 (Strix Halo) - 128GB unified memory
  • Quantization: 4-bit NF4 for training
  • LoRA Config:
    • r=16
    • alpha=16
    • target_modules: ["o_proj", "v_proj", "gate_proj", "k_proj", "down_proj", "q_proj", "up_proj"]
    • dropout=0.05

Reward Function

  • Team Reward: Global collaboration score
  • Local Reward: Individual agent performance
  • Weight (α): 1.0 (balanced team/local optimization)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# 4-bit quantization for efficient inference
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False
)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "unsloth/Qwen3-30B-A3B",
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",
    trust_remote_code=True
)

# Load adapter
model = PeftModel.from_pretrained(model, "wheattoast11/qwen3-30b-atgrpo-production-k8")
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-30B-A3B", trust_remote_code=True)

# Generate
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain quantum entanglement in simple terms."}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.8,
    repetition_penalty=1.075,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Performance

  • Memory: ~13GB with 4-bit quantization (vs 52GB FP16)
  • Context: 262K tokens native, extensible to 1M with RoPE scaling
  • Inference: Flash Attention 2 enabled (15-30% faster)

Citation

If you use this model, please cite:

@article{chen2025atgrpo,
  title={Agent- and Turn-wise Group Relative Policy Optimization},
  author={Chen et al.},
  journal={arXiv preprint arXiv:2510.11062},
  year={2025}
}

License

Apache 2.0 (same as base model)

Training Infrastructure

  • AMD Ryzen AI Max+ 395 (Strix Halo)
  • 128GB LPDDR5X unified memory
  • ROCm 7.0+ with HipBLASLt optimization
  • PyTorch 2.8.0 with AOTriton

Generated on AMD Strix Halo platform.

Downloads last month
23
Video Preview
loading

Model tree for wheattoast11/qwen3-30b-atgrpo-production-k8

Finetuned
Qwen/Qwen3-30B-A3B
Adapter
(2)
this model