Qwen3-30B-A3B AT-GRPO Production (K8)
Model Description
This is a LoRA adapter trained using Agent- and Turn-wise Group Relative Policy Optimization (AT-GRPO) on the Qwen3-30B-A3B sparse MoE model.
Training Details
- Base Model: unsloth/Qwen3-30B-A3B
- Method: AT-GRPO (arXiv:2510.11062)
- Training Steps: 400
- Agent Role: Vision agent specialization
- K value: 8 (production configuration)
- Hardware: AMD Ryzen AI Max+ 395 (Strix Halo) - 128GB unified memory
- Quantization: 4-bit NF4 for training
- LoRA Config:
- r=16
- alpha=16
- target_modules: ["o_proj", "v_proj", "gate_proj", "k_proj", "down_proj", "q_proj", "up_proj"]
- dropout=0.05
Reward Function
- Team Reward: Global collaboration score
- Local Reward: Individual agent performance
- Weight (α): 1.0 (balanced team/local optimization)
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
# 4-bit quantization for efficient inference
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=False
)
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"unsloth/Qwen3-30B-A3B",
quantization_config=bnb_config,
device_map="auto",
attn_implementation="flash_attention_2",
trust_remote_code=True
)
# Load adapter
model = PeftModel.from_pretrained(model, "wheattoast11/qwen3-30b-atgrpo-production-k8")
model.eval()
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-30B-A3B", trust_remote_code=True)
# Generate
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain quantum entanglement in simple terms."}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.8,
repetition_penalty=1.075,
do_sample=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Performance
- Memory: ~13GB with 4-bit quantization (vs 52GB FP16)
- Context: 262K tokens native, extensible to 1M with RoPE scaling
- Inference: Flash Attention 2 enabled (15-30% faster)
Citation
If you use this model, please cite:
@article{chen2025atgrpo,
title={Agent- and Turn-wise Group Relative Policy Optimization},
author={Chen et al.},
journal={arXiv preprint arXiv:2510.11062},
year={2025}
}
License
Apache 2.0 (same as base model)
Training Infrastructure
- AMD Ryzen AI Max+ 395 (Strix Halo)
- 128GB LPDDR5X unified memory
- ROCm 7.0+ with HipBLASLt optimization
- PyTorch 2.8.0 with AOTriton
Generated on AMD Strix Halo platform.
- Downloads last month
- 23