Qwen3-30B-A3B AT-GRPO Production (K8)

Model Description

This is a LoRA adapter trained using Agent- and Turn-wise Group Relative Policy Optimization (AT-GRPO) on the Qwen3-30B-A3B sparse MoE model.

Training Details

Base Model: unsloth/Qwen3-30B-A3B
Method: AT-GRPO (arXiv:2510.11062)
Training Steps: 400
Agent Role: Vision agent specialization
K value: 8 (production configuration)
Hardware: AMD Ryzen AI Max+ 395 (Strix Halo) - 128GB unified memory
Quantization: 4-bit NF4 for training
LoRA Config:
- r=16
- alpha=16
- target_modules: ["o_proj", "v_proj", "gate_proj", "k_proj", "down_proj", "q_proj", "up_proj"]
- dropout=0.05

Reward Function

Team Reward: Global collaboration score
Local Reward: Individual agent performance
Weight (α): 1.0 (balanced team/local optimization)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# 4-bit quantization for efficient inference
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False
)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "unsloth/Qwen3-30B-A3B",
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",
    trust_remote_code=True
)

# Load adapter
model = PeftModel.from_pretrained(model, "wheattoast11/qwen3-30b-atgrpo-production-k8")
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-30B-A3B", trust_remote_code=True)

# Generate
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain quantum entanglement in simple terms."}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.8,
    repetition_penalty=1.075,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Performance

Memory: ~13GB with 4-bit quantization (vs 52GB FP16)
Context: 262K tokens native, extensible to 1M with RoPE scaling
Inference: Flash Attention 2 enabled (15-30% faster)

Citation

If you use this model, please cite:

@article{chen2025atgrpo,
  title={Agent- and Turn-wise Group Relative Policy Optimization},
  author={Chen et al.},
  journal={arXiv preprint arXiv:2510.11062},
  year={2025}
}

License

Apache 2.0 (same as base model)

Training Infrastructure

AMD Ryzen AI Max+ 395 (Strix Halo)
128GB LPDDR5X unified memory
ROCm 7.0+ with HipBLASLt optimization
PyTorch 2.8.0 with AOTriton

Generated on AMD Strix Halo platform.

Downloads last month: 23

Video Preview

Reinforcement Learning

Model tree for wheattoast11/qwen3-30b-atgrpo-production-k8

Base model

Qwen/Qwen3-30B-A3B-Base

Finetuned

Qwen/Qwen3-30B-A3B

Finetuned

unsloth/Qwen3-30B-A3B

Adapter

(2)

this model