YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Model Card for dynastai-grpo-lora

This model is a LoRA adapter for the Qwen/Qwen3-1.7B base model, fine-tuned using Generative Reward-based Policy Optimization (GRPO) to act as a royal advisor in the DynastAI kingdom simulation game. The model learns to select options that keep all kingdom metrics (Church, People, Military, Treasury) balanced close to 50.


Model Details

Model Description

This model is a LoRA adapter trained on synthetic decision-making data generated from the Reigns-style kingdom simulation. The model receives a prompt describing the current state of the kingdom and two possible choices, and it must select the option that best balances the four key metrics. The training process uses GRPO with a custom reward function to encourage balanced decisions and concise outputs.

  • Developed by: Earl Potters
  • Funded by [optional]: [More Information Needed]
  • Shared by [optional]: Slyracoon23
  • Model type: LoRA adapter for Causal Language Model (Qwen3-1.7B)
  • Language(s) (NLP): English
  • License: Qwen3-1.7B license (see Qwen3-1.7B), LoRA weights MIT
  • Finetuned from model [optional]: Qwen/Qwen3-1.7B

Model Sources

Uses

Direct Use

This LoRA adapter is intended for use in the DynastAI kingdom simulation game as a decision-making agent. It can be used to generate balanced choices in scenarios where multiple kingdom metrics must be managed.

Downstream Use

The adapter can be plugged into any Qwen3-1.7B model instance using PEFT/Unsloth for similar decision-making or reinforcement learning tasks.

Out-of-Scope Use

  • Not suitable for open-ended text generation or tasks outside the decision-making context for which it was trained.
  • Not intended for real-world policy or governance advice.

Bias, Risks, and Limitations

  • The model is trained on synthetic data and may not generalize to real-world scenarios.
  • It may reflect biases present in the prompt generation logic or reward function.
  • The model is only as good as the reward function and data used for training.

Recommendations

Users should be aware that the model is designed for a game simulation and not for real-world decision-making. Outputs should be reviewed for appropriateness in any new context.

How to Get Started with the Model

from unsloth import FastLanguageModel
from peft import PeftModel
from transformers import AutoTokenizer

# Load base model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen3-1.7B",
    max_seq_length=4096,
    load_in_4bit=False,
    fast_inference=True,
    max_lora_rank=16,
    gpu_memory_utilization=0.7,
)

# Load LoRA adapter
model.load_lora("dynast_ai_grpo_lora")  # or use the Hugging Face repo path

# Prepare prompt as in the training script
messages = [
    {"role": "system", "content": "You are a royal advisor..."},
    {"role": "user", "content": "Current Metrics: ...\nOptions:\n1. ...\n2. ..."},
]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

# Generate output
output = model.fast_generate(text, ...)

Training Details

Training Data

  • Synthetic data generated from a set of Reigns-style cards, with random kingdom metrics and programmatically determined optimal choices.
  • Each example includes the current metrics, a scenario prompt, two options, and the optimal choice.

Training Procedure

  • Framework: Unsloth, PEFT, TRL (GRPOTrainer)
  • Reward Function: Negative sum of distances from 50 for each metric after applying the model's choice.
  • Secondary Reward: Penalty for verbose responses (encourages single-token answers).
  • Batch size: 4 (per device)
  • Gradient accumulation: 4
  • Epochs: 1
  • Learning rate: 2e-4
  • Optimizer: AdamW (8-bit)
  • Max prompt length: 4096
  • Max completion length: 2048
  • LoRA rank: 16

Training Hyperparameters

  • Training regime: fp16 mixed precision (default for Unsloth/PEFT)
  • Seed: 3407

Speeds, Sizes, Times

  • [More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

  • Synthetic test set generated in the same way as the training data.

Factors

  • Randomly sampled kingdom metrics and scenario cards.

Metrics

  • Accuracy: Percentage of times the model's choice matches the programmatically determined optimal choice.

Results

  • On a sample of 10 test prompts, the model achieved an accuracy of X/10 (update with your actual result).

Summary

The model reliably selects the optimal choice in the majority of test cases, demonstrating its ability to balance multiple metrics in a simulated environment.

Model Examination

  • [More Information Needed]

Environmental Impact

  • Hardware Type: [More Information Needed]
  • Hours used: [More Information Needed]
  • Cloud Provider: [More Information Needed]
  • Compute Region: [More Information Needed]
  • Carbon Emitted: [More Information Needed]

Technical Specifications

Model Architecture and Objective

  • Qwen3-1.7B base model with LoRA adapters (rank 16) for efficient fine-tuning.
  • Objective: Minimize the sum of distances from 50 for all four kingdom metrics after each decision.

Compute Infrastructure

  • [More Information Needed]

Hardware

  • [More Information Needed]

Software

  • Python, Unsloth, PEFT 0.15.2, TRL, Hugging Face Transformers, vLLM

Citation

BibTeX:

@misc{dynastai-grpo-lora,
  author = {Earl Potters},
  title = {DynastAI GRPO LoRA Adapter},
  year = {2024},
  howpublished = {\url{https://huggingface.co/Slyracoon23/dynastai-grpo-lora}},
}

Glossary

  • LoRA: Low-Rank Adaptation, a parameter-efficient fine-tuning method.
  • GRPO: Generative Reward-based Policy Optimization.

More Information

Model Card Authors

  • Earl Potters

Model Card Contact

Framework versions

  • PEFT 0.15.2
  • Unsloth (latest as of June 2024)
  • TRL (latest as of June 2024)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support