YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Model Card for dynastai-grpo-lora

This model is a LoRA adapter for the Qwen/Qwen3-1.7B base model, fine-tuned using Generative Reward-based Policy Optimization (GRPO) to act as a royal advisor in the DynastAI kingdom simulation game. The model learns to select options that keep all kingdom metrics (Church, People, Military, Treasury) balanced close to 50.

Model Details

Model Description

This model is a LoRA adapter trained on synthetic decision-making data generated from the Reigns-style kingdom simulation. The model receives a prompt describing the current state of the kingdom and two possible choices, and it must select the option that best balances the four key metrics. The training process uses GRPO with a custom reward function to encourage balanced decisions and concise outputs.

Developed by: Earl Potters
Funded by [optional]: [More Information Needed]
Shared by [optional]: Slyracoon23
Model type: LoRA adapter for Causal Language Model (Qwen3-1.7B)
Language(s) (NLP): English
License: Qwen3-1.7B license (see Qwen3-1.7B), LoRA weights MIT
Finetuned from model [optional]: Qwen/Qwen3-1.7B

Model Sources

Repository: https://huggingface.co/Slyracoon23/dynastai-grpo-lora
Base Model: https://huggingface.co/Qwen/Qwen3-1.7B

Uses

Direct Use

This LoRA adapter is intended for use in the DynastAI kingdom simulation game as a decision-making agent. It can be used to generate balanced choices in scenarios where multiple kingdom metrics must be managed.

Downstream Use

The adapter can be plugged into any Qwen3-1.7B model instance using PEFT/Unsloth for similar decision-making or reinforcement learning tasks.

Out-of-Scope Use

Not suitable for open-ended text generation or tasks outside the decision-making context for which it was trained.
Not intended for real-world policy or governance advice.

Bias, Risks, and Limitations

The model is trained on synthetic data and may not generalize to real-world scenarios.
It may reflect biases present in the prompt generation logic or reward function.
The model is only as good as the reward function and data used for training.

Recommendations

Users should be aware that the model is designed for a game simulation and not for real-world decision-making. Outputs should be reviewed for appropriateness in any new context.

How to Get Started with the Model

from unsloth import FastLanguageModel
from peft import PeftModel
from transformers import AutoTokenizer

# Load base model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen3-1.7B",
    max_seq_length=4096,
    load_in_4bit=False,
    fast_inference=True,
    max_lora_rank=16,
    gpu_memory_utilization=0.7,
)

# Load LoRA adapter
model.load_lora("dynast_ai_grpo_lora")  # or use the Hugging Face repo path

# Prepare prompt as in the training script
messages = [
    {"role": "system", "content": "You are a royal advisor..."},
    {"role": "user", "content": "Current Metrics: ...\nOptions:\n1. ...\n2. ..."},
]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

# Generate output
output = model.fast_generate(text, ...)

Training Details

Training Data

Synthetic data generated from a set of Reigns-style cards, with random kingdom metrics and programmatically determined optimal choices.
Each example includes the current metrics, a scenario prompt, two options, and the optimal choice.

Training Procedure

Framework: Unsloth, PEFT, TRL (GRPOTrainer)
Reward Function: Negative sum of distances from 50 for each metric after applying the model's choice.
Secondary Reward: Penalty for verbose responses (encourages single-token answers).
Batch size: 4 (per device)
Gradient accumulation: 4
Epochs: 1
Learning rate: 2e-4
Optimizer: AdamW (8-bit)
Max prompt length: 4096
Max completion length: 2048
LoRA rank: 16

Training Hyperparameters

Training regime: fp16 mixed precision (default for Unsloth/PEFT)
Seed: 3407

Speeds, Sizes, Times

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

Synthetic test set generated in the same way as the training data.

Factors

Randomly sampled kingdom metrics and scenario cards.

Metrics

Accuracy: Percentage of times the model's choice matches the programmatically determined optimal choice.

Results

On a sample of 10 test prompts, the model achieved an accuracy of X/10 (update with your actual result).

Summary

The model reliably selects the optimal choice in the majority of test cases, demonstrating its ability to balance multiple metrics in a simulated environment.

Model Examination

[More Information Needed]

Environmental Impact

Hardware Type: [More Information Needed]
Hours used: [More Information Needed]
Cloud Provider: [More Information Needed]
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Technical Specifications

Model Architecture and Objective

Qwen3-1.7B base model with LoRA adapters (rank 16) for efficient fine-tuning.
Objective: Minimize the sum of distances from 50 for all four kingdom metrics after each decision.

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

Python, Unsloth, PEFT 0.15.2, TRL, Hugging Face Transformers, vLLM

Citation

BibTeX:

@misc{dynastai-grpo-lora,
  author = {Earl Potters},
  title = {DynastAI GRPO LoRA Adapter},
  year = {2024},
  howpublished = {\url{https://huggingface.co/Slyracoon23/dynastai-grpo-lora}},
}

Glossary

LoRA: Low-Rank Adaptation, a parameter-efficient fine-tuning method.
GRPO: Generative Reward-based Policy Optimization.

More Information

For questions, contact Earl Potters.

Model Card Authors

Earl Potters

Model Card Contact

[[email protected]]

Framework versions

PEFT 0.15.2
Unsloth (latest as of June 2024)
TRL (latest as of June 2024)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support