RL-Struct: Bridging the Structure Gap

δΈ­ζ–‡η‰ˆζœ¬

We introduce RL-Struct, a lightweight Reinforcement Learning framework designed to solve the "Structure Gap"β€”the tension between probabilistic token generation and deterministic structured formats (e.g., JSON). By leveraging GRPO (Gradient Regularized Policy Optimization) and a Multi-dimensional Reward Function, our model achieves superior structural reliability without the high inference latency of constrained decoding.

πŸš€ Key Features

  • Multi-dimensional Reward Function: Decomposes the objective into Structure, Format, Validity, Correctness, and Length.
  • Efficient Training: Uses GRPO to eliminate the critic network, reducing VRAM usage by ~40% compared to PPO.
  • Emergent Curriculum: The model spontaneously learns syntax (how to speak) before semantics (what to say).
  • High Performance: Achieves 89.7% Structural Accuracy and 92.1% JSON Validity on complex recipe generation, outperforming LLaMA-3-8B and GPT-3.5.

πŸ“Š Model Details

  • Base Model: Qwen/Qwen3-4B-Instruct-2507
  • Training Method: GRPO (Reinforcement Learning) + LoRA
  • Task: Structured Output Generation (JSON Recipes, GSM8K-JSON, ToolUse)
  • License: Apache-2.0

πŸ› οΈ Usage

The following is the system prompt:

You are a precise recipe assistant. Always respond in the following JSON format:
{
  "reasoning": "Your step-by-step reasoning here...",
  "answer": "{\"name\": \"Recipe Name\", \"nutrition\": \"Calories: ..., Protein: ..., Fat: ...\"}"
}
Do not include any other text, explanations, or markdown. Only output valid JSON.

πŸ“ˆ Performance

Method Structural Acc. JSON Validity Content Acc.
GPT-3.5 (Zero-shot) 45.5% 82.1% 88.0%
LLaMA-3-8B (SFT) 78.2% 85.4% 86.0%
RL-Struct (Ours) 89.7% 92.1% 84.5%
Downloads last month
120
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

4-bit

Video Preview
loading

Model tree for Freakz3z/Qwen-JSON

Quantized
(143)
this model