RL-Struct: Bridging the Structure Gap
We introduce RL-Struct, a lightweight Reinforcement Learning framework designed to solve the "Structure Gap"βthe tension between probabilistic token generation and deterministic structured formats (e.g., JSON). By leveraging GRPO (Gradient Regularized Policy Optimization) and a Multi-dimensional Reward Function, our model achieves superior structural reliability without the high inference latency of constrained decoding.
π Key Features
- Multi-dimensional Reward Function: Decomposes the objective into Structure, Format, Validity, Correctness, and Length.
- Efficient Training: Uses GRPO to eliminate the critic network, reducing VRAM usage by ~40% compared to PPO.
- Emergent Curriculum: The model spontaneously learns syntax (how to speak) before semantics (what to say).
- High Performance: Achieves 89.7% Structural Accuracy and 92.1% JSON Validity on complex recipe generation, outperforming LLaMA-3-8B and GPT-3.5.
π Model Details
- Base Model: Qwen/Qwen3-4B-Instruct-2507
- Training Method: GRPO (Reinforcement Learning) + LoRA
- Task: Structured Output Generation (JSON Recipes, GSM8K-JSON, ToolUse)
- License: Apache-2.0
π οΈ Usage
The following is the system prompt:
You are a precise recipe assistant. Always respond in the following JSON format:
{
"reasoning": "Your step-by-step reasoning here...",
"answer": "{\"name\": \"Recipe Name\", \"nutrition\": \"Calories: ..., Protein: ..., Fat: ...\"}"
}
Do not include any other text, explanations, or markdown. Only output valid JSON.
π Performance
| Method | Structural Acc. | JSON Validity | Content Acc. |
|---|---|---|---|
| GPT-3.5 (Zero-shot) | 45.5% | 82.1% | 88.0% |
| LLaMA-3-8B (SFT) | 78.2% | 85.4% | 86.0% |
| RL-Struct (Ours) | 89.7% | 92.1% | 84.5% |
- Downloads last month
- 120
Hardware compatibility
Log In
to view the estimation
4-bit
Model tree for Freakz3z/Qwen-JSON
Base model
Qwen/Qwen3-4B-Instruct-2507