Qwen2.5-7B-Reasoning-Adapter / README.md

Nagi-ovo

Update README.md

6cdfc78 verified 10 months ago

preview code

raw

history blame contribute delete

1.4 kB

metadata

base_model:
  - unsloth/Qwen2.5-7B-Instruct-bnb-4bit
tags:
  - transformers
  - unsloth
  - trl
  - qwen2.5
  - lora
license: apache-2.0
language:
  - en
  - zh
datasets:
  - openai/gsm8k
pipeline_tag: text-generation
library_name: peft

This model uses reinforcement learning to train on the GSM8K dataset, generating reasoning chains and formatted outputs despite the dataset lacking intermediate steps. A reward function guides the model, prioritizing answer correctness and XML format adherence.

Training Details:

Dataset: GSM8K
Algorithm: GRPO
Hardware: Single NVIDIA GeForce RTX 3090 Ti
Training Duration: 250 epochs, ~48 minutes

Limitations:

The output length limit(200) restricts the model's ability to generate complex reasoning chains, hindering observation of output length growth during training.

Example:

Which one is bigger? 9.11 or 9.8?

This qwen2.5 model was trained 2x faster with Unsloth and Huggingface's TRL library.