File size: 1,396 Bytes
4136d31
43fcc94
 
4136d31
 
 
 
08c4751
 
4136d31
 
 
0da3468
 
 
04bc08a
b713fae
4136d31
 
6cdfc78
4136d31
6cdfc78
4136d31
6cdfc78
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4136d31
0da3468
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
---
base_model:
- unsloth/Qwen2.5-7B-Instruct-bnb-4bit
tags:
- transformers
- unsloth
- trl
- qwen2.5
- lora
license: apache-2.0
language:
- en
- zh
datasets:
- openai/gsm8k
pipeline_tag: text-generation
library_name: peft
---

This model uses reinforcement learning to train on the GSM8K dataset, generating reasoning chains and formatted outputs despite the dataset lacking intermediate steps.  A reward function guides the model, prioritizing answer correctness and XML format adherence.

**Training Details:**

* Dataset: GSM8K
* Algorithm: GRPO
* Hardware: Single NVIDIA GeForce RTX 3090 Ti
* Training Duration: 250 epochs, ~48 minutes

![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F64b36c0a26893eb6a6e63da3%2Fr8Fz5cQtx38wcoZLDKQ_0.png%3C%2Fspan%3E)
**Limitations:**

The output length limit(200) restricts the model's ability to generate complex reasoning chains, hindering observation of output length growth during training.

**Example:**

Which one is bigger? 9.11 or 9.8?
![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F64b36c0a26893eb6a6e63da3%2FgbfcQXMLOn-n_CsbSVpy7.png%3C%2Fspan%3E)


This qwen2.5 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)