Files changed (1) hide show
  1. README.md +526 -512
README.md CHANGED
@@ -1,512 +1,526 @@
1
- ---
2
- license: other
3
- library_name: transformers
4
- tags:
5
- - generated_from_trainer
6
- base_model: Qwen/Qwen2.5-72B
7
- datasets:
8
- - anthracite-org/kalo-opus-instruct-22k-no-refusal
9
- - Nopm/Opus_WritingStruct
10
- - Gryphe/Sonnet3.5-SlimOrcaDedupCleaned
11
- - Gryphe/Sonnet3.5-Charcard-Roleplay
12
- - Gryphe/ChatGPT-4o-Writing-Prompts
13
- - Epiculous/Synthstruct-Gens-v1.1-Filtered-n-Cleaned
14
- - Epiculous/SynthRP-Gens-v1.1-Filtered-n-Cleaned
15
- - nothingiisreal/Reddit-Dirty-And-WritingPrompts
16
- - allura-org/Celeste-1.x-data-mixture
17
- - cognitivecomputations/dolphin-2.9.3
18
- license_name: qwen
19
- license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
20
- model-index:
21
- - name: EVA-Qwen2.5-72B-SFFT-v0.2
22
- results: []
23
- ---
24
-
25
-
26
-
27
- # EVA Qwen2.5-72B v0.2
28
-
29
- <p>
30
- A RP/storywriting specialist model, full-parameter finetune of Qwen2.5-72B on mixture of synthetic and natural data.<br>
31
- It uses Celeste 70B 0.1 data mixture, greatly expanding it to improve versatility, creativity and "flavor" of the resulting model.<br>
32
- </p>
33
-
34
- <p>Dedicated to Nev.</p>
35
-
36
- <p><b>NOTE: LLM-Compressor quants don't seem to work correctly, quality seems to be much worse than normal. It wasn't the case with previous versions. GGUF and GPTQ seem to be unaffected.</b></p>
37
- </br>
38
- <p><b>Version notes for 0.2</b>: Optimized training hyperparameters and increased sequence length. Better instruction following deeper into context and less repetition.</p>
39
-
40
- <p>
41
- <p>Prompt format is ChatML.</p><br>
42
- <h3>Recommended sampler values:</h3>
43
- <ul>
44
- <li>Temperature: 0.8</li>
45
- <li>Min-P: 0.05</li>
46
- <li>Top-A: 0.3</li>
47
- <li>Repetition Penalty: 1.03</li>
48
- </ul>
49
-
50
- <h3>Recommended SillyTavern preset (via CalamitousFelicitousness):</h3>
51
- <ul><li><a href="https://huggingface.co/EVA-UNIT-01/EVA-Qwen2.5-72B-v0.2/blob/main/EV01.json">Master import</a></li></ul>
52
-
53
- </p>
54
-
55
- <p>
56
- <br>
57
- <h3>
58
- Training data:
59
- </h3>
60
- <ul>
61
- <li>Celeste 70B 0.1 data mixture minus Opus Instruct subset. See that model's <a href=https://huggingface.co/nothingiisreal/L3.1-70B-Celeste-V0.1-BF16>card</a> for details.</li>
62
- <li>Kalomaze's Opus_Instruct_25k dataset, filtered for refusals.</li>
63
- <li>A subset (1k rows) of ChatGPT-4o-WritingPrompts by Gryphe</li>
64
- <li>A subset (2k rows) of Sonnet3.5-Charcards-Roleplay by Gryphe</li>
65
- <li>Synthstruct and SynthRP datasets by Epiculous</li>
66
- <li>A subset from Dolphin-2.9.3, including filtered version of not_samantha and a small subset of systemchat.</li>
67
- </ul>
68
- <h3>
69
- Training time and hardware:
70
- </h3>
71
- <ul><li>17 hours on 8xH100 SXM</a></li></ul><br>
72
- </p>
73
- <p>Model was created by Kearm, Auri and Cahvay.</p>
74
- <h4>Special thanks:</h4><ul>
75
- <li>to Featherless for sponsoring this run</li>
76
- <li>to Cahvay for his work on investigating and reprocessing the corrupted dataset, removing the single biggest source of data poisoning.</li>
77
- <li>to Gryphe, Lemmy, Kalomaze, Nopm, Epiculous and CognitiveComputations for the data</li>
78
- <li>and to Allura-org for support, feedback, beta-testing and doing quality control of EVA models.</li></ul>
79
-
80
- <h3>Statement about change in licensing for the future models.</h3>
81
- <p>For all future EVA-Unit-01 models, there will be a provision in the license stating that Infermatic and any of its employees or paid associates cannot utilize, distribute, download, or otherwise make use of EVA models.
82
- While this cannot retroactively apply to our licensing, we officially request Infermatic immediately cease use of our models for unwarranted profit, although we acknowledge at this point it will not likely be followed.
83
- EVA models will still be available in the future on Featherless, ArliAI (in the future), and other providers who want to host them, as well as for local and cloud usage.</p>
84
-
85
-
86
- [<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
87
- <details><summary>See axolotl config</summary>
88
-
89
- axolotl version: `0.4.1`
90
- ```yaml
91
- base_model: Qwen/Qwen2.5-72B
92
-
93
- load_in_8bit: false
94
- load_in_4bit: false
95
- strict: false
96
-
97
- plugins:
98
- - axolotl.integrations.liger.LigerPlugin
99
- liger_rope: true
100
- liger_rms_norm: true
101
- liger_swiglu: true
102
- liger_fused_linear_cross_entropy: true
103
-
104
- # plugins:
105
- # - axolotl.integrations.spectrum.SpectrumPlugin
106
-
107
- # spectrum_top_fraction: 0.5
108
- # # Optional if using a pre-scanned model as your base_model. Useful if using a model mirror
109
- # spectrum_model_name: Qwen/Qwen2.5-32B
110
-
111
- datasets:
112
- - path: datasets/Celeste_Filtered_utf8fix.jsonl
113
- type: sharegpt
114
- - path: datasets/deduped_not_samantha_norefusals.jsonl
115
- type: sharegpt
116
- - path: datasets/deduped_SynthRP-Gens_processed_ShareGPT_converted_cleaned.jsonl
117
- type: sharegpt
118
- - path: datasets/deduped_Synthstruct-Gens_processed_sharegpt_converted_cleaned.jsonl
119
- type: sharegpt
120
- - path: datasets/Gryphe-4o-WP-filtered-sharegpt_utf8fix.jsonl
121
- type: sharegpt
122
- - path: datasets/opus-instruct-22k-no_refusals-filtered_utf8fix.jsonl
123
- type: sharegpt
124
- - path: datasets/Sonnet3-5-charcard-names-filtered-sharegpt_utf8fix.jsonl
125
- type: sharegpt
126
- - path: datasets/SystemChat_subset_filtered_sharegpt_utf8fix.jsonl
127
- type: sharegpt
128
-
129
- chat_template: chatml
130
- shuffle_merged_datasets: true
131
- val_set_size: 0.001
132
- output_dir: EVA-Qwen2.5-72B-SFFT-v0.2
133
-
134
- sequence_len: 10240
135
- sample_packing: true
136
- eval_sample_packing: false
137
- pad_to_sequence_len: false
138
-
139
- # adapter: qlora
140
- # lora_model_dir:
141
- # lora_r: 64
142
- # lora_alpha: 128
143
- # lora_dropout: 0.05
144
- # lora_target_linear: true
145
- # peft_use_dora: true
146
-
147
- unfrozen_parameters:
148
- - ^lm_head.weight$
149
- - ^model.embed_tokens.weight$
150
- # mlp.down_proj layers
151
- - model.layers.62.mlp.down_proj
152
- - model.layers.64.mlp.down_proj
153
- - model.layers.63.mlp.down_proj
154
- - model.layers.66.mlp.down_proj
155
- - model.layers.65.mlp.down_proj
156
- - model.layers.67.mlp.down_proj
157
- - model.layers.68.mlp.down_proj
158
- - model.layers.31.mlp.down_proj
159
- - model.layers.60.mlp.down_proj
160
- - model.layers.69.mlp.down_proj
161
- - model.layers.61.mlp.down_proj
162
- - model.layers.59.mlp.down_proj
163
- - model.layers.30.mlp.down_proj
164
- - model.layers.70.mlp.down_proj
165
- - model.layers.32.mlp.down_proj
166
- - model.layers.34.mlp.down_proj
167
- - model.layers.33.mlp.down_proj
168
- - model.layers.76.mlp.down_proj
169
- - model.layers.72.mlp.down_proj
170
- - model.layers.71.mlp.down_proj
171
- - model.layers.58.mlp.down_proj
172
- - model.layers.75.mlp.down_proj
173
- - model.layers.29.mlp.down_proj
174
- - model.layers.56.mlp.down_proj
175
- - model.layers.26.mlp.down_proj
176
- - model.layers.35.mlp.down_proj
177
- - model.layers.28.mlp.down_proj
178
- - model.layers.57.mlp.down_proj
179
- - model.layers.77.mlp.down_proj
180
- - model.layers.36.mlp.down_proj
181
- - model.layers.27.mlp.down_proj
182
- - model.layers.25.mlp.down_proj
183
- - model.layers.78.mlp.down_proj
184
- - model.layers.37.mlp.down_proj
185
- - model.layers.73.mlp.down_proj
186
- - model.layers.55.mlp.down_proj
187
- - model.layers.54.mlp.down_proj
188
- - model.layers.74.mlp.down_proj
189
- - model.layers.24.mlp.down_proj
190
- - model.layers.53.mlp.down_proj
191
- # mlp.gate_proj layers
192
- - model.layers.78.mlp.gate_proj
193
- - model.layers.77.mlp.gate_proj
194
- - model.layers.76.mlp.gate_proj
195
- - model.layers.79.mlp.gate_proj
196
- - model.layers.75.mlp.gate_proj
197
- - model.layers.74.mlp.gate_proj
198
- - model.layers.73.mlp.gate_proj
199
- - model.layers.72.mlp.gate_proj
200
- - model.layers.71.mlp.gate_proj
201
- - model.layers.70.mlp.gate_proj
202
- - model.layers.69.mlp.gate_proj
203
- - model.layers.57.mlp.gate_proj
204
- - model.layers.54.mlp.gate_proj
205
- - model.layers.55.mlp.gate_proj
206
- - model.layers.68.mlp.gate_proj
207
- - model.layers.63.mlp.gate_proj
208
- - model.layers.53.mlp.gate_proj
209
- - model.layers.44.mlp.gate_proj
210
- - model.layers.45.mlp.gate_proj
211
- - model.layers.49.mlp.gate_proj
212
- - model.layers.58.mlp.gate_proj
213
- - model.layers.46.mlp.gate_proj
214
- - model.layers.56.mlp.gate_proj
215
- - model.layers.67.mlp.gate_proj
216
- - model.layers.62.mlp.gate_proj
217
- - model.layers.50.mlp.gate_proj
218
- - model.layers.64.mlp.gate_proj
219
- - model.layers.52.mlp.gate_proj
220
- - model.layers.40.mlp.gate_proj
221
- - model.layers.43.mlp.gate_proj
222
- - model.layers.48.mlp.gate_proj
223
- - model.layers.66.mlp.gate_proj
224
- - model.layers.47.mlp.gate_proj
225
- - model.layers.59.mlp.gate_proj
226
- - model.layers.65.mlp.gate_proj
227
- - model.layers.61.mlp.gate_proj
228
- - model.layers.60.mlp.gate_proj
229
- - model.layers.42.mlp.gate_proj
230
- - model.layers.51.mlp.gate_proj
231
- - model.layers.41.mlp.gate_proj
232
- # mlp.up_proj layers
233
- - model.layers.70.mlp.up_proj
234
- - model.layers.69.mlp.up_proj
235
- - model.layers.71.mlp.up_proj
236
- - model.layers.68.mlp.up_proj
237
- - model.layers.72.mlp.up_proj
238
- - model.layers.67.mlp.up_proj
239
- - model.layers.66.mlp.up_proj
240
- - model.layers.73.mlp.up_proj
241
- - model.layers.46.mlp.up_proj
242
- - model.layers.63.mlp.up_proj
243
- - model.layers.75.mlp.up_proj
244
- - model.layers.76.mlp.up_proj
245
- - model.layers.74.mlp.up_proj
246
- - model.layers.45.mlp.up_proj
247
- - model.layers.62.mlp.up_proj
248
- - model.layers.64.mlp.up_proj
249
- - model.layers.65.mlp.up_proj
250
- - model.layers.44.mlp.up_proj
251
- - model.layers.53.mlp.up_proj
252
- - model.layers.47.mlp.up_proj
253
- - model.layers.49.mlp.up_proj
254
- - model.layers.48.mlp.up_proj
255
- - model.layers.57.mlp.up_proj
256
- - model.layers.43.mlp.up_proj
257
- - model.layers.42.mlp.up_proj
258
- - model.layers.56.mlp.up_proj
259
- - model.layers.61.mlp.up_proj
260
- - model.layers.54.mlp.up_proj
261
- - model.layers.40.mlp.up_proj
262
- - model.layers.55.mlp.up_proj
263
- - model.layers.77.mlp.up_proj
264
- - model.layers.60.mlp.up_proj
265
- - model.layers.41.mlp.up_proj
266
- - model.layers.35.mlp.up_proj
267
- - model.layers.37.mlp.up_proj
268
- - model.layers.58.mlp.up_proj
269
- - model.layers.34.mlp.up_proj
270
- - model.layers.38.mlp.up_proj
271
- - model.layers.33.mlp.up_proj
272
- - model.layers.39.mlp.up_proj
273
- # self_attn.k_proj layers
274
- - model.layers.36.self_attn.k_proj
275
- - model.layers.79.self_attn.k_proj
276
- - model.layers.35.self_attn.k_proj
277
- - model.layers.34.self_attn.k_proj
278
- - model.layers.37.self_attn.k_proj
279
- - model.layers.33.self_attn.k_proj
280
- - model.layers.38.self_attn.k_proj
281
- - model.layers.39.self_attn.k_proj
282
- - model.layers.74.self_attn.k_proj
283
- - model.layers.77.self_attn.k_proj
284
- - model.layers.41.self_attn.k_proj
285
- - model.layers.69.self_attn.k_proj
286
- - model.layers.32.self_attn.k_proj
287
- - model.layers.78.self_attn.k_proj
288
- - model.layers.30.self_attn.k_proj
289
- - model.layers.70.self_attn.k_proj
290
- - model.layers.25.self_attn.k_proj
291
- - model.layers.42.self_attn.k_proj
292
- - model.layers.29.self_attn.k_proj
293
- - model.layers.31.self_attn.k_proj
294
- - model.layers.68.self_attn.k_proj
295
- - model.layers.66.self_attn.k_proj
296
- - model.layers.22.self_attn.k_proj
297
- - model.layers.65.self_attn.k_proj
298
- - model.layers.44.self_attn.k_proj
299
- - model.layers.40.self_attn.k_proj
300
- - model.layers.63.self_attn.k_proj
301
- - model.layers.23.self_attn.k_proj
302
- - model.layers.28.self_attn.k_proj
303
- - model.layers.24.self_attn.k_proj
304
- - model.layers.26.self_attn.k_proj
305
- - model.layers.67.self_attn.k_proj
306
- - model.layers.75.self_attn.k_proj
307
- - model.layers.27.self_attn.k_proj
308
- - model.layers.57.self_attn.k_proj
309
- - model.layers.64.self_attn.k_proj
310
- - model.layers.71.self_attn.k_proj
311
- - model.layers.61.self_attn.k_proj
312
- - model.layers.72.self_attn.k_proj
313
- - model.layers.73.self_attn.k_proj
314
- # self_attn.o_proj layers
315
- - model.layers.69.self_attn.o_proj
316
- - model.layers.39.self_attn.o_proj
317
- - model.layers.16.self_attn.o_proj
318
- - model.layers.14.self_attn.o_proj
319
- - model.layers.19.self_attn.o_proj
320
- - model.layers.42.self_attn.o_proj
321
- - model.layers.12.self_attn.o_proj
322
- - model.layers.15.self_attn.o_proj
323
- - model.layers.17.self_attn.o_proj
324
- - model.layers.38.self_attn.o_proj
325
- - model.layers.23.self_attn.o_proj
326
- - model.layers.22.self_attn.o_proj
327
- - model.layers.13.self_attn.o_proj
328
- - model.layers.29.self_attn.o_proj
329
- - model.layers.41.self_attn.o_proj
330
- - model.layers.44.self_attn.o_proj
331
- - model.layers.46.self_attn.o_proj
332
- - model.layers.45.self_attn.o_proj
333
- - model.layers.43.self_attn.o_proj
334
- - model.layers.49.self_attn.o_proj
335
- - model.layers.30.self_attn.o_proj
336
- - model.layers.26.self_attn.o_proj
337
- - model.layers.25.self_attn.o_proj
338
- - model.layers.37.self_attn.o_proj
339
- - model.layers.47.self_attn.o_proj
340
- - model.layers.11.self_attn.o_proj
341
- - model.layers.18.self_attn.o_proj
342
- - model.layers.28.self_attn.o_proj
343
- - model.layers.20.self_attn.o_proj
344
- - model.layers.27.self_attn.o_proj
345
- - model.layers.53.self_attn.o_proj
346
- - model.layers.52.self_attn.o_proj
347
- - model.layers.35.self_attn.o_proj
348
- - model.layers.71.self_attn.o_proj
349
- - model.layers.10.self_attn.o_proj
350
- - model.layers.3.self_attn.o_proj
351
- - model.layers.21.self_attn.o_proj
352
- - model.layers.24.self_attn.o_proj
353
- - model.layers.68.self_attn.o_proj
354
- - model.layers.48.self_attn.o_proj
355
- # self_attn.q_proj layers
356
- - model.layers.1.self_attn.q_proj
357
- - model.layers.2.self_attn.q_proj
358
- - model.layers.3.self_attn.q_proj
359
- - model.layers.0.self_attn.q_proj
360
- - model.layers.5.self_attn.q_proj
361
- - model.layers.4.self_attn.q_proj
362
- - model.layers.6.self_attn.q_proj
363
- - model.layers.8.self_attn.q_proj
364
- - model.layers.7.self_attn.q_proj
365
- - model.layers.9.self_attn.q_proj
366
- - model.layers.10.self_attn.q_proj
367
- - model.layers.68.self_attn.q_proj
368
- - model.layers.25.self_attn.q_proj
369
- - model.layers.12.self_attn.q_proj
370
- - model.layers.54.self_attn.q_proj
371
- - model.layers.55.self_attn.q_proj
372
- - model.layers.61.self_attn.q_proj
373
- - model.layers.18.self_attn.q_proj
374
- - model.layers.49.self_attn.q_proj
375
- - model.layers.66.self_attn.q_proj
376
- - model.layers.72.self_attn.q_proj
377
- - model.layers.11.self_attn.q_proj
378
- - model.layers.52.self_attn.q_proj
379
- - model.layers.64.self_attn.q_proj
380
- - model.layers.15.self_attn.q_proj
381
- - model.layers.60.self_attn.q_proj
382
- - model.layers.50.self_attn.q_proj
383
- - model.layers.59.self_attn.q_proj
384
- - model.layers.53.self_attn.q_proj
385
- - model.layers.48.self_attn.q_proj
386
- - model.layers.57.self_attn.q_proj
387
- - model.layers.70.self_attn.q_proj
388
- - model.layers.17.self_attn.q_proj
389
- - model.layers.67.self_attn.q_proj
390
- - model.layers.71.self_attn.q_proj
391
- - model.layers.62.self_attn.q_proj
392
- - model.layers.51.self_attn.q_proj
393
- - model.layers.19.self_attn.q_proj
394
- - model.layers.58.self_attn.q_proj
395
- - model.layers.13.self_attn.q_proj
396
- # self_attn.v_proj layers
397
- - model.layers.23.self_attn.v_proj
398
- - model.layers.25.self_attn.v_proj
399
- - model.layers.26.self_attn.v_proj
400
- - model.layers.27.self_attn.v_proj
401
- - model.layers.28.self_attn.v_proj
402
- - model.layers.29.self_attn.v_proj
403
- - model.layers.30.self_attn.v_proj
404
- - model.layers.31.self_attn.v_proj
405
- - model.layers.34.self_attn.v_proj
406
- - model.layers.35.self_attn.v_proj
407
- - model.layers.36.self_attn.v_proj
408
- - model.layers.37.self_attn.v_proj
409
- - model.layers.38.self_attn.v_proj
410
- - model.layers.42.self_attn.v_proj
411
- - model.layers.48.self_attn.v_proj
412
- - model.layers.57.self_attn.v_proj
413
- - model.layers.58.self_attn.v_proj
414
- - model.layers.61.self_attn.v_proj
415
- - model.layers.63.self_attn.v_proj
416
- - model.layers.64.self_attn.v_proj
417
- - model.layers.65.self_attn.v_proj
418
- - model.layers.66.self_attn.v_proj
419
- - model.layers.69.self_attn.v_proj
420
- - model.layers.70.self_attn.v_proj
421
- - model.layers.74.self_attn.v_proj
422
- - model.layers.75.self_attn.v_proj
423
- - model.layers.72.self_attn.v_proj
424
- - model.layers.39.self_attn.v_proj
425
- - model.layers.41.self_attn.v_proj
426
- - model.layers.40.self_attn.v_proj
427
- - model.layers.33.self_attn.v_proj
428
- - model.layers.59.self_attn.v_proj
429
- - model.layers.16.self_attn.v_proj
430
- - model.layers.15.self_attn.v_proj
431
- - model.layers.76.self_attn.v_proj
432
- - model.layers.24.self_attn.v_proj
433
- - model.layers.68.self_attn.v_proj
434
- - model.layers.67.self_attn.v_proj
435
- - model.layers.55.self_attn.v_proj
436
- - model.layers.44.self_attn.v_proj
437
-
438
-
439
-
440
- wandb_project: EVA-Qwen2.5-72B-SFFT-v0.2
441
- wandb_entity:
442
- wandb_watch:
443
- wandb_name: Unit-02
444
- wandb_log_model:
445
-
446
- gradient_accumulation_steps: 8
447
- micro_batch_size: 1
448
- num_epochs: 3
449
- optimizer: paged_ademamix_8bit
450
- lr_scheduler: cosine
451
- learning_rate: 0.00003
452
- max_grad_norm: 1.5
453
-
454
- train_on_inputs: false
455
- group_by_length: false
456
- bf16: auto
457
- fp16:
458
- tf32: false
459
-
460
- gradient_checkpointing: "unsloth"
461
- # gradient_checkpointing_kwargs:
462
- # use_reentrant: true
463
- early_stopping_patience:
464
- resume_from_checkpoint: EVA-Qwen2.5-72B-SFFT-v0.2/checkpoint-128
465
- local_rank:
466
- logging_steps: 1
467
- xformers_attention:
468
- flash_attention: true
469
-
470
- warmup_steps: 20
471
- evals_per_epoch: 4
472
- saves_per_epoch: 4
473
- save_safetensors: true
474
- save_total_limit: 1
475
- hub_model_id:
476
- hub_strategy:
477
- debug:
478
- deepspeed: deepspeed_configs/zero3_bf16_cpuoffload_params.json
479
- weight_decay: 0.12
480
- # fsdp:
481
- # - full_shard
482
- # - auto_wrap
483
- # fsdp_config:
484
- # fsdp_limit_all_gathers: true
485
- # fsdp_sync_module_states: false
486
- # fsdp_offload_params: true
487
- # fsdp_cpu_ram_efficient_loading: true
488
- # fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
489
- # fsdp_transformer_layer_cls_to_wrap: Qwen2DecoderLayer
490
- # fsdp_activation_checkpointing: true
491
- # fsdp_state_dict_type: SHARDED_STATE_DICT # Changed from FULL_STATE_DICT
492
- # fsdp_sharding_strategy: FULL_SHARD
493
- # fsdp_forward_prefetch: false # Added
494
- # fsdp_backward_prefetch: "BACKWARD_PRE" # Added
495
- # fsdp_backward_prefetch_limit: 1 # Added
496
- # fsdp_mixed_precision: BF16 # Added
497
- ```
498
-
499
- </details><br>
500
-
501
- <h3><a href=https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard>Open LLM Leaderboard Evaluation Results</a></h3>
502
-
503
- | Metric |Value|
504
- |-------------------|----:|
505
- |Avg. |43.54|
506
- |IFEval (0-Shot) |68.79|
507
- |BBH (3-Shot) |59.07|
508
- |MATH Lvl 5 (4-Shot)|39.05|
509
- |GPQA (0-shot) |21.14|
510
- |MuSR (0-shot) |19.73|
511
- |MMLU-PRO (5-shot) |53.48|
512
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ library_name: transformers
4
+ tags:
5
+ - generated_from_trainer
6
+ base_model: Qwen/Qwen2.5-72B
7
+ datasets:
8
+ - anthracite-org/kalo-opus-instruct-22k-no-refusal
9
+ - Nopm/Opus_WritingStruct
10
+ - Gryphe/Sonnet3.5-SlimOrcaDedupCleaned
11
+ - Gryphe/Sonnet3.5-Charcard-Roleplay
12
+ - Gryphe/ChatGPT-4o-Writing-Prompts
13
+ - Epiculous/Synthstruct-Gens-v1.1-Filtered-n-Cleaned
14
+ - Epiculous/SynthRP-Gens-v1.1-Filtered-n-Cleaned
15
+ - nothingiisreal/Reddit-Dirty-And-WritingPrompts
16
+ - allura-org/Celeste-1.x-data-mixture
17
+ - cognitivecomputations/dolphin-2.9.3
18
+ license_name: qwen
19
+ license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
20
+ language:
21
+ - zho
22
+ - eng
23
+ - fra
24
+ - spa
25
+ - por
26
+ - deu
27
+ - ita
28
+ - rus
29
+ - jpn
30
+ - kor
31
+ - vie
32
+ - tha
33
+ - ara
34
+ model-index:
35
+ - name: EVA-Qwen2.5-72B-SFFT-v0.2
36
+ results: []
37
+ ---
38
+
39
+
40
+
41
+ # EVA Qwen2.5-72B v0.2
42
+
43
+ <p>
44
+ A RP/storywriting specialist model, full-parameter finetune of Qwen2.5-72B on mixture of synthetic and natural data.<br>
45
+ It uses Celeste 70B 0.1 data mixture, greatly expanding it to improve versatility, creativity and "flavor" of the resulting model.<br>
46
+ </p>
47
+
48
+ <p>Dedicated to Nev.</p>
49
+
50
+ <p><b>NOTE: LLM-Compressor quants don't seem to work correctly, quality seems to be much worse than normal. It wasn't the case with previous versions. GGUF and GPTQ seem to be unaffected.</b></p>
51
+ </br>
52
+ <p><b>Version notes for 0.2</b>: Optimized training hyperparameters and increased sequence length. Better instruction following deeper into context and less repetition.</p>
53
+
54
+ <p>
55
+ <p>Prompt format is ChatML.</p><br>
56
+ <h3>Recommended sampler values:</h3>
57
+ <ul>
58
+ <li>Temperature: 0.8</li>
59
+ <li>Min-P: 0.05</li>
60
+ <li>Top-A: 0.3</li>
61
+ <li>Repetition Penalty: 1.03</li>
62
+ </ul>
63
+
64
+ <h3>Recommended SillyTavern preset (via CalamitousFelicitousness):</h3>
65
+ <ul><li><a href="https://huggingface.co/EVA-UNIT-01/EVA-Qwen2.5-72B-v0.2/blob/main/EV01.json">Master import</a></li></ul>
66
+
67
+ </p>
68
+
69
+ <p>
70
+ <br>
71
+ <h3>
72
+ Training data:
73
+ </h3>
74
+ <ul>
75
+ <li>Celeste 70B 0.1 data mixture minus Opus Instruct subset. See that model's <a href=https://huggingface.co/nothingiisreal/L3.1-70B-Celeste-V0.1-BF16>card</a> for details.</li>
76
+ <li>Kalomaze's Opus_Instruct_25k dataset, filtered for refusals.</li>
77
+ <li>A subset (1k rows) of ChatGPT-4o-WritingPrompts by Gryphe</li>
78
+ <li>A subset (2k rows) of Sonnet3.5-Charcards-Roleplay by Gryphe</li>
79
+ <li>Synthstruct and SynthRP datasets by Epiculous</li>
80
+ <li>A subset from Dolphin-2.9.3, including filtered version of not_samantha and a small subset of systemchat.</li>
81
+ </ul>
82
+ <h3>
83
+ Training time and hardware:
84
+ </h3>
85
+ <ul><li>17 hours on 8xH100 SXM</a></li></ul><br>
86
+ </p>
87
+ <p>Model was created by Kearm, Auri and Cahvay.</p>
88
+ <h4>Special thanks:</h4><ul>
89
+ <li>to Featherless for sponsoring this run</li>
90
+ <li>to Cahvay for his work on investigating and reprocessing the corrupted dataset, removing the single biggest source of data poisoning.</li>
91
+ <li>to Gryphe, Lemmy, Kalomaze, Nopm, Epiculous and CognitiveComputations for the data</li>
92
+ <li>and to Allura-org for support, feedback, beta-testing and doing quality control of EVA models.</li></ul>
93
+
94
+ <h3>Statement about change in licensing for the future models.</h3>
95
+ <p>For all future EVA-Unit-01 models, there will be a provision in the license stating that Infermatic and any of its employees or paid associates cannot utilize, distribute, download, or otherwise make use of EVA models.
96
+ While this cannot retroactively apply to our licensing, we officially request Infermatic immediately cease use of our models for unwarranted profit, although we acknowledge at this point it will not likely be followed.
97
+ EVA models will still be available in the future on Featherless, ArliAI (in the future), and other providers who want to host them, as well as for local and cloud usage.</p>
98
+
99
+
100
+ [<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
101
+ <details><summary>See axolotl config</summary>
102
+
103
+ axolotl version: `0.4.1`
104
+ ```yaml
105
+ base_model: Qwen/Qwen2.5-72B
106
+
107
+ load_in_8bit: false
108
+ load_in_4bit: false
109
+ strict: false
110
+
111
+ plugins:
112
+ - axolotl.integrations.liger.LigerPlugin
113
+ liger_rope: true
114
+ liger_rms_norm: true
115
+ liger_swiglu: true
116
+ liger_fused_linear_cross_entropy: true
117
+
118
+ # plugins:
119
+ # - axolotl.integrations.spectrum.SpectrumPlugin
120
+
121
+ # spectrum_top_fraction: 0.5
122
+ # # Optional if using a pre-scanned model as your base_model. Useful if using a model mirror
123
+ # spectrum_model_name: Qwen/Qwen2.5-32B
124
+
125
+ datasets:
126
+ - path: datasets/Celeste_Filtered_utf8fix.jsonl
127
+ type: sharegpt
128
+ - path: datasets/deduped_not_samantha_norefusals.jsonl
129
+ type: sharegpt
130
+ - path: datasets/deduped_SynthRP-Gens_processed_ShareGPT_converted_cleaned.jsonl
131
+ type: sharegpt
132
+ - path: datasets/deduped_Synthstruct-Gens_processed_sharegpt_converted_cleaned.jsonl
133
+ type: sharegpt
134
+ - path: datasets/Gryphe-4o-WP-filtered-sharegpt_utf8fix.jsonl
135
+ type: sharegpt
136
+ - path: datasets/opus-instruct-22k-no_refusals-filtered_utf8fix.jsonl
137
+ type: sharegpt
138
+ - path: datasets/Sonnet3-5-charcard-names-filtered-sharegpt_utf8fix.jsonl
139
+ type: sharegpt
140
+ - path: datasets/SystemChat_subset_filtered_sharegpt_utf8fix.jsonl
141
+ type: sharegpt
142
+
143
+ chat_template: chatml
144
+ shuffle_merged_datasets: true
145
+ val_set_size: 0.001
146
+ output_dir: EVA-Qwen2.5-72B-SFFT-v0.2
147
+
148
+ sequence_len: 10240
149
+ sample_packing: true
150
+ eval_sample_packing: false
151
+ pad_to_sequence_len: false
152
+
153
+ # adapter: qlora
154
+ # lora_model_dir:
155
+ # lora_r: 64
156
+ # lora_alpha: 128
157
+ # lora_dropout: 0.05
158
+ # lora_target_linear: true
159
+ # peft_use_dora: true
160
+
161
+ unfrozen_parameters:
162
+ - ^lm_head.weight$
163
+ - ^model.embed_tokens.weight$
164
+ # mlp.down_proj layers
165
+ - model.layers.62.mlp.down_proj
166
+ - model.layers.64.mlp.down_proj
167
+ - model.layers.63.mlp.down_proj
168
+ - model.layers.66.mlp.down_proj
169
+ - model.layers.65.mlp.down_proj
170
+ - model.layers.67.mlp.down_proj
171
+ - model.layers.68.mlp.down_proj
172
+ - model.layers.31.mlp.down_proj
173
+ - model.layers.60.mlp.down_proj
174
+ - model.layers.69.mlp.down_proj
175
+ - model.layers.61.mlp.down_proj
176
+ - model.layers.59.mlp.down_proj
177
+ - model.layers.30.mlp.down_proj
178
+ - model.layers.70.mlp.down_proj
179
+ - model.layers.32.mlp.down_proj
180
+ - model.layers.34.mlp.down_proj
181
+ - model.layers.33.mlp.down_proj
182
+ - model.layers.76.mlp.down_proj
183
+ - model.layers.72.mlp.down_proj
184
+ - model.layers.71.mlp.down_proj
185
+ - model.layers.58.mlp.down_proj
186
+ - model.layers.75.mlp.down_proj
187
+ - model.layers.29.mlp.down_proj
188
+ - model.layers.56.mlp.down_proj
189
+ - model.layers.26.mlp.down_proj
190
+ - model.layers.35.mlp.down_proj
191
+ - model.layers.28.mlp.down_proj
192
+ - model.layers.57.mlp.down_proj
193
+ - model.layers.77.mlp.down_proj
194
+ - model.layers.36.mlp.down_proj
195
+ - model.layers.27.mlp.down_proj
196
+ - model.layers.25.mlp.down_proj
197
+ - model.layers.78.mlp.down_proj
198
+ - model.layers.37.mlp.down_proj
199
+ - model.layers.73.mlp.down_proj
200
+ - model.layers.55.mlp.down_proj
201
+ - model.layers.54.mlp.down_proj
202
+ - model.layers.74.mlp.down_proj
203
+ - model.layers.24.mlp.down_proj
204
+ - model.layers.53.mlp.down_proj
205
+ # mlp.gate_proj layers
206
+ - model.layers.78.mlp.gate_proj
207
+ - model.layers.77.mlp.gate_proj
208
+ - model.layers.76.mlp.gate_proj
209
+ - model.layers.79.mlp.gate_proj
210
+ - model.layers.75.mlp.gate_proj
211
+ - model.layers.74.mlp.gate_proj
212
+ - model.layers.73.mlp.gate_proj
213
+ - model.layers.72.mlp.gate_proj
214
+ - model.layers.71.mlp.gate_proj
215
+ - model.layers.70.mlp.gate_proj
216
+ - model.layers.69.mlp.gate_proj
217
+ - model.layers.57.mlp.gate_proj
218
+ - model.layers.54.mlp.gate_proj
219
+ - model.layers.55.mlp.gate_proj
220
+ - model.layers.68.mlp.gate_proj
221
+ - model.layers.63.mlp.gate_proj
222
+ - model.layers.53.mlp.gate_proj
223
+ - model.layers.44.mlp.gate_proj
224
+ - model.layers.45.mlp.gate_proj
225
+ - model.layers.49.mlp.gate_proj
226
+ - model.layers.58.mlp.gate_proj
227
+ - model.layers.46.mlp.gate_proj
228
+ - model.layers.56.mlp.gate_proj
229
+ - model.layers.67.mlp.gate_proj
230
+ - model.layers.62.mlp.gate_proj
231
+ - model.layers.50.mlp.gate_proj
232
+ - model.layers.64.mlp.gate_proj
233
+ - model.layers.52.mlp.gate_proj
234
+ - model.layers.40.mlp.gate_proj
235
+ - model.layers.43.mlp.gate_proj
236
+ - model.layers.48.mlp.gate_proj
237
+ - model.layers.66.mlp.gate_proj
238
+ - model.layers.47.mlp.gate_proj
239
+ - model.layers.59.mlp.gate_proj
240
+ - model.layers.65.mlp.gate_proj
241
+ - model.layers.61.mlp.gate_proj
242
+ - model.layers.60.mlp.gate_proj
243
+ - model.layers.42.mlp.gate_proj
244
+ - model.layers.51.mlp.gate_proj
245
+ - model.layers.41.mlp.gate_proj
246
+ # mlp.up_proj layers
247
+ - model.layers.70.mlp.up_proj
248
+ - model.layers.69.mlp.up_proj
249
+ - model.layers.71.mlp.up_proj
250
+ - model.layers.68.mlp.up_proj
251
+ - model.layers.72.mlp.up_proj
252
+ - model.layers.67.mlp.up_proj
253
+ - model.layers.66.mlp.up_proj
254
+ - model.layers.73.mlp.up_proj
255
+ - model.layers.46.mlp.up_proj
256
+ - model.layers.63.mlp.up_proj
257
+ - model.layers.75.mlp.up_proj
258
+ - model.layers.76.mlp.up_proj
259
+ - model.layers.74.mlp.up_proj
260
+ - model.layers.45.mlp.up_proj
261
+ - model.layers.62.mlp.up_proj
262
+ - model.layers.64.mlp.up_proj
263
+ - model.layers.65.mlp.up_proj
264
+ - model.layers.44.mlp.up_proj
265
+ - model.layers.53.mlp.up_proj
266
+ - model.layers.47.mlp.up_proj
267
+ - model.layers.49.mlp.up_proj
268
+ - model.layers.48.mlp.up_proj
269
+ - model.layers.57.mlp.up_proj
270
+ - model.layers.43.mlp.up_proj
271
+ - model.layers.42.mlp.up_proj
272
+ - model.layers.56.mlp.up_proj
273
+ - model.layers.61.mlp.up_proj
274
+ - model.layers.54.mlp.up_proj
275
+ - model.layers.40.mlp.up_proj
276
+ - model.layers.55.mlp.up_proj
277
+ - model.layers.77.mlp.up_proj
278
+ - model.layers.60.mlp.up_proj
279
+ - model.layers.41.mlp.up_proj
280
+ - model.layers.35.mlp.up_proj
281
+ - model.layers.37.mlp.up_proj
282
+ - model.layers.58.mlp.up_proj
283
+ - model.layers.34.mlp.up_proj
284
+ - model.layers.38.mlp.up_proj
285
+ - model.layers.33.mlp.up_proj
286
+ - model.layers.39.mlp.up_proj
287
+ # self_attn.k_proj layers
288
+ - model.layers.36.self_attn.k_proj
289
+ - model.layers.79.self_attn.k_proj
290
+ - model.layers.35.self_attn.k_proj
291
+ - model.layers.34.self_attn.k_proj
292
+ - model.layers.37.self_attn.k_proj
293
+ - model.layers.33.self_attn.k_proj
294
+ - model.layers.38.self_attn.k_proj
295
+ - model.layers.39.self_attn.k_proj
296
+ - model.layers.74.self_attn.k_proj
297
+ - model.layers.77.self_attn.k_proj
298
+ - model.layers.41.self_attn.k_proj
299
+ - model.layers.69.self_attn.k_proj
300
+ - model.layers.32.self_attn.k_proj
301
+ - model.layers.78.self_attn.k_proj
302
+ - model.layers.30.self_attn.k_proj
303
+ - model.layers.70.self_attn.k_proj
304
+ - model.layers.25.self_attn.k_proj
305
+ - model.layers.42.self_attn.k_proj
306
+ - model.layers.29.self_attn.k_proj
307
+ - model.layers.31.self_attn.k_proj
308
+ - model.layers.68.self_attn.k_proj
309
+ - model.layers.66.self_attn.k_proj
310
+ - model.layers.22.self_attn.k_proj
311
+ - model.layers.65.self_attn.k_proj
312
+ - model.layers.44.self_attn.k_proj
313
+ - model.layers.40.self_attn.k_proj
314
+ - model.layers.63.self_attn.k_proj
315
+ - model.layers.23.self_attn.k_proj
316
+ - model.layers.28.self_attn.k_proj
317
+ - model.layers.24.self_attn.k_proj
318
+ - model.layers.26.self_attn.k_proj
319
+ - model.layers.67.self_attn.k_proj
320
+ - model.layers.75.self_attn.k_proj
321
+ - model.layers.27.self_attn.k_proj
322
+ - model.layers.57.self_attn.k_proj
323
+ - model.layers.64.self_attn.k_proj
324
+ - model.layers.71.self_attn.k_proj
325
+ - model.layers.61.self_attn.k_proj
326
+ - model.layers.72.self_attn.k_proj
327
+ - model.layers.73.self_attn.k_proj
328
+ # self_attn.o_proj layers
329
+ - model.layers.69.self_attn.o_proj
330
+ - model.layers.39.self_attn.o_proj
331
+ - model.layers.16.self_attn.o_proj
332
+ - model.layers.14.self_attn.o_proj
333
+ - model.layers.19.self_attn.o_proj
334
+ - model.layers.42.self_attn.o_proj
335
+ - model.layers.12.self_attn.o_proj
336
+ - model.layers.15.self_attn.o_proj
337
+ - model.layers.17.self_attn.o_proj
338
+ - model.layers.38.self_attn.o_proj
339
+ - model.layers.23.self_attn.o_proj
340
+ - model.layers.22.self_attn.o_proj
341
+ - model.layers.13.self_attn.o_proj
342
+ - model.layers.29.self_attn.o_proj
343
+ - model.layers.41.self_attn.o_proj
344
+ - model.layers.44.self_attn.o_proj
345
+ - model.layers.46.self_attn.o_proj
346
+ - model.layers.45.self_attn.o_proj
347
+ - model.layers.43.self_attn.o_proj
348
+ - model.layers.49.self_attn.o_proj
349
+ - model.layers.30.self_attn.o_proj
350
+ - model.layers.26.self_attn.o_proj
351
+ - model.layers.25.self_attn.o_proj
352
+ - model.layers.37.self_attn.o_proj
353
+ - model.layers.47.self_attn.o_proj
354
+ - model.layers.11.self_attn.o_proj
355
+ - model.layers.18.self_attn.o_proj
356
+ - model.layers.28.self_attn.o_proj
357
+ - model.layers.20.self_attn.o_proj
358
+ - model.layers.27.self_attn.o_proj
359
+ - model.layers.53.self_attn.o_proj
360
+ - model.layers.52.self_attn.o_proj
361
+ - model.layers.35.self_attn.o_proj
362
+ - model.layers.71.self_attn.o_proj
363
+ - model.layers.10.self_attn.o_proj
364
+ - model.layers.3.self_attn.o_proj
365
+ - model.layers.21.self_attn.o_proj
366
+ - model.layers.24.self_attn.o_proj
367
+ - model.layers.68.self_attn.o_proj
368
+ - model.layers.48.self_attn.o_proj
369
+ # self_attn.q_proj layers
370
+ - model.layers.1.self_attn.q_proj
371
+ - model.layers.2.self_attn.q_proj
372
+ - model.layers.3.self_attn.q_proj
373
+ - model.layers.0.self_attn.q_proj
374
+ - model.layers.5.self_attn.q_proj
375
+ - model.layers.4.self_attn.q_proj
376
+ - model.layers.6.self_attn.q_proj
377
+ - model.layers.8.self_attn.q_proj
378
+ - model.layers.7.self_attn.q_proj
379
+ - model.layers.9.self_attn.q_proj
380
+ - model.layers.10.self_attn.q_proj
381
+ - model.layers.68.self_attn.q_proj
382
+ - model.layers.25.self_attn.q_proj
383
+ - model.layers.12.self_attn.q_proj
384
+ - model.layers.54.self_attn.q_proj
385
+ - model.layers.55.self_attn.q_proj
386
+ - model.layers.61.self_attn.q_proj
387
+ - model.layers.18.self_attn.q_proj
388
+ - model.layers.49.self_attn.q_proj
389
+ - model.layers.66.self_attn.q_proj
390
+ - model.layers.72.self_attn.q_proj
391
+ - model.layers.11.self_attn.q_proj
392
+ - model.layers.52.self_attn.q_proj
393
+ - model.layers.64.self_attn.q_proj
394
+ - model.layers.15.self_attn.q_proj
395
+ - model.layers.60.self_attn.q_proj
396
+ - model.layers.50.self_attn.q_proj
397
+ - model.layers.59.self_attn.q_proj
398
+ - model.layers.53.self_attn.q_proj
399
+ - model.layers.48.self_attn.q_proj
400
+ - model.layers.57.self_attn.q_proj
401
+ - model.layers.70.self_attn.q_proj
402
+ - model.layers.17.self_attn.q_proj
403
+ - model.layers.67.self_attn.q_proj
404
+ - model.layers.71.self_attn.q_proj
405
+ - model.layers.62.self_attn.q_proj
406
+ - model.layers.51.self_attn.q_proj
407
+ - model.layers.19.self_attn.q_proj
408
+ - model.layers.58.self_attn.q_proj
409
+ - model.layers.13.self_attn.q_proj
410
+ # self_attn.v_proj layers
411
+ - model.layers.23.self_attn.v_proj
412
+ - model.layers.25.self_attn.v_proj
413
+ - model.layers.26.self_attn.v_proj
414
+ - model.layers.27.self_attn.v_proj
415
+ - model.layers.28.self_attn.v_proj
416
+ - model.layers.29.self_attn.v_proj
417
+ - model.layers.30.self_attn.v_proj
418
+ - model.layers.31.self_attn.v_proj
419
+ - model.layers.34.self_attn.v_proj
420
+ - model.layers.35.self_attn.v_proj
421
+ - model.layers.36.self_attn.v_proj
422
+ - model.layers.37.self_attn.v_proj
423
+ - model.layers.38.self_attn.v_proj
424
+ - model.layers.42.self_attn.v_proj
425
+ - model.layers.48.self_attn.v_proj
426
+ - model.layers.57.self_attn.v_proj
427
+ - model.layers.58.self_attn.v_proj
428
+ - model.layers.61.self_attn.v_proj
429
+ - model.layers.63.self_attn.v_proj
430
+ - model.layers.64.self_attn.v_proj
431
+ - model.layers.65.self_attn.v_proj
432
+ - model.layers.66.self_attn.v_proj
433
+ - model.layers.69.self_attn.v_proj
434
+ - model.layers.70.self_attn.v_proj
435
+ - model.layers.74.self_attn.v_proj
436
+ - model.layers.75.self_attn.v_proj
437
+ - model.layers.72.self_attn.v_proj
438
+ - model.layers.39.self_attn.v_proj
439
+ - model.layers.41.self_attn.v_proj
440
+ - model.layers.40.self_attn.v_proj
441
+ - model.layers.33.self_attn.v_proj
442
+ - model.layers.59.self_attn.v_proj
443
+ - model.layers.16.self_attn.v_proj
444
+ - model.layers.15.self_attn.v_proj
445
+ - model.layers.76.self_attn.v_proj
446
+ - model.layers.24.self_attn.v_proj
447
+ - model.layers.68.self_attn.v_proj
448
+ - model.layers.67.self_attn.v_proj
449
+ - model.layers.55.self_attn.v_proj
450
+ - model.layers.44.self_attn.v_proj
451
+
452
+
453
+
454
+ wandb_project: EVA-Qwen2.5-72B-SFFT-v0.2
455
+ wandb_entity:
456
+ wandb_watch:
457
+ wandb_name: Unit-02
458
+ wandb_log_model:
459
+
460
+ gradient_accumulation_steps: 8
461
+ micro_batch_size: 1
462
+ num_epochs: 3
463
+ optimizer: paged_ademamix_8bit
464
+ lr_scheduler: cosine
465
+ learning_rate: 0.00003
466
+ max_grad_norm: 1.5
467
+
468
+ train_on_inputs: false
469
+ group_by_length: false
470
+ bf16: auto
471
+ fp16:
472
+ tf32: false
473
+
474
+ gradient_checkpointing: "unsloth"
475
+ # gradient_checkpointing_kwargs:
476
+ # use_reentrant: true
477
+ early_stopping_patience:
478
+ resume_from_checkpoint: EVA-Qwen2.5-72B-SFFT-v0.2/checkpoint-128
479
+ local_rank:
480
+ logging_steps: 1
481
+ xformers_attention:
482
+ flash_attention: true
483
+
484
+ warmup_steps: 20
485
+ evals_per_epoch: 4
486
+ saves_per_epoch: 4
487
+ save_safetensors: true
488
+ save_total_limit: 1
489
+ hub_model_id:
490
+ hub_strategy:
491
+ debug:
492
+ deepspeed: deepspeed_configs/zero3_bf16_cpuoffload_params.json
493
+ weight_decay: 0.12
494
+ # fsdp:
495
+ # - full_shard
496
+ # - auto_wrap
497
+ # fsdp_config:
498
+ # fsdp_limit_all_gathers: true
499
+ # fsdp_sync_module_states: false
500
+ # fsdp_offload_params: true
501
+ # fsdp_cpu_ram_efficient_loading: true
502
+ # fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
503
+ # fsdp_transformer_layer_cls_to_wrap: Qwen2DecoderLayer
504
+ # fsdp_activation_checkpointing: true
505
+ # fsdp_state_dict_type: SHARDED_STATE_DICT # Changed from FULL_STATE_DICT
506
+ # fsdp_sharding_strategy: FULL_SHARD
507
+ # fsdp_forward_prefetch: false # Added
508
+ # fsdp_backward_prefetch: "BACKWARD_PRE" # Added
509
+ # fsdp_backward_prefetch_limit: 1 # Added
510
+ # fsdp_mixed_precision: BF16 # Added
511
+ ```
512
+
513
+ </details><br>
514
+
515
+ <h3><a href=https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard>Open LLM Leaderboard Evaluation Results</a></h3>
516
+
517
+ | Metric |Value|
518
+ |-------------------|----:|
519
+ |Avg. |43.54|
520
+ |IFEval (0-Shot) |68.79|
521
+ |BBH (3-Shot) |59.07|
522
+ |MATH Lvl 5 (4-Shot)|39.05|
523
+ |GPQA (0-shot) |21.14|
524
+ |MuSR (0-shot) |19.73|
525
+ |MMLU-PRO (5-shot) |53.48|
526
+