--- library_name: transformers base_model: giux78/zagreus-test-202000 tags: - generated_from_trainer model-index: - name: ale_outputs/opendata-sft-chatml-final results: [] --- [Built with Axolotl](https://github.com/axolotl-ai-cloud/axolotl)
See axolotl config axolotl version: `0.9.2` ```yaml # ============= SFT DEBUG (~1M conv) ============= base_model: giux78/zagreus-test-202000 #/leonardo_work/EUHPC_A04_045/training/ale_outputs/opendata-sft-chatml-phase1 #giux78/zagreus-test-202000 strict: false output_dir: ./ale_outputs/opendata-sft-chatml-final seed: 42 chat_template_jinja: "{%- for message in messages -%}\n {{- \"<|im_start|>\" + message.role + \"\\n\" + message.content + \"<|im_end|>\" + \"\\n\" -}}\n{%- endfor -%}\n{%- if add_generation_prompt -%}\n\t{{- \"<|im_start|>assistant\\n\" -}}\n{%- endif -%}" datasets: - path: /leonardo_work/EUHPC_A04_045/training/opendata-1000000 type: chat_template field_messages: conversation roles_to_train: ["assistant"] train_on_eos: turn #dataset_prepared_path: ./ale_outputs/dataset_cache/chatml-opendata-sft #default_system_message: "Sei un assistente utile." # chat_template: llama3 sequence_len: 4096 sample_packing: true eval_sample_packing: true pad_to_sequence_len: true # --- Cosine knobs (Axolotl) --- # 1) Mantieni LR costante al massimo per l'80% degli step cosine_constant_lr_ratio: 0.8 # 2) Floor al 30% del LR max (min_lr = 4.5e-6) cosine_min_lr_ratio: 0.3 # Non impostare lr_div_factor quando usi cosine_min_lr_ratio optimizer: adamw_torch_fused lr_scheduler: constant # <-- per isolare il comportamento learning_rate: 1.0e-03 #2.0e-6 #warmup_ratio: 0.05 # un po’ più lungo in debug #weight_decay: 0.01 max_grad_norm: 1.0 micro_batch_size: 1 gradient_accumulation_steps: 8 # Usa max_steps per “più step” indipendentemente dalla lunghezza effettiva del dataset max_steps: 1500 # ≈ 4x gli step attuali #num_epochs: null # ignora epoche quando max_steps è settato bf16: auto flash_attention: true gradient_checkpointing: true logging_steps: 10 eval_strategy: steps eval_steps: 100 save_strategy: steps save_steps: 500 save_total_limit: 3 val_set_size: 10000 fsdp_config: fsdp_sharding_strategy: FULL_SHARD fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer fsdp_backward_prefetch_policy: BACKWARD_PRE fsdp_state_dict_type: FULL_STATE_DICT special_tokens: pad_token: <|im_end|> eos_token: <|im_end|> tokens: - <|im_start|> - <|im_end|> - - - - - - ```

# ale_outputs/opendata-sft-chatml-final This model is a fine-tuned version of [giux78/zagreus-test-202000](https://huggingface.co/giux78/zagreus-test-202000) on the None dataset. It achieves the following results on the evaluation set: - Loss: 0.9688 ## Model description More information needed ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.001 - train_batch_size: 1 - eval_batch_size: 1 - seed: 42 - distributed_type: multi-GPU - num_devices: 32 - gradient_accumulation_steps: 8 - total_train_batch_size: 256 - total_eval_batch_size: 32 - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: constant - lr_scheduler_warmup_steps: 14 - training_steps: 1500 ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:------:|:----:|:---------------:| | No log | 0.0020 | 1 | 3.7159 | | 1.3414 | 0.2015 | 100 | 1.2867 | | 1.2277 | 0.4030 | 200 | 1.1868 | | 1.1801 | 0.6045 | 300 | 1.1344 | | 1.1356 | 0.8060 | 400 | 1.0987 | | 1.1131 | 1.0060 | 500 | 1.0737 | | 1.0815 | 1.2076 | 600 | 1.0527 | | 1.0572 | 1.4091 | 700 | 1.0366 | | 1.0497 | 1.6106 | 800 | 1.0234 | | 1.0291 | 1.8121 | 900 | 1.0116 | | 1.0116 | 2.0121 | 1000 | 1.0026 | | 1.0064 | 2.2136 | 1100 | 0.9925 | | 0.9918 | 2.4151 | 1200 | 0.9853 | | 0.9863 | 2.6166 | 1300 | 0.9783 | | 0.9766 | 2.8181 | 1400 | 0.9721 | | 0.96 | 3.0181 | 1500 | 0.9688 | ### Framework versions - Transformers 4.56.2 - Pytorch 2.5.1+cu121 - Datasets 3.5.1 - Tokenizers 0.22.1