See axolotl config
axolotl version: 0.9.2
# ============= SFT DEBUG (~1M conv) =============
base_model: giux78/zagreus-test-202000 #/leonardo_work/EUHPC_A04_045/training/ale_outputs/opendata-sft-chatml-phase1 #giux78/zagreus-test-202000
strict: false
output_dir: ./ale_outputs/opendata-zagreus-sft-final
seed: 42
chat_template_jinja: "{%- for message in messages -%}\n {{- \"<|im_start|>\" + message.role + \"\\n\" + message.content + \"<|im_end|>\" + \"\\n\" -}}\n{%- endfor -%}\n{%- if add_generation_prompt -%}\n\t{{- \"<|im_start|>assistant\\n\" -}}\n{%- endif -%}"
datasets:
- path: /leonardo_work/EUHPC_A04_045/training/openitaliandata #/leonardo_work/EUHPC_A04_045/training/opendata-1000000
type: chat_template
field_messages: conversation
roles_to_train: ["assistant"]
train_on_eos: turn
dataset_prepared_path: ./ale_outputs/dataset_cache/opendata-zagreus-sft
#default_system_message: "Sei un assistente utile."
# chat_template: llama3
sequence_len: 4096
sample_packing: true
eval_sample_packing: true
pad_to_sequence_len: true
# --- Cosine knobs (Axolotl) ---
# 1) Mantieni LR costante al massimo per l'80% degli step
cosine_constant_lr_ratio: 0.8
# 2) Floor al 30% del LR max (min_lr = 4.5e-6)
cosine_min_lr_ratio: 0.3
# Non impostare lr_div_factor quando usi cosine_min_lr_ratio
optimizer: adamw_torch_fused
lr_scheduler: constant # <-- per isolare il comportamento
learning_rate: 1.0e-03 #2.0e-6
#warmup_ratio: 0.05 # un po’ più lungo in debug
#weight_decay: 0.01
max_grad_norm: 1.0
micro_batch_size: 1
gradient_accumulation_steps: 8
# Usa max_steps per “più step” indipendentemente dalla lunghezza effettiva del dataset
#max_steps: 1500 # ≈ 4x gli step attuali
num_epochs: 3 # ignora epoche quando max_steps è settato
bf16: auto
flash_attention: true
gradient_checkpointing: true
logging_steps: 10
eval_strategy: steps
eval_steps: 300
save_strategy: steps
save_steps: 500
save_total_limit: 3
val_set_size: 10000
fsdp_config:
fsdp_sharding_strategy: FULL_SHARD
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_state_dict_type: FULL_STATE_DICT
special_tokens:
pad_token: <|im_end|>
eos_token: <|im_end|>
tokens:
- <|im_start|>
- <|im_end|>
- <tool_response>
- </tool_response>
- <tool_call>
- </tool_call>
- <code>
- </code>
ale_outputs/opendata-zagreus-sft-final
This model is a fine-tuned version of giux78/zagreus-test-202000 on the None dataset. It achieves the following results on the evaluation set:
- Loss: 1.3634
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.001
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- distributed_type: multi-GPU
- num_devices: 32
- gradient_accumulation_steps: 8
- total_train_batch_size: 256
- total_eval_batch_size: 32
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: constant
- lr_scheduler_warmup_steps: 100
- num_epochs: 3.0
Training results
| Training Loss | Epoch | Step | Validation Loss |
|---|---|---|---|
| No log | 0.0008 | 1 | 3.6573 |
| 1.2502 | 0.2543 | 300 | 1.6034 |
| 1.1656 | 0.5086 | 600 | 1.5197 |
| 1.1173 | 0.7630 | 900 | 1.4731 |
| 1.0834 | 1.0170 | 1200 | 1.4429 |
| 1.0612 | 1.2713 | 1500 | 1.4229 |
| 1.0396 | 1.5256 | 1800 | 1.4073 |
| 1.0271 | 1.7799 | 2100 | 1.3946 |
| 1.0185 | 2.0339 | 2400 | 1.3840 |
| 1.0129 | 2.2882 | 2700 | 1.3761 |
| 0.9896 | 2.5425 | 3000 | 1.3693 |
| 0.9921 | 2.7969 | 3300 | 1.3634 |
Framework versions
- Transformers 4.56.2
- Pytorch 2.5.1+cu121
- Datasets 3.5.1
- Tokenizers 0.22.1
- Downloads last month
- 30
Model tree for giux78/open-zagreus-350M-sft
Base model
giux78/zagreus-test-202000