See axolotl config

axolotl version: 0.9.2

# ============= SFT DEBUG (~1M conv) =============
base_model: giux78/zagreus-test-202000  #/leonardo_work/EUHPC_A04_045/training/ale_outputs/opendata-sft-chatml-phase1  #giux78/zagreus-test-202000
strict: false
output_dir: ./ale_outputs/opendata-zagreus-sft-final
seed: 42

chat_template_jinja: "{%- for message in messages -%}\n    {{- \"<|im_start|>\" + message.role + \"\\n\" + message.content + \"<|im_end|>\" + \"\\n\" -}}\n{%- endfor -%}\n{%- if add_generation_prompt -%}\n\t{{- \"<|im_start|>assistant\\n\" -}}\n{%- endif -%}"
datasets:
  - path: /leonardo_work/EUHPC_A04_045/training/openitaliandata  #/leonardo_work/EUHPC_A04_045/training/opendata-1000000
    type: chat_template
    field_messages: conversation
    roles_to_train: ["assistant"]
    train_on_eos: turn

dataset_prepared_path: ./ale_outputs/dataset_cache/opendata-zagreus-sft
#default_system_message: "Sei un assistente utile."
# chat_template: llama3

sequence_len: 4096
sample_packing: true
eval_sample_packing: true
pad_to_sequence_len: true



# --- Cosine knobs (Axolotl) ---
# 1) Mantieni LR costante al massimo per l'80% degli step
cosine_constant_lr_ratio: 0.8
# 2) Floor al 30% del LR max (min_lr = 4.5e-6)
cosine_min_lr_ratio: 0.3
# Non impostare lr_div_factor quando usi cosine_min_lr_ratio


optimizer: adamw_torch_fused
lr_scheduler: constant   # <-- per isolare il comportamento
learning_rate: 1.0e-03  #2.0e-6
#warmup_ratio: 0.05                   # un po’ più lungo in debug
#weight_decay: 0.01
max_grad_norm: 1.0

micro_batch_size: 1
gradient_accumulation_steps: 8
# Usa max_steps per “più step” indipendentemente dalla lunghezza effettiva del dataset
#max_steps: 1500                      # ≈ 4x gli step attuali
num_epochs: 3                    # ignora epoche quando max_steps è settato

bf16: auto
flash_attention: true
gradient_checkpointing: true

logging_steps: 10
eval_strategy: steps
eval_steps: 300
save_strategy: steps
save_steps: 500
save_total_limit: 3
val_set_size: 10000

fsdp_config:
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_state_dict_type: FULL_STATE_DICT

special_tokens:
  pad_token: <|im_end|>
  eos_token: <|im_end|>

tokens:
   - <|im_start|>
   - <|im_end|>
   - <tool_response>
   - </tool_response>
   - <tool_call>
   - </tool_call>
   - <code>
   - </code>

ale_outputs/opendata-zagreus-sft-final

This model is a fine-tuned version of giux78/zagreus-test-202000 on the None dataset. It achieves the following results on the evaluation set:

Loss: 1.3634

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.001
train_batch_size: 1
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 32
gradient_accumulation_steps: 8
total_train_batch_size: 256
total_eval_batch_size: 32
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: constant
lr_scheduler_warmup_steps: 100
num_epochs: 3.0

Training results

Training Loss	Epoch	Step	Validation Loss
No log	0.0008	1	3.6573
1.2502	0.2543	300	1.6034
1.1656	0.5086	600	1.5197
1.1173	0.7630	900	1.4731
1.0834	1.0170	1200	1.4429
1.0612	1.2713	1500	1.4229
1.0396	1.5256	1800	1.4073
1.0271	1.7799	2100	1.3946
1.0185	2.0339	2400	1.3840
1.0129	2.2882	2700	1.3761
0.9896	2.5425	3000	1.3693
0.9921	2.7969	3300	1.3634

Framework versions

Transformers 4.56.2
Pytorch 2.5.1+cu121
Datasets 3.5.1
Tokenizers 0.22.1

Downloads last month: 30

Safetensors

Model size

0.4B params

Tensor type

BF16

Model tree for giux78/open-zagreus-350M-sft

Base model

giux78/zagreus-test-202000

Finetuned

(13)

this model

Evaluation results

Metadata error: specify a dataset to view leaderboard