--- library_name: transformers base_model: giux78/zagreus-test-202000 tags: - generated_from_trainer model-index: - name: ale_outputs/sft-prod-l3p32 results: [] --- [

](https://github.com/axolotl-ai-cloud/axolotl)

See axolotl config

axolotl version: `0.9.2` ```yaml # ============= SFT PRODUCTION (4M ShareGPT) ============= base_model: giux78/zagreus-test-202000 strict: false output_dir: ./ale_outputs/sft-prod-l3p32 seed: 42 # ---- Dataset ---- datasets: - path: /leonardo_work/EUHPC_A04_045/.data type: chat_template field_messages: conversations message_property_mappings: role: from content: value roles: user: ["human", "user"] assistant: ["gpt", "assistant"] system: ["system"] tool: ["tool"] roles_to_train: ["assistant"] # loss solo sui turni assistant train_on_eos: turn # predici <|eot_id|> a fine risposta assistant # (opzionale ma consigliato: cache pretokenizzata tra run) dataset_prepared_path: ./ale_outputs/dataset_cache/sharegpt_4m_llama32_4096 default_system_message: "Sei un assistente utile." # ---- Chat template (Llama-3.2 style) ---- chat_template: jinja chat_template_jinja: | {{- bos_token -}} {%- set has_system = messages and messages[0]['role'] == 'system' -%} {%- if has_system -%} {{- '<|start_header_id|>system<|end_header_id|>\n\n' + (messages[0]['content']|default('')) + '<|eot_id|>' -}} {%- set loop_messages = messages[1:] -%} {%- else -%} {%- set loop_messages = messages -%} {%- endif -%} {%- for m in loop_messages -%} {%- set role = m['role']|default('') -%} {%- set content = m['content']|default('', true) -%} {%- if content is string -%} {%- set text = content -%} {%- else -%} {%- set text = (content | map(attribute='text') | join('')) -%} {%- endif -%} {%- if text|trim|length > 0 -%} {%- if role == 'user' -%} {{- '<|start_header_id|>user<|end_header_id|>\n\n' + text + '<|eot_id|>' -}} {%- elif role == 'assistant' -%} {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' + text + '<|eot_id|>' -}} {%- elif role == 'system' -%} {{- '<|start_header_id|>system<|end_header_id|>\n\n' + text + '<|eot_id|>' -}} {%- endif -%} {%- endif -%} {%- endfor -%} {%- if add_generation_prompt and (loop_messages|length == 0 or (loop_messages|last)['role'] != 'assistant') -%} {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}} {%- else -%} {{- eos_token -}} {%- endif -%} # ---- Training ---- sequence_len: 4096 sample_packing: true # ON per efficienza eval_sample_packing: true pad_to_sequence_len: false optimizer: adamw_torch_fused lr_scheduler: cosine learning_rate: 1.5e-5 warmup_ratio: 0.03 # ~3% dei passi totali weight_decay: 0.01 max_grad_norm: 1.0 # 32 GPU totali -> eff. batch = 1 * 8 * 32 = 256 micro_batch_size: 1 gradient_accumulation_steps: 8 num_epochs: 1.0 # 1 epoca completa su 4M conv # (alternativa: usa max_steps se vuoi fermarti prima) # ---- Precisione & memoria ---- bf16: auto flash_attention: true gradient_checkpointing: true # ---- Log/Eval/Save ---- logging_steps: 20 evaluation_strategy: steps eval_steps: 2000 # ~7-8 eval/epoca save_strategy: steps save_steps: 5000 # ~3 checkpoint/epoca save_total_limit: 4 # (opzionale) val_set_size: 10000 # se vuoi split automatico dal dataset # ---- FSDP multi-nodo ---- fsdp_config: fsdp_sharding_strategy: FULL_SHARD fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer fsdp_backward_prefetch_policy: BACKWARD_PRE fsdp_state_dict_type: FULL_STATE_DICT # ---- Token speciali (coerenti col tokenizer del base_model) ---- special_tokens: bos_token: <|begin_of_text|> pad_token: <|pad|> eos_token: <|end_of_text|> unk_token: <|unk|> ```

# ale_outputs/sft-prod-l3p32 This model is a fine-tuned version of [giux78/zagreus-test-202000](https://huggingface.co/giux78/zagreus-test-202000) on the None dataset. ## Model description More information needed ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 1.5e-05 - train_batch_size: 1 - eval_batch_size: 1 - seed: 42 - distributed_type: multi-GPU - num_devices: 32 - gradient_accumulation_steps: 8 - total_train_batch_size: 256 - total_eval_batch_size: 32 - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: cosine - lr_scheduler_warmup_steps: 102 - num_epochs: 1.0 ### Training results ### Framework versions - Transformers 4.56.2 - Pytorch 2.5.1+cu121 - Datasets 3.5.1 - Tokenizers 0.22.1