You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Built with Axolotl

See axolotl config

axolotl version: 0.12.0

# Name wildchat-expanded-sft_query_generation-qwen3_8b_base

# axolotl train red_team_agent/claude_wildchat/query_gen.yaml


base_model: Qwen/Qwen3-8B-Base
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
trust_remote_code: false

# --- Dataset Configuration ---
datasets:
  - path: nate-rahn/wildchat-anthropic-attributes-expanded-reversed
    type: chat_template # Use the chat_template processing strategy
    # --- Custom Template & Role Mapping ---
    chat_template: chatml # Specify we are using a custom jinja template below
    field_messages: messages # Assumes your dataset has a "messages" key with a list of dicts
    message_property_mappings: # Assumes each dict in the list has "role" and "content" keys
      role: role
      content: content
    roles: # Define the roles expected in your dataset for mapping
      user: ["user"] # Map "user" role in data to internal "user"
      assistant: ["assistant"] # Map "assistant" role in data to internal "assistant"
      system: ["system"] # Map "system" role in data to internal "system"
    # --- Training Target ---
    roles_to_train: ["assistant"]
    train_on_eos: turn # Train on the EOS token at the end of each 'user' turn

dataset_prepared_path: /scratch/tmp/wildchat_attributes_expanded_query_sft/last_run_prepared
dataset_processes: 128

# --- Training Hyperparameters ---
sequence_len: 4096 # Adjust based on your dataset and GPU memory
sample_packing: true # Pack multiple sequences into one example for efficiency
eval_sample_packing: true
pad_to_sequence_len: true # Pad sequences to sequence_len

# Full Parameter Finetuning (No adapter specified)
# adapter: # This is intentionally left blank/removed for full finetuning

# Performance & Precision (H100s excel with bf16)
bf16: true
tf32: true
flash_attention: true # for qwen

# Batching (Adjust based on GPU memory)
# Effective global batch size = micro_batch_size * gradient_accumulation_steps * num_gpus (4)
# Start low for full finetuning, e.g., 1 * 16 * 4 = 64
micro_batch_size: 2
gradient_accumulation_steps: 32
eval_batch_size: 16 # Can often be slightly higher than micro_batch_size

# Optimizer & Scheduler
optimizer: adamw_torch_fused # Good choice for newer GPUs
learning_rate: 1e-5 # Common starting point for full SFT
weight_decay: 0.01
lr_scheduler: cosine # Standard scheduler
warmup_steps: 50
max_grad_norm: 1.0

# Training Duration & Evaluation/Saving
num_epochs: 1 # Train for 1 epoch as requested
val_set_size: 0.001
logging_steps: 1
evals_per_epoch: 20
saves_per_epoch: 2 # Save 2 times per epoch
save_total_limit: 1 # Keep only the last 1 checkpoints

# Memory Saving
# gradient_checkpointing: true # Essential for full finetuning
# gradient_checkpointing_kwargs:
#   use_reentrant: false # Prefer non-reentrant if possible

# --- FSDP Configuration (for 4xH100) ---
fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_offload_params: false # Should not be needed with H100 VRAM
  fsdp_sync_module_states: true # Important for correctness
  fsdp_use_orig_params: false # Recommended for memory saving with FSDP
  fsdp_state_dict_type: SHARDED_STATE_DICT # Options: FULL_STATE_DICT or SHARDED_STATE_DICT (saves disk space)
  fsdp_transformer_layer_cls_to_wrap: 'Qwen3DecoderLayer'
  fsdp_activation_checkpointing: true # Alternative way to enable activation checkpointing for FSDP

# --- Special Tokens ---
# Define based on your custom template's terminators. Qwen already uses <|im_end|>
special_tokens:
  eos_token: "<|im_end|>"

# --- Logging & Saving ---
output_dir: /scratch/out/red-team-agent/runs/wildchat-expanded-query-generator-qwen3_8b_base # Local output directory

# W&B Logging
wandb_project: "red-team-agent" # Name your W&B project
wandb_entity: "aqi1048576-mats-program" # IMPORTANT: Replace with your W&B username or team name
wandb_name: "wildchat-expanded-query-generator-qwen3_8b_base" # Descriptive run name
# wandb_log_model: "checkpoint" # Log model checkpoints to W&B Artifacts

# Hugging Face Hub Upload
hub_model_id: "nate-rahn/wildchat-expanded-query-generator-qwen3_8b_base" # IMPORTANT: Replace with your desired HF repo ID
hub_strategy: "end" # Push checkpoints to the Hub (`"end"` pushes only the final model)
hf_use_auth_token: true # Required for pushing to the Hub (ensure you're logged in)

# --- Misc ---
seed: 42 

wildchat-expanded-query-generator-qwen3_8b_base

This model is a fine-tuned version of Qwen/Qwen3-8B-Base on the nate-rahn/wildchat-anthropic-attributes-expanded-reversed dataset. It achieves the following results on the evaluation set:

  • Loss: 0.9553
  • Memory/max Mem Active(gib): 47.35
  • Memory/max Mem Allocated(gib): 46.98
  • Memory/device Mem Reserved(gib): 57.75

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 2
  • eval_batch_size: 16
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • gradient_accumulation_steps: 32
  • total_train_batch_size: 512
  • total_eval_batch_size: 128
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 50
  • training_steps: 1416

Training results

Training Loss Epoch Step Validation Loss Mem Active(gib) Mem Allocated(gib) Mem Reserved(gib)
No log 0 0 2.0430 18.69 18.31 28.97
1.5441 0.0501 71 1.5740 47.35 46.98 57.08
1.3887 0.1002 142 1.4098 47.35 46.98 57.75
1.346 0.1503 213 1.3214 47.35 46.98 57.75
1.2299 0.2004 284 1.2551 47.35 46.98 57.75
1.2152 0.2505 355 1.2018 47.35 46.98 57.75
1.1485 0.3006 426 1.1570 47.35 46.98 57.75
1.1071 0.3507 497 1.1185 47.35 46.98 57.75
1.0486 0.4008 568 1.0866 47.35 46.98 57.75
1.0518 0.4510 639 1.0595 47.35 46.98 57.75
1.0197 0.5011 710 1.0359 47.35 46.98 57.75
1.0607 0.5512 781 1.0163 47.35 46.98 57.75
1.014 0.6013 852 0.9993 47.35 46.98 57.75
0.9585 0.6514 923 0.9861 47.35 46.98 57.75
0.9803 0.7015 994 0.9751 47.35 46.98 57.75
0.9467 0.7516 1065 0.9671 47.35 46.98 57.75
0.9015 0.8017 1136 0.9616 47.35 46.98 57.75
0.9579 0.8518 1207 0.9579 47.35 46.98 57.75
0.9739 0.9019 1278 0.9560 47.35 46.98 57.75
0.9346 0.9520 1349 0.9553 47.35 46.98 57.75

Framework versions

  • Transformers 4.55.0
  • Pytorch 2.6.0+cu126
  • Datasets 4.0.0
  • Tokenizers 0.21.4
Downloads last month
-
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nate-rahn/wildchat-expanded-query-generator-qwen3_8b_base

Base model

Qwen/Qwen3-8B-Base
Finetuned
(265)
this model

Dataset used to train nate-rahn/wildchat-expanded-query-generator-qwen3_8b_base