See axolotl config

axolotl version: 0.5.2

base_model: Open-Orca/Mistral-7B-OpenOrca
model_type: AutoModelForCausalLM
tokenizer_config: Open-Orca/Mistral-7B-OpenOrca
tokenizer_type: AutoTokenizer
tokenizer_use_fast: false
resize_token_embeddings_to_32x: true

plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
liger_fused_linear_cross_entropy: true

load_in_8bit: false
load_in_4bit: false
strict: false

chat_template: chatml
datasets:
  - path: skymizer/open-orca-conversations
    type: chat_template
    field_messages: messages

hf_use_auth_token: true
dataset_prepared_path: pretokenized/open-orca
output_dir: ./outputs/out

sequence_len: 2048
sample_packing: true
pad_to_sequence_len: true

val_set_size: 0.005
eval_sample_packing: false
# eval_causal_lm_metrics: ["perplexity"]

wandb_project: "axolotl_mistral_sft"
wandb_entity:
wandb_watch:
wandb_name: "mistral-7B-v0.1-csft-open-orca-on-open-orca"
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 16
max_steps: 3000
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.000005 
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.95
adam_eps: 0.000001
max_grad_norm: 1.0

train_on_inputs: false
group_by_length: false
bf16: true
fp16:
tf32: false

hub_model_id: "skymizer/mistral-7b-v0.1-csft-open-orca-on-open-orca"
save_strategy: "steps"
save_steps: 1000

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_ratio: 0.03
eval_steps: 500
# evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
debug:
deepspeed: deepspeed_configs/zero3_bf16.json
fsdp:
fsdp_config:

seed: 42

mistral-7b-v0.1-csft-open-orca-on-open-orca

This model is a fine-tuned version of Open-Orca/Mistral-7B-OpenOrca on the None dataset. It achieves the following results on the evaluation set:

Loss: 2.1946

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 16
eval_batch_size: 16
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 2
total_train_batch_size: 128
total_eval_batch_size: 64
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.95) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 90
training_steps: 3000

Training results

Training Loss	Epoch	Step	Validation Loss
1.1311	0.0002	1	4.4372
0.5277	0.0831	500	2.2236
0.463	0.1663	1000	2.2066
0.4855	0.2494	1500	2.2146
0.4662	0.3325	2000	2.1989
0.4494	0.4157	2500	2.1966
0.4268	0.4988	3000	2.1946

Framework versions

Transformers 4.46.3
Pytorch 2.5.1+cu124
Datasets 3.1.0
Tokenizers 0.20.3

Downloads last month: -

Safetensors

Model size

7B params

Tensor type

BF16

Model tree for skymizer/mistral-7b-v0.1-csft-open-orca-on-open-orca

Base model

Open-Orca/Mistral-7B-OpenOrca

Finetuned

(11)

this model

Evaluation results

Metadata error: specify a dataset to view leaderboard