dpo_40k_abla_all_eight_lora_8

This model is a fine-tuned version of /p/scratch/taco-vlm/xiao4/models/Qwen2.5-VL-7B-Instruct on the dpo_ablation_all_eight dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 2
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 8
total_train_batch_size: 64
total_eval_batch_size: 4
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1.0

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/chosen	Logps/rejected	Logits/chosen	Logits/rejected
0.6934	0.0806	50	0.6949	-0.0076	-0.0044	0.4900	-0.0032	-32.0068	-36.6781	0.4688	0.4601
0.6714	0.1612	100	0.6716	-0.0773	-0.1260	0.6700	0.0487	-32.7034	-37.8937	0.4662	0.4538
0.6336	0.2418	150	0.6334	-0.2162	-0.3615	0.6750	0.1452	-34.0927	-40.2488	0.4570	0.4424
0.5815	0.3225	200	0.6023	-0.3208	-0.5647	0.6950	0.2438	-35.1384	-42.2806	0.4307	0.4118
0.5156	0.4031	250	0.5798	-0.3931	-0.7416	0.7050	0.3485	-35.8613	-44.0503	0.3851	0.3729
0.526	0.4837	300	0.5636	-0.3932	-0.8386	0.7150	0.4453	-35.8627	-45.0197	0.3565	0.3343
0.4516	0.5643	350	0.5514	-0.4842	-1.0116	0.7100	0.5275	-36.7721	-46.7503	0.3182	0.3018
0.4109	0.6449	400	0.5427	-0.4802	-1.0621	0.7050	0.5818	-36.7327	-47.2548	0.2981	0.2778
0.5726	0.7255	450	0.5362	-0.5329	-1.1560	0.7150	0.6231	-37.2598	-48.1941	0.2839	0.2644
0.4475	0.8061	500	0.5306	-0.5696	-1.2106	0.7200	0.6410	-37.6265	-48.7398	0.2658	0.2456
0.5105	0.8867	550	0.5270	-0.5710	-1.2235	0.7350	0.6525	-37.6404	-48.8687	0.2626	0.2447
0.4304	0.9674	600	0.5276	-0.5738	-1.2276	0.7300	0.6539	-37.6680	-48.9104	0.2642	0.2512

Base model

Adapter

(128)

this model