dpo_40k_abla_one_cat_neg_only

This model is a fine-tuned version of /p/scratch/taco-vlm/xiao4/models/Qwen2.5-VL-7B-Instruct on the dpo_ablation_one_cat_neg_only dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 2
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 8
total_train_batch_size: 64
total_eval_batch_size: 4
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1.0

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/chosen	Logps/rejected	Logits/chosen	Logits/rejected
0.6461	0.0804	50	0.6440	0.0667	-0.0352	0.9200	0.1019	-35.5412	-26.3565	0.5580	0.5903
0.3417	0.1608	100	0.3474	0.6275	-0.3350	0.9700	0.9626	-29.9331	-29.3550	0.5090	0.5769
0.1545	0.2412	150	0.1499	1.0768	-1.2928	0.9950	2.3696	-25.4406	-38.9325	0.4661	0.5881
0.0612	0.3216	200	0.0870	1.2276	-2.1346	1.0	3.3622	-23.9325	-47.3507	0.4834	0.6188
0.0653	0.4020	250	0.0584	1.3513	-2.6611	1.0	4.0125	-22.6949	-52.6160	0.4899	0.6294
0.034	0.4824	300	0.0426	1.4271	-3.1388	1.0	4.5659	-21.9378	-57.3929	0.5077	0.6533
0.0276	0.5628	350	0.0338	1.4728	-3.4698	1.0	4.9426	-21.4803	-60.7023	0.4949	0.6430
0.016	0.6432	400	0.0286	1.5195	-3.7150	1.0	5.2345	-21.0130	-63.1547	0.5002	0.6343
0.023	0.7236	450	0.0256	1.5404	-3.8924	1.0	5.4328	-20.8042	-64.9289	0.5012	0.6427
0.0203	0.8040	500	0.0242	1.5588	-3.9827	1.0	5.5414	-20.6205	-65.8313	0.4993	0.6421
0.0244	0.8844	550	0.0235	1.5606	-4.0356	1.0	5.5962	-20.6023	-66.3604	0.5056	0.6446
0.0175	0.9648	600	0.0235	1.5615	-4.0445	1.0	5.6061	-20.5929	-66.4498	0.4924	0.6398

Base model

Adapter

(145)

this model