dpo_40k_abla_one_cat_one

This model is a fine-tuned version of /p/scratch/taco-vlm/xiao4/models/Qwen2.5-VL-7B-Instruct on the dpo_ablation_one_cat_one dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 2
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 8
total_train_batch_size: 64
total_eval_batch_size: 4
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1.0

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/chosen	Logps/rejected	Logits/chosen	Logits/rejected
0.6882	0.0804	50	0.6898	-0.0044	-0.0115	0.5350	0.0071	-30.5444	-35.4393	0.5502	0.5581
0.6603	0.1608	100	0.6593	-0.0806	-0.1570	0.6850	0.0763	-31.3070	-36.8941	0.5384	0.5446
0.6387	0.2412	150	0.6298	-0.1917	-0.3455	0.7250	0.1538	-32.4175	-38.7793	0.5058	0.5080
0.5986	0.3216	200	0.5988	-0.2330	-0.4814	0.7050	0.2485	-32.8302	-40.1388	0.4847	0.4844
0.5368	0.4020	250	0.5667	-0.2959	-0.6688	0.7200	0.3728	-33.4601	-42.0120	0.4258	0.4350
0.5416	0.4824	300	0.5450	-0.3299	-0.8038	0.7450	0.4739	-33.8000	-43.3626	0.3828	0.3894
0.5141	0.5628	350	0.5301	-0.3794	-0.9226	0.7450	0.5432	-34.2943	-44.5501	0.3541	0.3622
0.5122	0.6432	400	0.5206	-0.4136	-1.0123	0.7550	0.5987	-34.6362	-45.4474	0.3284	0.3337
0.4817	0.7236	450	0.5165	-0.4476	-1.0766	0.7750	0.6290	-34.9764	-46.0903	0.3096	0.3177
0.4709	0.8040	500	0.5102	-0.4623	-1.1173	0.7800	0.6550	-35.1233	-46.4975	0.3006	0.3063
0.4759	0.8844	550	0.5098	-0.4751	-1.1359	0.7800	0.6609	-35.2515	-46.6838	0.2987	0.3002
0.4342	0.9648	600	0.5086	-0.4804	-1.1453	0.7800	0.6649	-35.3051	-46.7775	0.2947	0.2991

Base model

Adapter

(133)

this model