macadeliccc
/

SOLAR-10.7b-Instruct-truthy-dpo

Text Generation

text-generation-inference

Model card Files Files and versions

SOLAR-10.7b-Instruct-truthy-dpo

This model is a finetune of macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo

Process

I finetuned upstageai/Solar-10.7b-Instruct-v0.1 with 1 epoch of Intel/orca_dpo_pairs (12.4k samples)
I futher finetuned that model with 3 epochs of jondurbin/truthy-dpo-v0.1 (1.04k samples)
This process is experimental and the base model linked above is more tested at this time.

GGUF

Available here

Evaluations

----Benchmark Complete---- + 2024-01-26 20:57:38 + Time taken: 25.4 mins + Prompt Format: ChatML + Model: macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo-GGUF + Score (v2): 74.11 + Parseable: 171.0

Batch completed Time taken: 25.5 mins

Model	AGIEval	GPT4All	TruthfulQA	Bigbench	Average
SOLAR-10.7b-Instruct-truthy-dpo	48.69	73.82	76.81	45.71	61.26

AGIEval

Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	27.95	±	2.82
		acc_norm	27.95	±	2.82
agieval_logiqa_en	0	acc	42.40	±	1.94
		acc_norm	42.24	±	1.94
agieval_lsat_ar	0	acc	25.65	±	2.89
		acc_norm	23.91	±	2.82
agieval_lsat_lr	0	acc	54.12	±	2.21
		acc_norm	54.51	±	2.21
agieval_lsat_rc	0	acc	69.89	±	2.80
		acc_norm	69.89	±	2.80
agieval_sat_en	0	acc	80.10	±	2.79
		acc_norm	80.10	±	2.79
agieval_sat_en_without_passage	0	acc	50.00	±	3.49
		acc_norm	49.51	±	3.49
agieval_sat_math	0	acc	42.27	±	3.34
		acc_norm	41.36	±	3.33

Average: 48.69%

GPT4All

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	59.90	±	1.43
		acc_norm	63.91	±	1.40
arc_easy	0	acc	80.85	±	0.81
		acc_norm	78.16	±	0.85
boolq	1	acc	88.20	±	0.56
hellaswag	0	acc	68.34	±	0.46
		acc_norm	86.39	±	0.34
openbookqa	0	acc	37.60	±	2.17
		acc_norm	46.80	±	2.23
piqa	0	acc	78.84	±	0.95
		acc_norm	78.78	±	0.95
winogrande	0	acc	74.51	±	1.22

Average: 73.82%

TruthfulQA

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	61.81	±	1.70
		mc2	76.81	±	1.42

Average: 76.81%

Bigbench

Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	50.53	±	3.64
bigbench_date_understanding	0	multiple_choice_grade	63.14	±	2.51
bigbench_disambiguation_qa	0	multiple_choice_grade	47.67	±	3.12
bigbench_geometric_shapes	0	multiple_choice_grade	26.18	±	2.32
		exact_str_match	0.00	±	0.00
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	28.60	±	2.02
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	21.29	±	1.55
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	47.33	±	2.89
bigbench_movie_recommendation	0	multiple_choice_grade	39.80	±	2.19
bigbench_navigate	0	multiple_choice_grade	63.80	±	1.52
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	59.05	±	1.10
bigbench_ruin_names	0	multiple_choice_grade	40.18	±	2.32
bigbench_salient_translation_error_detection	0	multiple_choice_grade	46.69	±	1.58
bigbench_snarks	0	multiple_choice_grade	65.19	±	3.55
bigbench_sports_understanding	0	multiple_choice_grade	72.41	±	1.42
bigbench_temporal_sequences	0	multiple_choice_grade	60.30	±	1.55
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	25.76	±	1.24
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	17.43	±	0.91
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	47.33	±	2.89

Average: 45.71%

Average score: 61.26%

Elapsed time: 02:16:03

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	74.11
AI2 Reasoning Challenge (25-Shot)	72.10
HellaSwag (10-Shot)	88.44
MMLU (5-Shot)	65.45
TruthfulQA (0-shot)	76.75
Winogrande (5-shot)	82.72
GSM8k (5-shot)	59.21

Downloads last month: 39

Safetensors

Model size

11B params

Tensor type

F16

·

Model tree for macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo

Merges

1 model

Quantizations

Spaces using macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo 7

Collection including macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo

DPO fine tunes

3 items • Updated Jul 11, 2024

Evaluation results

normalized accuracy on AI2 Reasoning Challenge (25-Shot)
test set Open LLM Leaderboard

72.100
normalized accuracy on HellaSwag (10-Shot)
validation set Open LLM Leaderboard

88.440
accuracy on MMLU (5-Shot)
test set Open LLM Leaderboard

65.450
mc2 on TruthfulQA (0-shot)
validation set Open LLM Leaderboard

76.750
accuracy on Winogrande (5-shot)
validation set Open LLM Leaderboard

82.720
accuracy on GSM8k (5-shot)
test set Open LLM Leaderboard

59.210

View on Papers With Code