Commit
·
3f1c9ac
1
Parent(s):
6dbe66b
Update README.md
Browse files
README.md
CHANGED
|
@@ -147,19 +147,19 @@ For the Code data, the following table shows the proportion of different program
|
|
| 147 |
|
| 148 |
为了综合评估模型的性能,我们在一系列标准数据集上进行了全面测试,包括C-Eval、CMMLU、Gaokao-Bench、MMLU、GAOKAO-English、AGIEval、RACE-M、CommonSenseQA、PIQA、GSM8K和HumanEval。这些评估覆盖了模型在多个领域的能力,具体包括中文问答、英文问答、语言理解、常识问答、逻辑推理、数学问题解答以及编程能力。评估结果如下:
|
| 149 |
|
| 150 |
-
| 能力维度 | 数据集 | | XVERSE-65B | Llama1-65B | Llama2-70B | Falcon-180B | GPT-3.5 | GPT-4 |
|
| 151 |
-
| :--------: | :------------------------: | :----: | :--------: | :--------: | :--------: | :---------: | :-----: | :---: |
|
| 152 |
-
| 中文问答 | C-Eval | 5-shot | 68.6 | 38.8 | 49.9 | 54.2 | 54.4 | 68.7 |
|
| 153 |
-
| | CMMLU | 5-shot | 72.6 | 40.6 | 53.6 | 57.2 | 53.9 | 71.0 |
|
| 154 |
-
| | Gaokao-Bench<sup>1</sup> | 5-shot | 73.9 | 38.9 | 51.4 | 50.5 | - | - |
|
| 155 |
-
| 英文问答 | MMLU | 5-shot | 70.8 | 63.4 | 68.9 | 70.5 | 70.0 | 86.4 |
|
| 156 |
-
| | GAOKAO-English<sup>1</sup> | 5-shot | 85.3 | 67.0 | 76.6 | 63.3 | - | - |
|
| 157 |
-
| 中英文问答 | AGIEval<sup>1</sup> | 5-shot | 61.8 | 42.4 | 51.4 | 51.3 | - | - |
|
| 158 |
-
| 语言理解 | RACE-M | 0-shot | 90.6 | 67.9 | 81.5 | 87.6 | 85.6 | 93.7 |
|
| 159 |
-
| 常识问答 | CommonSenseQA | 7-shot | 79.8 | 74.0 | 78.5 | 82.4 | 80.2 | 88.3 |
|
| 160 |
-
| 推理 | PIQA | 0-shot | 80.4 | 82.8 | 82.8 | 85.3 | 81.7 | 89.2 |
|
| 161 |
-
| 数学 | GSM8K | 4-shot | 60.3 | 50.9 | 56.8 | 62.6 | 57.1 | 92.0 |
|
| 162 |
-
| 代码 | HumanEval | 0-shot | 26.8 | 23.7 | 29.9 | - | 48.1 | 67.0 |
|
| 163 |
|
| 164 |
> <sup>1:只针对其中的单项选择题进行测试,即排除了填空题、开放性问题和多项选择题</sup>
|
| 165 |
|
|
@@ -170,19 +170,19 @@ For the Code data, the following table shows the proportion of different program
|
|
| 170 |
|
| 171 |
To comprehensively assess the performance of the model, we conducted extensive testing across a range of standard datasets, including C-Eval, CMMLU, Gaokao-Bench, MMLU, GAOKAO-English, AGIEval, RACE-M, CommonSenseQA, PIQA, GSM8K and HumanEval. These evaluations spanned multiple capabilities of the model, specifically including Chinese question answering, English question answering, language comprehension, common sense questioning, logical reasoning, mathematical problem-solving, and coding ability. The results of the evaluations are as follows:
|
| 172 |
|
| 173 |
-
| Capability Dimension | Dataset | | XVERSE-65B | Llama1-65B | Llama2-70B | Falcon-180B | GPT-3.5 | GPT-4 |
|
| 174 |
-
| :--------------------: | :------------------------: | :----: | :--------: | :--------: | :--------: | :---------: | :-----: | :---: |
|
| 175 |
-
| Chinese QA | C-Eval | 5-shot | 68.6 | 38.8 | 49.9 | 54.2 | 54.4 | 68.7 |
|
| 176 |
-
| | CMMLU | 5-shot | 72.6 | 40.6 | 53.6 | 57.2 | 53.9 | 71.0 |
|
| 177 |
-
| | Gaokao-Bench<sup>1</sup> | 5-shot | 73.9 | 38.9 | 51.4 | 50.5 | - | - |
|
| 178 |
-
| English QA | MMLU | 5-shot | 70.8 | 63.4 | 68.9 | 70.5 | 70.0 | 86.4 |
|
| 179 |
-
| | GAOKAO-English<sup>1</sup> | 5-shot | 85.3 | 67.0 | 76.6 | 63.3 | - | - |
|
| 180 |
-
| Chinese & English QA | AGIEval<sup>1</sup> | 5-shot | 61.8 | 42.4 | 51.4 | 51.3 | - | - |
|
| 181 |
-
| Language Understanding | RACE-M | 0-shot | 90.6 | 67.9 | 81.5 | 87.6 | 85.6 | 93.7 |
|
| 182 |
-
| Common Sense QA | CommonSenseQA | 7-shot | 79.8 | 74.0 | 78.5 | 82.4 | 80.2 | 88.3 |
|
| 183 |
-
| Reasoning | PIQA | 0-shot | 80.4 | 82.8 | 82.8 | 85.3 | 81.7 | 89.2 |
|
| 184 |
-
| Math | GSM8K | 4-shot | 60.3 | 50.9 | 56.8 | 62.6 | 57.1 | 92.0 |
|
| 185 |
-
| Coding | HumanEval | 0-shot | 26.8 | 23.7 | 29.9 | - | 48.1 | 67.0 |
|
| 186 |
|
| 187 |
> <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>
|
| 188 |
|
|
|
|
| 147 |
|
| 148 |
为了综合评估模型的性能,我们在一系列标准数据集上进行了全面测试,包括C-Eval、CMMLU、Gaokao-Bench、MMLU、GAOKAO-English、AGIEval、RACE-M、CommonSenseQA、PIQA、GSM8K和HumanEval。这些评估覆盖了模型在多个领域的能力,具体包括中文问答、英文问答、语言理解、常识问答、逻辑推理、数学问题解答以及编程能力。评估结果如下:
|
| 149 |
|
| 150 |
+
| 能力维度 | 数据集 | | XVERSE-65B-2 | XVERSE-65B | Llama1-65B | Llama2-70B | Falcon-180B | GPT-3.5 | GPT-4 |
|
| 151 |
+
| :--------: | :------------------------: | :----: | :----------: | :--------: | :--------: | :--------: | :---------: | :-----: | :---: |
|
| 152 |
+
| 中文问答 | C-Eval | 5-shot | 72.4 | 68.6 | 38.8 | 49.9 | 54.2 | 54.4 | 68.7 |
|
| 153 |
+
| | CMMLU | 5-shot | 75.1 | 72.6 | 40.6 | 53.6 | 57.2 | 53.9 | 71.0 |
|
| 154 |
+
| | Gaokao-Bench<sup>1</sup> | 5-shot | 76.9 | 73.9 | 38.9 | 51.4 | 50.5 | - | - |
|
| 155 |
+
| 英文问答 | MMLU | 5-shot | 74.4 | 70.8 | 63.4 | 68.9 | 70.5 | 70.0 | 86.4 |
|
| 156 |
+
| | GAOKAO-English<sup>1</sup> | 5-shot | 86.6 | 85.3 | 67.0 | 76.6 | 63.3 | - | - |
|
| 157 |
+
| 中英文问答 | AGIEval<sup>1</sup> | 5-shot | 66.2 | 61.8 | 42.4 | 51.4 | 51.3 | - | - |
|
| 158 |
+
| 语言理解 | RACE-M | 0-shot | 90.7 | 90.6 | 67.9 | 81.5 | 87.6 | 85.6 | 93.7 |
|
| 159 |
+
| 常识问答 | CommonSenseQA | 7-shot | 81.1 | 79.8 | 74.0 | 78.5 | 82.4 | 80.2 | 88.3 |
|
| 160 |
+
| 推理 | PIQA | 0-shot | 79.4 | 80.4 | 82.8 | 82.8 | 85.3 | 81.7 | 89.2 |
|
| 161 |
+
| 数学 | GSM8K | 4-shot | 72.6 | 60.3 | 50.9 | 56.8 | 62.6 | 57.1 | 92.0 |
|
| 162 |
+
| 代码 | HumanEval | 0-shot | 37.8 | 26.8 | 23.7 | 29.9 | - | 48.1 | 67.0 |
|
| 163 |
|
| 164 |
> <sup>1:只针对其中的单项选择题进行测试,即排除了填空题、开放性问题和多项选择题</sup>
|
| 165 |
|
|
|
|
| 170 |
|
| 171 |
To comprehensively assess the performance of the model, we conducted extensive testing across a range of standard datasets, including C-Eval, CMMLU, Gaokao-Bench, MMLU, GAOKAO-English, AGIEval, RACE-M, CommonSenseQA, PIQA, GSM8K and HumanEval. These evaluations spanned multiple capabilities of the model, specifically including Chinese question answering, English question answering, language comprehension, common sense questioning, logical reasoning, mathematical problem-solving, and coding ability. The results of the evaluations are as follows:
|
| 172 |
|
| 173 |
+
| Capability Dimension | Dataset | | XVERSE-65B-2 | XVERSE-65B | Llama1-65B | Llama2-70B | Falcon-180B | GPT-3.5 | GPT-4 |
|
| 174 |
+
| :--------------------: | :------------------------: | :----: | :----------: | :--------: | :--------: | :--------: | :---------: | :-----: | :---: |
|
| 175 |
+
| Chinese QA | C-Eval | 5-shot | 72.4 | 68.6 | 38.8 | 49.9 | 54.2 | 54.4 | 68.7 |
|
| 176 |
+
| | CMMLU | 5-shot | 75.1 | 72.6 | 40.6 | 53.6 | 57.2 | 53.9 | 71.0 |
|
| 177 |
+
| | Gaokao-Bench<sup>1</sup> | 5-shot | 76.9 | 73.9 | 38.9 | 51.4 | 50.5 | - | - |
|
| 178 |
+
| English QA | MMLU | 5-shot | 74.4 | 70.8 | 63.4 | 68.9 | 70.5 | 70.0 | 86.4 |
|
| 179 |
+
| | GAOKAO-English<sup>1</sup> | 5-shot | 86.6 | 85.3 | 67.0 | 76.6 | 63.3 | - | - |
|
| 180 |
+
| Chinese & English QA | AGIEval<sup>1</sup> | 5-shot | 66.2 | 61.8 | 42.4 | 51.4 | 51.3 | - | - |
|
| 181 |
+
| Language Understanding | RACE-M | 0-shot | 90.7 | 90.6 | 67.9 | 81.5 | 87.6 | 85.6 | 93.7 |
|
| 182 |
+
| Common Sense QA | CommonSenseQA | 7-shot | 81.1 | 79.8 | 74.0 | 78.5 | 82.4 | 80.2 | 88.3 |
|
| 183 |
+
| Reasoning | PIQA | 0-shot | 79.4 | 80.4 | 82.8 | 82.8 | 85.3 | 81.7 | 89.2 |
|
| 184 |
+
| Math | GSM8K | 4-shot | 72.6 | 60.3 | 50.9 | 56.8 | 62.6 | 57.1 | 92.0 |
|
| 185 |
+
| Coding | HumanEval | 0-shot | 37.8 | 26.8 | 23.7 | 29.9 | - | 48.1 | 67.0 |
|
| 186 |
|
| 187 |
> <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>
|
| 188 |
|