xverse
/

XVERSE-65B-2

@@ -147,19 +147,19 @@ For the Code data, the following table shows the proportion of different program
 为了综合评估模型的性能，我们在一系列标准数据集上进行了全面测试，包括C-Eval、CMMLU、Gaokao-Bench、MMLU、GAOKAO-English、AGIEval、RACE-M、CommonSenseQA、PIQA、GSM8K和HumanEval。这些评估覆盖了模型在多个领域的能力，具体包括中文问答、英文问答、语言理解、常识问答、逻辑推理、数学问题解答以及编程能力。评估结果如下：
-|  能力维度  |           数据集           |        | XVERSE-65B | Llama1-65B | Llama2-70B | Falcon-180B | GPT-3.5 | GPT-4 |
-| :--------: | :------------------------: | :----: | :--------: | :--------: | :--------: | :---------: | :-----: | :---: |
-|  中文问答  |           C-Eval           | 5-shot |    68.6    |    38.8    |    49.9    |    54.2     |  54.4   | 68.7  |
-|            |           CMMLU            | 5-shot |    72.6    |    40.6    |    53.6    |    57.2     |  53.9   | 71.0  |
-|            |  Gaokao-Bench<sup>1</sup>  | 5-shot |    73.9    |    38.9    |    51.4    |    50.5     |    -    |   -   |
-|  英文问答  |            MMLU            | 5-shot |    70.8    |    63.4    |    68.9    |    70.5     |  70.0   | 86.4  |
-|            | GAOKAO-English<sup>1</sup> | 5-shot |    85.3    |    67.0    |    76.6    |    63.3     |    -    |   -   |
-| 中英文问答 |    AGIEval<sup>1</sup>     | 5-shot |    61.8    |    42.4    |    51.4    |    51.3     |    -    |   -   |
-|  语言理解  |           RACE-M           | 0-shot |    90.6    |    67.9    |    81.5    |    87.6     |  85.6   | 93.7  |
-|  常识问答  |       CommonSenseQA        | 7-shot |    79.8    |    74.0    |    78.5    |    82.4     |  80.2   | 88.3  |
-|    推理    |            PIQA            | 0-shot |    80.4    |    82.8    |    82.8    |    85.3     |  81.7   | 89.2  |
-|    数学    |           GSM8K            | 4-shot |    60.3    |    50.9    |    56.8    |    62.6     |  57.1   | 92.0  |
-|    代码    |         HumanEval          | 0-shot |    26.8    |    23.7    |    29.9    |      -      |  48.1   | 67.0  |
 > <sup>1：只针对其中的单项选择题进行测试，即排除了填空题、开放性问题和多项选择题</sup>
@@ -170,19 +170,19 @@ For the Code data, the following table shows the proportion of different program
 To comprehensively assess the performance of the model, we conducted extensive testing across a range of standard datasets, including C-Eval, CMMLU, Gaokao-Bench, MMLU, GAOKAO-English, AGIEval, RACE-M, CommonSenseQA, PIQA, GSM8K and HumanEval. These evaluations spanned multiple capabilities of the model, specifically including Chinese question answering, English question answering, language comprehension, common sense questioning, logical reasoning, mathematical problem-solving, and coding ability. The results of the evaluations are as follows:
-|  Capability Dimension  |          Dataset           |        | XVERSE-65B | Llama1-65B | Llama2-70B | Falcon-180B | GPT-3.5 | GPT-4 |
-| :--------------------: | :------------------------: | :----: | :--------: | :--------: | :--------: | :---------: | :-----: | :---: |
-|       Chinese QA       |           C-Eval           | 5-shot |    68.6    |    38.8    |    49.9    |    54.2     |  54.4   | 68.7  |
-|                        |           CMMLU            | 5-shot |    72.6    |    40.6    |    53.6    |    57.2     |  53.9   | 71.0  |
-|                        |  Gaokao-Bench<sup>1</sup>  | 5-shot |    73.9    |    38.9    |    51.4    |    50.5     |    -    |   -   |
-|       English QA       |            MMLU            | 5-shot |    70.8    |    63.4    |    68.9    |    70.5     |  70.0   | 86.4  |
-|                        | GAOKAO-English<sup>1</sup> | 5-shot |    85.3    |    67.0    |    76.6    |    63.3     |    -    |   -   |
-|  Chinese & English QA  |    AGIEval<sup>1</sup>     | 5-shot |    61.8    |    42.4    |    51.4    |    51.3     |    -    |   -   |
-| Language Understanding |           RACE-M           | 0-shot |    90.6    |    67.9    |    81.5    |    87.6     |  85.6   | 93.7  |
-|    Common Sense QA     |       CommonSenseQA        | 7-shot |    79.8    |    74.0    |    78.5    |    82.4     |  80.2   | 88.3  |
-|       Reasoning        |            PIQA            | 0-shot |    80.4    |    82.8    |    82.8    |    85.3     |  81.7   | 89.2  |
-|          Math          |           GSM8K            | 4-shot |    60.3    |    50.9    |    56.8    |    62.6     |  57.1   | 92.0  |
-|         Coding         |         HumanEval          | 0-shot |    26.8    |    23.7    |    29.9    |      -      |  48.1   | 67.0  |
 > <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>

 为了综合评估模型的性能，我们在一系列标准数据集上进行了全面测试，包括C-Eval、CMMLU、Gaokao-Bench、MMLU、GAOKAO-English、AGIEval、RACE-M、CommonSenseQA、PIQA、GSM8K和HumanEval。这些评估覆盖了模型在多个领域的能力，具体包括中文问答、英文问答、语言理解、常识问答、逻辑推理、数学问题解答以及编程能力。评估结果如下：
+|  能力维度  |           数据集           |        | XVERSE-65B-2 | XVERSE-65B | Llama1-65B | Llama2-70B | Falcon-180B | GPT-3.5 | GPT-4 |
+| :--------: | :------------------------: | :----: | :----------: | :--------: | :--------: | :--------: | :---------: | :-----: | :---: |
+|  中文问答  |           C-Eval           | 5-shot |     72.4     |    68.6    |    38.8    |    49.9    |    54.2     |  54.4   | 68.7  |
+|            |           CMMLU            | 5-shot |     75.1     |    72.6    |    40.6    |    53.6    |    57.2     |  53.9   | 71.0  |
+|            |  Gaokao-Bench<sup>1</sup>  | 5-shot |     76.9     |    73.9    |    38.9    |    51.4    |    50.5     |    -    |   -   |
+|  英文问答  |            MMLU            | 5-shot |     74.4     |    70.8    |    63.4    |    68.9    |    70.5     |  70.0   | 86.4  |
+|            | GAOKAO-English<sup>1</sup> | 5-shot |     86.6     |    85.3    |    67.0    |    76.6    |    63.3     |    -    |   -   |
+| 中英文问答 |    AGIEval<sup>1</sup>     | 5-shot |     66.2     |    61.8    |    42.4    |    51.4    |    51.3     |    -    |   -   |
+|  语言理解  |           RACE-M           | 0-shot |     90.7     |    90.6    |    67.9    |    81.5    |    87.6     |  85.6   | 93.7  |
+|  常识问答  |       CommonSenseQA        | 7-shot |     81.1     |    79.8    |    74.0    |    78.5    |    82.4     |  80.2   | 88.3  |
+|    推理    |            PIQA            | 0-shot |     79.4     |    80.4    |    82.8    |    82.8    |    85.3     |  81.7   | 89.2  |
+|    数学    |           GSM8K            | 4-shot |     72.6     |    60.3    |    50.9    |    56.8    |    62.6     |  57.1   | 92.0  |
+|    代码    |         HumanEval          | 0-shot |     37.8     |    26.8    |    23.7    |    29.9    |      -      |  48.1   | 67.0  |
 > <sup>1：只针对其中的单项选择题进行测试，即排除了填空题、开放性问题和多项选择题</sup>
 To comprehensively assess the performance of the model, we conducted extensive testing across a range of standard datasets, including C-Eval, CMMLU, Gaokao-Bench, MMLU, GAOKAO-English, AGIEval, RACE-M, CommonSenseQA, PIQA, GSM8K and HumanEval. These evaluations spanned multiple capabilities of the model, specifically including Chinese question answering, English question answering, language comprehension, common sense questioning, logical reasoning, mathematical problem-solving, and coding ability. The results of the evaluations are as follows:
+|  Capability Dimension  |          Dataset           |        | XVERSE-65B-2 | XVERSE-65B | Llama1-65B | Llama2-70B | Falcon-180B | GPT-3.5 | GPT-4 |
+| :--------------------: | :------------------------: | :----: | :----------: | :--------: | :--------: | :--------: | :---------: | :-----: | :---: |
+|       Chinese QA       |           C-Eval           | 5-shot |     72.4     |    68.6    |    38.8    |    49.9    |    54.2     |  54.4   | 68.7  |
+|                        |           CMMLU            | 5-shot |     75.1     |    72.6    |    40.6    |    53.6    |    57.2     |  53.9   | 71.0  |
+|                        |  Gaokao-Bench<sup>1</sup>  | 5-shot |     76.9     |    73.9    |    38.9    |    51.4    |    50.5     |    -    |   -   |
+|       English QA       |            MMLU            | 5-shot |     74.4     |    70.8    |    63.4    |    68.9    |    70.5     |  70.0   | 86.4  |
+|                        | GAOKAO-English<sup>1</sup> | 5-shot |     86.6     |    85.3    |    67.0    |    76.6    |    63.3     |    -    |   -   |
+|  Chinese & English QA  |    AGIEval<sup>1</sup>     | 5-shot |     66.2     |    61.8    |    42.4    |    51.4    |    51.3     |    -    |   -   |
+| Language Understanding |           RACE-M           | 0-shot |     90.7     |    90.6    |    67.9    |    81.5    |    87.6     |  85.6   | 93.7  |
+|    Common Sense QA     |       CommonSenseQA        | 7-shot |     81.1     |    79.8    |    74.0    |    78.5    |    82.4     |  80.2   | 88.3  |
+|       Reasoning        |            PIQA            | 0-shot |     79.4     |    80.4    |    82.8    |    82.8    |    85.3     |  81.7   | 89.2  |
+|          Math          |           GSM8K            | 4-shot |     72.6     |    60.3    |    50.9    |    56.8    |    62.6     |  57.1   | 92.0  |
+|         Coding         |         HumanEval          | 0-shot |     37.8     |    26.8    |    23.7    |    29.9    |      -      |  48.1   | 67.0  |
 > <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>