Add evaluation results for RepoBench, SAFIM, HumanEval (#1)

Browse files

- Add evaluation results for RepoBench, SAFIM, HumanEval (e3e5887b059b90936b33ca6e03e1d092240782fa)

Co-authored-by: Ivan Bondyrev <[email protected]>

Files changed (1) hide show

README.md +274 -0

README.md CHANGED Viewed

@@ -10,6 +10,241 @@ tags:
 - code
 base_model:
 - JetBrains/Mellum-4b-base
 ---
 # Model Description
@@ -37,6 +272,45 @@ llama-cli -m mellum-4b-sft-all.Q8_0.gguf --temp 0 -p $'<filename>Utils.kt\npacka
 ```
 # Citation
 If you use this model, please cite:

 - code
 base_model:
 - JetBrains/Mellum-4b-base
+model-index:
+- name: Mellum-4b-sft-all
+  results:
+  # --------------------------- RepoBench 1.1 – Python ---------------------------
+  - task:
+      type: text-generation
+    dataset:
+      type: tianyang/repobench_python_v1.1
+      name: RepoBench 1.1 (Python)
+    metrics:
+    - name: EM
+      type: exact_match
+      value: 0.2823
+      verified: false
+    - name: EM ≤ 8k
+      type: exact_match
+      value: 0.2870
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: tianyang/repobench_python_v1.1
+      name: RepoBench 1.1 (Python, 2k)
+    metrics:
+    - name: EM
+      type: exact_match
+      value: 0.2638
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: tianyang/repobench_python_v1.1
+      name: RepoBench 1.1 (Python, 4k)
+    metrics:
+    - name: EM
+      type: exact_match
+      value: 0.2930
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: tianyang/repobench_python_v1.1
+      name: RepoBench 1.1 (Python, 8k)
+    metrics:
+    - name: EM
+      type: exact_match
+      value: 0.3042
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: tianyang/repobench_python_v1.1
+      name: RepoBench 1.1 (Python, 12k)
+    metrics:
+    - name: EM
+      type: exact_match
+      value: 0.2685
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: tianyang/repobench_python_v1.1
+      name: RepoBench 1.1 (Python, 16k)
+    metrics:
+    - name: EM
+      type: exact_match
+      value: 0.2818
+      verified: false
+  # --------------------------- RepoBench 1.1 – Java ----------------------------
+  - task:
+      type: text-generation
+    dataset:
+      type: tianyang/repobench_java_v1.1
+      name: RepoBench 1.1 (Java)
+    metrics:
+    - name: EM
+      type: exact_match
+      value: 0.2867
+      verified: false
+    - name: EM ≤ 8k
+      type: exact_match
+      value: 0.3023
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: tianyang/repobench_java_v1.1
+      name: RepoBench 1.1 (Java, 2k)
+    metrics:
+    - name: EM
+      type: exact_match
+      value: 0.2883
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: tianyang/repobench_java_v1.1
+      name: RepoBench 1.1 (Java, 4k)
+    metrics:
+    - name: EM
+      type: exact_match
+      value: 0.3228
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: tianyang/repobench_java_v1.1
+      name: RepoBench 1.1 (Java, 8k)
+    metrics:
+    - name: EM
+      type: exact_match
+      value: 0.2958
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: tianyang/repobench_java_v1.1
+      name: RepoBench 1.1 (Java, 12k)
+    metrics:
+    - name: EM
+      type: exact_match
+      value: 0.2447
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: tianyang/repobench_java_v1.1
+      name: RepoBench 1.1 (Java, 16k)
+    metrics:
+    - name: EM
+      type: exact_match
+      value: 0.2821
+      verified: false
+  # --------------------------- SAFIM ------------------------------------------
+  - task:
+      type: text-generation
+    dataset:
+      type: gonglinyuan/safim
+      name: SAFIM
+    metrics:
+    - name: pass@1
+      type: pass@1
+      value: 0.5285
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: gonglinyuan/safim
+      name: SAFIM (API)
+    metrics:
+    - name: pass@1
+      type: pass@1
+      value: 0.6548
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: gonglinyuan/safim
+      name: SAFIM (Block)
+    metrics:
+    - name: pass@1
+      type: pass@1
+      value: 0.4005
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: gonglinyuan/safim
+      name: SAFIM (Control)
+    metrics:
+    - name: pass@1
+      type: pass@1
+      value: 0.5303
+      verified: false
+   # --------------------------- HumanEval Infilling ----------------------------
+  - task:
+      type: text-generation
+    dataset:
+      type: loubnabnl/humaneval_infilling
+      name: HumanEval Infilling (Single-Line)
+    metrics:
+    - name: pass@1
+      type: pass@1
+      value: 0.8083
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: loubnabnl/humaneval_infilling
+      name: HumanEval Infilling (Multi-Line)
+    metrics:
+    - name: pass@1
+      type: pass@1
+      value: 0.4819
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: loubnabnl/humaneval_infilling
+      name: HumanEval Infilling (Random Span)
+    metrics:
+    - name: pass@1
+      type: pass@1
+      value: 0.3720
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: loubnabnl/humaneval_infilling
+      name: HumanEval Infilling (Random Span Light)
+    metrics:
+    - name: pass@1
+      type: pass@1
+      value: 0.4024
+      verified: false
 ---
 # Model Description
 ```
+## Benchmarks
+We are providing scores for **Mellum‑4b‑sft‑all** to give users an estimate of the model’s potential capabilities.
+### RepoBench 1.1
+*Type:* single‑line  *Languages:* Python and Java  *Metric:* Exact Match (EM), %
+Since Mellum has a maximum context window of 8 k, we report both the average over **all** evaluated context lengths (2 k, 4 k, 8 k, 12 k and 16 k) and the average over the lengths within its supported range (≤ 8 k).
+#### Python subset
+| Model               | 2 k  | 4 k  | 8 k  | 12 k | 16 k | Avg  | Avg ≤ 8 k |
+|---------------------|------|------|------|------|------|------|-----------|
+| Mellum‑4b‑sft‑all   | 26.38% | 29.30% | 30.42% | 26.85% | 28.18% | 28.23% | 28.70% |
+#### Java subset
+| Model               | 2 k  | 4 k  | 8 k  | 12 k | 16 k | Avg  | Avg ≤ 8 k |
+|---------------------|------|------|------|------|------|------|-----------|
+| Mellum‑4b‑sft‑all   | 28.83% | 32.28% | 29.58% | 24.47% | 28.21% | 28.67% | 30.23% |
+### Syntax‑Aware Fill‑in‑the‑Middle (SAFIM)
+*Type:* mix of multi‑line and single‑line  *Languages:* multi‑language  *Metric:* pass@1, %
+| Model               | Algorithmic | Control | API | Average |
+|---------------------|--------------|---------|-----|---------|
+| Mellum‑4b‑sft‑all   | 40.05% | 53.03% | 65.48% | 52.85% |
+### HumanEval Infilling
+- Type: single‑line and multi‑line
+- Languages: Python
+- Metric: pass@1, %
+| Model               | Single‑Line | Multi‑Line | Random Span |
+|---------------------|-------------|------------|-------------|
+| Mellum‑4b‑sft‑all   | 80.83%  | 48.19% | 37.20%  |
 # Citation
 If you use this model, please cite: