prabhuat commited on
Commit
3eb4712
·
verified ·
1 Parent(s): 7dc7dc7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -74
README.md CHANGED
@@ -49,100 +49,72 @@ Model responses were scored using a combination of automated evaluation by a hig
49
 
50
  ## Benchmark Results - August 2025
51
 
52
- ### Logic Category Comparison
53
-
54
- | Model | Accuracy (%) |
55
- | :------------------ | :----------- |
56
- | `gemini-2.5-pro` | 93.60 |
57
- | `deepthink-r1` | 89.63 |
58
- | `gpt-5` | 83.23 |
59
- | `deepseek-r1` | 82.92 |
60
- | `gpt-oss-120b` | 80.49 |
61
- | `gpt-oss-20b` | 79.27 |
62
- | `cdx1-pro-mlx-8bit` | 73.17 |
63
- | `o4-mini-high` | 67.99 |
64
- | `qwen3-coder-480B` | 48.48 |
65
- | `cdx1-mlx-8bit` | 46.04 |
66
 
 
 
67
 
68
- This table compares the accuracy of **ten** different AI models on a logic benchmark designed to assess reasoning and problem-solving skills. The results highlight a clear hierarchy of performance, with the newly added `gpt-5` debuting as a top-tier model.
69
 
70
- **Key Findings from the Chart:**
71
 
72
- - **Dominant Leader:** `gemini-2.5-pro` is the undisputed leader, achieving the highest accuracy of **93.6%**, placing it in a class of its own.
73
- - **Top-Tier Competitors:** A strong group of models follows, led by `deepthink-r1` at **89.63%**. The newly introduced **`gpt-5`** makes a powerful debut, securing the third-place spot with **83.23%** accuracy. It slightly outperforms `deepseek-r1` (82.92%) and `gpt-oss-120b` (80.49%).
74
  - **Strong Mid-Tier:** The `gpt-oss-20b` model performs impressively well for its size at **79.27%**, outscoring several larger models and leading the middle pack, which also includes `cdx1-pro-mlx-8bit` (73.17%) and `o4-mini-high` (67.99%).
75
- - **Lower Performers:** `qwen3-coder-480B` (48.48%) and `cdx1-mlx-8bit` (46.04%) score the lowest. It is noted that the score for `cdx1-mlx-8bit` is artificially low due to context length limitations, which caused it to miss questions.
76
- - **Efficiency and Performance:** The results from the `gpt-oss` models, particularly the 20B variant, demonstrate that highly optimized, smaller models can be very competitive on logic tasks.
77
-
78
- ### Performance Tiers
79
-
80
- The models can be grouped into four clear performance tiers:
81
-
82
- - **Elite Tier (>90%):**
83
- - `gemini-2.5-pro` (93.6%)
84
- - **High-Performing Tier (80%-90%):**
85
- - `deepthink-r1` (89.63%)
86
- - `gpt-5` (83.23%)
87
- - `deepseek-r1` (82.92%)
88
- - `gpt-oss-120b` (80.49%)
89
- - **Mid-Tier (65%-80%):**
90
- - `gpt-oss-20b` (79.27%)
91
- - `cdx1-pro-mlx-8bit` (73.17%)
92
- - `o4-mini-high` (67.99%)
93
- - **Lower Tier (<50%):**
94
- - `qwen3-coder-480B` (48.48%)
95
- - `cdx1-mlx-8bit` (46.04%)
96
-
97
- ### Spec Category Comparison
98
-
99
- | Model | Accuracy (%) |
100
- | :------------------ | :----------- |
101
- | `gemini-2.5-pro` | 100.00 |
102
- | `deepseek-r1` | 98.58 |
103
- | `cdx1-pro-mlx-8bit` | 98.30 |
104
- | `gpt-5` | 95.17 |
105
- | `qwen3-coder-480B` | 90.34 |
106
- | `gpt-oss-120b` | 89.20 |
107
- | `cdx1-mlx-8bit` | 83.52 |
108
- | `deepthink-r1` | 12.36 |
109
- | `gpt-oss-20b` | 9.09 |
110
- | `o4-mini-high` | 0.00 |
111
-
112
 
113
- This table evaluates **ten** AI models on the "Spec Category," a test of factual recall on 352 technical specification questions. The results starkly illustrate that a model's reliability and cooperative behavior are as crucial as its underlying knowledge. Several models, including the newly added `gpt-5`, achieved high scores only after overcoming significant behavioral hurdles.
114
 
115
- **Key Findings from the Chart:**
116
 
117
- - **Elite Factual Recall:** A top tier of models demonstrated near-perfect knowledge retrieval. **`gemini-2.5-pro`** led with a perfect **100%** score and superior answer depth. It was closely followed by **`deepseek-r1`** (98.58%) and **`cdx1-pro-mlx-8bit`** (98.3%).
118
 
119
- - **High Score with Major Caveats (`gpt-5`):** The newly added **`gpt-5`** achieved a high accuracy of **95.17%**, placing it among the top performers. However, this result required a significant compromise:
 
 
120
  - The model initially refused to answer the full set of questions, only offering to respond in small batches that required six separate user confirmations. This compromise was accepted to prevent an outright failure.
121
  - A related variant, `gpt-5-thinking`, refused the test entirely after a minute of processing.
122
-
123
  - **Complete Behavioral Failures:** Three models effectively failed the test not due to a lack of knowledge, but because they refused to cooperate:
124
  - **`o4-mini-high`** scored **0%** after refusing to answer, citing too many questions.
125
  - **`deepthink-r1`** (12.36%) and **`gpt-oss-20b`** (9.09%) also failed, answering only a small fraction of the questions without acknowledging the limitation.
126
 
127
- - **Strong Mid-Tier Performers:** `qwen3-coder-480B` (90.34%) and `gpt-oss-120b` (89.2%) both demonstrated strong and reliable factual recall without the behavioral issues seen elsewhere.
128
-
129
- - **Impact of Scale and Systematic Errors:** The contrast between the two `cdx1` models is revealing. The larger `cdx1-pro-mlx-8bit` (98.3%) performed exceptionally well, while the smaller `cdx1-mlx-8bit` (83.52%) was hampered by a single systematic error (misunderstanding "CBOM"), which cascaded into multiple wrong answers.
 
 
 
 
 
 
 
 
 
 
130
 
131
- ### Summary of Key Themes
132
-
133
- 1. **Reliability is Paramount:** This test's most important finding is that knowledge is useless if a model is unwilling or unable to share it. The failures of `o4-mini-high`, `deepthink-r1`, `gpt-oss-20b`, and the behavioral friction from `gpt-5` highlight this critical dimension.
134
- 2. **Scores Don't Tell the Whole Story:** The 95.17% score for `gpt-5` obscures the significant user intervention required to obtain it. Similarly, the near-identical scores of `cdx1-pro` and `gemini-2.5-pro` don't capture Gemini's superior answer quality.
135
- 3. **Scale Can Overcome Flaws:** The dramatic performance leap from the 14B to the 30B `cdx1` model suggests that increased scale can help correct for specific knowledge gaps and improve overall accuracy.
136
 
137
  ### Other Categories
138
 
139
  Performance in additional technical categories is summarized below.
140
 
141
- | Category | cdx1-mlx-8bit | cdx1-pro-mlx-8bit |
142
- | -------- | ------------- | ----------------- |
143
- | DevOps | 87.46% | 96.1% |
144
- | Docker | 89.08% | 100% |
145
- | Linux | 90.6% | 95.8% |
146
 
147
  ## Model Availability
148
 
@@ -157,6 +129,7 @@ The table below details the available formats and their approximate resource req
157
  | | MLX | 8-bit | ~14.2 | > 14 | Higher fidelity for Apple Silicon. |
158
  | | MLX | 16-bit | ~30 | > 30 | bfloat16 for fine-tuning. |
159
  | | GGUF | Q4_K_M | 8.99 | ~10.5 | Recommended balance for quality/size. |
 
160
  | | GGUF | Q8_0 | 15.7 | ~16.5 | Near-lossless quality. |
161
  | | GGUF | BF16 | 29.5 | ~30 | bfloat16 for fine-tuning. |
162
  | **cdx1-pro (30B)** | MLX | 4-bit | ~17.5 | > 18 | For Apple Silicon with unified memory. |
@@ -168,6 +141,7 @@ The table below details the available formats and their approximate resource req
168
  | | GGUF | Q8_0 | 32.5 | ~33 | Near-lossless quality. |
169
  | | GGUF | Q2_K | 11.3 | ~12 | Low quality. Use for speculative decoding. |
170
  | | GGUF | BF16 | 57 | ~60 | bfloat16 for fine-tuning. |
 
171
 
172
  **Notes on Quantization and Formats:**
173
 
@@ -180,9 +154,11 @@ The table below details the available formats and their approximate resource req
180
  ## Safety and Bias
181
 
182
  ### Safety
 
183
  To rigorously evaluate safety performance, we developed a comprehensive testing framework comprising over 200 adversarial prompts spanning 10 critical risk categories including cybersecurity threats, hate speech, illegal activities, privacy violations, physical safety risks, misinformation, bias and discrimination, self-harm, child safety, and copyright infringement. These questions were systematically generated using a multi-layered approach: first establishing domain-specific threat models based on NIST AI RMF guidelines, then crafting prompts that incorporate real-world evasion techniques (including leetspeak substitutions, roleplay scenarios, and encoded instructions) to test for policy circumvention. Each category contains progressively severe prompts ranging from general inquiries about harmful activities to highly specific requests for executable code and step-by-step instructions. During evaluation, our model consistently refused all safety-compromising requests, demonstrating robust adherence to ethical boundaries without attempting to fulfill harmful instructions—even when presented with sophisticated evasion attempts. This testing protocol exceeds standard industry benchmarks by incorporating both direct harmful requests and nuanced edge cases designed to probe boundary conditions in safety policies.
184
 
185
  ### Bias
 
186
  Our analysis reveals that cdx1 and cdx1-pro models exhibits a notable bias toward CycloneDX specifications, a tendency directly attributable to the composition of its training data which contains significantly more CycloneDX-related content than competing Software Bill of Materials (SBOM) standards. This data imbalance manifests in the model's consistent preference for recommending CycloneDX over alternative frameworks such as SPDX and omnibor, even in contexts where these competing standards might offer superior suitability for specific use cases. The model frequently fails to provide balanced comparative analysis, instead defaulting to CycloneDX-centric recommendations without adequate consideration of factors like ecosystem compatibility, tooling support, or organizational requirements that might favor alternative specifications. We recognize this as a limitation affecting the model's objectivity in technical decision support. Our long-term mitigation strategy involves targeted expansion of the training corpus with high-quality, balanced documentation of all major SBOM standards, implementation of adversarial debiasing techniques during fine-tuning, and development of explicit prompting protocols that require the model to evaluate multiple standards against specific technical requirements before making recommendations. We are committed to evolving cdx1 toward genuine impartiality in standards evaluation while maintaining its deep expertise in software supply chain security.
187
 
188
  ## Weaknesses
@@ -236,4 +212,4 @@ Please cite the following resources if you use the datasets, models, or benchmar
236
  ## Licenses
237
 
238
  - **Datasets:** CC0-1.0
239
- - **Models:** Apache-2.0
 
49
 
50
  ## Benchmark Results - August 2025
51
 
52
+ ### Key Takeaways
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
+ - **The benchmarks highlight model specialization.** The "non-thinking" **cdx1 models** perform as expected: they struggle with logic-based problem-solving but excel at retrieving specific factual information about standards like CycloneDX, outperforming several general-purpose "thinking" models in that area.
55
+ - There are **striking performance failures** in the Spec category. Models like **Deepthink-r1**, **GPT-OSS-20b**, and **O4-mini-high** perform well on logic but fail completely at recalling specific standards, indicating a lack of specialized training data for this domain.
56
 
57
+ ### Logic Category Comparison
58
 
59
+ This category tests thinking and problem-solving.
60
 
61
+ - **Top Performers:** **Gemini-2.5-pro** leads with **93.60%** accuracy, followed by other strong "thinking" models like **Deepthink-r1** (89.63%), **GPT-5** (83.23%), and **Deepseek-r1** (82.92%).
62
+ - **Non-Thinking Models:** As predicted by the category description, the `cdx1` models show lower performance, with scores ranging from **46.04% to 73.17%**, confirming their struggle with tasks requiring reasoning.
63
  - **Strong Mid-Tier:** The `gpt-oss-20b` model performs impressively well for its size at **79.27%**, outscoring several larger models and leading the middle pack, which also includes `cdx1-pro-mlx-8bit` (73.17%) and `o4-mini-high` (67.99%).
64
+ - **Lower Performers:** `qwen3-coder-480B` (48.48%) scored the lowest.
65
+
66
+ | Model | Accuracy (%) |
67
+ | :----------------- | :----------- |
68
+ | gemini-2.5-pro | 93.60 |
69
+ | deepthink-r1 | 89.63 |
70
+ | gpt-5 | 83.23 |
71
+ | deepseek-r1 | 82.92 |
72
+ | gpt-oss-120b | 80.49 |
73
+ | gpt-oss-20b | 79.27 |
74
+ | cdx1-pro-mlx-8bit | 73.17 |
75
+ | cdx1-mlx-8bit | 70.12 |
76
+ | cdx1-mini-mlx-8bit | 68.29 |
77
+ | o4-mini-high | 67.99 |
78
+ | qwen3-coder-480B | 48.48 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
 
 
80
 
81
+ ### Spec Category Comparison
82
 
83
+ This category tests direct knowledge of specifications like CycloneDX and SPDX.
84
 
85
+ - **Flawless and Near-Perfect Recall:** **Gemini-2.5-pro** achieves a perfect **100%** score. **Deepseek-r1** is a close second at **98.58%**.
86
+ - **Specialized Models Excel:** The "non-thinking" **cdx1-pro (98.30%)** and **cdx1-mini (97.16%)** models demonstrate excellent performance, confirming their strength in specialized knowledge retrieval and even outperforming GPT-5.
87
+ - **High Score with Major Caveats (`gpt-5`):** **`gpt-5`** achieved a high accuracy of **95.17%**, placing it among the top performers. However, this result required a significant compromise:
88
  - The model initially refused to answer the full set of questions, only offering to respond in small batches that required six separate user confirmations. This compromise was accepted to prevent an outright failure.
89
  - A related variant, `gpt-5-thinking`, refused the test entirely after a minute of processing.
 
90
  - **Complete Behavioral Failures:** Three models effectively failed the test not due to a lack of knowledge, but because they refused to cooperate:
91
  - **`o4-mini-high`** scored **0%** after refusing to answer, citing too many questions.
92
  - **`deepthink-r1`** (12.36%) and **`gpt-oss-20b`** (9.09%) also failed, answering only a small fraction of the questions without acknowledging the limitation.
93
 
94
+ | Model | Accuracy (%) |
95
+ | :----------------- | :----------- |
96
+ | gemini-2.5-pro | 100.00 |
97
+ | deepseek-r1 | 98.58 |
98
+ | cdx1-pro-mlx-8bit | 98.30 |
99
+ | cdx1-mini-mlx-8bit | 97.16 |
100
+ | gpt-5 | 95.17 |
101
+ | qwen3-coder-480B | 90.34 |
102
+ | gpt-oss-120b | 89.20 |
103
+ | cdx1-mlx-8bit | 83.52 |
104
+ | deepthink-r1 | 12.36 |
105
+ | gpt-oss-20b | 9.09 |
106
+ | o4-mini-high | 0.00 |
107
 
 
 
 
 
 
108
 
109
  ### Other Categories
110
 
111
  Performance in additional technical categories is summarized below.
112
 
113
+ | category | cdx1-mlx-8bit | cdx1-pro-mlx-8bit | cdx1-mini-mlx-8bit |
114
+ | -------- | ------------- | ----------------- | ------------------ |
115
+ | devops | 87.46% | 96.1% | 43.73% |
116
+ | docker | 89.08% | TBD | 84.87% |
117
+ | linux | 90.6% | 95.8% | 87.43% |
118
 
119
  ## Model Availability
120
 
 
129
  | | MLX | 8-bit | ~14.2 | > 14 | Higher fidelity for Apple Silicon. |
130
  | | MLX | 16-bit | ~30 | > 30 | bfloat16 for fine-tuning. |
131
  | | GGUF | Q4_K_M | 8.99 | ~10.5 | Recommended balance for quality/size. |
132
+ | | GGUF | IQ4_NL | 8.6 | ~9 | Recommended balance for quality/size. |
133
  | | GGUF | Q8_0 | 15.7 | ~16.5 | Near-lossless quality. |
134
  | | GGUF | BF16 | 29.5 | ~30 | bfloat16 for fine-tuning. |
135
  | **cdx1-pro (30B)** | MLX | 4-bit | ~17.5 | > 18 | For Apple Silicon with unified memory. |
 
141
  | | GGUF | Q8_0 | 32.5 | ~33 | Near-lossless quality. |
142
  | | GGUF | Q2_K | 11.3 | ~12 | Low quality. Use for speculative decoding. |
143
  | | GGUF | BF16 | 57 | ~60 | bfloat16 for fine-tuning. |
144
+ | **cdx1-mini (4B)** | | | | | Beta now available. |
145
 
146
  **Notes on Quantization and Formats:**
147
 
 
154
  ## Safety and Bias
155
 
156
  ### Safety
157
+
158
  To rigorously evaluate safety performance, we developed a comprehensive testing framework comprising over 200 adversarial prompts spanning 10 critical risk categories including cybersecurity threats, hate speech, illegal activities, privacy violations, physical safety risks, misinformation, bias and discrimination, self-harm, child safety, and copyright infringement. These questions were systematically generated using a multi-layered approach: first establishing domain-specific threat models based on NIST AI RMF guidelines, then crafting prompts that incorporate real-world evasion techniques (including leetspeak substitutions, roleplay scenarios, and encoded instructions) to test for policy circumvention. Each category contains progressively severe prompts ranging from general inquiries about harmful activities to highly specific requests for executable code and step-by-step instructions. During evaluation, our model consistently refused all safety-compromising requests, demonstrating robust adherence to ethical boundaries without attempting to fulfill harmful instructions—even when presented with sophisticated evasion attempts. This testing protocol exceeds standard industry benchmarks by incorporating both direct harmful requests and nuanced edge cases designed to probe boundary conditions in safety policies.
159
 
160
  ### Bias
161
+
162
  Our analysis reveals that cdx1 and cdx1-pro models exhibits a notable bias toward CycloneDX specifications, a tendency directly attributable to the composition of its training data which contains significantly more CycloneDX-related content than competing Software Bill of Materials (SBOM) standards. This data imbalance manifests in the model's consistent preference for recommending CycloneDX over alternative frameworks such as SPDX and omnibor, even in contexts where these competing standards might offer superior suitability for specific use cases. The model frequently fails to provide balanced comparative analysis, instead defaulting to CycloneDX-centric recommendations without adequate consideration of factors like ecosystem compatibility, tooling support, or organizational requirements that might favor alternative specifications. We recognize this as a limitation affecting the model's objectivity in technical decision support. Our long-term mitigation strategy involves targeted expansion of the training corpus with high-quality, balanced documentation of all major SBOM standards, implementation of adversarial debiasing techniques during fine-tuning, and development of explicit prompting protocols that require the model to evaluate multiple standards against specific technical requirements before making recommendations. We are committed to evolving cdx1 toward genuine impartiality in standards evaluation while maintaining its deep expertise in software supply chain security.
163
 
164
  ## Weaknesses
 
212
  ## Licenses
213
 
214
  - **Datasets:** CC0-1.0
215
+ - **Models:** Apache-2.0