Motit commited on
Commit
8f9ce0b
·
verified ·
1 Parent(s): 26a048e

Updated AI21-Jamba-Reasoning-3B-GGUF model card content

Browse files

This PR updates the AI21-Jamba-Reasoning-3B-GGUF model card with the latest images for the graphs.

Note: I’ve left placeholders for the following items:
- Graphs (waiting for approval of the images added here)
- Code snippet
- Link to the related blog

Please review and approve the images before we fill in the remaining placeholders.
Once approved, I will update the card with the final code snippet and blog link.

Files changed (1) hide show
  1. README.md +237 -0
README.md CHANGED
@@ -5,3 +5,240 @@ license_link: https://www.ai21.com/jamba-open-model-license/
5
  pipeline_tag: text-generation
6
  library_name: transformers
7
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  pipeline_tag: text-generation
6
  library_name: transformers
7
  ---
8
+ ## Introduction
9
+
10
+ AI21’s Jamba Reasoning 3B is a top-performing reasoning model that packs leading scores on intelligence benchmarks and highly-efficient processing into a compact 3B build.
11
+
12
+ ### Key Advantages
13
+
14
+ **Fast: Optimized for efficient sequence processing**
15
+
16
+ The hybrid design combines Transformer attention with Mamba (a state-space model). Mamba layers are more efficient for sequence processing, while attention layers capture complex dependencies. This mix reduces memory overhead, improves throughput, and makes the model run smoothly on laptops, GPUs, and even mobile devices, while maintainig impressive quality.
17
+
18
+ *Placeholder graph: Intelligence vs Speed*
19
+
20
+
21
+
22
+ **Smart: Leading intelligence scores**
23
+ The model outperforms competitors, such as Gemma 3 4B, Llama 3.2 3B, and Granite 4.0 Micro, on a combined intelligence score that averages 6 standard benchmarks.
24
+
25
+ *Placeholder graph: MMLU + HLE + IFBench*
26
+
27
+
28
+
29
+ **Scalable: Handles very long contexts**
30
+
31
+ Unlike most compact models, Jamba Reasoning 3B supports extremely long contexts. Mamba layers allow the model to process inputs without storing massive attention caches, so it scales to **256K tokens** while keeping inference practical. This makes it suitable for edge deployment as well as datacenter workloads.
32
+ *Placeholder graph: On-Device Speed as Context Scales*
33
+
34
+
35
+
36
+ ## Model Details
37
+
38
+ - Number of Parameters: 3B
39
+ - Number of Layers: 28 (26 Mamba, 2 Attention)
40
+ - Number of Attention Heads: 20 MQA (20 for Q, 1 for KV)
41
+ - Vocabulary Size: 64K
42
+ - Context Length: **256k**
43
+ - Architecture: Hybrid Transformer–Mamba with efficient attention and long-context support
44
+ - **Developed by:** [**AI21**](https://www.ai21.com/)
45
+ - **Supported languages:** English, Spanish, French, Portuguese, Italian, Dutch, German, Arabic and Hebrew
46
+ - Intelligence benchmark results:
47
+
48
+ | | **MMLU-Pro** | **Humanity’s Last Exam** | **IFBench** |
49
+ | --- | --- | --- | --- |
50
+ | DeepSeek R1 Distill Qwen 1.5B | 27.0% | 3.3% | 13.0% |
51
+ | Phi-4 mini | 47.0% | 4.2% | 21.0% |
52
+ | Granite 4.0 Micro | 44.7% | 5.1% | 24.8% |
53
+ | Llama 3.2 3B | 35.0% | 5.2% | 26.0% |
54
+ | Gemma 3 4B | 42.0% | 5.2% | 28.0% |
55
+ | Qwen 3 1.7B | 57.0% | 4.8% | 27.0% |
56
+ | Qwen 3 4B | 70% | 5.1% | 0.33 |
57
+ | **Jamba Reasoning 3B** | **61.0%** | **6.0%** | **52.0%** |
58
+
59
+ ## Quickstart
60
+
61
+ **Extended version** – reasoning mode example with `<think>` block and recommended sampling params.
62
+ Code Snippet Placeholder
63
+
64
+ ### **Run the model with vLLM**
65
+
66
+ For best results, we recommend using vLLM version 0.10.2 or higher and enabling `--mamba-ssm-cache-dtype=float32`
67
+
68
+ ```jsx
69
+ pip install vllm>=0.10.2
70
+ ```
71
+
72
+ Using vllm in online server mode:
73
+
74
+ ```jsx
75
+ vllm serve "ai21labs/AI21-Jamba-Reasoning-3B" --mamba-ssm-cache-dtype float32 --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes
76
+ ```
77
+
78
+ Using vllm in offline mode:
79
+
80
+ ```jsx
81
+ from vllm import LLM, SamplingParams
82
+ from transformers import AutoTokenizer
83
+
84
+ model = "ai21labs/AI21-Jamba-Reasoning-3B"
85
+ number_gpus = 1
86
+
87
+ llm = LLM(model=model,
88
+ tensor_parallel_size=number_gpus,
89
+ mamba_ssm_cache_dtype="float32")
90
+
91
+ tokenizer = AutoTokenizer.from_pretrained(model)
92
+
93
+ messages = [
94
+ {"role": "user", "content": "Hello!"},
95
+ ]
96
+
97
+ prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
98
+
99
+ sampling_params = SamplingParams(temperature=0.6, max_tokens=4096)
100
+ outputs = llm.generate(prompts, sampling_params)
101
+
102
+ generated_text = outputs[0].outputs[0].text
103
+ print(generated_text)
104
+
105
+ ```
106
+
107
+ ### **Run the model with Transformers**
108
+
109
+ ```jsx
110
+ pip install transformers>= 4.54.0
111
+ pip install flash-attn --no-build-isolation
112
+ pip install causal-conv1d>=1.2.0
113
+ pip install mamba-ssm
114
+ ```
115
+
116
+ ```jsx
117
+ import torch
118
+ from transformers import AutoModelForCausalLM, AutoTokenizer
119
+
120
+ model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-Reasoning-3B",
121
+ dtype=torch.bfloat16,
122
+ attn_implementation="flash_attention_2",
123
+ device_map="auto")
124
+
125
+ tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-Reasoning-3B")
126
+
127
+ messages = [
128
+ {"role": "user", "content": "Hello!"},
129
+ ]
130
+
131
+ prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
132
+
133
+ outputs = model.generate(**tokenizer(prompts, return_tensors="pt").to(model.device), do_sample=True, temperature=0.6, max_new_tokens=4096)
134
+
135
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
136
+ print(generated_text)
137
+ ```
138
+
139
+ ## **How to Run This Model Locally**
140
+
141
+ You can run Jamba Reasoning 3B on your own machine using popular lightweight runtimes. This makes it possible to experiment with long-context reasoning without relying on cloud infrastructure.
142
+
143
+ - **Supported runtimes**: [llama.cpp](https://github.com/ggml-org/llama.cpp), [LM Studio](https://lmstudio.ai/), and [Ollama](https://ollama.com/).
144
+ - **Quantizations**: Multiple quantization levels are provided to shrink the model size.
145
+ - Full precision FP16 GGUF - **5.96** GB
146
+ - 4 bit quantization using Q4-K-M GGUF - **1.80** GB
147
+ - More GGUF quantizations - TBD
148
+
149
+ ## Deployment
150
+
151
+ - Support for **Ollama** , **LMStudio** and **llama.cpp** (for local use)
152
+ 1. llama.cpp using llama.cpp Python sdk run example:
153
+
154
+
155
+ ```python
156
+ pip install --upgrade llama-cpp-python
157
+ ```
158
+
159
+ ```python
160
+ from llama_cpp import Llama
161
+
162
+ llm = Llama(
163
+ model_path="path-to-Jamba-Reasoning-3B-gguf",
164
+ n_ctx=128000,
165
+ n_threads=10, # CPU threads
166
+ n_gpu_layers=-1, # -1 = all layers on GPU (Metal/CUDA if available)
167
+ flash_attn=True,
168
+ )
169
+
170
+ prompt = """<think>
171
+ You are analyzing a stream of customer support tickets to decide which ones require escalation.
172
+
173
+ Ticket 1: "The new update caused our app to crash whenever users upload a file larger than 50MB."
174
+ Ticket 2: "I can't log in because I forgot my password."
175
+ Ticket 3: "The billing page is missing the new enterprise pricing option."
176
+
177
+ Classify each ticket as 'Critical', 'Medium', or 'Low' priority and explain your reasoning.
178
+ </think>"""
179
+ res = llm(
180
+ prompt,
181
+ max_tokens=128,
182
+ temperature=0.6,
183
+ )
184
+
185
+ print(res["choices"][0]["text"])
186
+ ```
187
+
188
+ 2. llama.cpp using llama.cpp server:
189
+
190
+ ```bash
191
+ git clone https://github.com/ggerganov/llama.cpp.git
192
+ cd llama.cpp
193
+ cmake -S . -B build \
194
+ -DGGML_METAL=ON \
195
+ -DGGML_METAL_EMBED_LIBRARY=ON
196
+ cmake --build build --config Release -j
197
+ ```
198
+
199
+ Start llama.cpp server with Jamba-Reasoning-3B gguf:
200
+
201
+ ```python
202
+ ./build/bin/llama-server \
203
+ -m "ai21labs/AI21-Jamba-Reasoning-3B-GGUF" \
204
+ -c 8192 \
205
+ -ngl 99 \
206
+ --host 127.0.0.1 \
207
+ --port 8000
208
+ ```
209
+
210
+ Quick sanity test using curl:
211
+
212
+ ```bash
213
+ curl -s http://127.0.0.1:8000/v1/completions \
214
+ -H "Content-Type: application/json" \
215
+ -d '{
216
+ "model": "Jamba-Reasoning-3B",
217
+ "prompt": "<think>\nYou are analyzing customer support tickets to decide which need escalation.\nTicket 1: 'App crashes when uploading files >50MB.'\nTicket 2: 'Forgot password, can’t log in.'\nTicket 3: 'Billing page missing enterprise pricing.'\nClassify each ticket as Critical, Medium, or Low and explain your reasoning.\n</think>",
218
+ "max_tokens": 64,
219
+ "temperature": 0.6
220
+ }' | jq -r '.choices[0].text'
221
+ ```
222
+
223
+
224
+ ## Training Details
225
+
226
+ We trained the model in multiple stages, each designed to strengthen reasoning and long-context performance. The process began with large-scale pre-training on a diverse corpus of natural documents. We then mid-trained on ~0.5T tokens of math and code, while extending the context length to 32K tokens. During this stage we also applied a [Mamba-specific long-context method](https://arxiv.org/abs/2507.02782), which we found to significantly improve long-context abilities.
227
+
228
+ To improve reasoning, tool use, and instruction following, we applied cold-start distillation: supervised fine-tuning with a 32K window and direct preference optimization with a 64K window. Finally, we enhanced reasoning performance further through online reinforcement learning with RLVR, targeting tasks such as code generation, mathematical problem solving, structured output, and information extraction.
229
+
230
+ ## Reinforcement “Fine-Tuning”
231
+
232
+ Full support for training Jamba through VeRL will be available soon. AI21 has introduced several improvements to the VeRL framework (https://github.com/volcengine/verl), including new capabilities for training hybrid models, and stability improvements for GRPO training. These improvements will soon be available to the open source community.
233
+
234
+ ---
235
+
236
+ ## License
237
+
238
+ - `Apache 2.0`
239
+
240
+ ---
241
+
242
+ ## Citation
243
+
244
+ - Blog post- Placeholder