Paper-Summarizer-Nemotron-12B
A fine-tuned Nemotron-12B model specialized for generating structured summaries of scientific research papers in standardized JSON format with superior throughput.
Model Description
This model is part of Project AELLA, developed in collaboration with LAION and Wynd Labs to democratize access to scientific knowledge by creating structured summaries of research papers at scale.
Base Model: NVIDIA Nemotron 12B (Hybrid Mamba-Transformer) Training Data: 110,000 curated research papers Performance: Achieves 71.3% accuracy on QA evaluation Throughput: 2.25× faster than Qwen3-14B variant
This generates comprehensive structured summaries in a JSON format. The papers are either classified as SCIENTIFIC_TEXT, PARTIAL_SCIENTIFIC_TEXT, or NON_SCIENTIFIC_TEXT. The fields extracted are key research elements such as methodology, results, claims, and limitations.
The model supports papers up to 131K tokens and is optimized for large-scale batch processing with high throughput (0.97 requests/sec).
Usage
Serving the Model
Note: This model requires a custom chat template for proper reasoning token handling.
vllm serve inference-net/Paper-Summarizer-Nemotron-12B \
--port 8000 \
--host 0.0.0.0 \
--trust-remote-code \
--data-parallel-size 1 \
--tensor-parallel-size 1 \
--max-num-seqs 32 \
--max-model-len 131072 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--enable-chunked-prefill \
--chat-template "{%- set ns = namespace(enable_thinking=true) %}{%- for message in messages -%}{%- set content = message['content'] -%}{%- if message['role'] == 'user' or message['role'] == 'system' -%}{%- if '/think' in content -%}{%- set ns.enable_thinking = true -%}{%- elif '/no_think' in content -%}{%- set ns.enable_thinking = false -%}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if messages[0]['role'] != 'system' -%}{%- set ns.non_tool_system_content = '' -%}{{- '<SPECIAL_10>System\n' -}}{%- else -%}{%- set ns.non_tool_system_content = messages[0]['content'].replace('/think', '').replace('/no_think', '').strip() -%}{{- '<SPECIAL_10>System\n' + ns.non_tool_system_content }}{%- endif -%}{%- if tools -%}{%- if ns.non_tool_system_content is defined and ns.non_tool_system_content != '' -%}{{- '\n\n' -}}{%- endif -%}{{- 'You can use the following tools to assist the user if required:' -}}{{- '\n<AVAILABLE_TOOLS>[' -}}{%- for tool in tools -%}{{- (tool.function if tool.function is defined else tool) | tojson -}}{{- ', ' if not loop.last else '' -}}{%- endfor -%}{{- ']</AVAILABLE_TOOLS>\n\n' -}}{{- 'If you decide to call any tool(s), use the following format:\n' -}}{{- '<TOOLCALL>[{{\"name\": \"tool_name1\", \"arguments\": \"tool_args1\"}}, ' -}}{{- '{{\"name\": \"tool_name2\", \"arguments\": \"tool_args2\"}}]</TOOLCALL>\n\n' -}}{{- 'The user will execute tool-calls and return responses from tool(s) in this format:\n' -}}{{- '<TOOL_RESPONSE>[{{\"tool_response1\"}}, {{\"tool_response2\"}}]</TOOL_RESPONSE>\n\n' -}}{{- 'Based on the tool responses, you can call additional tools if needed, correct tool calls if any errors are found, or just respond to the user.' -}}{%- endif -%}{{- '\n' -}}{%- set messages = messages[1:] if messages[0]['role'] == 'system' else messages -%}{%- if messages[-1]['role'] == 'assistant' -%}{%- set ns.last_turn_assistant_content = messages[-1]['content'].strip() -%}{%- set messages = messages[:-1] -%}{%- endif -%}{%- for message in messages %}{%- set content = message['content'] %}{%- if message['role'] == 'user' -%}{{- '<SPECIAL_11>User\n' + content.replace('/think', '').replace('/no_think', '').strip() + '\n' }}{%- elif message['role'] == 'tool' -%}{%- if loop.first or (messages[loop.index0 - 1].role != 'tool') -%}{{- '<SPECIAL_11>User\n' + '<TOOL_RESPONSE>[' }}{%- endif -%}{{- message['content'] -}}{{- ', ' if not loop.last and (messages[loop.index0 + 1].role == 'tool') else '' -}}{%- if loop.last or (messages[loop.index0 + 1].role != 'tool') -%}{{- ']</TOOL_RESPONSE>\n' -}}{%- endif -%}{%- elif message['role'] == 'assistant' -%}{%- if '</think>' in content -%}{%- set content = content.split('</think>')[1].strip() %}{%- endif -%}{{- '<SPECIAL_11>Assistant\n' + content.strip() }}{%- if message.tool_calls -%}{%- if content.strip() != '' -%}{{- '\n\n' -}}{%- endif -%}{{- '<TOOLCALL>[' -}}{%- for call in message.tool_calls -%}{%- set fn = call.function if call.function is defined else call -%}{{- '{\"name\": \"' + fn.name + '\", \"arguments\": ' -}}{%- if fn.arguments is string -%}{{- fn.arguments -}}{%- else -%}{{- fn.arguments | tojson -}}{%- endif -%}{{- '}' + (', ' if not loop.last else '') -}}{%- endfor -%}{{- ']</TOOLCALL>' -}}{%- endif -%}{{- '\n<SPECIAL_12>\n' -}}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '<SPECIAL_11>Assistant\n' -}}{%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}{{- '<think></think>' -}}{%- else -%}{{- '<think>\n' -}}{%- endif -%}{%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}{{- ns.last_turn_assistant_content -}}{%- endif -%}{%- else -%}{%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}{{- '<SPECIAL_11>Assistant\n' -}}{%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}{{- '<think></think>' -}}{%- else -%}{{- '<think>\n' -}}{%- endif -%}{{- ns.last_turn_assistant_content -}}{%- if continue_final_message is defined -%}{%- if continue_final_message is false -%}{{- '\n<SPECIAL_12>\n' -}}{%- endif -%}{%- else -%}{{- '\n<SPECIAL_12>\n' -}}{%- endif -%}{%- endif -%}{%- endif -%}"
Making Requests
import requests
# System prompt (required)
system_prompt = """[Insert the full system prompt from the prompt.txt file -
see the full prompt in the model repository]"""
# User prompt: the paper text to summarize
paper_text = """
Title: Your Paper Title
Authors: Author 1, Author 2
Abstract: ...
[Full paper content]
"""
# API request
response = requests.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": "inference-net/Paper-Summarizer-Nemotron-12B",
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": paper_text}
],
"temperature": 0.2
},
timeout=600
)
result = response.json()
# Note: Response may include reasoning tokens wrapped in <think></think>
# These are automatically stripped by the chat template
summary = result["choices"][0]["message"]["content"]
print(summary)
System Prompt
The model requires the same system prompt as the Qwen3-14B variant. The prompt instructs the model to:
- Classify the text as SCIENTIFIC_TEXT, PARTIAL_SCIENTIFIC_TEXT, or NON_SCIENTIFIC_TEXT
- Extract structured information including:
- Title, authors, publication year
- Research context and hypotheses
- Methodological details
- Key results with quantitative data
- Claims with supporting evidence
- Limitations and ethical considerations
The full system prompt is available in the model repository's prompt.txt file.
Output Format
The model outputs a single valid JSON object with this structure:
{
"article_classification": "SCIENTIFIC_TEXT",
"reason": null,
"summary": {
"title": "",
"authors": "",
"publication_year": null,
"field_subfield": "",
"executive_summary": "",
"research_context": "",
"methodological_details": "",
"key_results": "",
"claims": [...],
"contradictions_and_limitations": "",
...
}
}
Performance
LLM-as-a-Judge Evaluation
- Score: 4.095/5.0
- Comparison: Slightly behind Qwen3-14B (4.207) but still high quality
QA Dataset Evaluation
- Accuracy: 71.3%
- Comparison: Strong performance, suitable for batch processing
Throughput (8×H200 node)
- Requests/sec: 0.97 (2.25× faster than Qwen3-14B)
- Input Tokens/sec: 16,943.69
- Output Tokens/sec: 4,880.76
- Single Request Tokens/sec: 76.17
Cost Efficiency
- Processing 100M papers: ~$45,000 (vs $100,000 for Qwen3-14B, $5M+ for GPT-5)
- Ideal for: Large-scale batch processing where throughput matters
Training Details
- Training Set: 100,000 papers (same as Qwen3-14B)
- Validation Set: 10,000 papers
- Average Paper Length: 81,334 characters
- Architecture: Hybrid Mamba-Transformer for high throughput
- Training Approach: Post-training on summaries generated by frontier models
When to Use This Model
Choose Nemotron-12B if:
- Processing large batches (100K+ papers)
- Throughput and cost are primary concerns
- Accuracy in the 70-75% range is acceptable
- Running on GPU infrastructure with parallel processing
Choose Qwen3-14B if:
- Need highest possible accuracy (73.9% vs 71.3%)
- Processing smaller batches or single papers
- Quality is more important than speed
Limitations
- May generate subtle factual errors (hallucinations) for fine-grained details
- Context limit (131K tokens) may truncate extremely long documents
- Unified schema may not capture all domain-specific nuances
- Summaries are research aids, not replacements for primary sources in high-stakes scenarios
- Slightly lower accuracy than Qwen3-14B variant
Related Resources
- Paper Visualization Website: https://laion.inference.net
- Visualization Repository: https://github.com/context-labs/laion-data-explorer
- Alexandria Paper: https://arxiv.org/abs/2502.19413
- Qwen3-14B Variant: inference-net/Paper-Summarizer-Qwen3-14B
License
[License information to be added]
Acknowledgments
This work was made possible through collaboration with:
- LAION
- Wynd Labs
- Inference.net
- NVIDIA (base Nemotron architecture)
- Contributors to bethgelab, PeS2o, Common Pile, and OpenAlex
- Downloads last month
- 10
Model tree for inference-net/Aella-Nemotron-12B
Base model
nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base