Paper-Summarizer-Nemotron-12B

A fine-tuned Nemotron-12B model specialized for generating structured summaries of scientific research papers in standardized JSON format with superior throughput.

Model Description

This model is part of Project AELLA, developed in collaboration with LAION and Wynd Labs to democratize access to scientific knowledge by creating structured summaries of research papers at scale.

Base Model: NVIDIA Nemotron 12B (Hybrid Mamba-Transformer) Training Data: 110,000 curated research papers Performance: Achieves 71.3% accuracy on QA evaluation Throughput: 2.25× faster than Qwen3-14B variant

This generates comprehensive structured summaries in a JSON format. The papers are either classified as SCIENTIFIC_TEXT, PARTIAL_SCIENTIFIC_TEXT, or NON_SCIENTIFIC_TEXT. The fields extracted are key research elements such as methodology, results, claims, and limitations.

The model supports papers up to 131K tokens and is optimized for large-scale batch processing with high throughput (0.97 requests/sec).

Usage

Serving the Model

Note: This model requires a custom chat template for proper reasoning token handling.

vllm serve inference-net/Paper-Summarizer-Nemotron-12B \
  --port 8000 \
  --host 0.0.0.0 \
  --trust-remote-code \
  --data-parallel-size 1 \
  --tensor-parallel-size 1 \
  --max-num-seqs 32 \
  --max-model-len 131072 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --chat-template "{%- set ns = namespace(enable_thinking=true) %}{%- for message in messages -%}{%- set content = message['content'] -%}{%- if message['role'] == 'user' or message['role'] == 'system' -%}{%- if '/think' in content -%}{%- set ns.enable_thinking = true -%}{%- elif '/no_think' in content -%}{%- set ns.enable_thinking = false -%}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if messages[0]['role'] != 'system' -%}{%- set ns.non_tool_system_content = '' -%}{{- '<SPECIAL_10>System\n' -}}{%- else -%}{%- set ns.non_tool_system_content = messages[0]['content'].replace('/think', '').replace('/no_think', '').strip() -%}{{- '<SPECIAL_10>System\n' + ns.non_tool_system_content }}{%- endif -%}{%- if tools -%}{%- if ns.non_tool_system_content is defined and ns.non_tool_system_content != '' -%}{{- '\n\n' -}}{%- endif -%}{{- 'You can use the following tools to assist the user if required:' -}}{{- '\n<AVAILABLE_TOOLS>[' -}}{%- for tool in tools -%}{{- (tool.function if tool.function is defined else tool) | tojson -}}{{- ', ' if not loop.last else '' -}}{%- endfor -%}{{- ']</AVAILABLE_TOOLS>\n\n' -}}{{- 'If you decide to call any tool(s), use the following format:\n' -}}{{- '<TOOLCALL>[{{\"name\": \"tool_name1\", \"arguments\": \"tool_args1\"}}, ' -}}{{- '{{\"name\": \"tool_name2\", \"arguments\": \"tool_args2\"}}]</TOOLCALL>\n\n' -}}{{- 'The user will execute tool-calls and return responses from tool(s) in this format:\n' -}}{{- '<TOOL_RESPONSE>[{{\"tool_response1\"}}, {{\"tool_response2\"}}]</TOOL_RESPONSE>\n\n' -}}{{- 'Based on the tool responses, you can call additional tools if needed, correct tool calls if any errors are found, or just respond to the user.' -}}{%- endif -%}{{- '\n' -}}{%- set messages = messages[1:] if messages[0]['role'] == 'system' else messages -%}{%- if messages[-1]['role'] == 'assistant' -%}{%- set ns.last_turn_assistant_content = messages[-1]['content'].strip() -%}{%- set messages = messages[:-1] -%}{%- endif -%}{%- for message in messages %}{%- set content = message['content'] %}{%- if message['role'] == 'user' -%}{{- '<SPECIAL_11>User\n' + content.replace('/think', '').replace('/no_think', '').strip() + '\n' }}{%- elif message['role'] == 'tool' -%}{%- if loop.first or (messages[loop.index0 - 1].role != 'tool') -%}{{- '<SPECIAL_11>User\n' + '<TOOL_RESPONSE>[' }}{%- endif -%}{{- message['content'] -}}{{- ', ' if not loop.last and (messages[loop.index0 + 1].role == 'tool') else '' -}}{%- if loop.last or (messages[loop.index0 + 1].role != 'tool') -%}{{- ']</TOOL_RESPONSE>\n' -}}{%- endif -%}{%- elif message['role'] == 'assistant' -%}{%- if '</think>' in content -%}{%- set content = content.split('</think>')[1].strip() %}{%- endif -%}{{- '<SPECIAL_11>Assistant\n' + content.strip() }}{%- if message.tool_calls -%}{%- if content.strip() != '' -%}{{- '\n\n' -}}{%- endif -%}{{- '<TOOLCALL>[' -}}{%- for call in message.tool_calls -%}{%- set fn = call.function if call.function is defined else call -%}{{- '{\"name\": \"' + fn.name + '\", \"arguments\": ' -}}{%- if fn.arguments is string -%}{{- fn.arguments -}}{%- else -%}{{- fn.arguments | tojson -}}{%- endif -%}{{- '}' + (', ' if not loop.last else '') -}}{%- endfor -%}{{- ']</TOOLCALL>' -}}{%- endif -%}{{- '\n<SPECIAL_12>\n' -}}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '<SPECIAL_11>Assistant\n' -}}{%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}{{- '<think></think>' -}}{%- else -%}{{- '<think>\n' -}}{%- endif -%}{%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}{{- ns.last_turn_assistant_content -}}{%- endif -%}{%- else -%}{%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}{{- '<SPECIAL_11>Assistant\n' -}}{%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}{{- '<think></think>' -}}{%- else -%}{{- '<think>\n' -}}{%- endif -%}{{- ns.last_turn_assistant_content -}}{%- if continue_final_message is defined -%}{%- if continue_final_message is false -%}{{- '\n<SPECIAL_12>\n' -}}{%- endif -%}{%- else -%}{{- '\n<SPECIAL_12>\n' -}}{%- endif -%}{%- endif -%}{%- endif -%}"

Making Requests

import requests

# System prompt (required)
system_prompt = """[Insert the full system prompt from the prompt.txt file -
see the full prompt in the model repository]"""

# User prompt: the paper text to summarize
paper_text = """
Title: Your Paper Title
Authors: Author 1, Author 2
Abstract: ...
[Full paper content]
"""

# API request
response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "model": "inference-net/Paper-Summarizer-Nemotron-12B",
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": paper_text}
        ],
        "temperature": 0.2
    },
    timeout=600
)

result = response.json()
# Note: Response may include reasoning tokens wrapped in <think></think>
# These are automatically stripped by the chat template
summary = result["choices"][0]["message"]["content"]
print(summary)

System Prompt

The model requires the same system prompt as the Qwen3-14B variant. The prompt instructs the model to:

Classify the text as SCIENTIFIC_TEXT, PARTIAL_SCIENTIFIC_TEXT, or NON_SCIENTIFIC_TEXT
Extract structured information including:
- Title, authors, publication year
- Research context and hypotheses
- Methodological details
- Key results with quantitative data
- Claims with supporting evidence
- Limitations and ethical considerations

The full system prompt is available in the model repository's prompt.txt file.

Output Format

The model outputs a single valid JSON object with this structure:

{
  "article_classification": "SCIENTIFIC_TEXT",
  "reason": null,
  "summary": {
    "title": "",
    "authors": "",
    "publication_year": null,
    "field_subfield": "",
    "executive_summary": "",
    "research_context": "",
    "methodological_details": "",
    "key_results": "",
    "claims": [...],
    "contradictions_and_limitations": "",
    ...
  }
}

Performance

LLM-as-a-Judge Evaluation

Score: 4.095/5.0
Comparison: Slightly behind Qwen3-14B (4.207) but still high quality

QA Dataset Evaluation

Accuracy: 71.3%
Comparison: Strong performance, suitable for batch processing

Throughput (8×H200 node)

Requests/sec: 0.97 (2.25× faster than Qwen3-14B)
Input Tokens/sec: 16,943.69
Output Tokens/sec: 4,880.76
Single Request Tokens/sec: 76.17

Cost Efficiency

Processing 100M papers: ~$45,000 (vs $100,000 for Qwen3-14B, $5M+ for GPT-5)
Ideal for: Large-scale batch processing where throughput matters

Training Details

Training Set: 100,000 papers (same as Qwen3-14B)
Validation Set: 10,000 papers
Average Paper Length: 81,334 characters
Architecture: Hybrid Mamba-Transformer for high throughput
Training Approach: Post-training on summaries generated by frontier models

When to Use This Model

Choose Nemotron-12B if:

Processing large batches (100K+ papers)
Throughput and cost are primary concerns
Accuracy in the 70-75% range is acceptable
Running on GPU infrastructure with parallel processing

Choose Qwen3-14B if:

Need highest possible accuracy (73.9% vs 71.3%)
Processing smaller batches or single papers
Quality is more important than speed

Limitations

May generate subtle factual errors (hallucinations) for fine-grained details
Context limit (131K tokens) may truncate extremely long documents
Unified schema may not capture all domain-specific nuances
Summaries are research aids, not replacements for primary sources in high-stakes scenarios
Slightly lower accuracy than Qwen3-14B variant

Related Resources

Paper Visualization Website: https://laion.inference.net
Visualization Repository: https://github.com/context-labs/laion-data-explorer
Alexandria Paper: https://arxiv.org/abs/2502.19413
Qwen3-14B Variant: inference-net/Paper-Summarizer-Qwen3-14B

License

[License information to be added]

Acknowledgments

This work was made possible through collaboration with:

LAION
Wynd Labs
Inference.net
NVIDIA (base Nemotron architecture)
Contributors to bethgelab, PeS2o, Common Pile, and OpenAlex

Downloads last month: 10

Safetensors

Model size

12B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inference-net/Aella-Nemotron-12B

Base model

nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base

Finetuned

nvidia/NVIDIA-Nemotron-Nano-12B-v2

Finetuned

(9)

this model