Paper-Summarizer-Nemotron-12B

A fine-tuned Nemotron-12B model specialized for generating structured summaries of scientific research papers in standardized JSON format with superior throughput.

Model Description

This model is part of Project AELLA, developed in collaboration with LAION and Wynd Labs to democratize access to scientific knowledge by creating structured summaries of research papers at scale.

Base Model: NVIDIA Nemotron 12B (Hybrid Mamba-Transformer) Training Data: 110,000 curated research papers Performance: Achieves 71.3% accuracy on QA evaluation Throughput: 2.25× faster than Qwen3-14B variant

This generates comprehensive structured summaries in a JSON format. The papers are either classified as SCIENTIFIC_TEXT, PARTIAL_SCIENTIFIC_TEXT, or NON_SCIENTIFIC_TEXT. The fields extracted are key research elements such as methodology, results, claims, and limitations.

The model supports papers up to 131K tokens and is optimized for large-scale batch processing with high throughput (0.97 requests/sec).

Usage

Serving the Model

Note: This model requires a custom chat template for proper reasoning token handling.

vllm serve inference-net/Paper-Summarizer-Nemotron-12B \
  --port 8000 \
  --host 0.0.0.0 \
  --trust-remote-code \
  --data-parallel-size 1 \
  --tensor-parallel-size 1 \
  --max-num-seqs 32 \
  --max-model-len 131072 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --chat-template "{%- set ns = namespace(enable_thinking=true) %}{%- for message in messages -%}{%- set content = message['content'] -%}{%- if message['role'] == 'user' or message['role'] == 'system' -%}{%- if '/think' in content -%}{%- set ns.enable_thinking = true -%}{%- elif '/no_think' in content -%}{%- set ns.enable_thinking = false -%}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if messages[0]['role'] != 'system' -%}{%- set ns.non_tool_system_content = '' -%}{{- '<SPECIAL_10>System\n' -}}{%- else -%}{%- set ns.non_tool_system_content = messages[0]['content'].replace('/think', '').replace('/no_think', '').strip() -%}{{- '<SPECIAL_10>System\n' + ns.non_tool_system_content }}{%- endif -%}{%- if tools -%}{%- if ns.non_tool_system_content is defined and ns.non_tool_system_content != '' -%}{{- '\n\n' -}}{%- endif -%}{{- 'You can use the following tools to assist the user if required:' -}}{{- '\n<AVAILABLE_TOOLS>[' -}}{%- for tool in tools -%}{{- (tool.function if tool.function is defined else tool) | tojson -}}{{- ', ' if not loop.last else '' -}}{%- endfor -%}{{- ']</AVAILABLE_TOOLS>\n\n' -}}{{- 'If you decide to call any tool(s), use the following format:\n' -}}{{- '<TOOLCALL>[{{\"name\": \"tool_name1\", \"arguments\": \"tool_args1\"}}, ' -}}{{- '{{\"name\": \"tool_name2\", \"arguments\": \"tool_args2\"}}]</TOOLCALL>\n\n' -}}{{- 'The user will execute tool-calls and return responses from tool(s) in this format:\n' -}}{{- '<TOOL_RESPONSE>[{{\"tool_response1\"}}, {{\"tool_response2\"}}]</TOOL_RESPONSE>\n\n' -}}{{- 'Based on the tool responses, you can call additional tools if needed, correct tool calls if any errors are found, or just respond to the user.' -}}{%- endif -%}{{- '\n' -}}{%- set messages = messages[1:] if messages[0]['role'] == 'system' else messages -%}{%- if messages[-1]['role'] == 'assistant' -%}{%- set ns.last_turn_assistant_content = messages[-1]['content'].strip() -%}{%- set messages = messages[:-1] -%}{%- endif -%}{%- for message in messages %}{%- set content = message['content'] %}{%- if message['role'] == 'user' -%}{{- '<SPECIAL_11>User\n' + content.replace('/think', '').replace('/no_think', '').strip() + '\n' }}{%- elif message['role'] == 'tool' -%}{%- if loop.first or (messages[loop.index0 - 1].role != 'tool') -%}{{- '<SPECIAL_11>User\n' + '<TOOL_RESPONSE>[' }}{%- endif -%}{{- message['content'] -}}{{- ', ' if not loop.last and (messages[loop.index0 + 1].role == 'tool') else '' -}}{%- if loop.last or (messages[loop.index0 + 1].role != 'tool') -%}{{- ']</TOOL_RESPONSE>\n' -}}{%- endif -%}{%- elif message['role'] == 'assistant' -%}{%- if '</think>' in content -%}{%- set content = content.split('</think>')[1].strip() %}{%- endif -%}{{- '<SPECIAL_11>Assistant\n' + content.strip() }}{%- if message.tool_calls -%}{%- if content.strip() != '' -%}{{- '\n\n' -}}{%- endif -%}{{- '<TOOLCALL>[' -}}{%- for call in message.tool_calls -%}{%- set fn = call.function if call.function is defined else call -%}{{- '{\"name\": \"' + fn.name + '\", \"arguments\": ' -}}{%- if fn.arguments is string -%}{{- fn.arguments -}}{%- else -%}{{- fn.arguments | tojson -}}{%- endif -%}{{- '}' + (', ' if not loop.last else '') -}}{%- endfor -%}{{- ']</TOOLCALL>' -}}{%- endif -%}{{- '\n<SPECIAL_12>\n' -}}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '<SPECIAL_11>Assistant\n' -}}{%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}{{- '<think></think>' -}}{%- else -%}{{- '<think>\n' -}}{%- endif -%}{%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}{{- ns.last_turn_assistant_content -}}{%- endif -%}{%- else -%}{%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}{{- '<SPECIAL_11>Assistant\n' -}}{%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}{{- '<think></think>' -}}{%- else -%}{{- '<think>\n' -}}{%- endif -%}{{- ns.last_turn_assistant_content -}}{%- if continue_final_message is defined -%}{%- if continue_final_message is false -%}{{- '\n<SPECIAL_12>\n' -}}{%- endif -%}{%- else -%}{{- '\n<SPECIAL_12>\n' -}}{%- endif -%}{%- endif -%}{%- endif -%}"

Making Requests

import requests

# System prompt (required)
system_prompt = """[Insert the full system prompt from the prompt.txt file -
see the full prompt in the model repository]"""

# User prompt: the paper text to summarize
paper_text = """
Title: Your Paper Title
Authors: Author 1, Author 2
Abstract: ...
[Full paper content]
"""

# API request
response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "model": "inference-net/Paper-Summarizer-Nemotron-12B",
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": paper_text}
        ],
        "temperature": 0.2
    },
    timeout=600
)

result = response.json()
# Note: Response may include reasoning tokens wrapped in <think></think>
# These are automatically stripped by the chat template
summary = result["choices"][0]["message"]["content"]
print(summary)

System Prompt

The model requires the same system prompt as the Qwen3-14B variant. The prompt instructs the model to:

  1. Classify the text as SCIENTIFIC_TEXT, PARTIAL_SCIENTIFIC_TEXT, or NON_SCIENTIFIC_TEXT
  2. Extract structured information including:
    • Title, authors, publication year
    • Research context and hypotheses
    • Methodological details
    • Key results with quantitative data
    • Claims with supporting evidence
    • Limitations and ethical considerations

The full system prompt is available in the model repository's prompt.txt file.

Output Format

The model outputs a single valid JSON object with this structure:

{
  "article_classification": "SCIENTIFIC_TEXT",
  "reason": null,
  "summary": {
    "title": "",
    "authors": "",
    "publication_year": null,
    "field_subfield": "",
    "executive_summary": "",
    "research_context": "",
    "methodological_details": "",
    "key_results": "",
    "claims": [...],
    "contradictions_and_limitations": "",
    ...
  }
}

Performance

LLM-as-a-Judge Evaluation

  • Score: 4.095/5.0
  • Comparison: Slightly behind Qwen3-14B (4.207) but still high quality

QA Dataset Evaluation

  • Accuracy: 71.3%
  • Comparison: Strong performance, suitable for batch processing

Throughput (8×H200 node)

  • Requests/sec: 0.97 (2.25× faster than Qwen3-14B)
  • Input Tokens/sec: 16,943.69
  • Output Tokens/sec: 4,880.76
  • Single Request Tokens/sec: 76.17

Cost Efficiency

  • Processing 100M papers: ~$45,000 (vs $100,000 for Qwen3-14B, $5M+ for GPT-5)
  • Ideal for: Large-scale batch processing where throughput matters

Training Details

  • Training Set: 100,000 papers (same as Qwen3-14B)
  • Validation Set: 10,000 papers
  • Average Paper Length: 81,334 characters
  • Architecture: Hybrid Mamba-Transformer for high throughput
  • Training Approach: Post-training on summaries generated by frontier models

When to Use This Model

Choose Nemotron-12B if:

  • Processing large batches (100K+ papers)
  • Throughput and cost are primary concerns
  • Accuracy in the 70-75% range is acceptable
  • Running on GPU infrastructure with parallel processing

Choose Qwen3-14B if:

  • Need highest possible accuracy (73.9% vs 71.3%)
  • Processing smaller batches or single papers
  • Quality is more important than speed

Limitations

  • May generate subtle factual errors (hallucinations) for fine-grained details
  • Context limit (131K tokens) may truncate extremely long documents
  • Unified schema may not capture all domain-specific nuances
  • Summaries are research aids, not replacements for primary sources in high-stakes scenarios
  • Slightly lower accuracy than Qwen3-14B variant

Related Resources

License

[License information to be added]

Acknowledgments

This work was made possible through collaboration with:

  • LAION
  • Wynd Labs
  • Inference.net
  • NVIDIA (base Nemotron architecture)
  • Contributors to bethgelab, PeS2o, Common Pile, and OpenAlex
Downloads last month
10
Safetensors
Model size
12B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inference-net/Aella-Nemotron-12B