|
|
--- |
|
|
library_name: transformers |
|
|
tags: [] |
|
|
--- |
|
|
|
|
|
# Hymba2-2.7B-Instruct |
|
|
|
|
|
Hymba2 is a new hybrid SLM model family that outperforms Qwen models in accuracy (math, coding, and commonsense), batch-size-1 latency, and throughput. More details are in our NeurIPS 2025 [paper](https://drive.google.com/drive/folders/17vOGktwUfUpRAJPGJUV6oX8XwLSczZtv?usp=sharing). |
|
|
|
|
|
Docker path: `/lustre/fsw/portfolios/nvr/users/yongganf/docker/megatron_py25_fla.sqsh` on ORD/NRT or `/lustre/fsw/nvr_lpr_llm/yongganf/docker/megatron_py25_fla.sqsh` on EOS. |
|
|
|
|
|
|
|
|
## Chat with Hymba2-2.7B-Instruct |
|
|
|
|
|
We wrap the model into CUDA Graph for fast generation: |
|
|
|
|
|
``` |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
repo_name = "nvidia/Hymba2-2.7B-Instruct" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True) |
|
|
model = AutoModelForCausalLM.from_pretrained(repo_name, trust_remote_code=True) |
|
|
model = model.cuda().to(torch.bfloat16) |
|
|
|
|
|
|
|
|
max_new_tokens = 256 |
|
|
|
|
|
print('Initializing generation state...') |
|
|
generation_state = model.init_cuda_graph_generation( |
|
|
max_new_tokens=max_new_tokens, |
|
|
batch_size=1, |
|
|
device='cuda', |
|
|
) |
|
|
|
|
|
while True: |
|
|
prompt = input("User:") |
|
|
if prompt.lower() == "exit": |
|
|
break |
|
|
|
|
|
inputs = tokenizer(prompt, return_tensors="pt").to('cuda') |
|
|
|
|
|
print(f"Generating with CUDA graph acceleration...") |
|
|
outputs = model.generate_with_cuda_graph( |
|
|
input_ids=inputs["input_ids"], |
|
|
generation_state=generation_state, |
|
|
max_new_tokens=max_new_tokens, |
|
|
temperature=0, |
|
|
top_k=50, |
|
|
eos_token_id=tokenizer.eos_token_id, |
|
|
profiling=False, |
|
|
) |
|
|
|
|
|
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) |
|
|
|
|
|
print(f"Response: {response}") |
|
|
``` |