File size: 1,772 Bytes
18f3dac
 
 
 
 
2aae7ad
18f3dac
2aae7ad
18f3dac
2aae7ad
18f3dac
 
2aae7ad
18f3dac
2aae7ad
18f3dac
2aae7ad
 
 
18f3dac
2aae7ad
18f3dac
2aae7ad
 
 
18f3dac
 
2aae7ad
18f3dac
2aae7ad
 
 
 
 
 
18f3dac
2aae7ad
 
 
 
18f3dac
2aae7ad
18f3dac
2aae7ad
 
 
 
 
 
 
 
 
 
18f3dac
2aae7ad
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
---
library_name: transformers
tags: []
---

# Hymba2-2.7B-Instruct

Hymba2 is a new hybrid SLM model family that outperforms Qwen models in accuracy (math, coding, and commonsense), batch-size-1 latency, and throughput. More details are in our NeurIPS 2025 [paper](https://drive.google.com/drive/folders/17vOGktwUfUpRAJPGJUV6oX8XwLSczZtv?usp=sharing).

Docker path: `/lustre/fsw/portfolios/nvr/users/yongganf/docker/megatron_py25_fla.sqsh` on ORD/NRT or `/lustre/fsw/nvr_lpr_llm/yongganf/docker/megatron_py25_fla.sqsh` on EOS.


## Chat with Hymba2-2.7B-Instruct

We wrap the model into CUDA Graph for fast generation:

```
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo_name = "nvidia/Hymba2-2.7B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo_name, trust_remote_code=True)
model = model.cuda().to(torch.bfloat16)


max_new_tokens = 256

print('Initializing generation state...')
generation_state = model.init_cuda_graph_generation(
    max_new_tokens=max_new_tokens,
    batch_size=1,
    device='cuda',
)

while True:
    prompt = input("User:")
    if prompt.lower() == "exit":
        break

    inputs = tokenizer(prompt, return_tensors="pt").to('cuda')

    print(f"Generating with CUDA graph acceleration...")
    outputs = model.generate_with_cuda_graph(
        input_ids=inputs["input_ids"],
        generation_state=generation_state,
        max_new_tokens=max_new_tokens,
        temperature=0,
        top_k=50,
        eos_token_id=tokenizer.eos_token_id,
        profiling=False,
    )

    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    
    print(f"Response: {response}")
```