--- library_name: transformers tags: [] --- # Hymba2-2.7B-Instruct Hymba2 is a new hybrid SLM model family that outperforms Qwen models in accuracy (math, coding, and commonsense), batch-size-1 latency, and throughput. More details are in our NeurIPS 2025 [paper](https://drive.google.com/drive/folders/17vOGktwUfUpRAJPGJUV6oX8XwLSczZtv?usp=sharing). Docker path: `/lustre/fsw/portfolios/nvr/users/yongganf/docker/megatron_py25_fla.sqsh` on ORD/NRT or `/lustre/fsw/nvr_lpr_llm/yongganf/docker/megatron_py25_fla.sqsh` on EOS. ## Chat with Hymba2-2.7B-Instruct We wrap the model into CUDA Graph for fast generation: ``` from transformers import AutoModelForCausalLM, AutoTokenizer import torch repo_name = "nvidia/Hymba2-2.7B-Instruct" tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(repo_name, trust_remote_code=True) model = model.cuda().to(torch.bfloat16) max_new_tokens = 256 print('Initializing generation state...') generation_state = model.init_cuda_graph_generation( max_new_tokens=max_new_tokens, batch_size=1, device='cuda', ) while True: prompt = input("User:") if prompt.lower() == "exit": break inputs = tokenizer(prompt, return_tensors="pt").to('cuda') print(f"Generating with CUDA graph acceleration...") outputs = model.generate_with_cuda_graph( input_ids=inputs["input_ids"], generation_state=generation_state, max_new_tokens=max_new_tokens, temperature=0, top_k=50, eos_token_id=tokenizer.eos_token_id, profiling=False, ) response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) print(f"Response: {response}") ```