nvidia
/

Nemotron-Flash-3B-Instruct

Text Generation

Model card Files Files and versions

Nemotron-Flash-3B-Instruct / README.md

YongganFu's picture

Update README.md

2aae7ad verified 3 months ago

|

1.77 kB

	---
	library_name: transformers
	tags: []
	---

	# Hymba2-2.7B-Instruct

	Hymba2 is a new hybrid SLM model family that outperforms Qwen models in accuracy (math, coding, and commonsense), batch-size-1 latency, and throughput. More details are in our NeurIPS 2025 [paper](https://drive.google.com/drive/folders/17vOGktwUfUpRAJPGJUV6oX8XwLSczZtv?usp=sharing).

	Docker path: `/lustre/fsw/portfolios/nvr/users/yongganf/docker/megatron_py25_fla.sqsh` on ORD/NRT or `/lustre/fsw/nvr_lpr_llm/yongganf/docker/megatron_py25_fla.sqsh` on EOS.


	## Chat with Hymba2-2.7B-Instruct

	We wrap the model into CUDA Graph for fast generation:

	```
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	repo_name = "nvidia/Hymba2-2.7B-Instruct"

	tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(repo_name, trust_remote_code=True)
	model = model.cuda().to(torch.bfloat16)


	max_new_tokens = 256

	print('Initializing generation state...')
	generation_state = model.init_cuda_graph_generation(
	max_new_tokens=max_new_tokens,
	batch_size=1,
	device='cuda',
	)

	while True:
	prompt = input("User:")
	if prompt.lower() == "exit":
	break

	inputs = tokenizer(prompt, return_tensors="pt").to('cuda')

	print(f"Generating with CUDA graph acceleration...")
	outputs = model.generate_with_cuda_graph(
	input_ids=inputs["input_ids"],
	generation_state=generation_state,
	max_new_tokens=max_new_tokens,
	temperature=0,
	top_k=50,
	eos_token_id=tokenizer.eos_token_id,
	profiling=False,
	)

	response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)

	print(f"Response: {response}")
	```