VLLM serving Tulu3 Llama 405B

#3
by JettLam - opened

Hi,

I have tested the same command depicted in the hugging face to host the model through VLLM. I have 2 node w 8 GPUs each set up, sufficient VRAM.

In the CLI, I ran the following line
vllm serve /path/to/model —tensor-parallel-size 8 —pipeline-parallel-size 2

The model is able to be hosted but the respond is only exclamation marks regardless of the input. Will love to hear how your team manages to serve it through VLLM.

Thanks!

Hey @JettLam , what dtype are you using? If your vocab size is > 2**16, make sure you’re using uint32 for token indices.

Ai2 org

Hi, thanks again for the inquiry! We’re currently working on closing out old tickets, so we’re closing this out for now, but if you require a follow-up response, please re-open this ticket or a new one and we will get back to you!

baileyk changed discussion status to closed

Sign up or log in to comment