VLLM serving Tulu3 Llama 405B

by JettLam - opened Feb 14

Feb 14

Hi,

I have tested the same command depicted in the hugging face to host the model through VLLM. I have 2 node w 8 GPUs each set up, sufficient VRAM.

In the CLI, I ran the following line
vllm serve /path/to/model —tensor-parallel-size 8 —pipeline-parallel-size 2

The model is able to be hosted but the respond is only exclamation marks regardless of the input. Will love to hear how your team manages to serve it through VLLM.

Thanks!

amanrangapur

Ai2 org Feb 24

Hey @JettLam , what dtype are you using? If your vocab size is > 2**16, make sure you’re using uint32 for token indices.

baileyk

Ai2 org Jul 17

Hi, thanks again for the inquiry! We’re currently working on closing out old tickets, so we’re closing this out for now, but if you require a follow-up response, please re-open this ticket or a new one and we will get back to you!

baileyk changed discussion status to closed Jul 17

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment