Update README.md
Browse files
README.md
CHANGED
|
@@ -66,8 +66,7 @@ Weight quantization also reduces disk size requirements by approximately 50%. Th
|
|
| 66 |
|
| 67 |
This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
|
| 68 |
|
| 69 |
-
<
|
| 70 |
-
<summary>Deploy on <strong>vLLM</strong></summary>
|
| 71 |
|
| 72 |
```python
|
| 73 |
from vllm import LLM, SamplingParams
|
|
@@ -90,7 +89,24 @@ generated_text = outputs[0].outputs[0].text
|
|
| 90 |
print(generated_text)
|
| 91 |
```
|
| 92 |
|
| 93 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
</details>
|
| 95 |
|
| 96 |
<details>
|
|
|
|
| 66 |
|
| 67 |
This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
|
| 68 |
|
| 69 |
+
Deploy on <strong>vLLM</strong>
|
|
|
|
| 70 |
|
| 71 |
```python
|
| 72 |
from vllm import LLM, SamplingParams
|
|
|
|
| 89 |
print(generated_text)
|
| 90 |
```
|
| 91 |
|
| 92 |
+
vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
<details>
|
| 96 |
+
<summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
|
| 97 |
+
|
| 98 |
+
```bash
|
| 99 |
+
$ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
|
| 100 |
+
--ipc=host \
|
| 101 |
+
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
| 102 |
+
--env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
|
| 103 |
+
--name=vllm \
|
| 104 |
+
registry.access.redhat.com/rhaiis/rh-vllm-cuda \
|
| 105 |
+
vllm serve \
|
| 106 |
+
--tensor-parallel-size 8 \
|
| 107 |
+
--max-model-len 32768 \
|
| 108 |
+
--enforce-eager --model RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
|
| 109 |
+
```
|
| 110 |
</details>
|
| 111 |
|
| 112 |
<details>
|