Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -66,8 +66,7 @@ Weight quantization also reduces disk size requirements by approximately 50%. Th
 This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
-<details>
-  <summary>Deploy on <strong>vLLM</strong></summary>
 ```python
 from vllm import LLM, SamplingParams
@@ -90,7 +89,24 @@ generated_text = outputs[0].outputs[0].text
 print(generated_text)
 ```
-  vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
 </details>
 <details>

 This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
+Deploy on <strong>vLLM</strong>
 ```python
 from vllm import LLM, SamplingParams
 print(generated_text)
 ```
+vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
+<details>
+  <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
+```bash
+$ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
+ --ipc=host \
+--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
+--env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
+--name=vllm \
+registry.access.redhat.com/rhaiis/rh-vllm-cuda \
+vllm serve \
+--tensor-parallel-size 8 \
+--max-model-len 32768  \
+--enforce-eager --model RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
+```
 </details>
 <details>