jennyyyi commited on
Commit
7c53838
·
verified ·
1 Parent(s): f1d45e1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -3
README.md CHANGED
@@ -66,8 +66,7 @@ Weight quantization also reduces disk size requirements by approximately 50%. Th
66
 
67
  This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
68
 
69
- <details>
70
- <summary>Deploy on <strong>vLLM</strong></summary>
71
 
72
  ```python
73
  from vllm import LLM, SamplingParams
@@ -90,7 +89,24 @@ generated_text = outputs[0].outputs[0].text
90
  print(generated_text)
91
  ```
92
 
93
- vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
  </details>
95
 
96
  <details>
 
66
 
67
  This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
68
 
69
+ Deploy on <strong>vLLM</strong>
 
70
 
71
  ```python
72
  from vllm import LLM, SamplingParams
 
89
  print(generated_text)
90
  ```
91
 
92
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
93
+
94
+
95
+ <details>
96
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
97
+
98
+ ```bash
99
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
100
+ --ipc=host \
101
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
102
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
103
+ --name=vllm \
104
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
105
+ vllm serve \
106
+ --tensor-parallel-size 8 \
107
+ --max-model-len 32768 \
108
+ --enforce-eager --model RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
109
+ ```
110
  </details>
111
 
112
  <details>