Update README.md for Validation

#5
Files changed (1) hide show
  1. README.md +169 -5
README.md CHANGED
@@ -31,8 +31,15 @@ license: other
31
  license_name: llama4
32
  ---
33
 
34
- # Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
35
- **Built with Llama**
 
 
 
 
 
 
 
36
  ## Model Overview
37
  - **Model Architecture:** Llama4ForConditionalGeneration
38
  - **Input:** Text / Image
@@ -51,10 +58,11 @@ This model was obtained by quantizing activations and weights of [Llama-4-Scout-
51
  This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
52
  Weight quantization also reduces disk size requirements by approximately 50%. The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
53
 
54
-
55
  ## Deployment
56
 
57
- This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
 
 
58
 
59
  ```python
60
  from vllm import LLM, SamplingParams
@@ -77,7 +85,163 @@ generated_text = outputs[0].outputs[0].text
77
  print(generated_text)
78
  ```
79
 
80
- vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
  ## Creation
83
 
 
31
  license_name: llama4
32
  ---
33
 
34
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
35
+ Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
36
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
37
+ </h1>
38
+
39
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
40
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
41
+ </a>
42
+
43
  ## Model Overview
44
  - **Model Architecture:** Llama4ForConditionalGeneration
45
  - **Input:** Text / Image
 
58
  This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
59
  Weight quantization also reduces disk size requirements by approximately 50%. The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
60
 
 
61
  ## Deployment
62
 
63
+ This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
64
+
65
+ Deploy on <strong>vLLM</strong>
66
 
67
  ```python
68
  from vllm import LLM, SamplingParams
 
85
  print(generated_text)
86
  ```
87
 
88
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
89
+
90
+
91
+ <details>
92
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
93
+
94
+ ```bash
95
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
96
+ --ipc=host \
97
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
98
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
99
+ --name=vllm \
100
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
101
+ vllm serve \
102
+ --tensor-parallel-size 8 \
103
+ --max-model-len 32768 \
104
+ --enforce-eager --model RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
105
+ ```
106
+ </details>
107
+
108
+ <details>
109
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
110
+
111
+ ```bash
112
+ # Download model from Red Hat Registry via docker
113
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
114
+ ilab model download --repository docker://registry.redhat.io/rhelai1/llama-4-scout-17b-16e-instruct-fp8-dynamic:1.5
115
+ ```
116
+
117
+ ```bash
118
+ # Serve model via ilab
119
+ ilab model serve --model-path ~/.cache/instructlab/models/llama-4-scout-17b-16e-instruct-fp8-dynamic
120
+
121
+ # Chat with model
122
+ ilab model chat --model ~/.cache/instructlab/models/llama-4-scout-17b-16e-instruct-fp8-dynamic
123
+ ```
124
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
125
+ </details>
126
+
127
+ <details>
128
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
129
+
130
+ ```python
131
+ # Setting up vllm server with ServingRuntime
132
+ # Save as: vllm-servingruntime.yaml
133
+ apiVersion: serving.kserve.io/v1alpha1
134
+ kind: ServingRuntime
135
+ metadata:
136
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
137
+ annotations:
138
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
139
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
140
+ labels:
141
+ opendatahub.io/dashboard: 'true'
142
+ spec:
143
+ annotations:
144
+ prometheus.io/port: '8080'
145
+ prometheus.io/path: '/metrics'
146
+ multiModel: false
147
+ supportedModelFormats:
148
+ - autoSelect: true
149
+ name: vLLM
150
+ containers:
151
+ - name: kserve-container
152
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
153
+ command:
154
+ - python
155
+ - -m
156
+ - vllm.entrypoints.openai.api_server
157
+ args:
158
+ - "--port=8080"
159
+ - "--model=/mnt/models"
160
+ - "--served-model-name={{.Name}}"
161
+ env:
162
+ - name: HF_HOME
163
+ value: /tmp/hf_home
164
+ ports:
165
+ - containerPort: 8080
166
+ protocol: TCP
167
+ ```
168
+
169
+ ```python
170
+ # Attach model to vllm server. This is an NVIDIA template
171
+ # Save as: inferenceservice.yaml
172
+ apiVersion: serving.kserve.io/v1beta1
173
+ kind: InferenceService
174
+ metadata:
175
+ annotations:
176
+ openshift.io/display-name: Llama-4-Scout-17B-16E-Instruct-FP8-dynamic # OPTIONAL CHANGE
177
+ serving.kserve.io/deploymentMode: RawDeployment
178
+ name: Llama-4-Scout-17B-16E-Instruct-FP8-dynamic # specify model name. This value will be used to invoke the model in the payload
179
+ labels:
180
+ opendatahub.io/dashboard: 'true'
181
+ spec:
182
+ predictor:
183
+ maxReplicas: 1
184
+ minReplicas: 1
185
+ model:
186
+ modelFormat:
187
+ name: vLLM
188
+ name: ''
189
+ resources:
190
+ limits:
191
+ cpu: '2' # this is model specific
192
+ memory: 8Gi # this is model specific
193
+ nvidia.com/gpu: '1' # this is accelerator specific
194
+ requests: # same comment for this block
195
+ cpu: '1'
196
+ memory: 4Gi
197
+ nvidia.com/gpu: '1'
198
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
199
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-llama-4-scout-17b-16e-instruct-fp8-dynamic:1.5
200
+ tolerations:
201
+ - effect: NoSchedule
202
+ key: nvidia.com/gpu
203
+ operator: Exists
204
+ ```
205
+
206
+ ```bash
207
+ # make sure first to be in the project where you want to deploy the model
208
+ # oc project <project-name>
209
+
210
+ # apply both resources to run model
211
+
212
+ # Apply the ServingRuntime
213
+ oc apply -f vllm-servingruntime.yaml
214
+
215
+ # Apply the InferenceService
216
+ oc apply -f qwen-inferenceservice.yaml
217
+ ```
218
+
219
+ ```python
220
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
221
+ # - Run `oc get inferenceservice` to find your URL if unsure.
222
+
223
+ # Call the server using curl:
224
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
225
+ -H "Content-Type: application/json" \
226
+ -d '{
227
+ "model": "Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",
228
+ "stream": true,
229
+ "stream_options": {
230
+ "include_usage": true
231
+ },
232
+ "max_tokens": 1,
233
+ "messages": [
234
+ {
235
+ "role": "user",
236
+ "content": "How can a bee fly when its wings are so small?"
237
+ }
238
+ ]
239
+ }'
240
+
241
+ ```
242
+
243
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
244
+ </details>
245
 
246
  ## Creation
247