RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic · Update README.md for Validation

Update README.md for Validation

by jennyyyi - opened May 13

base: refs/heads/main

←

from: refs/pr/5

Discussion Files changed

+169

-5

Files changed (1) hide show

README.md +169 -5

README.md CHANGED Viewed

@@ -31,8 +31,15 @@ license: other
 license_name: llama4
 ---
-# Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
-**Built with Llama**
 ## Model Overview
 - **Model Architecture:** Llama4ForConditionalGeneration
   - **Input:** Text / Image
@@ -51,10 +58,11 @@ This model was obtained by quantizing activations and weights of [Llama-4-Scout-
 This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
 Weight quantization also reduces disk size requirements by approximately 50%. The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
 ## Deployment
-This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
 ```python
 from vllm import LLM, SamplingParams
@@ -77,7 +85,163 @@ generated_text = outputs[0].outputs[0].text
 print(generated_text)
 ```
-vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
 ## Creation

 license_name: llama4
 ---
+<h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
+  Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
+  <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
+</h1>
+<a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
+<img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
+</a>
 ## Model Overview
 - **Model Architecture:** Llama4ForConditionalGeneration
   - **Input:** Text / Image
 This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
 Weight quantization also reduces disk size requirements by approximately 50%. The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
 ## Deployment
+This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
+Deploy on <strong>vLLM</strong>
 ```python
 from vllm import LLM, SamplingParams
 print(generated_text)
 ```
+vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
+<details>
+  <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
+```bash
+$ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
+ --ipc=host \
+--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
+--env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
+--name=vllm \
+registry.access.redhat.com/rhaiis/rh-vllm-cuda \
+vllm serve \
+--tensor-parallel-size 8 \
+--max-model-len 32768  \
+--enforce-eager --model RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
+```
+</details>
+<details>
+  <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
+```bash
+# Download model from Red Hat Registry via docker
+# Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
+ilab model download --repository docker://registry.redhat.io/rhelai1/llama-4-scout-17b-16e-instruct-fp8-dynamic:1.5
+```
+```bash
+# Serve model via ilab
+ilab model serve --model-path ~/.cache/instructlab/models/llama-4-scout-17b-16e-instruct-fp8-dynamic
+# Chat with model
+ilab model chat --model ~/.cache/instructlab/models/llama-4-scout-17b-16e-instruct-fp8-dynamic
+```
+See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
+</details>
+<details>
+  <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
+```python
+# Setting up vllm server with ServingRuntime
+# Save as: vllm-servingruntime.yaml
+apiVersion: serving.kserve.io/v1alpha1
+kind: ServingRuntime
+metadata:
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
+ annotations:
+   openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
+   opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
+ labels:
+   opendatahub.io/dashboard: 'true'
+spec:
+ annotations:
+   prometheus.io/port: '8080'
+   prometheus.io/path: '/metrics'
+ multiModel: false
+ supportedModelFormats:
+   - autoSelect: true
+     name: vLLM
+ containers:
+   - name: kserve-container
+     image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
+     command:
+       - python
+       - -m
+       - vllm.entrypoints.openai.api_server
+     args:
+       - "--port=8080"
+       - "--model=/mnt/models"
+       - "--served-model-name={{.Name}}"
+     env:
+       - name: HF_HOME
+         value: /tmp/hf_home
+     ports:
+       - containerPort: 8080
+         protocol: TCP
+```
+```python
+# Attach model to vllm server. This is an NVIDIA template
+# Save as: inferenceservice.yaml
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+  annotations:
+    openshift.io/display-name: Llama-4-Scout-17B-16E-Instruct-FP8-dynamic # OPTIONAL CHANGE
+    serving.kserve.io/deploymentMode: RawDeployment
+  name: Llama-4-Scout-17B-16E-Instruct-FP8-dynamic          # specify model name. This value will be used to invoke the model in the payload
+  labels:
+    opendatahub.io/dashboard: 'true'
+spec:
+  predictor:
+    maxReplicas: 1
+    minReplicas: 1
+    model:
+      modelFormat:
+        name: vLLM
+      name: ''
+      resources:
+        limits:
+          cpu: '2'			# this is model specific
+          memory: 8Gi		# this is model specific
+          nvidia.com/gpu: '1'	# this is accelerator specific
+        requests:			# same comment for this block
+          cpu: '1'
+          memory: 4Gi
+          nvidia.com/gpu: '1'
+      runtime: vllm-cuda-runtime	# must match the ServingRuntime name above
+      storageUri: oci://registry.redhat.io/rhelai1/modelcar-llama-4-scout-17b-16e-instruct-fp8-dynamic:1.5
+    tolerations:
+    - effect: NoSchedule
+      key: nvidia.com/gpu
+      operator: Exists
+```
+```bash
+# make sure first to be in the project where you want to deploy the model
+# oc project <project-name>
+# apply both resources to run model
+# Apply the ServingRuntime
+oc apply -f vllm-servingruntime.yaml
+# Apply the InferenceService
+oc apply -f qwen-inferenceservice.yaml
+```
+```python
+# Replace <inference-service-name> and <cluster-ingress-domain> below:
+# - Run `oc get inferenceservice` to find your URL if unsure.
+# Call the server using curl:
+curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
+        -H "Content-Type: application/json" \
+        -d '{
+    "model": "Llama-4-Scout-17B-16E-Instruct-FP8-dynamic",
+    "stream": true,
+    "stream_options": {
+        "include_usage": true
+    },
+    "max_tokens": 1,
+    "messages": [
+        {
+            "role": "user",
+            "content": "How can a bee fly when its wings are so small?"
+        }
+    ]
+}'
+```
+See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
+</details>
 ## Creation