huawei-csl
/

Apertus-8B-2509-4bit-SINQ

Text Generation

efficient-inference

8-bit precision

Model card Files Files and versions

chiaraboretti commited on 8 days ago

Commit

9baaa33

·

verified ·

1 Parent(s): 85a0a1c

Update README.md

Files changed (1) hide show

README.md +1 -0

README.md CHANGED Viewed

@@ -103,6 +103,7 @@ output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
 print(tokenizer.decode(output_ids, skip_special_tokens=True))
 ```
 <details>
 <summary><span style="font-size:1.1em; font-weight:bold;">🧩 Quantization Process</span></summary>

 print(tokenizer.decode(output_ids, skip_special_tokens=True))
 ```
+> You can optionally compile the model’s forward pass using torch.compile, which can provide a significant speed boost (especially after the first run). Please consider that the first run will take longer because PyTorch compiles optimized kernels, but subsequent runs will be much faster.
 <details>
 <summary><span style="font-size:1.1em; font-weight:bold;">🧩 Quantization Process</span></summary>