ybian-umd commited on
Commit
6488771
·
verified ·
1 Parent(s): e024d6e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +51 -1
README.md CHANGED
@@ -32,7 +32,57 @@ evaluation settings:
32
  **Note**: The 4B, 8B, and 30B models are coming soon. Performance results for these models will be released in the near future.
33
 
34
  ## Inference
35
- The inference code will come soon
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  ## Hightlights
38
  - **Performance**: SDAR-1.7B-Chat achieves state-of-the-art.
 
32
  **Note**: The 4B, 8B, and 30B models are coming soon. Performance results for these models will be released in the near future.
33
 
34
  ## Inference
35
+
36
+ ### Using the tailored inference engine [JetEngine](https://github.com/Labman42/JetEngine)
37
+
38
+ JetEngine enables more efficient inference compared to the built-in implementation.
39
+
40
+ ```bash
41
+ git clone https://github.com/Labman42/JetEngine.git
42
+ cd JetEngine
43
+ pip install .
44
+ ```
45
+
46
+ The following example shows how to quickly load a model with JetEngine and run a prompt end-to-end.
47
+
48
+ ```python
49
+ import os
50
+ from jetengine import LLM, SamplingParams
51
+ from transformers import AutoTokenizer
52
+
53
+ model_path = os.path.expanduser("/path/to/your/sdar-model")
54
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
55
+ # Initialize the LLM
56
+ llm = LLM(
57
+ model_path,
58
+ enforce_eager=True,
59
+ tensor_parallel_size=1,
60
+ mask_token_id=151669, # Optional: only needed for masked/diffusion models
61
+ block_length=4
62
+ )
63
+
64
+ # Set sampling/generation parameters
65
+ sampling_params = SamplingParams(
66
+ temperature=1.0,
67
+ topk=0,
68
+ topp=1.0,
69
+ max_tokens=256,
70
+ remasking_strategy="low_confidence_dynamic",
71
+ block_length=4,
72
+ denoising_steps=4,
73
+ dynamic_threshold=0.9
74
+ )
75
+
76
+ # Prepare a simple chat-style prompt
77
+ prompt = tokenizer.apply_chat_template(
78
+ [{"role": "user", "content": "Explain what reinforcement learning is in simple terms."}],
79
+ tokenize=False,
80
+ add_generation_prompt=True
81
+ )
82
+
83
+ # Generate text
84
+ outputs = llm.generate_streaming([prompt], sampling_params)
85
+ ```
86
 
87
  ## Hightlights
88
  - **Performance**: SDAR-1.7B-Chat achieves state-of-the-art.