Update README.md
Browse files
README.md
CHANGED
|
@@ -25,11 +25,17 @@ It is the result of quantising to 4bit using [GPTQ-for-LLaMa](https://github.com
|
|
| 25 |
|
| 26 |
**This is an experimental new GPTQ which offers up to 8K context size**
|
| 27 |
|
| 28 |
-
The increased context is
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
Please read carefully below to see how to use it.
|
| 31 |
|
| 32 |
-
**NOTE**: Using the full 8K context will exceed 24GB VRAM.
|
| 33 |
|
| 34 |
GGML versions are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
|
| 35 |
|
|
@@ -40,7 +46,7 @@ GGML versions are not yet provided, as there is not yet support for SuperHOT in
|
|
| 40 |
|
| 41 |
GGML quants are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
|
| 42 |
|
| 43 |
-
## How to easily download and use this model in text-generation-webui
|
| 44 |
|
| 45 |
Please make sure you're using the latest version of text-generation-webui
|
| 46 |
|
|
@@ -56,9 +62,76 @@ Please make sure you're using the latest version of text-generation-webui
|
|
| 56 |
10. The model will automatically load, and is now ready for use!
|
| 57 |
11. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
|
| 58 |
|
| 59 |
-
## How to use this GPTQ model from Python code
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
-
Using this model with increased context from Python code is currently untested, so this section is removed for now.
|
| 62 |
|
| 63 |
## Provided files
|
| 64 |
|
|
|
|
| 25 |
|
| 26 |
**This is an experimental new GPTQ which offers up to 8K context size**
|
| 27 |
|
| 28 |
+
The increased context is tested to work with [ExLlama](https://github.com/turboderp/exllama), via the latest release of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
| 29 |
+
|
| 30 |
+
It has also been tested from Python code using AutoGPTQ, and `trust_remote_code=True`.
|
| 31 |
+
|
| 32 |
+
Code credits:
|
| 33 |
+
- Original concept and code for increasing context length: [kaiokendev](https://huggingface.co/kaiokendev)
|
| 34 |
+
- Updated Llama modelling code that includes this automatically via trust_remote_code: [emozilla](https://huggingface.co/emozilla).
|
| 35 |
|
| 36 |
Please read carefully below to see how to use it.
|
| 37 |
|
| 38 |
+
**NOTE**: Using the full 8K context on a 30B model will exceed 24GB VRAM.
|
| 39 |
|
| 40 |
GGML versions are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
|
| 41 |
|
|
|
|
| 46 |
|
| 47 |
GGML quants are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
|
| 48 |
|
| 49 |
+
## How to easily download and use this model in text-generation-webui with ExLlama
|
| 50 |
|
| 51 |
Please make sure you're using the latest version of text-generation-webui
|
| 52 |
|
|
|
|
| 62 |
10. The model will automatically load, and is now ready for use!
|
| 63 |
11. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
|
| 64 |
|
| 65 |
+
## How to use this GPTQ model from Python code with AutoGPTQ
|
| 66 |
+
|
| 67 |
+
First make sure you have AutoGPTQ and Einops installed:
|
| 68 |
+
|
| 69 |
+
```
|
| 70 |
+
pip3 install einops auto-gptq
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
Then run the following code. Note that in order to get this to work, `config.json` has been hardcoded to a sequence length of 8192.
|
| 74 |
+
|
| 75 |
+
If you want to try 4096 instead to reduce VRAM usage, please manually edit `config.json` to set `max_position_embeddings` to the value you want.
|
| 76 |
+
|
| 77 |
+
```python
|
| 78 |
+
from transformers import AutoTokenizer, pipeline, logging
|
| 79 |
+
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
|
| 80 |
+
import argparse
|
| 81 |
+
|
| 82 |
+
model_name_or_path = "TheBloke/WizardLM-33B-V1.0-Uncensored-SuperHOT-8K-GPTQ"
|
| 83 |
+
model_basename = "wizardlm-33b-v1.0-uncensored-superhot-8k-GPTQ-4bit--1g.act.order"
|
| 84 |
+
|
| 85 |
+
use_triton = False
|
| 86 |
+
|
| 87 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
|
| 88 |
+
|
| 89 |
+
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
| 90 |
+
model_basename=model_basename,
|
| 91 |
+
use_safetensors=True,
|
| 92 |
+
trust_remote_code=True,
|
| 93 |
+
device_map='auto',
|
| 94 |
+
use_triton=use_triton,
|
| 95 |
+
quantize_config=None)
|
| 96 |
+
|
| 97 |
+
model.seqlen = 8192
|
| 98 |
+
|
| 99 |
+
# Note: check the prompt template is correct for this model.
|
| 100 |
+
prompt = "Tell me about AI"
|
| 101 |
+
prompt_template=f'''USER: {prompt}
|
| 102 |
+
ASSISTANT:'''
|
| 103 |
+
|
| 104 |
+
print("\n\n*** Generate:")
|
| 105 |
+
|
| 106 |
+
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
|
| 107 |
+
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
|
| 108 |
+
print(tokenizer.decode(output[0]))
|
| 109 |
+
|
| 110 |
+
# Inference can also be done using transformers' pipeline
|
| 111 |
+
|
| 112 |
+
# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
|
| 113 |
+
logging.set_verbosity(logging.CRITICAL)
|
| 114 |
+
|
| 115 |
+
print("*** Pipeline:")
|
| 116 |
+
pipe = pipeline(
|
| 117 |
+
"text-generation",
|
| 118 |
+
model=model,
|
| 119 |
+
tokenizer=tokenizer,
|
| 120 |
+
max_new_tokens=512,
|
| 121 |
+
temperature=0.7,
|
| 122 |
+
top_p=0.95,
|
| 123 |
+
repetition_penalty=1.15
|
| 124 |
+
)
|
| 125 |
+
|
| 126 |
+
print(pipe(prompt_template)[0]['generated_text'])
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
+
## Using other UIs: monkey patch
|
| 130 |
+
|
| 131 |
+
Provided in the repo is `llama_rope_scaled_monkey_patch.py`, written by @kaiokendev.
|
| 132 |
+
|
| 133 |
+
It can be theoretically be added to any Python UI or custom code to enable the same result as `trust_remote_code=True`. I have not tested this, and it should be superseded by using `trust_remote_code=True`, but I include it for completeness and for interest.
|
| 134 |
|
|
|
|
| 135 |
|
| 136 |
## Provided files
|
| 137 |
|