AdapterHub
/

llama2-13b-qlora-openassistant

@@ -1,16 +1,19 @@
 ---
 tags:
-- adapter-transformers
 - llama
 datasets:
 - timdettmers/openassistant-guanaco
 ---
-# Adapter `AdapterHub/llama2-13b-qlora-openassistant` for meta-llama/Llama-2-13b-hf
-An [adapter](https://adapterhub.ml) for the `meta-llama/Llama-2-13b-hf` model that was trained on the [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/) dataset.
-This adapter was created for usage with the **[Adapters](https://github.com/Adapter-Hub/adapters)** library.
 ## Usage
@@ -20,23 +23,85 @@ First, install `adapters`:
 pip install -U adapters
 ```
-Now, the adapter can be loaded and activated like this:
 ```python
-from adapters import AutoAdapterModel
-model = AutoAdapterModel.from_pretrained("meta-llama/Llama-2-13b-hf")
-adapter_name = model.load_adapter("AdapterHub/llama2-13b-qlora-openassistant", source="hf", set_active=True)
 ```
-## Architecture & Training
-<!-- Add some description here -->
-## Evaluation results
-<!-- Add some description here -->
-## Citation
-<!-- Add some description here -->

 ---
 tags:
 - llama
+- adapter-transformers
+- llama-2
 datasets:
 - timdettmers/openassistant-guanaco
+license: apache-2.0
+pipeline_tag: text-generation
 ---
+# OpenAssistant QLoRA Adapter for Llama-2 13B
+QLoRA adapter for the Llama-2 13B (`meta-llama/Llama-2-13b-hf`) model trained for instruction tuning on the [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/) dataset.
+**This adapter was created for usage with the [Adapters](https://github.com/Adapter-Hub/adapters) library.**
 ## Usage
 pip install -U adapters
 ```
+Now, the model and adapter can be loaded and activated like this:
 ```python
+import adapters
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+model_id = "meta-llama/Llama-2-13b-hf"
+adapter_id = "AdapterHub/llama2-13b-qlora-openassistant"
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    quantization_config=BitsAndBytesConfig(
+        load_in_4bit=True,
+        bnb_4bit_quant_type="nf4",
+        bnb_4bit_use_double_quant=True,
+        bnb_4bit_compute_dtype=torch.bfloat16,
+    ),
+    torch_dtype=torch.bfloat16,
+)
+adapters.init(model)
+adapter_name = model.load_adapter(adapter_id, set_active=True)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
 ```
+### Inference
+Inference can be done via standard methods built in to the Transformers library.
+We add some helper code to properly prompt the model first:
+```python
+from transformers import StoppingCriteria
+# stop if model starts to generate "### Human:"
+class EosListStoppingCriteria(StoppingCriteria):
+    def __init__(self, eos_sequence = [12968, 29901]):
+        self.eos_sequence = eos_sequence
+    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
+        last_ids = input_ids[:,-len(self.eos_sequence):].tolist()
+        return self.eos_sequence in last_ids
+def prompt_model(model, text: str):
+    batch = tokenizer(f"### Human: {text} ### Assistant:", return_tensors="pt")
+    batch = batch.to(model.device)
+    with torch.cuda.amp.autocast():
+        output_tokens = model.generate(**batch, stopping_criteria=[EosListStoppingCriteria()])
+    # skip prompt when decoding
+    return tokenizer.decode(output_tokens[0, batch["input_ids"].shape[1]:], skip_special_tokens=True)
+```
+Now, to prompt the model:
+```python
+prompt_model(model, "Please explain NLP in simple terms.")
+```
+### Weight merging
+To decrease inference latency, the LoRA weights can be merged with the base model:
+```python
+model.merge_adapter(adapter_name)
+```
+## Architecture & Training
+**Training was run with the code in [this notebook](https://github.com/adapter-hub/adapters/blob/main/notebooks/QLoRA_Llama2_Finetuning.ipynb)**.
+The LoRA architecture closely follows the configuration described in the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf):
+- `r=64`, `alpha=16`
+- LoRA modules added in output, intermediate and all (Q, K, V) self-attention linear layers
+The adapter is trained similar to the Guanaco models proposed in the paper:
+- Dataset: [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)
+- Quantization: 4-bit QLoRA
+- Batch size: 16, LR: 2e-4, max steps: 1875
+- Sequence length: 512