|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- meta-llama/Llama-3.3-70B-Instruct |
|
|
- unsloth/Meta-Llama-3.1-8B-Instruct |
|
|
tags: |
|
|
- biology |
|
|
- chemistry |
|
|
--- |
|
|
|
|
|
# Pro-1-preview |
|
|
|
|
|
[](https://github.com/michaelhla/pro-1) |
|
|
[](https://twitter.com/hla_michael) |
|
|
[](https://huggingface.co/mhla/pro-1) |
|
|
[](https://michaelhla.com/blog/pro1) |
|
|
|
|
|
Pro-1 is a reasoning model trained using GRPO towards a physics based reward function for protein stability. |
|
|
|
|
|
It takes in a protein sequence + text description of the protein + effects of previous engineering attempts, reasons over the information given, and proposes modifications to improve the stability of the given sequence. |
|
|
|
|
|
|
|
|
## LORA checkpoints |
|
|
| Model | Checkpoint | |
|
|
|------------|-------------| |
|
|
| 8b base GRPO | [best-checkpoint](https://huggingface.co/mhla/pro-1/tree/main/best-checkpoint) | |
|
|
| 8b creative reward | [creativity-lm-grpo-mega-run-full](https://huggingface.co/mhla/pro-1/tree/main/creativity-lm-grpo-mega-run-full) | |
|
|
| 8b creative + specificity reward (default) | [all-lm-grpo-mega-run](https://huggingface.co/mhla/pro-1/tree/main/all-lm-grpo-mega-run-full) | |
|
|
| 70b SFT only | [llama_70b_4bit_sft_lora_model](https://huggingface.co/mhla/pro-1/tree/main/llama_70b_4bit_sft_lora_model) | |
|
|
|
|
|
|
|
|
## Example Usage |
|
|
|
|
|
```python |
|
|
from unsloth import FastLanguageModel |
|
|
from transformers import TextIteratorStreamer |
|
|
import threading |
|
|
|
|
|
def run_protein_engineering_example(): |
|
|
# Load the model and tokenizer |
|
|
model, tokenizer = FastLanguageModel.from_pretrained( |
|
|
model_name="unsloth/meta-Llama-3.1-8B-Instruct", |
|
|
max_seq_length=32768, |
|
|
load_in_4bit=True, |
|
|
fast_inference=True, |
|
|
max_lora_rank=32, |
|
|
gpu_memory_utilization=0.6, |
|
|
) |
|
|
|
|
|
# Load the protein engineering adapter weights |
|
|
model.load_adapter("your-username/protein-engineering-llama-3.1") |
|
|
FastLanguageModel.for_inference(model) |
|
|
|
|
|
protein_sequence = "MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTHTAKYDPSLKPLSVSYDQATSLRILNNGHAFNVEFDDSQDKAVLKGGPLDGTY" |
|
|
|
|
|
prompt = f""" |
|
|
|
|
|
...{STRUCTURED PROMPT SEE https://github.com/michaelhla/pro-1 FOR CORRECT USAGE}... |
|
|
|
|
|
""" |
|
|
|
|
|
# Initialize the streamer for text generation |
|
|
streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True) |
|
|
|
|
|
# Set up generation parameters |
|
|
generation_kwargs = dict( |
|
|
input_ids=tokenizer(prompt, return_tensors="pt").input_ids.to(model.device), |
|
|
streamer=streamer, |
|
|
max_new_tokens=4096, |
|
|
temperature=0.9, |
|
|
top_p=0.95, |
|
|
do_sample=True |
|
|
) |
|
|
|
|
|
# Create a thread to run the generation |
|
|
thread = threading.Thread(target=model.generate, kwargs=generation_kwargs) |
|
|
thread.start() |
|
|
|
|
|
# Print the response as it streams |
|
|
print("Model response (streaming):") |
|
|
for new_text in streamer: |
|
|
print(new_text, end="", flush=True) |
|
|
|
|
|
thread.join() # Ensure generation is complete |
|
|
|
|
|
if __name__ == "__main__": |
|
|
run_protein_engineering_example() |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
Note: While the model was specifically trained on enzymes, it should work for any protein sequence. Curious to hear if this is true! |
|
|
|
|
|
Disclaimer: This is a preview version and as a result the model can be very dumb. Always double check sure your modified sequences have the correct mutations applied. Assume all references from the model are hallucinated. |
|
|
|