File size: 2,717 Bytes
c9d2f6f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cc43529
 
 
 
 
 
 
 
 
 
c9d2f6f
 
 
 
4812774
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
---
license: apache-2.0
---

## Model Overview

PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the [Caduceus](https://caduceus-dna.github.io/) and [Mamba](https://arxiv.org/abs/2312.00752) architectures and a masked language modeling objective, PlantCaduceus is designed to learn evolutionary conservation and DNA sequence grammar from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:

- **[PlantCaduceus_l20](https://huggingface.co/kuleshov-group/PlantCaduceus_l20)**: 20 layers, 384 hidden size, 20M parameters
- **[PlantCaduceus_l24](https://huggingface.co/kuleshov-group/PlantCaduceus_l24)**: 24 layers, 512 hidden size, 40M parameters
- **[PlantCaduceus_l28](https://huggingface.co/kuleshov-group/PlantCaduceus_l28)**: 28 layers, 768 hidden size, 112M parameters
- **[PlantCaduceus_l32](https://huggingface.co/kuleshov-group/PlantCaduceus_l32)**: 32 layers, 1024 hidden size, 225M parameters

**We would highly recommend using the largest model ([PlantCaduceus_l32](https://huggingface.co/kuleshov-group/PlantCaduceus_l32)) for the zero-shot score estimation.**


## How to use
```python
from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
import torch
model_path = 'kuleshov-group/PlantCaduceus_l28'
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True, device_map=device)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

sequence = "ATGCGTACGATCGTAG"
encoding = tokenizer.encode_plus(
            sequence,
            return_tensors="pt",
            return_attention_mask=False,
            return_token_type_ids=False
        )
input_ids = encoding["input_ids"].to(device)
with torch.inference_mode():
    outputs = model(input_ids=input_ids, output_hidden_states=True)
```

## Citation
```bibtex
@article{Zhai2025CrossSpecies,
  author       = {Zhai, Jingjing and Gokaslan, Aaron and Schiff, Yoni and Berthel, Alexander and Liu, Z. Y. and Lai, W. L. and Miller, Z. R. and Scheben, Armin and Stitzer, Michelle C. and Romay, Maria C. and Buckler, Edward S. and Kuleshov, Volodymyr},
  title        = {Cross-species modeling of plant genomes at single nucleotide resolution using a pretrained DNA language model},
  journal      = {Proceedings of the National Academy of Sciences},
  year         = {2025},
  volume       = {122},
  number       = {24},
  pages        = {e2421738122},
  doi          = {10.1073/pnas.2421738122},
  url          = {https://doi.org/10.1073/pnas.2421738122}
}
```

## Contact
Jingjing Zhai ([email protected])