🏂 ProteinSkier
ProteinSkier is a GPT-2–based language model that “carves fresh lines” through chemical space, producing drug-like SMILES strings with an explicit bias toward ADMET quality, novelty, and synthesizability.
1 · Why another generative model?
Traditional generative models often rediscover known scaffolds or output molecules that fail late-stage ADMET filters.
ProteinSkier addresses this by coupling large-scale pre-training on ~2 M curated molecules with a second-stage Reinforcement Fine-Tuning (RFT) that rewards:
| Component | Reward signal (λ) | Source |
|---|---|---|
| Validity | hard filter | RDKit sanitisation |
| QED ↑ | 0.35 | RDKit |
| Novelty ↑ | 0.25 | training-set hash table |
| Lipinski pass ↑ | 0.20 | RDKit |
| logP in [–1, 4] | 0.10 | RDKit |
| Predicted tox ↓ | 0.10 | internal classifier |
The policy is updated with policy-gradient REINFORCE; low-quality trajectories are rejected via an adaptive threshold (see FullDatasetRFTTrainer in the code).
2 · Intended uses & scope
| Stage | Example use-case | Not a good fit |
|---|---|---|
| Hit finding | Rapidly scaffold-hop around a weak binder identified by docking. | Predicting absolute IC₅₀/Kᵢ values. |
| Lead optimisation | Generating analogues that respect Lipinski & BBB guidelines. | Ensuring synthetic accessibility without chemist review. |
| Ideation / teaching | Demonstrating language-model chemistry in the classroom. | Production-scale enumeration without downstream filtering. |
3 · Quick start
Requires
transformers ≥ 4.42,torch ≥ 2.2,rdkit,accelerate.
from transformers import AutoTokenizer, GPT2LMHeadModel
model_id = "ProteinDance/ProteinSkier"
tok = AutoTokenizer.from_pretrained(model_id)
model = GPT2LMHeadModel.from_pretrained(model_id)
# Generate 5 novel molecules
prompt = tok("<bos>", return_tensors="pt").input_ids
gen = model.generate(
prompt.repeat(5, 1),
max_length=128,
do_sample=True,
top_p=0.95,
temperature=0.7,
)
smiles = tok.batch_decode(gen, skip_special_tokens=True)
print("\n".join(smiles))
4 · Limitations & caveats
- No guaranteed synthesizability – always perform retrosynthetic analysis.
- Property estimators used in RFT are fast; wet-lab assays will vary.
- Output may include patented molecules – run IP checks.
- ADMET focus biases chemistry toward oral drugs; unsuitable for agrochemicals or materials.
- Downloads last month
- 13
Model tree for ProteinDance/ProteinSkier
Base model
openai-community/gpt2