🏂 ProteinSkier

ProteinSkier is a GPT-2–based language model that “carves fresh lines” through chemical space, producing drug-like SMILES strings with an explicit bias toward ADMET quality, novelty, and synthesizability.

1 · Why another generative model?

Traditional generative models often rediscover known scaffolds or output molecules that fail late-stage ADMET filters.
ProteinSkier addresses this by coupling large-scale pre-training on ~2 M curated molecules with a second-stage Reinforcement Fine-Tuning (RFT) that rewards:

Component Reward signal (λ) Source
Validity hard filter RDKit sanitisation
QED ↑ 0.35 RDKit
Novelty ↑ 0.25 training-set hash table
Lipinski pass ↑ 0.20 RDKit
logP in [–1, 4] 0.10 RDKit
Predicted tox ↓ 0.10 internal classifier

The policy is updated with policy-gradient REINFORCE; low-quality trajectories are rejected via an adaptive threshold (see FullDatasetRFTTrainer in the code).

2 · Intended uses & scope

Stage Example use-case Not a good fit
Hit finding Rapidly scaffold-hop around a weak binder identified by docking. Predicting absolute IC₅₀/Kᵢ values.
Lead optimisation Generating analogues that respect Lipinski & BBB guidelines. Ensuring synthetic accessibility without chemist review.
Ideation / teaching Demonstrating language-model chemistry in the classroom. Production-scale enumeration without downstream filtering.

3 · Quick start

Requires transformers ≥ 4.42, torch ≥ 2.2, rdkit, accelerate.

from transformers import AutoTokenizer, GPT2LMHeadModel

model_id = "ProteinDance/ProteinSkier"
tok = AutoTokenizer.from_pretrained(model_id)
model = GPT2LMHeadModel.from_pretrained(model_id)

# Generate 5 novel molecules
prompt = tok("<bos>", return_tensors="pt").input_ids
gen = model.generate(
    prompt.repeat(5, 1),
    max_length=128,
    do_sample=True,
    top_p=0.95,
    temperature=0.7,
)
smiles = tok.batch_decode(gen, skip_special_tokens=True)
print("\n".join(smiles))

4 · Limitations & caveats

  • No guaranteed synthesizability – always perform retrosynthetic analysis.
  • Property estimators used in RFT are fast; wet-lab assays will vary.
  • Output may include patented molecules – run IP checks.
  • ADMET focus biases chemistry toward oral drugs; unsuitable for agrochemicals or materials.
Downloads last month
13
Safetensors
Model size
25.3M params
Tensor type
F32
·
Video Preview
loading

Model tree for ProteinDance/ProteinSkier

Finetuned
(2024)
this model

Datasets used to train ProteinDance/ProteinSkier