🏂 ProteinSkier

ProteinSkier is a GPT-2–based language model that “carves fresh lines” through chemical space, producing drug-like SMILES strings with an explicit bias toward ADMET quality, novelty, and synthesizability.

1 · Why another generative model?

Traditional generative models often rediscover known scaffolds or output molecules that fail late-stage ADMET filters.
ProteinSkier addresses this by coupling large-scale pre-training on ~2 M curated molecules with a second-stage Reinforcement Fine-Tuning (RFT) that rewards:

Component	Reward signal (λ)	Source
Validity	hard filter	RDKit sanitisation
QED ↑	0.35	RDKit
Novelty ↑	0.25	training-set hash table
Lipinski pass ↑	0.20	RDKit
logP in [–1, 4]	0.10	RDKit
Predicted tox ↓	0.10	internal classifier

The policy is updated with policy-gradient REINFORCE; low-quality trajectories are rejected via an adaptive threshold (see FullDatasetRFTTrainer in the code).

2 · Intended uses & scope

Stage	Example use-case	Not a good fit
Hit finding	Rapidly scaffold-hop around a weak binder identified by docking.	Predicting absolute IC₅₀/Kᵢ values.
Lead optimisation	Generating analogues that respect Lipinski & BBB guidelines.	Ensuring synthetic accessibility without chemist review.
Ideation / teaching	Demonstrating language-model chemistry in the classroom.	Production-scale enumeration without downstream filtering.

3 · Quick start

Requires transformers ≥ 4.42, torch ≥ 2.2, rdkit, accelerate.

from transformers import AutoTokenizer, GPT2LMHeadModel

model_id = "ProteinDance/ProteinSkier"
tok = AutoTokenizer.from_pretrained(model_id)
model = GPT2LMHeadModel.from_pretrained(model_id)

# Generate 5 novel molecules
prompt = tok("<bos>", return_tensors="pt").input_ids
gen = model.generate(
    prompt.repeat(5, 1),
    max_length=128,
    do_sample=True,
    top_p=0.95,
    temperature=0.7,
)
smiles = tok.batch_decode(gen, skip_special_tokens=True)
print("\n".join(smiles))

4 · Limitations & caveats

No guaranteed synthesizability – always perform retrosynthetic analysis.
Property estimators used in RFT are fast; wet-lab assays will vary.
Output may include patented molecules – run IP checks.
ADMET focus biases chemistry toward oral drugs; unsuitable for agrochemicals or materials.

Downloads last month: 13

Safetensors

Model size

25.3M params

Tensor type

F32

Video Preview

Reinforcement Learning

Model tree for ProteinDance/ProteinSkier

Base model

openai-community/gpt2

Finetuned

(2024)

this model

ProteinDance
/

ProteinSkier