|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
 |
|
|
|
|
|
# PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction π§¬π |
|
|
|
|
|
This is the repository for [PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction](https://www.biorxiv.org/content/10.64898/2025.12.31.697180), a collection of machine learning predictors for canonical and non-canonical peptide property prediction using sequence and SMILES representations. 𧬠PeptiVerse π enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation. |
|
|
|
|
|
## Table of Contents |
|
|
|
|
|
- [Quick start](#quick-start) |
|
|
- [Installation](#installation) |
|
|
- [Repository Structure](#repository-structure) |
|
|
- [Training data collection](#training-data-collection) |
|
|
- [Best model list](#best-model-list) |
|
|
- [Full model set (cuML-enabled)](#full-model-set-gpu-enabled) |
|
|
- [Minimal deployable model set (no cuML)](#minimal-deployable-set) |
|
|
- [Usage](#usage) |
|
|
- [Local Application Hosting](#local-application-hosting) |
|
|
- [Dataset integration](#dataset-integration) |
|
|
- [Quick inference by property per model](#Quick-inference-by-property-per-model) |
|
|
- [Property Interpretations](#property-interpretations) |
|
|
- [Model Architecture](#model-architecture) |
|
|
- [Troubleshooting](#troubleshooting) |
|
|
- [Citation](#citation) |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```bash |
|
|
# Clone repository |
|
|
git clone https://huggingface.co/ChatterjeeLab/PeptiVerse |
|
|
|
|
|
# Install dependencies |
|
|
pip install -r requirements.txt |
|
|
|
|
|
# Run inference |
|
|
python inference.py |
|
|
``` |
|
|
## Installation |
|
|
### Minimal Setup |
|
|
- Easy start-up environment (using transformers, xgboost models) |
|
|
```bash |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
### Full Setup |
|
|
- Additional access to trained SVM and ElastNet models requires installation of `RAPIDS cuML`, with instructions available from their official [github page](https://github.com/rapidsai/cuml) (**CUDA-capable GPU required**). |
|
|
- Optional: pre-compiled Singularity/Apptainer environment (7.52G) is available at [Google drive](https://drive.google.com/file/d/1RJQ9HK0_gsPOhRo5H5ZmH_MYcpJqQD7e/view?usp=sharing) with everything you need (still need CUDA/GPU to load cuML models). |
|
|
``` |
|
|
# test |
|
|
apptainer exec peptiverse.sif python -c "import sys; print(sys.executable)" |
|
|
|
|
|
# run inference (see below) |
|
|
apptainer exec peptiverse.sif python inference.py |
|
|
``` |
|
|
## Repository Structure |
|
|
This repo contains important large files for [PeptiVerse](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse), an interactive app for peptide property prediction. [Paper link.](https://www.biorxiv.org/content/10.64898/2025.12.31.697180v1) |
|
|
|
|
|
``` |
|
|
PeptiVerse/ |
|
|
βββ training_data_cleaned/ # Processed datasets with embeddings |
|
|
β βββ <property>/ # Property-specific data |
|
|
β βββ train/val splits |
|
|
β βββ precomputed embeddings |
|
|
βββ training_classifiers/ # Trained model weights |
|
|
β βββ <property>/ |
|
|
β βββ cnn_wt/ # CNN architectures |
|
|
β βββ mlp_wt/ # MLP architectures |
|
|
β βββ xgb_wt/ # XGBoost models |
|
|
βββ tokenizer/ # PeptideCLM tokenizer |
|
|
βββ training_data/ # Raw training data |
|
|
βββ inference.py # Main prediction interface |
|
|
βββ best_models.txt # Model selection manifest |
|
|
βββ requirements.txt # Python dependencies |
|
|
``` |
|
|
|
|
|
## Training Data Collection |
|
|
|
|
|
<table> |
|
|
<caption><strong>Data distribution.</strong> Classification tasks report counts for class 0/1; regression tasks report total sample size (N).</caption> |
|
|
<thead> |
|
|
<tr> |
|
|
<th rowspan="2"><strong>Properties</strong></th> |
|
|
<th colspan="2"><strong>Amino Acid Sequences</strong></th> |
|
|
<th colspan="2"><strong>SMILES Sequences</strong></th> |
|
|
</tr> |
|
|
<tr> |
|
|
<th><strong>0</strong></th> |
|
|
<th><strong>1</strong></th> |
|
|
<th><strong>0</strong></th> |
|
|
<th><strong>1</strong></th> |
|
|
</tr> |
|
|
</thead> |
|
|
<tbody> |
|
|
<tr> |
|
|
<td colspan="5"><strong>Classification</strong></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Hemolysis</td> |
|
|
<td>4765</td> |
|
|
<td>1311</td> |
|
|
<td>4765</td> |
|
|
<td>1311</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Non-Fouling</td> |
|
|
<td>13580</td> |
|
|
<td>3600</td> |
|
|
<td>13580</td> |
|
|
<td>3600</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Solubility</td> |
|
|
<td>9668</td> |
|
|
<td>8785</td> |
|
|
<td>-</td> |
|
|
<td>-</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Permeability (Penetrance)</td> |
|
|
<td>1162</td> |
|
|
<td>1162</td> |
|
|
<td>-</td> |
|
|
<td>-</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Toxicity</td> |
|
|
<td>-</td> |
|
|
<td>-</td> |
|
|
<td>5518</td> |
|
|
<td>5518</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td colspan="5"><strong>Regression (N)</strong></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Permeability (PAMPA)</td> |
|
|
<td colspan="2" align="center">-</td> |
|
|
<td colspan="2" align="center">6869</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Permeability (CACO2)</td> |
|
|
<td colspan="2" align="center">-</td> |
|
|
<td colspan="2" align="center">606</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Half-Life</td> |
|
|
<td colspan="2" align="center">130</td> |
|
|
<td colspan="2" align="center">245</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Binding Affinity</td> |
|
|
<td colspan="2" align="center">1436</td> |
|
|
<td colspan="2" align="center">1597</td> |
|
|
</tr> |
|
|
</tbody> |
|
|
</table> |
|
|
|
|
|
|
|
|
## Best Model List |
|
|
|
|
|
### Full model set (cuML-enabled) |
|
|
| Property | Best Model (Sequence) | Best Model (SMILES) | Task Type | Threshold (Sequence) | Threshold (SMILES) | |
|
|
|----------------------------|-----------------|---------------------|-------------|----------------|--------------------| |
|
|
| Hemolysis | SVM | Transformer | Classifier | 0.2521 | 0.4343 | |
|
|
| Non-Fouling | MLP | ENET | Classifier | 0.57 | 0.6969 | |
|
|
| Solubility | CNN | β | Classifier | 0.377 | β | |
|
|
| Permeability (Penetrance) | SVM | β | Classifier | 0.5493 | β | |
|
|
| Toxicity | β | Transformer | Classifier | β | 0.3401 | |
|
|
| Binding Affinity | unpooled | unpooled | Regression | β | β | |
|
|
| Permeability (PAMPA) | β | CNN | Regression | β | β | |
|
|
| Permeability (Caco-2) | β | SVR | Regression | β | β | |
|
|
| Half-life | Transformer | XGB | Regression | β | β | |
|
|
>Note: *unpooled* indicates models operating on token-level embeddings with cross-attention, rather than mean-pooled representations. |
|
|
|
|
|
### Minimal deployable model set (no cuML) |
|
|
| Property | Best Model (WT) | Best Model (SMILES) | Task Type | Threshold (WT) | Threshold (SMILES) | |
|
|
|----------------------------|-----------------|---------------------|-------------|----------------|--------------------| |
|
|
| Hemolysis | XGB | Transformer | Classifier | 0.2801 | 0.4343 | |
|
|
| Non-Fouling | MLP | XGB | Classifier | 0.57 | 0.3982 | |
|
|
| Solubility | CNN | β | Classifier | 0.377 | β | |
|
|
| Permeability (Penetrance) | XGB | β | Classifier | 0.4301 | β | |
|
|
| Toxicity | β | Transformer | Classifier | β | 0.3401 | |
|
|
| Binding Affinity | unpooled | unpooled | Regression | β | β | |
|
|
| Permeability (PAMPA) | β | CNN | Regression | β | β | |
|
|
| Permeability (Caco-2) | β | SVR | Regression | β | β | |
|
|
| Half-life | xgb_wt_log | xgb_smiles | Regression | β | β | |
|
|
|
|
|
>Note: Models marked as SVM or ENET are replaced with XGB as these models are not currently supported in the deployment environment without cuML setups. *xgb_wt_log* indicated log-scaled transformation of time during training. |
|
|
|
|
|
|
|
|
## Usage |
|
|
|
|
|
### Local Application Hosting |
|
|
- Host the [PeptiVerse UI](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse) locally with your own resources. |
|
|
```bash |
|
|
# Configure models in best_models.txt |
|
|
|
|
|
git clone https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse |
|
|
python app.py |
|
|
``` |
|
|
### Dataset integration |
|
|
- All properties are provided with raw_data/split_ready_csvs/[huggingface_datasets](https://huggingface.co/docs/datasets/en/index). |
|
|
- Selective download the data you need with `huggingface-cli` |
|
|
```bash |
|
|
huggingface-cli download ChatterjeeLab/PeptiVerse \ |
|
|
--include "training_data_cleaned/**" \ # only this folder |
|
|
--exclude "**/*.pt" "**/*.joblib" \ # skip weights/artifacts |
|
|
--local-dir PeptiVerse_partial \ |
|
|
--local-dir-use-symlinks False # make real copies |
|
|
``` |
|
|
- Or in python |
|
|
```python |
|
|
from huggingface_hub import snapshot_download |
|
|
|
|
|
local_dir = snapshot_download( |
|
|
repo_id="ChatterjeeLab/PeptiVerse", |
|
|
allow_patterns=["training_data_cleaned/**"], # only this folder |
|
|
ignore_patterns=["**/*.pt", "**/*.joblib"], # skip weights/artifacts |
|
|
local_dir="PeptiVerse_partial", |
|
|
local_dir_use_symlinks=False, # make real copies |
|
|
) |
|
|
print("Downloaded to:", local_dir) |
|
|
``` |
|
|
- Usage of the huggingface datasets (with pre-computed embeddings and splits) |
|
|
- All embedding datasets are saved via `DatasetDict.save_to_disk` and loadable with: |
|
|
``` python |
|
|
from datasets import load_from_disk |
|
|
ds = load_from_disk(PATH) |
|
|
train_ds = ds["train"] |
|
|
val_ds = ds["val"] |
|
|
``` |
|
|
- A) Sequence Based ([ESM-2](https://huggingface.co/facebook/esm2_t33_650M_UR50D) embeddings) |
|
|
- Pooled (fixed-length vector per sequence) |
|
|
- Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding. |
|
|
- Each item: |
|
|
sequence: `str`; |
|
|
label: `int` (classification) or `float` (regression); |
|
|
embedding: `float32[H]` (H=1280 for ESM-2 650M); |
|
|
- Unpooled (variable-length token matrix) |
|
|
- Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix. |
|
|
- Each item: |
|
|
sequence: `str`; |
|
|
label: `int` (classification) or `float` (regression); |
|
|
embedding: `float16[L, H]` (nested lists); |
|
|
attention_mask: `int8[L]`; |
|
|
length: `int` (=L); |
|
|
- B) SMILES-based ([PeptideCLM](https://github.com/AaronFeller/PeptideCLM) embeddings) |
|
|
- Pooled (fixed-length vector per sequence) |
|
|
- Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding. |
|
|
- Each item: |
|
|
sequence: `str` (SMILES); |
|
|
label: `int` (classification) or `float` (regression); |
|
|
embedding: `float32[H]`; |
|
|
- Unpooled (variable-length token matrix) |
|
|
- Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix. |
|
|
- Each item: |
|
|
sequence: `str` (SMILES); |
|
|
label: `int` (classification) or `float` (regression); |
|
|
embedding: `float16[L, H]` (nested lists); |
|
|
attention_mask: `int8[L]`; |
|
|
length: `int` (=L); |
|
|
|
|
|
|
|
|
### Quick Inference By Property Per Model |
|
|
```python |
|
|
from inference import PeptiVersePredictor |
|
|
|
|
|
pred = PeptiVersePredictor( |
|
|
manifest_path="best_models.txt", # best model list |
|
|
classifier_weight_root=".", # repo root (where training_classifiers/ lives) |
|
|
device="cuda", # or "cpu" |
|
|
) |
|
|
|
|
|
# mode: smiles (SMILES-based models) / wt (Sequence-based models) |
|
|
# property keys (with some level of name normalization) |
|
|
# hemolysis |
|
|
# nf (Non-Fouling) |
|
|
# solubility |
|
|
# permeability_penetrance |
|
|
# toxicity |
|
|
# permeability_pampa |
|
|
# permeability_caco2 |
|
|
# halflife |
|
|
# binding_affinity |
|
|
|
|
|
seq = "GIVEQCCTSICSLYQLENYCN" |
|
|
smiles = "CC(C)C[C@@H]1NC(=O)[C@@H](CC(C)C)N(C)C(=O)[C@@H](C)N(C)C(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](CC(C)C)N(C)C(=O)[C@H]2CCCN2C1=O" |
|
|
|
|
|
# Hemolysis |
|
|
out = pred.predict_property("hemolysis", mode="wt", input_str=seq) |
|
|
print(out) |
|
|
# {"property":"hemolysis","mode":"wt","score":prob,"label":0/1,"threshold":...} |
|
|
|
|
|
out = pred.predict_property("hemolysis", mode="smiles", input_str=smiles) |
|
|
print(out) |
|
|
|
|
|
# Non-fouling (key is nf) |
|
|
out = pred.predict_property("nf", mode="wt", input_str=seq) |
|
|
print(out) |
|
|
|
|
|
out = pred.predict_property("nf", mode="smiles", input_str=smiles) |
|
|
print(out) |
|
|
|
|
|
# Solubility (Sequence-only) |
|
|
out = pred.predict_property("solubility", mode="wt", input_str=seq) |
|
|
print(out) |
|
|
|
|
|
# Permeability (Penetrance) (Sequence-only) |
|
|
out = pred.predict_property("permeability_penetrance", mode="wt", input_str=seq) |
|
|
print(out) |
|
|
|
|
|
# Toxicity (SMILES-only) |
|
|
out = pred.predict_property("toxicity", mode="smiles", input_str=smiles) |
|
|
print(out) |
|
|
|
|
|
# Permeability (PAMPA) (SMILES regression) |
|
|
out = pred.predict_property("permeability_pampa", mode="smiles", input_str=smiles) |
|
|
print(out) |
|
|
# {"property":"permeability_pampa","mode":"smiles","score":value} |
|
|
|
|
|
# Permeability (Caco-2) (SMILES regression) |
|
|
out = pred.predict_property("permeability_caco2", mode="smiles", input_str=smiles) |
|
|
print(out) |
|
|
|
|
|
# Half-life (sequence-based + SMILES regression) |
|
|
out = pred.predict_property("halflife", mode="wt", input_str=seq) |
|
|
print(out) |
|
|
|
|
|
out = pred.predict_property("halflife", mode="smiles", input_str=smiles) |
|
|
print(out) |
|
|
|
|
|
# Binding Affinity |
|
|
protein = "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQV..." # target protein |
|
|
peptide_seq = "GIVEQCCTSICSLYQLENYCN" |
|
|
|
|
|
out = pred.predict_binding_affinity( |
|
|
mode="wt", |
|
|
target_seq=protein, |
|
|
binder_str=peptide_seq, |
|
|
) |
|
|
print(out) |
|
|
# { |
|
|
# "property":"binding_affinity", |
|
|
# "mode":"wt", |
|
|
# "affinity": float, |
|
|
# "class_by_threshold": "High (β₯9)" / "Moderate (7-9)" / "Low (<7)", |
|
|
# "class_by_logits": same buckets, |
|
|
# "binding_model": "pooled" or "unpooled", |
|
|
# } |
|
|
|
|
|
``` |
|
|
|
|
|
## Interpretation |
|
|
|
|
|
You can also find the same description in the paper or in the PeptiVerse app `Documentation` tab. |
|
|
|
|
|
--- |
|
|
#### π©Έ Hemolysis Prediction |
|
|
50% of read blood cells being lysed at x ug/ml concetration (HC50). If HC50 < 100uM, considered as hemolytic, otherwise non-hemolytic, resulting in a binary 0/1 dataset. The predicted probability should therefore be interpreted as a risk indicator, not an exact concentration estimate. <br> |
|
|
|
|
|
**Output interpretation:**<br> |
|
|
|
|
|
- Score close to 1.0 = high probability of red blood cell membrane disruption<br> |
|
|
- Score close to 0.0 = non-hemolytic |
|
|
--- |
|
|
|
|
|
#### π§ Solubility Prediction |
|
|
Outputs a probability (0β1) that a peptide remains soluble in aqueous conditions.<br> |
|
|
|
|
|
**Output interpretation:**<br> |
|
|
|
|
|
- Score close to 1.0 = highly soluble<br> |
|
|
- Score close to 0.0 = poorly soluble<br> |
|
|
--- |
|
|
|
|
|
#### π― Non-Fouling Prediction |
|
|
Higher scores indicate stronger non-fouling behavior, desirable for circulation and surface-exposed applications.<br> |
|
|
|
|
|
**Output interpretation:**<br> |
|
|
- Score close to 1.0 = non-fouling<br> |
|
|
- Score close to 0.0 = fouling<br> |
|
|
|
|
|
--- |
|
|
|
|
|
#### πͺ£ Permeability Prediction |
|
|
Predicts membrane permeability on a log P scale.<br> |
|
|
|
|
|
**Output interpretation:**<br> |
|
|
- Higher values = more permeable (>-6.0)<br> |
|
|
- For penetrance predictions, it is a classification prediction, so within the [0, 1] range, closer to 1 indicates more permeable.<br> |
|
|
|
|
|
--- |
|
|
|
|
|
#### β±οΈ Half-Life Prediction |
|
|
**Interpretation:** Predicted values reflect relative peptide stability for the unit in hours. Higher scores indicate longer persistence in serum, while lower scores suggest faster degradation. |
|
|
|
|
|
--- |
|
|
|
|
|
#### β οΈ Toxicity Prediction |
|
|
**Interpretation:** Outputs a probability (0β1) that a peptide exhibits toxic effects. Higher scores indicate increased toxicity risk. |
|
|
|
|
|
--- |
|
|
|
|
|
#### π Binding Affinity Prediction |
|
|
|
|
|
Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence.<br> |
|
|
|
|
|
**Interpretation:**<br> |
|
|
- Scores β₯ 9 correspond to tight binders (K β€ 10β»βΉ M, nanomolar to picomolar range)<br> |
|
|
- Scores between 7 and 9 correspond to medium binders (10β»β·β10β»βΉ M, nanomolar to micromolar range)<br> |
|
|
- Scores < 7 correspond to weak binders (K β₯ 10β»βΆ M, micromolar and weaker)<br> |
|
|
- A difference of 1 unit in score corresponds to an approximately tenfold change in binding affinity.<br> |
|
|
|
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
- **Sequence Embeddings:** [ESM-2 650M model](https://huggingface.co/facebook/esm2_t33_650M_UR50D) / [PeptideCLM model](https://huggingface.co/aaronfeller/PeptideCLM-23M-all). Foundational embeddings are frozen. |
|
|
- **XGBoost Model:** Gradient boosting on pooled embedding features for efficient, high-performance prediction. |
|
|
- **CNN/Transformer Model:** One-dimensional convolutional/self-attention transformer networks operating on unpooled embeddings to capture local sequence patterns. |
|
|
- **Binding Model:** Transformer-based architecture with cross-attention between protein and peptide representations. |
|
|
- **SVR Model:** Support Vector Regression applied to pooled embeddings, providing a kernel-based, nonparametric regression baseline that is robust on smaller or noisy datasets. |
|
|
- **Others:** SVM and Elastic Nets were trained with [RAPIDS cuML](https://github.com/rapidsai/cuml), which requires a CUDA environment and is therefore not supported in the web app. Model checkpoints remain available in the Hugging Face repository. |
|
|
|
|
|
## Troubleshooting |
|
|
|
|
|
### LFS Download Issues |
|
|
|
|
|
If files appear as SHA pointers: |
|
|
|
|
|
```bash |
|
|
huggingface-cli download ChatterjeeLab/PeptiVerse \ |
|
|
training_data_cleaned/hemolysis/hemo_smiles_meta_with_split.csv \ |
|
|
--local-dir . \ |
|
|
--local-dir-use-symlinks False |
|
|
``` |
|
|
### Trouble installing cuML |
|
|
For error related to cuda library, reinstall the `torch` after installing `cuML`. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find this repository helpful for your publications, please consider citing our paper: |
|
|
|
|
|
``` |
|
|
@article {zhang2025peptiverse, |
|
|
author = {Zhang, Yinuo and Tang, Sophia and Chen, Tong and Mahood, Elizabeth and Vincoff, Sophia and Chatterjee, Pranam}, |
|
|
title = {PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction}, |
|
|
year = {2026}, |
|
|
doi = {10.64898/2025.12.31.697180}, |
|
|
URL = {https://doi.org/10.64898/2025.12.31.697180}, |
|
|
journal = {bioRxiv} |
|
|
} |
|
|
``` |
|
|
To use this repository, you agree to abide by the Apache 2.0 license. |