shivendrra
/

BiosaicTokenizer

Model card Files Files and versions

shivendrra commited on Apr 18

Commit

31c2a86

·

verified ·

1 Parent(s): a5f2b4a

Update README.md

Files changed (1) hide show

README.md +80 -1

README.md CHANGED Viewed

@@ -10,4 +10,83 @@ tags:
 - biology
 - alphafold
 - bio-compute
----

 - biology
 - alphafold
 - bio-compute
+---
+# Biosaic Tokenizer
+## Overview
+Biosaic(Bio-Mosaic) is a tokenizer library built for [Enigma2](https://github.com/shivendrra/enigma2). It contains: Tokenizer, Embedder for DNA & Amino Acid Protein Sequences. Has a VQ-VAE & Evoformer architecture based encoders that could convert sequences into embeddings and vice-versa for model training use-case.
+## Features
+- **Tokenization:** converts the sequences into K-Mers. *(for DNA only)*
+- **Encoding:** converts sequences into embeddings for classification, training purposes.
+- **Easy use:** it's very basic and easy to use library.
+- **SoTA encoder:** Evoformer & VQ-VAE model are inspired from the [AlphaFold-2](https://www.biorxiv.org/content/10.1101/2024.12.02.626366v1.full)
+## Models
+It has two different Models,
+  - for DNA tokenization & encoding: **VQ-VAE**
+  - for Protein Encodings: **EvoFormer**
+**VQ-VAE** is around 160M parameter big(for now it's just around 40M just to test run).
+**EvoFormer** is around 136M parameter big (still in training).
+### Config:
+```python
+class ModelConfig:
+  d_model: int= 768
+  in_dim: int= 4
+  beta: float= 0.15
+  dropout: float= 0.25
+  n_heads: int= 16
+  n_layers: int= 12
+```
+```python
+class ModelConfig:
+  DEVICE       = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+  A            = 4        # DNA alphabet
+  C            = 21       # 21 letter for amino acid & 4 for dna
+  d_msa        = 768
+  d_pair       = 256
+  n_heads      = 32
+  n_blocks     = 28
+```
+## Training:
+For training the ``VQ-VAE`` & ``Evo-Former`` model, batch training is preferred, with it's own sepearte ``Dateset`` class that takes input of the strings and then Hot-encodes the DNA Sequences first and then fill them into batches according to ``train`` & ``val`` splits which is around 20% of the full dataset.
+#### For VQ-VAE:
+```python
+class TrainConfig:
+  device        = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+  learning_rate = 1e-4         # bumped from 1e-5
+  weight_decay  = 1e-4
+  amsgrad       = True
+  warmup_epochs = 50           # linear warm‑up
+  epochs        = 2000
+  eval_interval = 100
+  eval_iters    = 30
+  batch_size    = 6
+  block_size    = 256
+```
+#### For EvoFormer:
+```python
+class TrainConfig:
+  DEVICE       = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+  LR           = 1e-4
+  WD           = 1e-4
+  AMS          = True
+  WARMUP       = 50
+  EPOCHS       = 500
+  BATCH        = 8
+  MSA_SEQ      = 32       # number of sequences in each MSA
+  L_SEQ        = 256      # length of each sequence
+  EVAL_ITERS   = 5
+  EVAL_INTV    = 50
+```