English
biology
alphafold
bio-compute
shivendrra commited on
Commit
31c2a86
·
verified ·
1 Parent(s): a5f2b4a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -1
README.md CHANGED
@@ -10,4 +10,83 @@ tags:
10
  - biology
11
  - alphafold
12
  - bio-compute
13
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  - biology
11
  - alphafold
12
  - bio-compute
13
+ ---
14
+
15
+ # Biosaic Tokenizer
16
+
17
+ ## Overview
18
+ Biosaic(Bio-Mosaic) is a tokenizer library built for [Enigma2](https://github.com/shivendrra/enigma2). It contains: Tokenizer, Embedder for DNA & Amino Acid Protein Sequences. Has a VQ-VAE & Evoformer architecture based encoders that could convert sequences into embeddings and vice-versa for model training use-case.
19
+
20
+ ## Features
21
+ - **Tokenization:** converts the sequences into K-Mers. *(for DNA only)*
22
+ - **Encoding:** converts sequences into embeddings for classification, training purposes.
23
+ - **Easy use:** it's very basic and easy to use library.
24
+ - **SoTA encoder:** Evoformer & VQ-VAE model are inspired from the [AlphaFold-2](https://www.biorxiv.org/content/10.1101/2024.12.02.626366v1.full)
25
+
26
+ ## Models
27
+
28
+ It has two different Models,
29
+ - for DNA tokenization & encoding: **VQ-VAE**
30
+ - for Protein Encodings: **EvoFormer**
31
+
32
+ **VQ-VAE** is around 160M parameter big(for now it's just around 40M just to test run).
33
+ **EvoFormer** is around 136M parameter big (still in training).
34
+
35
+
36
+ ### Config:
37
+
38
+ ```python
39
+ class ModelConfig:
40
+ d_model: int= 768
41
+ in_dim: int= 4
42
+ beta: float= 0.15
43
+ dropout: float= 0.25
44
+ n_heads: int= 16
45
+ n_layers: int= 12
46
+ ```
47
+
48
+ ```python
49
+ class ModelConfig:
50
+ DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
51
+ A = 4 # DNA alphabet
52
+ C = 21 # 21 letter for amino acid & 4 for dna
53
+ d_msa = 768
54
+ d_pair = 256
55
+ n_heads = 32
56
+ n_blocks = 28
57
+ ```
58
+
59
+ ## Training:
60
+
61
+ For training the ``VQ-VAE`` & ``Evo-Former`` model, batch training is preferred, with it's own sepearte ``Dateset`` class that takes input of the strings and then Hot-encodes the DNA Sequences first and then fill them into batches according to ``train`` & ``val`` splits which is around 20% of the full dataset.
62
+
63
+ #### For VQ-VAE:
64
+ ```python
65
+ class TrainConfig:
66
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
67
+ learning_rate = 1e-4 # bumped from 1e-5
68
+ weight_decay = 1e-4
69
+ amsgrad = True
70
+ warmup_epochs = 50 # linear warm‑up
71
+ epochs = 2000
72
+ eval_interval = 100
73
+ eval_iters = 30
74
+ batch_size = 6
75
+ block_size = 256
76
+ ```
77
+
78
+ #### For EvoFormer:
79
+ ```python
80
+ class TrainConfig:
81
+ DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
82
+ LR = 1e-4
83
+ WD = 1e-4
84
+ AMS = True
85
+ WARMUP = 50
86
+ EPOCHS = 500
87
+ BATCH = 8
88
+ MSA_SEQ = 32 # number of sequences in each MSA
89
+ L_SEQ = 256 # length of each sequence
90
+ EVAL_ITERS = 5
91
+ EVAL_INTV = 50
92
+ ```