vshirasuna commited on
Commit
bac76b0
Β·
verified Β·
1 Parent(s): 2f0e472

Updated README.md

Browse files
Files changed (1) hide show
  1. README.md +235 -235
README.md CHANGED
@@ -1,235 +1,235 @@
1
- ---
2
- license: apache-2.0
3
- tags:
4
- - chemistry
5
- - foundation models
6
- - AI4Science
7
- - materials
8
- - molecules
9
- - smiles
10
- - selfies
11
- - molecular formula
12
- - iupac name
13
- - inchi
14
- - polymer smiles
15
- - formulation
16
- - pytorch
17
- - bamba
18
- - transformers
19
- - mamba2
20
- ---
21
-
22
- # Molecular String-based Bamba Encoder-Decoder (STR-Bamba)
23
-
24
- This repository provides PyTorch source code associated with our publication, "STR-Bamba: Multimodal Molecular Textual Representation Encoder-Decoder Foundation Model".
25
-
26
- **Paper:** [OpenReview Link](https://openreview.net/pdf?id=0uWNuJ1xtz)
27
-
28
- **GitHub:** [GitHub Link](https://github.com/IBM/materials/tree/main/models/str_bamba)
29
-
30
- For more information contact: [email protected] or [email protected].
31
-
32
- ![str_bamba](images/str-bamba.png)
33
-
34
- ## Introduction
35
-
36
- We present a large encoder-decoder chemical foundation model based on the IBM Bamba architecture, a hybrid of Transformers and Mamba-2 layers, designed to support multi-representational molecular string inputs. The model is pre-trained in a BERT-style on 588 million samples, resulting in a corpus of approximately 29 billion molecular tokens. These models serve as a foundation for language chemical research in supporting different complex tasks, including molecular properties prediction, classification, and molecular translation. **Additionally, the STR-Bamba architecture allows for the aggregation of multiple representations in a single text input, as it does not contain any token length limitation, except for hardware limitations.** Our experiments across multiple benchmark datasets demonstrate state-of-the-art performance for various tasks. Model weights are available at: [GitHub Link](https://github.com/IBM/materials/tree/main/models/str_bamba).
37
-
38
- The STR-Bamba model supports the following **molecular representations**:
39
- - SMILES
40
- - SELFIES
41
- - Molecular Formula
42
- - InChI
43
- - IUPAC Name
44
- - Polymer SMILES in [SPG notation](https://openreview.net/pdf?id=L47GThI95d)
45
- - Formulations
46
-
47
- ## Table of Contents
48
-
49
- 1. [Getting Started](#getting-started)
50
- 1. [Pretrained Models and Training Logs](#pretrained-models-and-training-logs)
51
- 2. [Replicating Conda Environment](#replicating-conda-environment)
52
- 2. [Pretraining](#pretraining)
53
- 3. [Finetuning](#finetuning)
54
- 4. [Feature Extraction](#feature-extraction)
55
- 5. [Citations](#citations)
56
-
57
- ## Getting Started
58
-
59
- **This code and environment have been tested on Nvidia V100s and Nvidia A100s**
60
-
61
- ### Pretrained Models and Training Logs
62
-
63
- We provide checkpoints of the STR-Bamba model pre-trained on a dataset of ~118M small molecules, ~2M polymer structures, and 258 formulations. The pre-trained model shows competitive performance on classification and regression benchmarks across small and polymer molecules, and electrolyte formulations. For model weights: [GitHub Link](https://github.com/IBM/materials/tree/main/models/str_bamba)
64
-
65
- Add the STR-Bamba `pre-trained weights.pt` to the `inference/` or `finetune/` directory according to your needs. The directory structure should look like the following:
66
-
67
- ```
68
- inference/
69
- └── str_bamba/
70
- β”œβ”€β”€ config/
71
- β”œβ”€β”€ checkpoints/
72
- β”‚ └── STR-Bamba_8.pt
73
- └── tokenizer/
74
- ```
75
- and/or:
76
-
77
- ```
78
- finetune/
79
- └── str_bamba/
80
- β”œβ”€β”€ config/
81
- β”œβ”€β”€ checkpoints/
82
- β”‚ └── STR-Bamba_8.pt
83
- └── tokenizer/
84
- ```
85
-
86
- ### Replicating Conda Environment
87
-
88
- Follow these steps to replicate our Conda environment and install the necessary libraries:
89
-
90
- #### Create and Activate Conda Environment
91
- ```shell
92
- mamba create -n strbamba python=3.10.13
93
- mamba activate strbamba
94
- ```
95
-
96
- #### PyTorch 2.4.0 and CUDA 12.4
97
- ```shell
98
- pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124
99
- ```
100
-
101
- #### Mamba2 dependencies:
102
-
103
- Install the following packages in this order and with a **GPU**, because `mamba` depends on `causal-conv1d` to be installed.
104
-
105
- ```shell
106
- # causal-conv1d
107
- git clone https://github.com/Dao-AILab/causal-conv1d.git
108
- cd causal-conv1d && git checkout v1.5.0.post8 && pip install . && cd .. && rm -rf causal-conv1d
109
- ```
110
-
111
- ```shell
112
- # mamba
113
- git clone https://github.com/state-spaces/mamba.git
114
- cd mamba && git checkout v2.2.4 && pip install --no-build-isolation . && cd .. && rm -rf mamba
115
- ```
116
-
117
- ```shell
118
- # flash-attn
119
- pip install flash-attn==2.6.1 --no-build-isolation
120
- ```
121
-
122
- #### Install Packages with Pip
123
- ```shell
124
- pip install -r requirements.txt
125
- ```
126
-
127
- #### Troubleshooting
128
- ```shell
129
- pip install mamba-ssm==2.2.4
130
- MAX_JOBS=2 pip install flash-attn==2.6.1 --no-build-isolation --verbose
131
- ```
132
-
133
-
134
- ## Pretraining
135
-
136
- For pretraining, we use two strategies: the masked language model method to train the encoder part and a next token prediction strategy to train the decoder in order to refine molecular representation reconstruction and generation conditioned from the encoder.
137
-
138
- The pretraining code provides examples of data processing and model training on a smaller dataset, requiring a A100 GPU.
139
-
140
- To pre-train the two stages of the STR-Bamba model, run:
141
-
142
- ```
143
- bash training/run_model_encoder_training.sh
144
- ```
145
- or
146
- ```
147
- bash training/run_model_decoder_training.sh
148
- ```
149
-
150
- ## Finetuning
151
-
152
- The finetuning datasets and environment can be found in the [finetune](finetune/) directory. After setting up the environment, you can run a finetuning task with:
153
-
154
- ```
155
- bash finetune/runs/esol/run_finetune_esol.sh
156
- ```
157
-
158
- Finetuning training/checkpointing resources will be available in directories named `checkpoint_<measure_name>`.
159
-
160
- ## Feature Extraction
161
-
162
- To load STR-Bamba, you can simply use:
163
-
164
- ```python
165
- model = load_strbamba('STR-Bamba_8.pt')
166
- ```
167
-
168
- To encode SMILES, SELFIES, InChI or other supported molecular representations into embeddings, you can use:
169
-
170
- ```python
171
- with torch.no_grad():
172
- encoded_embeddings = model.encode(df['SMILES'], return_torch=True)
173
- ```
174
- For decoder, you can use the following code, so you can generate new molecular representations conditioned from the encoder:
175
-
176
- ```python
177
- with torch.no_grad():
178
- # encoder and decoder inputs
179
- encoder_input = '<smiles>CCO'
180
- decoder_input = '<smiles>'
181
- decoder_target = '<smiles>CCO'
182
-
183
- # tokenization
184
- encoder_input_ids = model.tokenizer(encoder_input,
185
- padding=True,
186
- truncation=True,
187
- return_tensors='pt')['input_ids'].to(device)
188
- decoder_input_ids = model.tokenizer(decoder_input,
189
- padding=True,
190
- truncation=True,
191
- return_tensors='pt')['input_ids'][:, :-1].to(device)
192
- decoder_target_ids = model.tokenizer(decoder_target,
193
- padding=True,
194
- truncation=True,
195
- return_tensors='pt')['input_ids'].to(device)
196
-
197
- # visualize input texts
198
- print('Encoder input:', model.tokenizer.batch_decode(encoder_input_ids))
199
- print('Decoder input:', model.tokenizer.batch_decode(decoder_input_ids))
200
- print('Decoder target:', model.tokenizer.batch_decode(decoder_target_ids))
201
- print('Target:', decoder_target_ids)
202
-
203
- # encoder forward
204
- encoder_hidden_states = model.encoder(encoder_input_ids).hidden_states
205
-
206
- # model generation
207
- output = model.decoder.generate(
208
- input_ids=decoder_input_ids,
209
- encoder_hidden_states=encoder_hidden_states,
210
- max_length=decoder_target_ids.shape[1],
211
- cg=True,
212
- return_dict_in_generate=True,
213
- output_scores=True,
214
- enable_timing=False,
215
- temperature=1,
216
- top_k=1,
217
- top_p=1.0,
218
- min_p=0.,
219
- repetition_penalty=1,
220
- )
221
-
222
- # visualize model output
223
- generated_text = ''.join(
224
- ''.join(
225
- model.tokenizer.batch_decode(
226
- output.sequences,
227
- clean_up_tokenization_spaces=True,
228
- skip_special_tokens=False
229
- )
230
- ).split(' ')
231
- )
232
- print(generated_text)
233
- ```
234
-
235
- ## Citations
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - chemistry
5
+ - foundation models
6
+ - AI4Science
7
+ - materials
8
+ - molecules
9
+ - smiles
10
+ - selfies
11
+ - molecular formula
12
+ - iupac name
13
+ - inchi
14
+ - polymer smiles
15
+ - formulation
16
+ - pytorch
17
+ - bamba
18
+ - transformers
19
+ - mamba2
20
+ ---
21
+
22
+ # Molecular String-based Bamba Encoder-Decoder (STR-Bamba)
23
+
24
+ This repository provides PyTorch source code associated with our publication, "STR-Bamba: Multimodal Molecular Textual Representation Encoder-Decoder Foundation Model".
25
+
26
+ **Paper:** [OpenReview Link](https://openreview.net/pdf?id=0uWNuJ1xtz)
27
+
28
+ **GitHub:** [GitHub Link](https://github.com/IBM/materials/tree/main/models/str_bamba)
29
+
30
+ For more information contact: [email protected] or [email protected].
31
+
32
+ ![str_bamba](images/str-bamba.png)
33
+
34
+ ## Introduction
35
+
36
+ We present a large encoder-decoder chemical foundation model based on the IBM Bamba architecture, a hybrid of Transformers and Mamba-2 layers, designed to support multi-representational molecular string inputs. The model is pre-trained in a BERT-style on 588 million samples, resulting in a corpus of approximately 29 billion molecular tokens. These models serve as a foundation for language chemical research in supporting different complex tasks, including molecular properties prediction, classification, and molecular translation. **Additionally, the STR-Bamba architecture allows for the aggregation of multiple representations in a single text input, as it does not contain any token length limitation, except for hardware limitations.** Our experiments across multiple benchmark datasets demonstrate state-of-the-art performance for various tasks. Code details are available at: [GitHub Link](https://github.com/IBM/materials/tree/main/models/str_bamba).
37
+
38
+ The STR-Bamba model supports the following **molecular representations**:
39
+ - SMILES
40
+ - SELFIES
41
+ - Molecular Formula
42
+ - InChI
43
+ - IUPAC Name
44
+ - Polymer SMILES in [SPG notation](https://openreview.net/pdf?id=L47GThI95d)
45
+ - Formulations
46
+
47
+ ## Table of Contents
48
+
49
+ 1. [Getting Started](#getting-started)
50
+ 1. [Pretrained Models and Training Logs](#pretrained-models-and-training-logs)
51
+ 2. [Replicating Conda Environment](#replicating-conda-environment)
52
+ 2. [Pretraining](#pretraining)
53
+ 3. [Finetuning](#finetuning)
54
+ 4. [Feature Extraction](#feature-extraction)
55
+ 5. [Citations](#citations)
56
+
57
+ ## Getting Started
58
+
59
+ **This code and environment have been tested on Nvidia V100s and Nvidia A100s**
60
+
61
+ ### Pretrained Models and Training Logs
62
+
63
+ We provide checkpoints of the STR-Bamba model pre-trained on a dataset of ~118M small molecules, ~2M polymer structures, and 258 formulations. The pre-trained model shows competitive performance on classification and regression benchmarks across small and polymer molecules, and electrolyte formulations. For code details: [GitHub Link](https://github.com/IBM/materials/tree/main/models/str_bamba)
64
+
65
+ Add the STR-Bamba `pre-trained weights.pt` to the `inference/` or `finetune/` directory according to your needs. The directory structure should look like the following:
66
+
67
+ ```
68
+ inference/
69
+ └── str_bamba/
70
+ β”œβ”€β”€ config/
71
+ β”œβ”€β”€ checkpoints/
72
+ β”‚ └── STR-Bamba_8.pt
73
+ └── tokenizer/
74
+ ```
75
+ and/or:
76
+
77
+ ```
78
+ finetune/
79
+ └── str_bamba/
80
+ β”œβ”€β”€ config/
81
+ β”œβ”€β”€ checkpoints/
82
+ β”‚ └── STR-Bamba_8.pt
83
+ └── tokenizer/
84
+ ```
85
+
86
+ ### Replicating Conda Environment
87
+
88
+ Follow these steps to replicate our Conda environment and install the necessary libraries:
89
+
90
+ #### Create and Activate Conda Environment
91
+ ```shell
92
+ mamba create -n strbamba python=3.10.13
93
+ mamba activate strbamba
94
+ ```
95
+
96
+ #### PyTorch 2.4.0 and CUDA 12.4
97
+ ```shell
98
+ pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124
99
+ ```
100
+
101
+ #### Mamba2 dependencies:
102
+
103
+ Install the following packages in this order and with a **GPU**, because `mamba` depends on `causal-conv1d` to be installed.
104
+
105
+ ```shell
106
+ # causal-conv1d
107
+ git clone https://github.com/Dao-AILab/causal-conv1d.git
108
+ cd causal-conv1d && git checkout v1.5.0.post8 && pip install . && cd .. && rm -rf causal-conv1d
109
+ ```
110
+
111
+ ```shell
112
+ # mamba
113
+ git clone https://github.com/state-spaces/mamba.git
114
+ cd mamba && git checkout v2.2.4 && pip install --no-build-isolation . && cd .. && rm -rf mamba
115
+ ```
116
+
117
+ ```shell
118
+ # flash-attn
119
+ pip install flash-attn==2.6.1 --no-build-isolation
120
+ ```
121
+
122
+ #### Install Packages with Pip
123
+ ```shell
124
+ pip install -r requirements.txt
125
+ ```
126
+
127
+ #### Troubleshooting
128
+ ```shell
129
+ pip install mamba-ssm==2.2.4
130
+ MAX_JOBS=2 pip install flash-attn==2.6.1 --no-build-isolation --verbose
131
+ ```
132
+
133
+
134
+ ## Pretraining
135
+
136
+ For pretraining, we use two strategies: the masked language model method to train the encoder part and a next token prediction strategy to train the decoder in order to refine molecular representation reconstruction and generation conditioned from the encoder.
137
+
138
+ The pretraining code provides examples of data processing and model training on a smaller dataset, requiring a A100 GPU.
139
+
140
+ To pre-train the two stages of the STR-Bamba model, run:
141
+
142
+ ```
143
+ bash training/run_model_encoder_training.sh
144
+ ```
145
+ or
146
+ ```
147
+ bash training/run_model_decoder_training.sh
148
+ ```
149
+
150
+ ## Finetuning
151
+
152
+ The finetuning datasets and environment can be found in the [finetune](https://github.com/IBM/materials/tree/main/models/str_bamba/finetune) directory. After setting up the environment, you can run a finetuning task with:
153
+
154
+ ```
155
+ bash finetune/runs/esol/run_finetune_esol.sh
156
+ ```
157
+
158
+ Finetuning training/checkpointing resources will be available in directories named `checkpoint_<measure_name>`.
159
+
160
+ ## Feature Extraction
161
+
162
+ To load STR-Bamba, you can simply use:
163
+
164
+ ```python
165
+ model = load_strbamba('STR-Bamba_8.pt')
166
+ ```
167
+
168
+ To encode SMILES, SELFIES, InChI or other supported molecular representations into embeddings, you can use:
169
+
170
+ ```python
171
+ with torch.no_grad():
172
+ encoded_embeddings = model.encode(df['SMILES'], return_torch=True)
173
+ ```
174
+ For decoder, you can use the following code, so you can generate new molecular representations conditioned from the encoder:
175
+
176
+ ```python
177
+ with torch.no_grad():
178
+ # encoder and decoder inputs
179
+ encoder_input = '<smiles>CCO'
180
+ decoder_input = '<smiles>'
181
+ decoder_target = '<smiles>CCO'
182
+
183
+ # tokenization
184
+ encoder_input_ids = model.tokenizer(encoder_input,
185
+ padding=True,
186
+ truncation=True,
187
+ return_tensors='pt')['input_ids'].to(device)
188
+ decoder_input_ids = model.tokenizer(decoder_input,
189
+ padding=True,
190
+ truncation=True,
191
+ return_tensors='pt')['input_ids'][:, :-1].to(device)
192
+ decoder_target_ids = model.tokenizer(decoder_target,
193
+ padding=True,
194
+ truncation=True,
195
+ return_tensors='pt')['input_ids'].to(device)
196
+
197
+ # visualize input texts
198
+ print('Encoder input:', model.tokenizer.batch_decode(encoder_input_ids))
199
+ print('Decoder input:', model.tokenizer.batch_decode(decoder_input_ids))
200
+ print('Decoder target:', model.tokenizer.batch_decode(decoder_target_ids))
201
+ print('Target:', decoder_target_ids)
202
+
203
+ # encoder forward
204
+ encoder_hidden_states = model.encoder(encoder_input_ids).hidden_states
205
+
206
+ # model generation
207
+ output = model.decoder.generate(
208
+ input_ids=decoder_input_ids,
209
+ encoder_hidden_states=encoder_hidden_states,
210
+ max_length=decoder_target_ids.shape[1],
211
+ cg=True,
212
+ return_dict_in_generate=True,
213
+ output_scores=True,
214
+ enable_timing=False,
215
+ temperature=1,
216
+ top_k=1,
217
+ top_p=1.0,
218
+ min_p=0.,
219
+ repetition_penalty=1,
220
+ )
221
+
222
+ # visualize model output
223
+ generated_text = ''.join(
224
+ ''.join(
225
+ model.tokenizer.batch_decode(
226
+ output.sequences,
227
+ clean_up_tokenization_spaces=True,
228
+ skip_special_tokens=False
229
+ )
230
+ ).split(' ')
231
+ )
232
+ print(generated_text)
233
+ ```
234
+
235
+ ## Citations