The Tokenizer is Broken
#3
by
ribesstefano
- opened
As already mentioned in other models' repositories from DeepChem (see here), the model tokenizer is broken.
Snippet to reproduce:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('DeepChem/ChemBERTa-77M-MLM')
sample_smiles = 'CN(CCCNc1nc(Nc2ccc([*:1])cc2)ncc1Br)C(=O)C1CCC1'
tokens = tokenizer(sample_smiles)
print(tokens)
decoded_smiles = tokenizer.decode(tokens['input_ids'], skip_special_tokens=True)
print(f"Original: {sample_smiles}")
print(f"Decoded: {decoded_smiles}")
assert sample_smiles == decoded_smiles
Output:
Original: CN(CCCNc1nc(Nc2ccc([*:1])cc2)ncc1Br)C(=O)C1CCC1
Decoded: CN(CCCNc1nc(Nc2ccc(*1)cc2)ncc1B)C(=O)C1CCC1
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[154], line 10
8 print(f"Original: {sample_smiles}")
9 print(f"Decoded: {decoded_smiles}")
---> 10 assert sample_smiles == decoded_smiles
AssertionError:
Yes bro this not working