The Tokenizer is Broken

#3
by ribesstefano - opened

As already mentioned in other models' repositories from DeepChem (see here), the model tokenizer is broken.

Snippet to reproduce:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('DeepChem/ChemBERTa-77M-MLM')

sample_smiles = 'CN(CCCNc1nc(Nc2ccc([*:1])cc2)ncc1Br)C(=O)C1CCC1'
tokens = tokenizer(sample_smiles)
print(tokens)
decoded_smiles = tokenizer.decode(tokens['input_ids'], skip_special_tokens=True)
print(f"Original: {sample_smiles}")
print(f"Decoded:  {decoded_smiles}")
assert sample_smiles == decoded_smiles

Output:

Original: CN(CCCNc1nc(Nc2ccc([*:1])cc2)ncc1Br)C(=O)C1CCC1
Decoded:  CN(CCCNc1nc(Nc2ccc(*1)cc2)ncc1B)C(=O)C1CCC1

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[154], line 10
      8 print(f"Original: {sample_smiles}")
      9 print(f"Decoded:  {decoded_smiles}")
---> 10 assert sample_smiles == decoded_smiles

AssertionError:

Yes bro this not working

Sign up or log in to comment