Perovskite Chemical Formula Tokenizer

A BPE tokenizer specialized for perovskite chemical formulas with chemical-aware preprocessing that preserves element boundaries and handles fractional compositions.

Features

  • Chemical-aware tokenization with element boundary preservation
  • Support for 44 chemical elements including organic cations (MA, FA, DMA)
  • Handles fractional compositions and complex bracket notation
  • Special token conversion for brackets and decimal points

Usage

from tokenizers import Tokenizer

# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")

# Simple formula
encoding = tokenizer.encode("(MA0.1FA0.9)PbCl3")
print(encoding.tokens)
# ['[LB]', 'MA', '0', '[DOT]', '1', 'FA', '0', '[DOT]', '9', '[RB]', 'Pb', 'Cl', '3']
# With special tokens (if enabled)
# ['[CLS]', '[LB]', 'MA', '0', '[DOT]', '1', 'FA', '0', '[DOT]', '9', '[RB]', 'Pb', 'Cl', '3', '[SEP]']

# Complex formula with fractional composition
encoding = tokenizer.encode("(DMA0.1FA0.9)Pb(Cl0.1Br0.9)3")
print(encoding.tokens)
#['[LB]', 'DMA', '0', '[DOT]', '1', 'FA', '0', '[DOT]', '9', '[RB]', 'Pb', '[LB]', 'Cl', '0', '[DOT]', '1', 'Br', '0', '[DOT]', '9', '[RB]', '3']

Model Specifications

  • Model Type: BPE (Byte-Pair Encoding)
  • Special Tokens: [PAD], [UNK], [CLS], [SEP], [MASK], [DOT], [LB], [RB]
  • Pre-tokenizer: Whitespace
  • Normalizer: NFD

Supported Elements (44)

Organic: MA, FA, DMA
Alkali: Li, Na, K, Rb, Cs
Alkaline Earth: Mg, Sr, Ba
Transition: Ti, Mn, Fe, Co, Ni, Cu, Zn, Y, Nb, Pd, Ag, Cd, La, Yb, Tb, Au, Hg
Post-transition: Ga, In, Sn, Tl, Pb, Bi
Metalloid: Ge, Sb, Te
Halogen: F, Cl, Br, I
Other: P, S, Se

Preprocessing Pipeline

  1. Element identification and separation
  2. Decimal splitting: 0.5 โ†’ 0 [DOT] 5
  3. Bracket conversion: () โ†’ [LB] ... [RB]
  4. BPE tokenization with special tokens

Known Limitations

  • Only recognizes pre-defined 44 elements
  • No support for charge states or isotope notation

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support