Perovskite Chemical Formula Tokenizer
A BPE tokenizer specialized for perovskite chemical formulas with chemical-aware preprocessing that preserves element boundaries and handles fractional compositions.
Features
- Chemical-aware tokenization with element boundary preservation
- Support for 44 chemical elements including organic cations (MA, FA, DMA)
- Handles fractional compositions and complex bracket notation
- Special token conversion for brackets and decimal points
Usage
from tokenizers import Tokenizer
# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
# Simple formula
encoding = tokenizer.encode("(MA0.1FA0.9)PbCl3")
print(encoding.tokens)
# ['[LB]', 'MA', '0', '[DOT]', '1', 'FA', '0', '[DOT]', '9', '[RB]', 'Pb', 'Cl', '3']
# With special tokens (if enabled)
# ['[CLS]', '[LB]', 'MA', '0', '[DOT]', '1', 'FA', '0', '[DOT]', '9', '[RB]', 'Pb', 'Cl', '3', '[SEP]']
# Complex formula with fractional composition
encoding = tokenizer.encode("(DMA0.1FA0.9)Pb(Cl0.1Br0.9)3")
print(encoding.tokens)
#['[LB]', 'DMA', '0', '[DOT]', '1', 'FA', '0', '[DOT]', '9', '[RB]', 'Pb', '[LB]', 'Cl', '0', '[DOT]', '1', 'Br', '0', '[DOT]', '9', '[RB]', '3']
Model Specifications
- Model Type: BPE (Byte-Pair Encoding)
- Special Tokens:
[PAD],[UNK],[CLS],[SEP],[MASK],[DOT],[LB],[RB] - Pre-tokenizer: Whitespace
- Normalizer: NFD
Supported Elements (44)
Organic: MA, FA, DMA
Alkali: Li, Na, K, Rb, Cs
Alkaline Earth: Mg, Sr, Ba
Transition: Ti, Mn, Fe, Co, Ni, Cu, Zn, Y, Nb, Pd, Ag, Cd, La, Yb, Tb, Au, Hg
Post-transition: Ga, In, Sn, Tl, Pb, Bi
Metalloid: Ge, Sb, Te
Halogen: F, Cl, Br, I
Other: P, S, Se
Preprocessing Pipeline
- Element identification and separation
- Decimal splitting:
0.5โ0 [DOT] 5 - Bracket conversion:
()โ[LB] ... [RB] - BPE tokenization with special tokens
Known Limitations
- Only recognizes pre-defined 44 elements
- No support for charge states or isotope notation
License
MIT