OGBERT Tokenizer (32K)

A 32,768-token BPE tokenizer for OpenGloss OGBERT embedding models.

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mjbommar/ogbert-tokenizer-32k")
tokens = tokenizer.encode("hello world")

Details

Vocab Size: 32,768 (power of 2)
Space Token: ID 32767
Special Tokens: IDs 0-6 (<|start|>, <|end|>, <|pad|>, <|unk|>, <|cls|>, <|sep|>, <|mask|>)
Training Data: mjbommar/opengloss-v1.1-dictionary

Citation

@misc{bommarito2025opengloss,
    title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
    author={Michael J. Bommarito II},
    year={2025},
    eprint={2511.18622},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support