OGBERT Tokenizer (32K)

A 32,768-token BPE tokenizer for OpenGloss OGBERT embedding models.

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mjbommar/ogbert-tokenizer-32k")
tokens = tokenizer.encode("hello world")

Details

  • Vocab Size: 32,768 (power of 2)
  • Space Token: ID 32767
  • Special Tokens: IDs 0-6 (<|start|>, <|end|>, <|pad|>, <|unk|>, <|cls|>, <|sep|>, <|mask|>)
  • Training Data: mjbommar/opengloss-v1.1-dictionary

Citation

@misc{bommarito2025opengloss,
    title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
    author={Michael J. Bommarito II},
    year={2025},
    eprint={2511.18622},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support