OGBERT Tokenizer (8K)
A 8,192-token BPE tokenizer for OpenGloss OGBERT embedding models.
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mjbommar/ogbert-tokenizer-8k")
tokens = tokenizer.encode("hello world")
Details
- Vocab Size: 8,192 (power of 2)
- Space Token: ID 8191
- Special Tokens: IDs 0-6 (
<|start|>,<|end|>,<|pad|>,<|unk|>,<|cls|>,<|sep|>,<|mask|>) - Training Data: mjbommar/opengloss-v1.1-dictionary
Citation
@misc{bommarito2025opengloss,
title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
author={Michael J. Bommarito II},
year={2025},
eprint={2511.18622},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
License
Apache 2.0
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support