WikiText-2 BPE Tokenizer

A Byte Pair Encoding (BPE) tokenizer trained on the WikiText-2 dataset.

Model Details

  • Vocabulary Size: 30,000 tokens
  • Training Data: WikiText-2 (Salesforce/wikitext)
  • Special Tokens: [PAD], [UNK], [CLS], [SEP], [MASK]
  • Compression Ratio: ~6.4 characters per token

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Rogarcia18/wikitext2-bpe-tokenizer")

Training Details

  • Dataset: WikiText-2 (wikitext-2-v1)
  • Preprocessing: Deduplication, removal, whitespace normalization, remove samples cases with less than 10 characters
  • Architecture: BPE with HuggingFace tokenizers library
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support