|
|
--- |
|
|
language: en |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
# BERT Hash Nano Models |
|
|
|
|
|
This is a set of 3 Nano [BERT](https://arxiv.org/abs/1810.04805) models with a modified embeddings layer. The embeddings layer is the same BERT vocabulary (30,522 tokens) projected to a smaller dimensional space then re-encoded to the hidden size. This method is inspired by [MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings](https://arxiv.org/abs/2405.19504). |
|
|
|
|
|
The number of projections is like a hash. Setting the projections parameter to 5 is like generating a 160-bit hash (5 x float32) for each token. That hash is then projected to the hidden size. |
|
|
|
|
|
This significantly reduces the number of parameters necessary for token embeddings. |
|
|
|
|
|
For example: |
|
|
|
|
|
Standard token embeddings: |
|
|
- 30,522 (vocab size) x 768 (hidden size) = 23,440,896 parameters |
|
|
- 23,440,896 x 4 (float32) = 93,763,584 bytes |
|
|
|
|
|
Hash token embeddings: |
|
|
- 30,522 (vocab size) x 5 (hash buckets) + 5 x 768 (projection matrix)= 156,450 parameters |
|
|
- 156,450 x 4 (float32) = 625,800 bytes |
|
|
|
|
|
These models are pre-trained on the same training corpus as BERT (with a copy of Wikipedia from 2025) as recommended in the paper [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). |
|
|
|
|
|
Below is a subset of GLUE scores on the dev set using the [script provided by Hugging Face Transformers](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py) with the following parameters. |
|
|
|
|
|
```bash |
|
|
python run_glue.py --model_name_or_path <model path> --task_name <task name> --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 32 --learning_rate 1e-4 --num_train_epochs 4 --output_dir outputs --trust-remote-code True |
|
|
``` |
|
|
|
|
|
| Model | Parameters | MNLI (acc m/mm) | MRPC (f1/acc) | SST-2 (acc) | |
|
|
| ----- | ---------- | --------------- | ---------------- | ----------- | |
|
|
| [baseline (bert-tiny)](https://hf.co/google/bert_uncased_L-2_H-128_A-2) | 4.4M | 0.7114 / 0.7161 | 0.8318 / 0.7353 | 0.8222 | |
|
|
| [bert-hash-femto](https://hf.co/neuml/bert-hash-femto) | 0.243M | 0.5697 / 0.5750 | 0.8122 / 0.6838 | 0.7821 | |
|
|
| [bert-hash-pico](https://hf.co/neuml/bert-hash-pico) | 0.448M | 0.6228 / 0.6363 | 0.8205 / 0.7083 | 0.7878 | |
|
|
| [**bert-hash-nano**](https://hf.co/neuml/bert-hash-nano) | **0.969M** | **0.6565 / 0.6670** | **0.8172 / 0.7083** | **0.8131** | |
|
|
|
|
|
## Usage |
|
|
|
|
|
These models can be loaded using Hugging Face Transformers as follows. Note that given that this is a custom architecture, `trust_remote_code` needs to be set. |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel |
|
|
|
|
|
model = AutoModel.from_pretrained("neuml/bert-hash-nano", trust_remote_code=True) |
|
|
``` |
|
|
|
|
|
## Training |
|
|
|
|
|
Training your own Nano model is simple. All you need is a Hugging Face dataset and the code below using [txtai](https://github.com/neuml/txtai). |
|
|
|
|
|
```python |
|
|
from datasets import concatenate_datasets, load_dataset |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
from txtai.pipeline import HFTrainer |
|
|
|
|
|
from configuration_bert_hash import * |
|
|
from modeling_bert_hash import * |
|
|
|
|
|
dataset = load_dataset("path to target HF dataset") |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") |
|
|
|
|
|
config = BertHashConfig( |
|
|
hidden_size=128, |
|
|
num_hidden_layers=2, |
|
|
num_attention_heads=2, |
|
|
intermediate_size=512, |
|
|
projections=16 |
|
|
) |
|
|
model = BertHashForMaskedLM(config) |
|
|
|
|
|
print(config) |
|
|
print("Total parameters:", sum(p.numel() for p in model.bert.parameters())) |
|
|
|
|
|
train = HFTrainer() |
|
|
|
|
|
# Train using MLM |
|
|
train((model, tokenizer), dataset, task="language-modeling", output_dir="model", |
|
|
fp16=True, learning_rate=1e-3, per_device_train_batch_size=64, num_train_epochs=3, |
|
|
warmup_steps=2500, weight_decay=0.01, adam_epsilon=1e-6, |
|
|
tokenizers=True, dataloader_num_workers=20, |
|
|
save_strategy="steps", save_steps=5000, logging_steps=500, |
|
|
) |
|
|
``` |
|
|
|
|
|
## Future Work |
|
|
|
|
|
This model demonstrates that smaller models can still be productive models. |
|
|
|
|
|
The hope is that this work opens the door to many in building small encoder models that pack a punch. Models can be trained in a matter of hours using consumer GPUs. |
|
|
|
|
|
Imagine more specialized models like this for medical, legal, science and more. |
|
|
|
|
|
## More Information |
|
|
|
|
|
Read more about this model and how it was built in [this article](https://medium.com/neuml/training-tiny-language-models-with-token-hashing-b744aa7eb931). |
|
|
|