Model Card for starmpcc/NoPE_1.5B_FW_EDU_15T

This model is the official checkpoint accompanying the paper Behind RoPE: How Does Causal Mask Encode Positional Information?.

The model is trained without any explicit positional encoding (also known as NoPE).

It is based on the Llama-3 architecture, has 1.5 billion parameters, and was trained on 15 trillion tokens from the FineWeb-Edu dataset.

Model Training

The model is based on the Llama-3 architecture, with the positional encoding (RoPE) removed. It is trained on the deduplicated version of fineweb-edu dataset. The model has 1.5 billion parameters and is trained on 15 trillion tokens with a maximum sequence length of 1024. Further training details are provided in the accompanying paper.

Model Sources

Uses

from transformers import LlamaForCausalLM, LlamaTokenizer
import transformers.models.llama.modeling_llama as modeling_llama

def noop_apply_rotary_pos_emb(q, k, *args, **kwargs):
    return q, k

modeling_llama.apply_rotary_pos_emb = noop_apply_rotary_pos_emb

model = LlamaForCausalLM.from_pretrained(
    "starmpcc/NoPE_1.5B_FW_EDU_15T",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
tokenizer = LlamaTokenizer.from_pretrained("starmpcc/NoPE_1.5B_FW_EDU_15T")

Citation

@misc{kim2025ropedoescausalmask,
      title={Behind RoPE: How Does Causal Mask Encode Positional Information?}, 
      author={Junu Kim and Xiao Liu and Zhenghao Lin and Lei Ji and Yeyun Gong and Edward Choi},
      year={2025},
      eprint={2509.21042},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.21042}, 
}
Downloads last month
7
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train starmpcc/NoPE_1.5B_FW_EDU_15T