Safetensors
English
internlm
custom_code
File size: 6,575 Bytes
820aacd
 
 
04c1489
820aacd
 
 
 
 
 
d93b666
 
 
 
820aacd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
04c1489
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
license: mit
datasets:
- opendatalab/SlimPajama-Meta-rater
language:
- en
---

# PRRC-Cleanliness Language Model (1.3B Parameters, 30B Tokens)

This repository contains the model described in the paper [Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models](https://huggingface.co/papers/2504.14194).

Code: https://github.com/opendatalab/Meta-rater

## Model Description

This is a 1.3B parameter transformer-based decoder-only language model trained from scratch on 30B tokens selected from SlimPajama dataset using the **Cleanliness** dimension of the PRRC framework. The training data was curated by selecting text with high cleanliness scores, focusing on well-formatted, complete, and noise-free content.

## Model Details

- **Architecture**: Transformer decoder-only
- **Parameters**: 1.345B (1,345,423,360 parameters)
- **Training Tokens**: 30B tokens
- **Context Window**: 1,024 tokens
- **Vocabulary Size**: 32,000 (LLaMA tokenizer)
- **Data Selection Method**: Top-k selection based on Cleanliness scores
- **Rating Model**: ModernBERT-base fine-tuned for Cleanliness assessment

## Architecture Specifications

- **Hidden Dimension**: 2,048
- **Number of Layers**: 24
- **Attention Heads**: 16
- **Key-Value Heads**: 16
- **MLP Ratio**: 8/3
- **Position Encoding**: RoPE (base=10,000)

## Data Selection Criteria

The training data was selected using the Cleanliness rating model, which evaluates:
- **Correct Formatting**: Human-edited appearance without corrupted characters
- **Appropriate Content**: No irrelevant links, advertisements, or spam
- **Content Completeness**: Complete sentences and coherent structure
- **Structural Integrity**: Proper organization and layout
- **Noise Reduction**: Minimal irrelevant or distracting elements

Selected texts typically include:
- Well-formatted articles and documents
- Clean editorial content
- Professional publications
- Quality web content without artifacts
- Properly structured educational materials

## Training Details

- **Hardware**: 32x NVIDIA A800 GPUs
- **Global Batch Size**: 4,194,304 tokens
- **Learning Rate**: 5e-5
- **Optimizer**: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
- **Training Time**: ~14 hours

## Performance Results

### Downstream Task Performance (Average Accuracy)

- **General Knowledge**: 56.45% (+3.66% vs Random)
  - ARC-Easy: 56.89%
  - ARC-Challenge: 27.65%
  - SciQ: 84.80%

- **Commonsense Reasoning**: 44.88% (+0.94% vs Random)
  - HellaSwag: 40.34%
  - SIQA: 41.97%
  - WinoGrande: 52.33%

- **Reading Comprehension**: 30.72% (+0.70% vs Random)
  - RACE: 30.24%
  - OpenbookQA: 31.20%

- **Overall Average**: 45.68% (+1.90% vs Random)

## Key Findings

- **Strong General Knowledge**: Significant improvement in knowledge-based tasks
- **Formatting Benefits**: Clean, well-structured training data improves model output quality
- **Noise Reduction**: Elimination of web artifacts and spam improves learning efficiency
- **Structural Quality**: Better understanding of proper text organization and flow

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "opendatalab/meta-rater-1b-cleanliness"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text (particularly good for clean, well-formatted content)
prompt = "Here are the key points to consider:"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=100,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

## Applications

This model is particularly well-suited for:
- **Content generation** requiring clean formatting
- **Document creation** and professional writing
- **Web content** development without artifacts
- **Educational materials** with proper structure
- **Clean text processing** applications
- **Data preprocessing** and cleaning tasks
- **Quality content** creation for publications

## Strengths

- Generates well-formatted and clean text output
- Strong performance on knowledge-intensive tasks
- Reduced likelihood of producing noisy or corrupted text
- Better understanding of proper document structure
- Enhanced ability to maintain content organization
- Improved resistance to format-related errors

## Limitations

- May prioritize format over content depth in some cases
- Could be overly conservative in text generation
- Limited context window (1,024 tokens)
- No instruction tuning or safety alignment
- May avoid creative formatting that could be beneficial

## Data Quality Impact

This model demonstrates the importance of clean training data:
- **Artifact Removal**: Training on clean data reduces model exposure to web scraping artifacts
- **Structural Learning**: Well-formatted input leads to better-structured output
- **Noise Resistance**: Lower exposure to irrelevant content improves focus
- **Professional Standards**: Training on quality content improves output professionalism

## Comparison with Baselines

- **vs Random Baseline**: +1.90% overall, with strongest gains in General Knowledge (+3.66%)
- **vs Other PRRC Dimensions**: Competitive performance with focus on content quality
- **vs Meta-rater All (25)**: Demonstrates the individual contribution of data cleanliness

## Quality Characteristics

This model excels at producing:
- **Clean Formatting**: Proper structure and organization
- **Complete Content**: Full sentences and coherent paragraphs
- **Professional Appearance**: Business and academic writing standards
- **Artifact-Free Text**: No web scraping remnants or corrupted characters
- **Consistent Structure**: Logical flow and proper segmentation

## Citation

If you use this model in your research, please cite:

```bibtex
@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}
```

## License

Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.

## Contact

For questions or issues, please contact the authors or open an issue in the repository.