File size: 6,575 Bytes

---
license: mit
datasets:
- opendatalab/SlimPajama-Meta-rater
language:
- en
---

# PRRC-Cleanliness Language Model (1.3B Parameters, 30B Tokens)

This repository contains the model described in the paper [Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models](https://huggingface.co/papers/2504.14194).

Code: https://github.com/opendatalab/Meta-rater

## Model Description

This is a 1.3B parameter transformer-based decoder-only language model trained from scratch on 30B tokens selected from SlimPajama dataset using the **Cleanliness** dimension of the PRRC framework. The training data was curated by selecting text with high cleanliness scores, focusing on well-formatted, complete, and noise-free content.

## Model Details

- **Architecture**: Transformer decoder-only
- **Parameters**: 1.345B (1,345,423,360 parameters)
- **Training Tokens**: 30B tokens
- **Context Window**: 1,024 tokens
- **Vocabulary Size**: 32,000 (LLaMA tokenizer)
- **Data Selection Method**: Top-k selection based on Cleanliness scores
- **Rating Model**: ModernBERT-base fine-tuned for Cleanliness assessment

## Architecture Specifications

- **Hidden Dimension**: 2,048
- **Number of Layers**: 24
- **Attention Heads**: 16
- **Key-Value Heads**: 16
- **MLP Ratio**: 8/3
- **Position Encoding**: RoPE (base=10,000)

## Data Selection Criteria

The training data was selected using the Cleanliness rating model, which evaluates:
- **Correct Formatting**: Human-edited appearance without corrupted characters
- **Appropriate Content**: No irrelevant links, advertisements, or spam
- **Content Completeness**: Complete sentences and coherent structure
- **Structural Integrity**: Proper organization and layout
- **Noise Reduction**: Minimal irrelevant or distracting elements

Selected texts typically include:
- Well-formatted articles and documents
- Clean editorial content
- Professional publications
- Quality web content without artifacts
- Properly structured educational materials

## Training Details

- **Hardware**: 32x NVIDIA A800 GPUs
- **Global Batch Size**: 4,194,304 tokens
- **Learning Rate**: 5e-5
- **Optimizer**: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
- **Training Time**: ~14 hours

## Performance Results

### Downstream Task Performance (Average Accuracy)

- **General Knowledge**: 56.45% (+3.66% vs Random)
  - ARC-Easy: 56.89%
  - ARC-Challenge: 27.65%
  - SciQ: 84.80%

- **Commonsense Reasoning**: 44.88% (+0.94% vs Random)
  - HellaSwag: 40.34%
  - SIQA: 41.97%
  - WinoGrande: 52.33%

- **Reading Comprehension**: 30.72% (+0.70% vs Random)
  - RACE: 30.24%
  - OpenbookQA: 31.20%

- **Overall Average**: 45.68% (+1.90% vs Random)

## Key Findings

- **Strong General Knowledge**: Significant improvement in knowledge-based tasks
- **Formatting Benefits**: Clean, well-structured training data improves model output quality
- **Noise Reduction**: Elimination of web artifacts and spam improves learning efficiency
- **Structural Quality**: Better understanding of proper text organization and flow

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "opendatalab/meta-rater-1b-cleanliness"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text (particularly good for clean, well-formatted content)
prompt = "Here are the key points to consider:"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=100,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

## Applications

This model is particularly well-suited for:
- **Content generation** requiring clean formatting
- **Document creation** and professional writing
- **Web content** development without artifacts
- **Educational materials** with proper structure
- **Clean text processing** applications
- **Data preprocessing** and cleaning tasks
- **Quality content** creation for publications

## Strengths

- Generates well-formatted and clean text output
- Strong performance on knowledge-intensive tasks
- Reduced likelihood of producing noisy or corrupted text
- Better understanding of proper document structure
- Enhanced ability to maintain content organization
- Improved resistance to format-related errors

## Limitations

- May prioritize format over content depth in some cases
- Could be overly conservative in text generation
- Limited context window (1,024 tokens)
- No instruction tuning or safety alignment
- May avoid creative formatting that could be beneficial

## Data Quality Impact

This model demonstrates the importance of clean training data:
- **Artifact Removal**: Training on clean data reduces model exposure to web scraping artifacts
- **Structural Learning**: Well-formatted input leads to better-structured output
- **Noise Resistance**: Lower exposure to irrelevant content improves focus
- **Professional Standards**: Training on quality content improves output professionalism

## Comparison with Baselines

- **vs Random Baseline**: +1.90% overall, with strongest gains in General Knowledge (+3.66%)
- **vs Other PRRC Dimensions**: Competitive performance with focus on content quality
- **vs Meta-rater All (25)**: Demonstrates the individual contribution of data cleanliness

## Quality Characteristics

This model excels at producing:
- **Clean Formatting**: Proper structure and organization
- **Complete Content**: Full sentences and coherent paragraphs
- **Professional Appearance**: Business and academic writing standards
- **Artifact-Free Text**: No web scraping remnants or corrupted characters
- **Consistent Structure**: Logical flow and proper segmentation

## Citation

If you use this model in your research, please cite:

```bibtex
@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}
```

## License

Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.

## Contact

For questions or issues, please contact the authors or open an issue in the repository.