Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2506.05209

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published Jan 14 • 63
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 56

MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels

Paper • 2405.07526 • Published May 13, 2024 • 21
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

Paper • 2405.15613 • Published May 24, 2024 • 17
A Touch, Vision, and Language Dataset for Multimodal Alignment

Paper • 2402.13232 • Published Feb 20, 2024 • 16
How Do Large Language Models Acquire Factual Knowledge During Pretraining?

Paper • 2406.11813 • Published Jun 17, 2024 • 31

LLM Pruning and Distillation in Practice: The Minitron Approach

Paper • 2408.11796 • Published Aug 21, 2024 • 57
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Paper • 2408.09174 • Published Aug 17, 2024 • 52
To Code, or Not To Code? Exploring Impact of Code in Pre-training

Paper • 2408.10914 • Published Aug 20, 2024 • 44
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

Paper • 2408.11878 • Published Aug 20, 2024 • 63

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published Jan 14 • 63
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 56

LLM Pruning and Distillation in Practice: The Minitron Approach

Paper • 2408.11796 • Published Aug 21, 2024 • 57
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Paper • 2408.09174 • Published Aug 17, 2024 • 52
To Code, or Not To Code? Exploring Impact of Code in Pre-training

Paper • 2408.10914 • Published Aug 20, 2024 • 44
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

Paper • 2408.11878 • Published Aug 20, 2024 • 63

MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels

Paper • 2405.07526 • Published May 13, 2024 • 21
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

Paper • 2405.15613 • Published May 24, 2024 • 17
A Touch, Vision, and Language Dataset for Multimodal Alignment

Paper • 2402.13232 • Published Feb 20, 2024 • 16
How Do Large Language Models Acquire Factual Knowledge During Pretraining?

Paper • 2406.11813 • Published Jun 17, 2024 • 31

Previous
1
2
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs