Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2401.02385

Model Architectures

togethercomputer/StripedHyena-Hessian-7B

Text Generation • 8B • Updated Mar 27, 2024 • 20 • 66
Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention

Paper • 2312.08618 • Published Dec 14, 2023 • 15
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

Paper • 2312.07987 • Published Dec 13, 2023 • 41
LLM360: Towards Fully Transparent Open-Source LLMs

Paper • 2312.06550 • Published Dec 11, 2023 • 57

Llemma: An Open Language Model For Mathematics

Paper • 2310.10631 • Published Oct 16, 2023 • 56
Mistral 7B

Paper • 2310.06825 • Published Oct 10, 2023 • 55
Qwen Technical Report

Paper • 2309.16609 • Published Sep 28, 2023 • 37
BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

Paper • 2309.11568 • Published Sep 20, 2023 • 11

In-Context Pretraining: Language Modeling Beyond Document Boundaries

Paper • 2310.10638 • Published Oct 16, 2023 • 30
Magicoder: Source Code Is All You Need

Paper • 2312.02120 • Published Dec 4, 2023 • 82
Parameter Efficient Tuning Allows Scalable Personalization of LLMs for Text Entry: A Case Study on Abbreviation Expansion

Paper • 2312.14327 • Published Dec 21, 2023 • 8
WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation

Paper • 2312.14187 • Published Dec 20, 2023 • 50

MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

Paper • 2309.04662 • Published Sep 9, 2023 • 24
Neurons in Large Language Models: Dead, N-gram, Positional

Paper • 2309.04827 • Published Sep 9, 2023 • 17
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

Paper • 2309.05516 • Published Sep 11, 2023 • 10
DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs

Paper • 2309.03907 • Published May 18, 2023 • 12

Training & Architectures

Attention Is All You Need

Paper • 1706.03762 • Published Jun 12, 2017 • 99
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Paper • 2307.08691 • Published Jul 17, 2023 • 9
Mixtral of Experts

Paper • 2401.04088 • Published Jan 8, 2024 • 161
Mistral 7B

Paper • 2310.06825 • Published Oct 10, 2023 • 55

NExT-GPT: Any-to-Any Multimodal LLM

Paper • 2309.05519 • Published Sep 11, 2023 • 78
Large Language Model for Science: A Study on P vs. NP

Paper • 2309.05689 • Published Sep 11, 2023 • 21
AstroLLaMA: Towards Specialized Foundation Models in Astronomy

Paper • 2309.06126 • Published Sep 12, 2023 • 18
Large Language Models for Compiler Optimization

Paper • 2309.07062 • Published Sep 11, 2023 • 24

Approximating Two-Layer Feedforward Networks for Efficient Transformers

Paper • 2310.10837 • Published Oct 16, 2023 • 11
BitNet: Scaling 1-bit Transformers for Large Language Models

Paper • 2310.11453 • Published Oct 17, 2023 • 105
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

Paper • 2310.16795 • Published Oct 25, 2023 • 27
LLM-FP4: 4-Bit Floating-Point Quantized Transformers

Paper • 2310.16836 • Published Oct 25, 2023 • 14

Model Architectures

togethercomputer/StripedHyena-Hessian-7B

Text Generation • 8B • Updated Mar 27, 2024 • 20 • 66
Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention

Paper • 2312.08618 • Published Dec 14, 2023 • 15
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

Paper • 2312.07987 • Published Dec 13, 2023 • 41
LLM360: Towards Fully Transparent Open-Source LLMs

Paper • 2312.06550 • Published Dec 11, 2023 • 57

Training & Architectures

Attention Is All You Need

Paper • 1706.03762 • Published Jun 12, 2017 • 99
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Paper • 2307.08691 • Published Jul 17, 2023 • 9
Mixtral of Experts

Paper • 2401.04088 • Published Jan 8, 2024 • 161
Mistral 7B

Paper • 2310.06825 • Published Oct 10, 2023 • 55

Llemma: An Open Language Model For Mathematics

Paper • 2310.10631 • Published Oct 16, 2023 • 56
Mistral 7B

Paper • 2310.06825 • Published Oct 10, 2023 • 55
Qwen Technical Report

Paper • 2309.16609 • Published Sep 28, 2023 • 37
BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

Paper • 2309.11568 • Published Sep 20, 2023 • 11

NExT-GPT: Any-to-Any Multimodal LLM

Paper • 2309.05519 • Published Sep 11, 2023 • 78
Large Language Model for Science: A Study on P vs. NP

Paper • 2309.05689 • Published Sep 11, 2023 • 21
AstroLLaMA: Towards Specialized Foundation Models in Astronomy

Paper • 2309.06126 • Published Sep 12, 2023 • 18
Large Language Models for Compiler Optimization

Paper • 2309.07062 • Published Sep 11, 2023 • 24

In-Context Pretraining: Language Modeling Beyond Document Boundaries

Paper • 2310.10638 • Published Oct 16, 2023 • 30
Magicoder: Source Code Is All You Need

Paper • 2312.02120 • Published Dec 4, 2023 • 82
Parameter Efficient Tuning Allows Scalable Personalization of LLMs for Text Entry: A Case Study on Abbreviation Expansion

Paper • 2312.14327 • Published Dec 21, 2023 • 8
WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation

Paper • 2312.14187 • Published Dec 20, 2023 • 50

Approximating Two-Layer Feedforward Networks for Efficient Transformers

Paper • 2310.10837 • Published Oct 16, 2023 • 11
BitNet: Scaling 1-bit Transformers for Large Language Models

Paper • 2310.11453 • Published Oct 17, 2023 • 105
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

Paper • 2310.16795 • Published Oct 25, 2023 • 27
LLM-FP4: 4-Bit Floating-Point Quantized Transformers

Paper • 2310.16836 • Published Oct 25, 2023 • 14

MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

Paper • 2309.04662 • Published Sep 9, 2023 • 24
Neurons in Large Language Models: Dead, N-gram, Positional

Paper • 2309.04827 • Published Sep 9, 2023 • 17
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

Paper • 2309.05516 • Published Sep 11, 2023 • 10
DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs

Paper • 2309.03907 • Published May 18, 2023 • 12

Previous
1
2
3
4
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs