stereoplegic
			's Collections
			 
		
			
		Quantization
		
	updated
			
 
				
				
	
	
	
			
			LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
		
			Paper
			
•
			2310.08659
			
•
			Published
				
			•
				
				28
			
 
	
	 
	
	
	
			
			QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
		
			Paper
			
•
			2309.14717
			
•
			Published
				
			•
				
				45
			
 
	
	 
	
	
	
			
			Norm Tweaking: High-performance Low-bit Quantization of Large Language
  Models
		
			Paper
			
•
			2309.02784
			
•
			Published
				
			•
				
				2
			
 
	
	 
	
	
	
			
			ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with
  Modular Quantizers
		
			Paper
			
•
			2309.16119
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			OmniQuant: Omnidirectionally Calibrated Quantization for Large Language
  Models
		
			Paper
			
•
			2308.13137
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			FPTQ: Fine-grained Post-Training Quantization for Large Language Models
		
			Paper
			
•
			2308.15987
			
•
			Published
				
			•
				
				2
			
 
	
	 
	
	
	
			
			QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
		
			Paper
			
•
			2310.16795
			
•
			Published
				
			•
				
				27
			
 
	
	 
	
	
	
			
			LLM-FP4: 4-Bit Floating-Point Quantized Transformers
		
			Paper
			
•
			2310.16836
			
•
			Published
				
			•
				
				14
			
 
	
	 
	
	
	
			
			Microscaling Data Formats for Deep Learning
		
			Paper
			
•
			2310.10537
			
•
			Published
				
			•
				
				8
			
 
	
	 
	
	
	
			
			DeepliteRT: Computer Vision at the Edge
		
			Paper
			
•
			2309.10878
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Efficient Post-training Quantization with FP8 Formats
		
			Paper
			
•
			2309.14592
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search
		
			Paper
			
•
			2308.05600
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			BitNet: Scaling 1-bit Transformers for Large Language Models
		
			Paper
			
•
			2310.11453
			
•
			Published
				
			•
				
				105
			
 
	
	 
	
	
	
			
			Understanding the Impact of Post-Training Quantization on Large Language
  Models
		
			Paper
			
•
			2309.05210
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
  Quantization for LLMs
		
			Paper
			
•
			2308.09723
			
•
			Published
				
			•
				
				2
			
 
	
	 
	
	
	
			
			Softmax Bias Correction for Quantized Generative Models
		
			Paper
			
•
			2309.01729
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Training and inference of large language models using 8-bit floating
  point
		
			Paper
			
•
			2309.17224
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			TEQ: Trainable Equivalent Transformation for Quantization of LLMs
		
			Paper
			
•
			2310.10944
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large
  Language Models
		
			Paper
			
•
			2310.08041
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Optimize Weight Rounding via Signed Gradient Descent for the
  Quantization of LLMs
		
			Paper
			
•
			2309.05516
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			PB-LLM: Partially Binarized Large Language Models
		
			Paper
			
•
			2310.00034
			
•
			Published
				
			•
				
				2
			
 
	
	 
	
	
	
			
			Towards End-to-end 4-Bit Inference on Generative Large Language Models
		
			Paper
			
•
			2310.09259
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM
  Inference with Transferable Prompt
		
			Paper
			
•
			2305.11186
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			MEMORY-VQ: Compression for Tractable Internet-Scale Memory
		
			Paper
			
•
			2308.14903
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			FP8-LM: Training FP8 Large Language Models
		
			Paper
			
•
			2310.18313
			
•
			Published
				
			•
				
				33
			
 
	
	 
	
	
	
			
			Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
		
			Paper
			
•
			2310.19102
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			QLoRA: Efficient Finetuning of Quantized LLMs
		
			Paper
			
•
			2305.14314
			
•
			Published
				
			•
				
				56
			
 
	
	 
	
	
	
			
			A Survey on Model Compression for Large Language Models
		
			Paper
			
•
			2308.07633
			
•
			Published
				
			•
				
				3
			
 
	
	 
	
	
	
			
			REx: Data-Free Residual Quantization Error Expansion
		
			Paper
			
•
			2203.14645
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Data-Free Quantization with Accurate Activation Clipping and Adaptive
  Batch Normalization
		
			Paper
			
•
			2204.04215
			
•
			Published
				
			•
				
				2
			
 
	
	 
	
	
	
			
			LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
		
			Paper
			
•
			2305.17888
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Token-Scaled Logit Distillation for Ternary Weight Generative Language
  Models
		
			Paper
			
•
			2308.06744
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Understanding and Improving Knowledge Distillation for
  Quantization-Aware Training of Large Transformer Encoders
		
			Paper
			
•
			2211.11014
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Quantized Feature Distillation for Network Quantization
		
			Paper
			
•
			2307.10638
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Model compression via distillation and quantization
		
			Paper
			
•
			1802.05668
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Adaptive Precision Training (AdaPT): A dynamic fixed point quantized
  training approach for DNNs
		
			Paper
			
•
			2107.13490
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Feature Affinity Assisted Knowledge Distillation and Quantization of
  Deep Neural Networks on Label-Free Data
		
			Paper
			
•
			2302.10899
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Compressing LLMs: The Truth is Rarely Pure and Never Simple
		
			Paper
			
•
			2310.01382
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Quantizable Transformers: Removing Outliers by Helping Attention Heads
  Do Nothing
		
			Paper
			
•
			2306.12929
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			Outlier Suppression+: Accurate quantization of large language models by
  equivalent and optimal shifting and scaling
		
			Paper
			
•
			2304.09145
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot
  Compression
		
			Paper
			
•
			2309.14021
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Prune Once for All: Sparse Pre-Trained Language Models
		
			Paper
			
•
			2111.05754
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			eDKM: An Efficient and Accurate Train-time Weight Clustering for Large
  Language Models
		
			Paper
			
•
			2309.00964
			
•
			Published
				
			•
				
				2
			
 
	
	 
	
	
	
			
			Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit
  Quantization and Robustness
		
			Paper
			
•
			2310.02410
			
•
			Published
				
			•
				
				3
			
 
	
	 
	
	
	
			
			SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using
  Training Dynamics
		
			Paper
			
•
			2305.18513
			
•
			Published
				
			•
				
				2
			
 
	
	 
	
	
	
			
			NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization
  for Vision Transformers
		
			Paper
			
•
			2211.16056
			
•
			Published
				
			•
				
				4
			
 
	
	 
	
	
	
			
			LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient
  Language Model Finetuning
		
			Paper
			
•
			2311.12023
			
•
			Published
				
			•
				
				2
			
 
	
	 
	
	
	
			
			Blockwise Compression of Transformer-based Models without Retraining
		
			Paper
			
•
			2304.01483
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Towards Fine-tuning Pre-trained Language Models with Integer Forward and
  Backward Propagation
		
			Paper
			
•
			2209.09815
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Learning Low-Rank Representations for Model Compression
		
			Paper
			
•
			2211.11397
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Ada-QPacknet -- adaptive pruning with bit width reduction as an
  efficient continual learning method without forgetting
		
			Paper
			
•
			2308.07939
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			Efficient Storage of Fine-Tuned Models via Low-Rank Approximation of
  Weight Residuals
		
			Paper
			
•
			2305.18425
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			BitDelta: Your Fine-Tune May Only Be Worth One Bit
		
			Paper
			
•
			2402.10193
			
•
			Published
				
			•
				
				22