CCMat
			's Collections
			 
		
			
				
				
	
	
	
			
			Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with
  Fine-Grained Chinese Understanding
		
			Paper
			
•
			2405.08748
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
		
			Paper
			
•
			2405.10300
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			Chameleon: Mixed-Modal Early-Fusion Foundation Models
		
			Paper
			
•
			2405.09818
			
•
			Published
				
			•
				
				131
			
 
	
	 
	
	
	
			
			OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
		
			Paper
			
•
			2405.11143
			
•
			Published
				
			•
				
				41
			
 
	
	 
	
	
	
			
			MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
		
			Paper
			
•
			2405.12130
			
•
			Published
				
			•
				
				50
			
 
	
	 
	
	
	
			
			FIFO-Diffusion: Generating Infinite Videos from Text without Training
		
			Paper
			
•
			2405.11473
			
•
			Published
				
			•
				
				57
			
 
	
	 
	
	
	
			
			Your Transformer is Secretly Linear
		
			Paper
			
•
			2405.12250
			
•
			Published
				
			•
				
				158
			
 
	
	 
	
	
	
			
			Matryoshka Multimodal Models
		
			Paper
			
•
			2405.17430
			
•
			Published
				
			•
				
				34
			
 
	
	 
	
	
	
			
			An Introduction to Vision-Language Modeling
		
			Paper
			
•
			2405.17247
			
•
			Published
				
			•
				
				90
			
 
	
	 
	
	
	
			
			ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
  Models
		
			Paper
			
•
			2405.15738
			
•
			Published
				
			•
				
				46
			
 
	
	 
	
	
	
			
			Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
		
			Paper
			
•
			2403.03206
			
•
			Published
				
			•
				
				70
			
 
	
	 
	
	
	
			
			BitsFusion: 1.99 bits Weight Quantization of Diffusion Model
		
			Paper
			
•
			2406.04333
			
•
			Published
				
			•
				
				38
			
 
	
	 
	
	
	
			
			ShareGPT4Video: Improving Video Understanding and Generation with Better
  Captions
		
			Paper
			
•
			2406.04325
			
•
			Published
				
			•
				
				75
			
 
	
	 
	
	
	
			
			Block Transformer: Global-to-Local Language Modeling for Fast Inference
		
			Paper
			
•
			2406.02657
			
•
			Published
				
			•
				
				41
			
 
	
	 
	
	
	
			
			Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and
  Resolution
		
			Paper
			
•
			2307.06304
			
•
			Published
				
			•
				
				34
			
 
	
	 
	
	
	
			
			OpenELM: An Efficient Language Model Family with Open-source Training
  and Inference Framework
		
			Paper
			
•
			2404.14619
			
•
			Published
				
			•
				
				126
			
 
	
	 
	
	
	
			
			Multi-Head Mixture-of-Experts
		
			Paper
			
•
			2404.15045
			
•
			Published
				
			•
				
				60
			
 
	
	 
	
	
	
			
			Meteor: Mamba-based Traversal of Rationale for Large Language and Vision
  Models
		
			Paper
			
•
			2405.15574
			
•
			Published
				
			•
				
				55
			
 
	
	 
	
	
	
		
			Paper
			
•
			2405.18407
			
•
			Published
				
			•
				
				48
			
 
	
	 
	
	
	
			
			Transformers are SSMs: Generalized Models and Efficient Algorithms
  Through Structured State Space Duality
		
			Paper
			
•
			2405.21060
			
•
			Published
				
			•
				
				67
			
 
	
	 
	
	
	
			
			CRAG -- Comprehensive RAG Benchmark
		
			Paper
			
•
			2406.04744
			
•
			Published
				
			•
				
				48
			
 
	
	 
	
	
	
			
			DiTFastAttn: Attention Compression for Diffusion Transformer Models
		
			Paper
			
•
			2406.08552
			
•
			Published
				
			•
				
				25
			
 
	
	 
	
	
	
		
			Paper
			
•
			2406.09414
			
•
			Published
				
			•
				
				103
			
 
	
	 
	
	
	
			
			An Image is Worth More Than 16x16 Patches: Exploring Transformers on
  Individual Pixels
		
			Paper
			
•
			2406.09415
			
•
			Published
				
			•
				
				51
			
 
	
	 
	
	
	
			
			The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN
  Inversion and High Quality Image Editing
		
			Paper
			
•
			2406.10601
			
•
			Published
				
			•
				
				70
			
 
	
	 
	
	
	
			
			Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective
  Distillation and Unlabeled Data Augmentation
		
			Paper
			
•
			2406.12849
			
•
			Published
				
			•
				
				50
			
 
	
	 
	
	
	
			
			Adam-mini: Use Fewer Learning Rates To Gain More
		
			Paper
			
•
			2406.16793
			
•
			Published
				
			•
				
				69
			
 
	
	 
	
	
	
			
			DreamBench++: A Human-Aligned Benchmark for Personalized Image
  Generation
		
			Paper
			
•
			2406.16855
			
•
			Published
				
			•
				
				57
			
 
	
	 
	
	
	
			
			Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
		
			Paper
			
•
			2407.01392
			
•
			Published
				
			•
				
				45
			
 
	
	 
	
	
	
			
			InternLM-XComposer-2.5: A Versatile Large Vision Language Model
  Supporting Long-Contextual Input and Output
		
			Paper
			
•
			2407.03320
			
•
			Published
				
			•
				
				95
			
 
	
	 
	
	
	
			
			Video Diffusion Alignment via Reward Gradients
		
			Paper
			
•
			2407.08737
			
•
			Published
				
			•
				
				49
			
 
	
	 
	
	
	
		
			Paper
			
•
			2407.10671
			
•
			Published
				
			•
				
				166
			
 
	
	 
	
	
	
			
			Theia: Distilling Diverse Vision Foundation Models for Robot Learning
		
			Paper
			
•
			2407.20179
			
•
			Published
				
			•
				
				47
			
 
	
	 
	
	
	
			
			Gemma 2: Improving Open Language Models at a Practical Size
		
			Paper
			
•
			2408.00118
			
•
			Published
				
			•
				
				79
			
 
	
	 
	
	
	
			
			The Llama 3 Herd of Models
		
			Paper
			
•
			2407.21783
			
•
			Published
				
			•
				
				116
			
 
	
	 
	
	
	
			
			SAM 2: Segment Anything in Images and Videos
		
			Paper
			
•
			2408.00714
			
•
			Published
				
			•
				
				116
			
 
	
	 
	
	
	
			
			MiniCPM-V: A GPT-4V Level MLLM on Your Phone
		
			Paper
			
•
			2408.01800
			
•
			Published
				
			•
				
				89
			
 
	
	 
	
	
	
			
			Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation
  with Multimodal Generative Pretraining
		
			Paper
			
•
			2408.02657
			
•
			Published
				
			•
				
				35
			
 
	
	 
	
	
	
			
			MMIU: Multimodal Multi-image Understanding for Evaluating Large
  Vision-Language Models
		
			Paper
			
•
			2408.02718
			
•
			Published
				
			•
				
				62
			
 
	
	 
	
	
	
			
			GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards
  General Medical AI
		
			Paper
			
•
			2408.03361
			
•
			Published
				
			•
				
				85
			
 
	
	 
	
	
	
			
			An Object is Worth 64x64 Pixels: Generating 3D Object via Image
  Diffusion
		
			Paper
			
•
			2408.03178
			
•
			Published
				
			•
				
				40
			
 
	
	 
	
	
	
			
			LLaVA-OneVision: Easy Visual Task Transfer
		
			Paper
			
•
			2408.03326
			
•
			Published
				
			•
				
				60
			
 
	
	 
	
	
	
			
			Transformer Explainer: Interactive Learning of Text-Generative Models
		
			Paper
			
•
			2408.04619
			
•
			Published
				
			•
				
				172
			
 
	
	 
	
	
	
			
			ControlNeXt: Powerful and Efficient Control for Image and Video
  Generation
		
			Paper
			
•
			2408.06070
			
•
			Published
				
			•
				
				55
			
 
	
	 
	
	
	
			
			Qwen2-Audio Technical Report
		
			Paper
			
•
			2407.10759
			
•
			Published
				
			•
				
				61
			
 
	
	 
	
	
	
			
			GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill
  and Extreme KV-Cache Compression
		
			Paper
			
•
			2407.12077
			
•
			Published
				
			•
				
				57
			
 
	
	 
	
	
	
			
			Compact Language Models via Pruning and Knowledge Distillation
		
			Paper
			
•
			2407.14679
			
•
			Published
				
			•
				
				38
			
 
	
	 
	
	
	
			
			SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
  Models
		
			Paper
			
•
			2407.15841
			
•
			Published
				
			•
				
				40
			
 
	
	 
	
	
	
			
			KAN or MLP: A Fairer Comparison
		
			Paper
			
•
			2407.16674
			
•
			Published
				
			•
				
				43
			
 
	
	 
	
	
	
			
			MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence
		
			Paper
			
•
			2407.16655
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any
  Person
		
			Paper
			
•
			2407.16224
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh
  Tokenization
		
			Paper
			
•
			2408.02555
			
•
			Published
				
			•
				
				32
			
 
	
	 
	
	
	
			
			Mixture of Nested Experts: Adaptive Processing of Visual Tokens
		
			Paper
			
•
			2407.19985
			
•
			Published
				
			•
				
				37
			
 
	
	 
	
	
	
			
			Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model
		
			Paper
			
•
			2407.16982
			
•
			Published
				
			•
				
				42
			
 
	
	 
	
	
	
			
			VILA^2: VILA Augmented VILA
		
			Paper
			
•
			2407.17453
			
•
			Published
				
			•
				
				41
			
 
	
	 
	
	
	
			
			CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
		
			Paper
			
•
			2408.06072
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
		
			Paper
			
•
			2408.07009
			
•
			Published
				
			•
				
				62
			
 
	
	 
	
	
	
			
			xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
		
			Paper
			
•
			2408.08872
			
•
			Published
				
			•
				
				100
			
 
	
	 
	
	
	
			
			MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction
  Model
		
			Paper
			
•
			2408.10198
			
•
			Published
				
			•
				
				35
			
 
	
	 
	
	
	
			
			Transfusion: Predict the Next Token and Diffuse Images with One
  Multi-Modal Model
		
			Paper
			
•
			2408.11039
			
•
			Published
				
			•
				
				63
			
 
	
	 
	
	
	
			
			Sapiens: Foundation for Human Vision Models
		
			Paper
			
•
			2408.12569
			
•
			Published
				
			•
				
				94
			
 
	
	 
	
	
	
			
			DreamCinema: Cinematic Transfer with Free Camera and 3D Character
		
			Paper
			
•
			2408.12601
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			Building and better understanding vision-language models: insights and
  future directions
		
			Paper
			
•
			2408.12637
			
•
			Published
				
			•
				
				133
			
 
	
	 
	
	
	
			
			LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation
		
			Paper
			
•
			2408.13252
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its
  Teacher
		
			Paper
			
•
			2408.14176
			
•
			Published
				
			•
				
				62
			
 
	
	 
	
	
	
			
			Foundation Models for Music: A Survey
		
			Paper
			
•
			2408.14340
			
•
			Published
				
			•
				
				44
			
 
	
	 
	
	
	
			
			Diffusion Models Are Real-Time Game Engines
		
			Paper
			
•
			2408.14837
			
•
			Published
				
			•
				
				126
			
 
	
	 
	
	
	
			
			Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of
  Encoders
		
			Paper
			
•
			2408.15998
			
•
			Published
				
			•
				
				87
			
 
	
	 
	
	
	
			
			CogVLM2: Visual Language Models for Image and Video Understanding
		
			Paper
			
•
			2408.16500
			
•
			Published
				
			•
				
				57
			
 
	
	 
	
	
	
			
			WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio
  Language Modeling
		
			Paper
			
•
			2408.16532
			
•
			Published
				
			•
				
				50
			
 
	
	 
	
	
	
			
			LinFusion: 1 GPU, 1 Minute, 16K Image
		
			Paper
			
•
			2409.02097
			
•
			Published
				
			•
				
				34
			
 
	
	 
	
	
	
			
			Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion
  Dependency
		
			Paper
			
•
			2409.02634
			
•
			Published
				
			•
				
				97
			
 
	
	 
	
	
	
			
			Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free
  Real Image Editing
		
			Paper
			
•
			2409.01322
			
•
			Published
				
			•
				
				96
			
 
	
	 
	
	
	
			
			Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with
  Image-Based Surface Representation
		
			Paper
			
•
			2409.03718
			
•
			Published
				
			•
				
				27
			
 
	
	 
	
	
	
			
			Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
  Models
		
			Paper
			
•
			2404.12387
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			Dynamic Typography: Bringing Words to Life
		
			Paper
			
•
			2404.11614
			
•
			Published
				
			•
				
				46
			
 
	
	 
	
	
	
			
			Phi-3 Technical Report: A Highly Capable Language Model Locally on Your
  Phone
		
			Paper
			
•
			2404.14219
			
•
			Published
				
			•
				
				258
			
 
	
	 
	
	
	
			
			LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
		
			Paper
			
•
			2404.16710
			
•
			Published
				
			•
				
				80
			
 
	
	 
	
	
	
			
			Iterative Reasoning Preference Optimization
		
			Paper
			
•
			2404.19733
			
•
			Published
				
			•
				
				49
			
 
	
	 
	
	
	
			
			KAN: Kolmogorov-Arnold Networks
		
			Paper
			
•
			2404.19756
			
•
			Published
				
			•
				
				115
			
 
	
	 
	
	
	
			
			OmniGen: Unified Image Generation
		
			Paper
			
•
			2409.11340
			
•
			Published
				
			•
				
				115
			
 
	
	 
	
	
	
			
			Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video
  Diffusion Models
		
			Paper
			
•
			2409.07452
			
•
			Published
				
			•
				
				21
			
 
	
	 
	
	
	
			
			Towards a Unified View of Preference Learning for Large Language Models:
  A Survey
		
			Paper
			
•
			2409.02795
			
•
			Published
				
			•
				
				72
			
 
	
	 
	
	
	
			
			Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
		
			Paper
			
•
			2409.11355
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			Qwen2.5-Coder Technical Report
		
			Paper
			
•
			2409.12186
			
•
			Published
				
			•
				
				150
			
 
	
	 
	
	
	
			
			Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
  Any Resolution
		
			Paper
			
•
			2409.12191
			
•
			Published
				
			•
				
				78
			
 
	
	 
	
	
	
			
			VideoPoet: A Large Language Model for Zero-Shot Video Generation
		
			Paper
			
•
			2312.14125
			
•
			Published
				
			•
				
				47
			
 
	
	 
	
	
	
			
			Training Language Models to Self-Correct via Reinforcement Learning
		
			Paper
			
•
			2409.12917
			
•
			Published
				
			•
				
				140
			
 
	
	 
	
	
	
			
			Imagine yourself: Tuning-Free Personalized Image Generation
		
			Paper
			
•
			2409.13346
			
•
			Published
				
			•
				
				70
			
 
	
	 
	
	
	
			
			Colorful Diffuse Intrinsic Image Decomposition in the Wild
		
			Paper
			
•
			2409.13690
			
•
			Published
				
			•
				
				14
			
 
	
	 
	
	
	
			
			Emu3: Next-Token Prediction is All You Need
		
			Paper
			
•
			2409.18869
			
•
			Published
				
			•
				
				95
			
 
	
	 
	
	
	
			
			MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
		
			Paper
			
•
			2409.20566
			
•
			Published
				
			•
				
				56
			
 
	
	 
	
	
	
			
			Revisit Large-Scale Image-Caption Data in Pre-training Multimodal
  Foundation Models
		
			Paper
			
•
			2410.02740
			
•
			Published
				
			•
				
				54
			
 
	
	 
	
	
	
			
			Loong: Generating Minute-level Long Videos with Autoregressive Language
  Models
		
			Paper
			
•
			2410.02757
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
		
			Paper
			
•
			2410.02073
			
•
			Published
				
			•
				
				41
			
 
	
	 
	
	
	
			
			Baichuan-Omni Technical Report
		
			Paper
			
•
			2410.08565
			
•
			Published
				
			•
				
				87
			
 
	
	 
	
	
	
			
			Animate-X: Universal Character Image Animation with Enhanced Motion
  Representation
		
			Paper
			
•
			2410.10306
			
•
			Published
				
			•
				
				57
			
 
	
	 
	
	
	
			
			Efficient Diffusion Models: A Comprehensive Survey from Principles to
  Practices
		
			Paper
			
•
			2410.11795
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a
  Training-Free Memory Tree
		
			Paper
			
•
			2410.16268
			
•
			Published
				
			•
				
				69
			
 
	
	 
	
	
	
			
			SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes
		
			Paper
			
•
			2410.17249
			
•
			Published
				
			•
				
				42
			
 
	
	 
	
	
	
			
			Movie Gen: A Cast of Media Foundation Models
		
			Paper
			
•
			2410.13720
			
•
			Published
				
			•
				
				98
			
 
	
	 
	
	
	
			
			Fluid: Scaling Autoregressive Text-to-image Generative Models with
  Continuous Tokens
		
			Paper
			
•
			2410.13863
			
•
			Published
				
			•
				
				38
			
 
	
	 
	
	
	
			
			FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without
  Learned Priors
		
			Paper
			
•
			2410.16271
			
•
			Published
				
			•
				
				84
			
 
	
	 
	
	
	
			
			PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
		
			Paper
			
•
			2410.13861
			
•
			Published
				
			•
				
				56
			
 
	
	 
	
	
	
			
			Unbounded: A Generative Infinite Game of Character Life Simulation
		
			Paper
			
•
			2410.18975
			
•
			Published
				
			•
				
				37
			
 
	
	 
	
	
	
			
			Breaking the Memory Barrier: Near Infinite Batch Size Scaling for
  Contrastive Loss
		
			Paper
			
•
			2410.17243
			
•
			Published
				
			•
				
				93
			
 
	
	 
	
	
	
			
			Representation Alignment for Generation: Training Diffusion Transformers
  Is Easier Than You Think
		
			Paper
			
•
			2410.06940
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			Addition is All You Need for Energy-efficient Language Models
		
			Paper
			
•
			2410.00907
			
•
			Published
				
			•
				
				151
			
 
	
	 
	
	
	
			
			Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
  and Generation
		
			Paper
			
•
			2410.13848
			
•
			Published
				
			•
				
				34
			
 
	
	 
	
	
	
			
			Semantic Image Inversion and Editing using Rectified Stochastic
  Differential Equations
		
			Paper
			
•
			2410.10792
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			CLEAR: Character Unlearning in Textual and Visual Modalities
		
			Paper
			
•
			2410.18057
			
•
			Published
				
			•
				
				209
			
 
	
	 
	
	
	
			
			LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
		
			Paper
			
•
			2411.04997
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			Add-it: Training-Free Object Insertion in Images With Pretrained
  Diffusion Models
		
			Paper
			
•
			2411.07232
			
•
			Published
				
			•
				
				67
			
 
	
	 
	
	
	
			
			OmniEdit: Building Image Editing Generalist Models Through Specialist
  Supervision
		
			Paper
			
•
			2411.07199
			
•
			Published
				
			•
				
				50
			
 
	
	 
	
	
	
			
			Large Language Models Can Self-Improve in Long-context Reasoning
		
			Paper
			
•
			2411.08147
			
•
			Published
				
			•
				
				66
			
 
	
	 
	
	
	
			
			EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video
  Generation
		
			Paper
			
•
			2411.08380
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models
		
			Paper
			
•
			2411.09595
			
•
			Published
				
			•
				
				77
			
 
	
	 
	
	
	
			
			MagicQuill: An Intelligent Interactive Image Editing System
		
			Paper
			
•
			2411.09703
			
•
			Published
				
			•
				
				78
			
 
	
	 
	
	
	
			
			LLaVA-o1: Let Vision Language Models Reason Step-by-Step
		
			Paper
			
•
			2411.10440
			
•
			Published
				
			•
				
				129
			
 
	
	 
	
	
	
			
			Region-Aware Text-to-Image Generation via Hard Binding and Soft
  Refinement
		
			Paper
			
•
			2411.06558
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			AnimateAnything: Consistent and Controllable Animation for Video
  Generation
		
			Paper
			
•
			2411.10836
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			RedPajama: an Open Dataset for Training Large Language Models
		
			Paper
			
•
			2411.12372
			
•
			Published
				
			•
				
				56
			
 
	
	 
	
	
	
			
			SageAttention2 Technical Report: Accurate 4 Bit Attention for
  Plug-and-play Inference Acceleration
		
			Paper
			
•
			2411.10958
			
•
			Published
				
			•
				
				56
			
 
	
	 
	
	
	
			
			SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking
  with Motion-Aware Memory
		
			Paper
			
•
			2411.11922
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			Stable Flow: Vital Layers for Training-Free Image Editing
		
			Paper
			
•
			2411.14430
			
•
			Published
				
			•
				
				21
			
 
	
	 
	
	
	
			
			Style-Friendly SNR Sampler for Style-Driven Generation
		
			Paper
			
•
			2411.14793
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			Star Attention: Efficient LLM Inference over Long Sequences
		
			Paper
			
•
			2411.17116
			
•
			Published
				
			•
				
				55
			
 
	
	 
	
	
	
			
			Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot
  Subject-Driven Image Generator
		
			Paper
			
•
			2411.15466
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			Material Anything: Generating Materials for Any 3D Object via Diffusion
		
			Paper
			
•
			2411.15138
			
•
			Published
				
			•
				
				50
			
 
	
	 
	
	
	
			
			OminiControl: Minimal and Universal Control for Diffusion Transformer
		
			Paper
			
•
			2411.15098
			
•
			Published
				
			•
				
				61
			
 
	
	 
	
	
	
			
			WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent
  Video Diffusion Model
		
			Paper
			
•
			2411.17459
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video
  Generation
		
			Paper
			
•
			2412.02259
			
•
			Published
				
			•
				
				60
			
 
	
	 
	
	
	
			
			Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's
  Reasoning Capability
		
			Paper
			
•
			2411.19943
			
•
			Published
				
			•
				
				63
			
 
	
	 
	
	
	
			
			PaliGemma 2: A Family of Versatile VLMs for Transfer
		
			Paper
			
•
			2412.03555
			
•
			Published
				
			•
				
				133
			
 
	
	 
	
	
	
			
			SNOOPI: Supercharged One-step Diffusion Distillation with Proper
  Guidance
		
			Paper
			
•
			2412.02687
			
•
			Published
				
			•
				
				113
			
 
	
	 
	
	
	
			
			TokenFlow: Unified Image Tokenizer for Multimodal Understanding and
  Generation
		
			Paper
			
•
			2412.03069
			
•
			Published
				
			•
				
				35
			
 
	
	 
	
	
	
			
			Imagine360: Immersive 360 Video Generation from Perspective Anchor
		
			Paper
			
•
			2412.03552
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion
		
			Paper
			
•
			2412.03515
			
•
			Published
				
			•
				
				27
			
 
	
	 
	
	
	
			
			FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking
  Portrait
		
			Paper
			
•
			2412.01064
			
•
			Published
				
			•
				
				47
			
 
	
	 
	
	
	
			
			VISTA: Enhancing Long-Duration and High-Resolution Video Understanding
  by Video Spatiotemporal Augmentation
		
			Paper
			
•
			2412.00927
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			Open-Sora Plan: Open-Source Large Video Generation Model
		
			Paper
			
•
			2412.00131
			
•
			Published
				
			•
				
				33
			
 
	
	 
	
	
	
			
			SpotLight: Shadow-Guided Object Relighting via Diffusion
		
			Paper
			
•
			2411.18665
			
•
			Published
				
			•
				
				3
			
 
	
	 
	
	
	
			
			Video Depth without Video Models
		
			Paper
			
•
			2411.19189
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction
  using Diffusion Models
		
			Paper
			
•
			2411.18350
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models
		
			Paper
			
•
			2411.18613
			
•
			Published
				
			•
				
				58
			
 
	
	 
	
	
	
			
			Pathways on the Image Manifold: Image Editing via Video Generation
		
			Paper
			
•
			2411.16819
			
•
			Published
				
			•
				
				37
			
 
	
	 
	
	
	
			
			Identity-Preserving Text-to-Video Generation by Frequency Decomposition
		
			Paper
			
•
			2411.17440
			
•
			Published
				
			•
				
				37
			
 
	
	 
	
	
	
			
			ROICtrl: Boosting Instance Control for Visual Generation
		
			Paper
			
•
			2411.17949
			
•
			Published
				
			•
				
				87
			
 
	
	 
	
	
	
			
			LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene
  Relighting
		
			Paper
			
•
			2412.00177
			
•
			Published
				
			•
				
				8
			
 
	
	 
	
	
	
			
			VisionZip: Longer is Better but Not Necessary in Vision Language Models
		
			Paper
			
•
			2412.04467
			
•
			Published
				
			•
				
				118
			
 
	
	 
	
	
	
			
			Florence-VL: Enhancing Vision-Language Models with Generative Vision
  Encoder and Depth-Breadth Fusion
		
			Paper
			
•
			2412.04424
			
•
			Published
				
			•
				
				63
			
 
	
	 
	
	
	
			
			Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
		
			Paper
			
•
			2412.04454
			
•
			Published
				
			•
				
				71
			
 
	
	 
	
	
	
			
			Structured 3D Latents for Scalable and Versatile 3D Generation
		
			Paper
			
•
			2412.01506
			
•
			Published
				
			•
				
				83
			
 
	
	 
	
	
	
			
			A Noise is Worth Diffusion Guidance
		
			Paper
			
•
			2412.03895
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent
  Diffusion Models
		
			Paper
			
•
			2412.04146
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			Expanding Performance Boundaries of Open-Source Multimodal Models with
  Model, Data, and Test-Time Scaling
		
			Paper
			
•
			2412.05271
			
•
			Published
				
			•
				
				159
			
 
	
	 
	
	
	
			
			SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step
  Diffusion
		
			Paper
			
•
			2412.04301
			
•
			Published
				
			•
				
				41
			
 
	
	 
	
	
	
			
			APOLLO: SGD-like Memory, AdamW-level Performance
		
			Paper
			
•
			2412.05270
			
•
			Published
				
			•
				
				38
			
 
	
	 
	
	
	
			
			STIV: Scalable Text and Image Conditioned Video Generation
		
			Paper
			
•
			2412.07730
			
•
			Published
				
			•
				
				74
			
 
	
	 
	
	
	
			
			UniReal: Universal Image Generation and Editing via Learning Real-world
  Dynamics
		
			Paper
			
•
			2412.07774
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			Video Motion Transfer with Diffusion Transformers
		
			Paper
			
•
			2412.07776
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse
  Viewpoints
		
			Paper
			
•
			2412.07760
			
•
			Published
				
			•
				
				55
			
 
	
	 
	
	
	
			
			StyleMaster: Stylize Your Video with Artistic Generation and Translation
		
			Paper
			
•
			2412.07744
			
•
			Published
				
			•
				
				20
			
 
	
	 
	
	
	
			
			Track4Gen: Teaching Video Diffusion Models to Track Points Improves
  Video Generation
		
			Paper
			
•
			2412.06016
			
•
			Published
				
			•
				
				20
			
 
	
	 
	
	
	
			
			Learning Flow Fields in Attention for Controllable Person Image
  Generation
		
			Paper
			
•
			2412.08486
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
  Long-term Streaming Video and Audio Interactions
		
			Paper
			
•
			2412.09596
			
•
			Published
				
			•
				
				98
			
 
	
	 
	
	
	
		
			Paper
			
•
			2412.08905
			
•
			Published
				
			•
				
				121
			
 
	
	 
	
	
	
			
			Neural LightRig: Unlocking Accurate Object Normal and Material
  Estimation with Multi-Light Diffusion
		
			Paper
			
•
			2412.09593
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
		
			Paper
			
•
			2412.15213
			
•
			Published
				
			•
				
				28
			
 
	
	 
	
	
	
			
			Parallelized Autoregressive Visual Generation
		
			Paper
			
•
			2412.15119
			
•
			Published
				
			•
				
				53
			
 
	
	 
	
	
	
			
			SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
		
			Paper
			
•
			2412.13649
			
•
			Published
				
			•
				
				20
			
 
	
	 
	
	
	
			
			B-STaR: Monitoring and Balancing Exploration and Exploitation in
  Self-Taught Reasoners
		
			Paper
			
•
			2412.17256
			
•
			Published
				
			•
				
				47
			
 
	
	 
	
	
	
		
			Paper
			
•
			2412.15115
			
•
			Published
				
			•
				
				376
			
 
	
	 
	
	
	
			
			Apollo: An Exploration of Video Understanding in Large Multimodal Models
		
			Paper
			
•
			2412.10360
			
•
			Published
				
			•
				
				147
			
 
	
	 
	
	
	
			
			GenEx: Generating an Explorable World
		
			Paper
			
•
			2412.09624
			
•
			Published
				
			•
				
				97
			
 
	
	 
	
	
	
			
			SynerGen-VL: Towards Synergistic Image Understanding and Generation with
  Vision Experts and Token Folding
		
			Paper
			
•
			2412.09604
			
•
			Published
				
			•
				
				38
			
 
	
	 
	
	
	
			
			FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free
  Scale Fusion
		
			Paper
			
•
			2412.09626
			
•
			Published
				
			•
				
				21
			
 
	
	 
	
	
	
			
			InstanceCap: Improving Text-to-Video Generation via Instance-aware
  Structured Caption
		
			Paper
			
•
			2412.09283
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			Byte Latent Transformer: Patches Scale Better Than Tokens
		
			Paper
			
•
			2412.09871
			
•
			Published
				
			•
				
				108
			
 
	
	 
	
	
	
			
			BrushEdit: All-In-One Image Inpainting and Editing
		
			Paper
			
•
			2412.10316
			
•
			Published
				
			•
				
				35
			
 
	
	 
	
	
	
			
			ColorFlow: Retrieval-Augmented Image Sequence Colorization
		
			Paper
			
•
			2412.11815
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			Thinking in Space: How Multimodal Large Language Models See, Remember,
  and Recall Spaces
		
			Paper
			
•
			2412.14171
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			Diffusion360: Seamless 360 Degree Panoramic Image Generation based on
  Diffusion Models
		
			Paper
			
•
			2311.13141
			
•
			Published
				
			•
				
				16
			
 
	
	 
	
	
	
			
			2.5 Years in Class: A Multimodal Textbook for Vision-Language
  Pretraining
		
			Paper
			
•
			2501.00958
			
•
			Published
				
			•
				
				107
			
 
	
	 
	
	
	
			
			VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion
  Control
		
			Paper
			
•
			2501.01427
			
•
			Published
				
			•
				
				54
			
 
	
	 
	
	
	
			
			LTX-Video: Realtime Video Latent Diffusion
		
			Paper
			
•
			2501.00103
			
•
			Published
				
			•
				
				47
			
 
	
	 
	
	
	
			
			Reconstruction vs. Generation: Taming Optimization Dilemma in Latent
  Diffusion Models
		
			Paper
			
•
			2501.01423
			
•
			Published
				
			•
				
				43
			
 
	
	 
	
	
	
			
			OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse
  Task Synthesis
		
			Paper
			
•
			2412.19723
			
•
			Published
				
			•
				
				87
			
 
	
	 
	
	
	
		
			Paper
			
•
			2412.18653
			
•
			Published
				
			•
				
				84
			
 
	
	 
	
	
	
			
			Orient Anything: Learning Robust Object Orientation Estimation from
  Rendering 3D Models
		
			Paper
			
•
			2412.18605
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
			
			DepthLab: From Partial to Complete
		
			Paper
			
•
			2412.18153
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			Fourier Position Embedding: Enhancing Attention's Periodic Extension for
  Length Generalization
		
			Paper
			
•
			2412.17739
			
•
			Published
				
			•
				
				41
			
 
	
	 
	
	
	
			
			DynamicScaler: Seamless and Scalable Video Generation for Panoramic
  Scenes
		
			Paper
			
•
			2412.11100
			
•
			Published
				
			•
				
				7
			
 
	
	 
	
	
	
			
			SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices
  with Efficient Architectures and Training
		
			Paper
			
•
			2412.09619
			
•
			Published
				
			•
				
				27
			
 
	
	 
	
	
	
			
			PIG: Physics-Informed Gaussians as Adaptive Parametric Mesh
  Representations
		
			Paper
			
•
			2412.05994
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			LAION-SG: An Enhanced Large-Scale Dataset for Training Complex
  Image-Text Models with Structural Annotations
		
			Paper
			
•
			2412.08580
			
•
			Published
				
			•
				
				45
			
 
	
	 
	
	
	
			
			StreamChat: Chatting with Streaming Video
		
			Paper
			
•
			2412.08646
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			Generative Densification: Learning to Densify Gaussians for
  High-Fidelity Generalizable 3D Reconstruction
		
			Paper
			
•
			2412.06234
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion
  Transformer
		
			Paper
			
•
			2412.07720
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			Around the World in 80 Timesteps: A Generative Approach to Global Visual
  Geolocation
		
			Paper
			
•
			2412.06781
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes
		
			Paper
			
•
			2411.14974
			
•
			Published
				
			•
				
				16
			
 
	
	 
	
	
	
			
			TEXGen: a Generative Diffusion Model for Mesh Textures
		
			Paper
			
•
			2411.14740
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting
  Synthesis
		
			Paper
			
•
			2411.16443
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			Mixture-of-Transformers: A Sparse and Scalable Architecture for
  Multi-Modal Foundation Models
		
			Paper
			
•
			2411.04996
			
•
			Published
				
			•
				
				51
			
 
	
	 
	
	
	
			
			DimensionX: Create Any 3D and 4D Scenes from a Single Image with
  Controllable Video Diffusion
		
			Paper
			
•
			2411.04928
			
•
			Published
				
			•
				
				57
			
 
	
	 
	
	
	
			
			ReCapture: Generative Video Camera Controls for User-Provided Videos
  using Masked Video Fine-Tuning
		
			Paper
			
•
			2411.05003
			
•
			Published
				
			•
				
				71
			
 
	
	 
	
	
	
			
			"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM
  Quantization
		
			Paper
			
•
			2411.02355
			
•
			Published
				
			•
				
				51
			
 
	
	 
	
	
	
			
			How Far is Video Generation from World Model: A Physical Law Perspective
		
			Paper
			
•
			2411.02385
			
•
			Published
				
			•
				
				34
			
 
	
	 
	
	
	
			
			Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated
  Parameters by Tencent
		
			Paper
			
•
			2411.02265
			
•
			Published
				
			•
				
				25
			
 
	
	 
	
	
	
			
			Adaptive Caching for Faster Video Generation with Diffusion Transformers
		
			Paper
			
•
			2411.02397
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D
		
			Paper
			
•
			2411.02336
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			AutoVFX: Physically Realistic Video Editing from Natural Language
  Instructions
		
			Paper
			
•
			2411.02394
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			GenXD: Generating Any 3D and 4D Scenes
		
			Paper
			
•
			2411.02319
			
•
			Published
				
			•
				
				20
			
 
	
	 
	
	
	
			
			Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse
  Autoencoders
		
			Paper
			
•
			2410.22366
			
•
			Published
				
			•
				
				83
			
 
	
	 
	
	
	
			
			One Shot, One Talk: Whole-body Talking Avatar from a Single Image
		
			Paper
			
•
			2412.01106
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling
		
			Paper
			
•
			2411.18664
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			FAM Diffusion: Frequency and Attention Modulation for High-Resolution
  Image Generation with Stable Diffusion
		
			Paper
			
•
			2411.18552
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based
  Image Editing
		
			Paper
			
•
			2412.04280
			
•
			Published
				
			•
				
				14
			
 
	
	 
	
	
	
			
			MV-Adapter: Multi-view Consistent Image Generation Made Easy
		
			Paper
			
•
			2412.03632
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			PanoDreamer: 3D Panorama Synthesis from a Single Image
		
			Paper
			
•
			2412.04827
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			GenMAC: Compositional Text-to-Video Generation with Multi-Agent
  Collaboration
		
			Paper
			
•
			2412.04440
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
			
			Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of
  Images and Videos
		
			Paper
			
•
			2501.04001
			
•
			Published
				
			•
				
				47
			
 
	
	 
	
	
	
			
			MotionBench: Benchmarking and Improving Fine-grained Video Motion
  Understanding for Vision Language Models
		
			Paper
			
•
			2501.02955
			
•
			Published
				
			•
				
				44
			
 
	
	 
	
	
	
			
			Cosmos World Foundation Model Platform for Physical AI
		
			Paper
			
•
			2501.03575
			
•
			Published
				
			•
				
				81
			
 
	
	 
	
	
	
			
			Dispider: Enabling Video LLMs with Active Real-Time Interaction via
  Disentangled Perception, Decision, and Reaction
		
			Paper
			
•
			2501.03218
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			STAR: Spatial-Temporal Augmentation with Text-to-Video Models for
  Real-World Video Super-Resolution
		
			Paper
			
•
			2501.02976
			
•
			Published
				
			•
				
				55
			
 
	
	 
	
	
	
			
			An Empirical Study of Autoregressive Pre-training from Videos
		
			Paper
			
•
			2501.05453
			
•
			Published
				
			•
				
				41
			
 
	
	 
	
	
	
			
			OmniManip: Towards General Robotic Manipulation via Object-Centric
  Interaction Primitives as Spatial Constraints
		
			Paper
			
•
			2501.03841
			
•
			Published
				
			•
				
				56
			
 
	
	 
	
	
	
			
			VideoRAG: Retrieval-Augmented Generation over Video Corpus
		
			Paper
			
•
			2501.05874
			
•
			Published
				
			•
				
				75
			
 
	
	 
	
	
	
			
			GameFactory: Creating New Games with Generative Interactive Videos
		
			Paper
			
•
			2501.08325
			
•
			Published
				
			•
				
				67
			
 
	
	 
	
	
	
			
			CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation
		
			Paper
			
•
			2501.09433
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			Do generative video models learn physical principles from watching
  videos?
		
			Paper
			
•
			2501.09038
			
•
			Published
				
			•
				
				34
			
 
	
	 
	
	
	
			
			OmniThink: Expanding Knowledge Boundaries in Machine Writing through
  Thinking
		
			Paper
			
•
			2501.09751
			
•
			Published
				
			•
				
				48
			
 
	
	 
	
	
	
			
			Diffusion Adversarial Post-Training for One-Step Video Generation
		
			Paper
			
•
			2501.08316
			
•
			Published
				
			•
				
				35
			
 
	
	 
	
	
	
			
			Omni-RGPT: Unifying Image and Video Region-level Understanding via Token
  Marks
		
			Paper
			
•
			2501.08326
			
•
			Published
				
			•
				
				33
			
 
	
	 
	
	
	
			
			MangaNinja: Line Art Colorization with Precise Reference Following
		
			Paper
			
•
			2501.08332
			
•
			Published
				
			•
				
				60
			
 
	
	 
	
	
	
			
			VideoAuteur: Towards Long Narrative Video Generation
		
			Paper
			
•
			2501.06173
			
•
			Published
				
			•
				
				33
			
 
	
	 
	
	
	
			
			Tensor Product Attention Is All You Need
		
			Paper
			
•
			2501.06425
			
•
			Published
				
			•
				
				89
			
 
	
	 
	
	
	
			
			Evolving Deeper LLM Thinking
		
			Paper
			
•
			2501.09891
			
•
			Published
				
			•
				
				115
			
 
	
	 
	
	
	
			
			FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in
  Virtual 3D Spaces
		
			Paper
			
•
			2501.12909
			
•
			Published
				
			•
				
				71
			
 
	
	 
	
	
	
			
			VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
  Understanding
		
			Paper
			
•
			2501.13106
			
•
			Published
				
			•
				
				90
			
 
	
	 
	
	
	
			
			The Lessons of Developing Process Reward Models in Mathematical
  Reasoning
		
			Paper
			
•
			2501.07301
			
•
			Published
				
			•
				
				99
			
 
	
	 
	
	
	
			
			FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow
  Models
		
			Paper
			
•
			2412.08629
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			SRMT: Shared Memory for Multi-agent Lifelong Pathfinding
		
			Paper
			
•
			2501.13200
			
•
			Published
				
			•
				
				68
			
 
	
	 
	
	
	
			
			Florence-2: Advancing a Unified Representation for a Variety of Vision
  Tasks
		
			Paper
			
•
			2311.06242
			
•
			Published
				
			•
				
				95
			
 
	
	 
	
	
	
			
			Elucidating the Design Space of Diffusion-Based Generative Models
		
			Paper
			
•
			2206.00364
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			Improving Video Generation with Human Feedback
		
			Paper
			
•
			2501.13918
			
•
			Published
				
			•
				
				52
			
 
	
	 
	
	
	
			
			Can We Generate Images with CoT? Let's Verify and Reinforce Image
  Generation Step by Step
		
			Paper
			
•
			2501.13926
			
•
			Published
				
			•
				
				42
			
 
	
	 
	
	
	
			
			SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model
  Post-training
		
			Paper
			
•
			2501.17161
			
•
			Published
				
			•
				
				123
			
 
	
	 
	
	
	
			
			DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian
  Splat Generation
		
			Paper
			
•
			2501.16764
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
			
			MatAnyone: Stable Video Matting with Consistent Memory Propagation
		
			Paper
			
•
			2501.14677
			
•
			Published
				
			•
				
				34
			
 
	
	 
	
	
	
			
			DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot
  Planning
		
			Paper
			
•
			2411.04983
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			Medusa: Simple LLM Inference Acceleration Framework with Multiple
  Decoding Heads
		
			Paper
			
•
			2401.10774
			
•
			Published
				
			•
				
				59
			
 
	
	 
	
	
	
			
			SAMPart3D: Segment Any Part in 3D Objects
		
			Paper
			
•
			2411.07184
			
•
			Published
				
			•
				
				28
			
 
	
	 
	
	
	
			
			SliderSpace: Decomposing the Visual Capabilities of Diffusion Models
		
			Paper
			
•
			2502.01639
			
•
			Published
				
			•
				
				26