LiveTradeBench: Seeking Real-World Alpha with Large Language Models Paper • 2511.03628 • Published 5 days ago • 9
MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity Paper • 2511.03146 • Published 5 days ago • 7
LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation Paper • 2511.03001 • Published 6 days ago • 45
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation Paper • 2511.02778 • Published 6 days ago • 95
Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization Paper • 2510.25616 • Published 12 days ago • 88
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought Paper • 2511.02779 • Published 6 days ago • 52
TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning Paper • 2511.01833 • Published 7 days ago • 15
Kimi Linear: An Expressive, Efficient Attention Architecture Paper • 2510.26692 • Published 11 days ago • 101
The End of Manual Decoding: Towards Truly End-to-End Language Models Paper • 2510.26697 • Published 11 days ago • 114
The Era of Agentic Organization: Learning to Organize with Language Models Paper • 2510.26658 • Published 11 days ago • 23
AMO-Bench: Large Language Models Still Struggle in High School Math Competitions Paper • 2510.26768 • Published 11 days ago • 33
Zep: A Temporal Knowledge Graph Architecture for Agent Memory Paper • 2501.13956 • Published Jan 20 • 5
A Survey of Data Agents: Emerging Paradigm or Overstated Hype? Paper • 2510.23587 • Published 14 days ago • 65
Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing Paper • 2510.19808 • Published 19 days ago • 28
ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers? Paper • 2510.24591 • Published 13 days ago • 4
FunReason-MT Technical Report: Overcoming the Complexity Barrier in Multi-Turn Function Calling Paper • 2510.24645 • Published 13 days ago • 5
ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality Paper • 2510.22037 • Published 16 days ago • 18
Repurposing Synthetic Data for Fine-grained Search Agent Supervision Paper • 2510.24694 • Published 13 days ago • 23