Intelligence per Watt: Measuring Intelligence Efficiency of Local AI
Abstract
Local inference using small language models can efficiently handle a significant portion of real-world queries, reducing demand on centralized cloud infrastructure, with intelligence per watt as a key metric for evaluation.
Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure. Rapidly growing demand strains this paradigm, and cloud providers struggle to scale infrastructure at pace. Two advances enable us to rethink this paradigm: small LMs (<=20B active parameters) now achieve competitive performance to frontier models on many tasks, and local accelerators (e.g., Apple M4 Max) run these models at interactive latencies. This raises the question: can local inference viably redistribute demand from centralized infrastructure? Answering this requires measuring whether local LMs can accurately answer real-world queries and whether they can do so efficiently enough to be practical on power-constrained devices (i.e., laptops). We propose intelligence per watt (IPW), task accuracy divided by unit of power, as a metric for assessing capability and efficiency of local inference across model-accelerator pairs. We conduct a large-scale empirical study across 20+ state-of-the-art local LMs, 8 accelerators, and a representative subset of LLM traffic: 1M real-world single-turn chat and reasoning queries. For each query, we measure accuracy, energy, latency, and power. Our analysis reveals 3 findings. First, local LMs can accurately answer 88.7% of single-turn chat and reasoning queries with accuracy varying by domain. Second, from 2023-2025, IPW improved 5.3x and local query coverage rose from 23.2% to 71.3%. Third, local accelerators achieve at least 1.4x lower IPW than cloud accelerators running identical models, revealing significant headroom for optimization. These findings demonstrate that local inference can meaningfully redistribute demand from centralized infrastructure, with IPW serving as the critical metric for tracking this transition. We release our IPW profiling harness for systematic intelligence-per-watt benchmarking.
Community
We propose intelligence per watt (IPW)—a metric quantifying intelligence delivered per unit of power consumed—to measure the viability of local AI systems within device constraints. We find that Local LMs already handle 88.7% of single-turn chat and reasoning queries, with local IPW improving 5.3× in 2 years—driven by better models (3.2×) and better accelerators (1.7×).
As local IPW improves, a meaningful fraction of workloads can shift from centralized infrastructure to local compute, with IPW serving as the critical metric for tracking this transition.
I'm having a hard time projecting the conclusion drawn here "Assuming perfect query-to-model assignment, oracle routing reduces energy consumption by 80.4%, compute by 77.3%, and cost by 73.8% versus cloud-only deployment to the largest model" to any realistic scenario because the research was designed around "single-query inference (batch size = 1), to (1) isolate intrinsic model-accelerator efficiency from system-level serving optimizations and (2) follow standard local inference benchmarking practices."
Cloud inference NEVER runs at batch size 1 because providers want to eke out as much efficiency as they can; further, cloud providers are incentivized to never let their accelerators go idle due to the cost of running them. It is much more likely, however, that a local endpoint would go idle (if I have a Ryzen Strix Halo machine running a local inference endpoint, it'll go idle any time I'm not using it). Therefore the "reduces energy consumption by 80%" might be accurate for a one-off, but that statistic is effectively meaningless in the real world.
I'd love to see a follow-up comparing best-case scenarios for energy optimized inference; I think it would be much more meaningful to industry forecasting for compute and energy consumption.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- EdgeReasoning: Characterizing Reasoning LLM Deployment on Edge GPUs (2025)
- lm-Meter: Unveiling Runtime Inference Latency for On-Device Language Models (2025)
- ML-EcoLyzer: Quantifying the Environmental Cost of Machine Learning Inference Across Frameworks and Hardware (2025)
- Rearchitecting Datacenter Lifecycle for AI: A TCO-Driven Framework (2025)
- From Prompts to Power: Measuring the Energy Footprint of LLM Inference (2025)
- Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute (2025)
- ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper