Papers
arXiv:2511.07885

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

Published on Nov 11
· Submitted by Narayan on Nov 12
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Local inference using small language models can efficiently handle a significant portion of real-world queries, reducing demand on centralized cloud infrastructure, with intelligence per watt as a key metric for evaluation.

AI-generated summary

Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure. Rapidly growing demand strains this paradigm, and cloud providers struggle to scale infrastructure at pace. Two advances enable us to rethink this paradigm: small LMs (<=20B active parameters) now achieve competitive performance to frontier models on many tasks, and local accelerators (e.g., Apple M4 Max) run these models at interactive latencies. This raises the question: can local inference viably redistribute demand from centralized infrastructure? Answering this requires measuring whether local LMs can accurately answer real-world queries and whether they can do so efficiently enough to be practical on power-constrained devices (i.e., laptops). We propose intelligence per watt (IPW), task accuracy divided by unit of power, as a metric for assessing capability and efficiency of local inference across model-accelerator pairs. We conduct a large-scale empirical study across 20+ state-of-the-art local LMs, 8 accelerators, and a representative subset of LLM traffic: 1M real-world single-turn chat and reasoning queries. For each query, we measure accuracy, energy, latency, and power. Our analysis reveals 3 findings. First, local LMs can accurately answer 88.7% of single-turn chat and reasoning queries with accuracy varying by domain. Second, from 2023-2025, IPW improved 5.3x and local query coverage rose from 23.2% to 71.3%. Third, local accelerators achieve at least 1.4x lower IPW than cloud accelerators running identical models, revealing significant headroom for optimization. These findings demonstrate that local inference can meaningfully redistribute demand from centralized infrastructure, with IPW serving as the critical metric for tracking this transition. We release our IPW profiling harness for systematic intelligence-per-watt benchmarking.

Community

Paper submitter

We propose intelligence per watt (IPW)—a metric quantifying intelligence delivered per unit of power consumed—to measure the viability of local AI systems within device constraints. We find that Local LMs already handle 88.7% of single-turn chat and reasoning queries, with local IPW improving 5.3× in 2 years—driven by better models (3.2×) and better accelerators (1.7×).

As local IPW improves, a meaningful fraction of workloads can shift from centralized infrastructure to local compute, with IPW serving as the critical metric for tracking this transition.

I'm having a hard time projecting the conclusion drawn here "Assuming perfect query-to-model assignment, oracle routing reduces energy consumption by 80.4%, compute by 77.3%, and cost by 73.8% versus cloud-only deployment to the largest model" to any realistic scenario because the research was designed around "single-query inference (batch size = 1), to (1) isolate intrinsic model-accelerator efficiency from system-level serving optimizations and (2) follow standard local inference benchmarking practices."

Cloud inference NEVER runs at batch size 1 because providers want to eke out as much efficiency as they can; further, cloud providers are incentivized to never let their accelerators go idle due to the cost of running them. It is much more likely, however, that a local endpoint would go idle (if I have a Ryzen Strix Halo machine running a local inference endpoint, it'll go idle any time I'm not using it). Therefore the "reduces energy consumption by 80%" might be accurate for a one-off, but that statistic is effectively meaningless in the real world.

I'd love to see a follow-up comparing best-case scenarios for energy optimized inference; I think it would be much more meaningful to industry forecasting for compute and energy consumption.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.07885 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.07885 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.07885 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.