On-Device vs. Cloud Inference: A 2026 Economic Analysis

TL;DR

On-device inference is roughly free per call in marginal cost, but carries a meaningful battery and thermal overhead; cloud inference on AWS Inferentia2 or GCP TPU v5e runs between $0.10 and $2.40 per million inferences depending on model size. For small models (<500M params), on-device wins nearly every time. For LLM-class models (>7B params), cloud is still the only option at scale. The interesting region is the 500M-7B parameter band, where hybrid architectures dominate.

The "on-device versus cloud" decision used to be easy. Five years ago, anything larger than a ResNet-18 did not fit on a phone, and anything smaller than a ResNet-18 was not worth the latency of a cloud round-trip. The architecture decision made itself. That is no longer true. A 2024 iPhone can run a 3-billion-parameter quantised Mistral at real-time speeds; a 2026 Pixel can run a 7B at interactive-but-not-realtime. Meanwhile, cloud inference has gotten an order of magnitude cheaper in the same period, driven by Inferentia2, TPU v5e, and Google and Amazon's race to undercut each other on $/TFLOP.

What used to be a capability boundary is now an economic tradeoff. This piece is our attempt to frame that tradeoff in numbers that a product team can actually use. Per-inference cost, battery impact, privacy exposure, and tail latency are each reducible to a quantity; the decision is a weighted sum over those quantities for a specific workload. We build the framework and price it out on four deployment archetypes.

The four cost axes

Any rigorous comparison of on-device and cloud inference has to account for four costs that are usually hidden or counted inconsistently.

Marginal compute cost

Cloud compute has a clean price: $/hour for a GPU or accelerator instance, divided by the number of inferences it can serve per second, gives a per-inference cost. AWS Inferentia2 is the most favourable cloud inference target in 2026 for reasonable-size models, with a published $0.76/hour for an inf2.xlarge. GCP's TPU v5e is in a similar range. On these accelerators, a quantised 1B-parameter vision model can serve around 400 requests per second, implying a per-inference cost of about $0.00000053 — or 53 µdollars, or about $0.53 per million inferences.

On-device compute, by contrast, has a marginal cost of zero. The user paid for the phone; the phone's Apple Neural Engine or Qualcomm Hexagon NPU is already there. Every inference run on-device is a cost the app developer does not pay. This is the single largest economic argument for on-device inference, and it is often underweighted because it is invisible on cloud bills.

Battery and thermal cost

On-device inference is not actually free. It consumes battery and produces heat. A 2.5 GFLOP inference on the Apple Neural Engine (A17 Pro generation) draws approximately 180 mJ; on the Hexagon NPU in the Snapdragon 8 Gen 3, approximately 220 mJ. A phone battery is around 15 Wh, or 54 kJ. An app running 1000 inferences over a day consumes roughly 200 J — about 0.37% of the battery. An app running 10,000 inferences consumes 3.7%, which users notice. Thermal throttling kicks in earlier on the Qualcomm side; sustained NPU load above roughly 2 W causes the NPU to downclock within 30-60 seconds, at which point inference latency doubles.

Privacy and data-transfer cost

Cloud inference requires that input data leave the device. For some workloads this is benign (text embeddings of publicly shared content); for others it is a compliance problem (health data, images of people, voice). The engineering cost of routing sensitive data to a cloud endpoint — encryption in transit, consent flows, regional data-residency compliance, enterprise SSO — is real and often exceeds the compute cost for low-volume applications.

For high-volume applications, there is also the data-transfer cost itself: a 2 MB image uploaded per inference at 10M inferences per month is 20 TB of egress, which at AWS list pricing is roughly $1800. This is usually rounded into background cloud cost but it is not negligible for image-heavy workloads.

Latency cost

On-device inference has a latency floor set by the NPU: roughly 20-40 ms for a small vision model, 150-400 ms for a mid-size transformer. Cloud inference has a latency floor set by the network: roughly 80-200 ms for a round-trip to the nearest cloud region under nominal conditions, rising to 500-2000 ms on cellular with poor signal. For latency-sensitive interactive features — cursor, autocomplete, camera overlay — on-device is often the only option regardless of cost.

Per-million-inference cost table

The headline table, summarising cost across typical 2026 hardware. Assumes batch size appropriate to each target (1 for on-device, batched for cloud) and INT8 quantisation for NPU targets, FP16 for cloud.

Target	Small vision (50M)	Mid model (1B)	Large model (7B)	Notes
Apple Neural Engine (A17 Pro)	$0 + ~0.02% batt	$0 + ~0.2% batt	N/A (OOM)	Quantised via Core ML, INT8
Qualcomm Hexagon NPU (Gen 3)	$0 + ~0.02% batt	$0 + ~0.3% batt	$0 + thermal throttling	Quantised via QNN, INT8
Google Tensor G3 TPU	$0 + ~0.03% batt	$0 + ~0.4% batt	barely, latency > 2s	Quantised via LiteRT
AWS Inferentia2 (inf2.xlarge)	$0.08	$0.53	$2.40	ONNX/Neuron, FP16
GCP TPU v5e	$0.11	$0.64	$2.80	JAX/XLA, FP16
NVIDIA L4 (GCP g2)	$0.19	$1.20	$4.90	TensorRT, FP16

The cloud numbers assume steady-state load; cold-start and underutilisation can multiply them several-fold. The on-device numbers assume the NPU is actually available, which is not always true: iOS restricts ANE access to models loaded through Core ML with specific op constraints, and Android fragmentation means an app has to ship three or four execution providers to hit most devices.

A decision framework

The framework we recommend, and have seen variants of at several production teams, works as follows. Score the workload on four axes; the highest-scoring deployment pattern is usually right.

│ on-device │ cloud │ hybrid ──────────────────────────┼────────────┼───────────┼────────── Model < 500M params │ +3 │ +1 │ +2 Model 500M - 7B params │ +1 │ +2 │ +3 Model > 7B params │ -2 │ +3 │ +2 Latency budget < 100ms │ +3 │ -1 │ +2 Per-user inferences/day │ │ │ <10 │ +1 │ +3 │ +1 10-1000 │ +2 │ +1 │ +3 >1000 │ +3 │ -2 │ +2 Privacy-sensitive input │ +3 │ -2 │ +2 Offline required │ +3 │ -3 │ +1

Figure 1. Scoring framework for on-device / cloud / hybrid deployment. Add up the scores for a given workload; highest wins. In borderline cases, hybrid is the safe default because it preserves optionality.

Four archetypes, priced out

To make the framework concrete, four archetype workloads with their actual answers.

Archetype 1: Real-time camera filter (Snapchat-style)

Small model (under 50M params), latency budget under 30 ms, thousands of inferences per session, privacy-sensitive input. On-device wins unambiguously. Cloud is not fast enough at 30 ms over a mobile network, and the cost at 1000 inferences per user per session would be prohibitive. Typical deployment: Core ML on iOS, LiteRT with GPU delegate on Android, INT8 quantised.

Archetype 2: Large language model assistant (ChatGPT-style)

Large model (>7B params), moderate latency budget (500-2000 ms acceptable), variable inferences per user per day, mixed privacy. Cloud wins. Even 2026 flagship phones cannot run a 70B model, and the current 7B on-device options are still marginal on latency for conversational use. Typical deployment: AWS Inferentia2 or GCP TPU v5e with vLLM or a similar serving stack.

Archetype 3: Keyboard autocomplete

Small model (under 100M params), latency budget under 50 ms, hundreds of inferences per session, highly privacy-sensitive. On-device wins, with emphasis. The SwiftKey / Gboard / Apple Keyboard lineage has used on-device inference for years, and the 2026 version is a distilled transformer in the 80-150M parameter range, quantised and fused. Cloud is not an acceptable option because of both latency and the ethics of sending keystrokes off-device.

Archetype 4: Food-tracking app (mixed workload)

A product that needs vision-model inference on captured photos, fast response, modest privacy sensitivity, and nutrient-database lookup against a large catalog. This is the archetypal case for a hybrid architecture, and hybrid architectures — on-device vision inference plus cloud nutrient lookup — are increasingly common in food-tracking apps such as PlateLens. The vision model runs on the phone's NPU for latency and battery reasons; the nutrient database lookup runs in the cloud because it is larger than any phone can reasonably hold and because database updates need to propagate without forcing a client update. The division pushes the most expensive cloud operation (heavy vision inference on large images) to free on-device compute, while keeping the expensive-to-maintain data (the nutrient database) in a single centralised place.

What is changing in 2026-2027

Three trends are worth tracking. First, Apple's rumoured ANE v3 (said to ship with A19 Pro in September 2026) is targeting 35 TOPS, up from the A17 Pro's 35 TOPS — a smaller generational jump than the marketing will suggest, but accompanied by better quantisation tooling and a larger on-device model-size envelope. Second, Qualcomm's 2026 NPU roadmap is aggressive enough that the "Android NPU is weaker" story from 2023 is no longer true. Third, on the cloud side, AWS's Trainium2 (which, unusually for AWS, is also competitive on inference for small-batch workloads) is starting to undercut Inferentia2 on models that benefit from Trainium's memory bandwidth.

None of these upset the framework. They shift the boundary between "on-device" and "cloud" upward by perhaps a generation — the 7B-on-device line becomes a 10B-on-device line — but the underlying economic decision remains the same one.

What we recommend

For new ML-heavy consumer apps shipping in 2026, we recommend designing for hybrid from day one. The cost of bolting on on-device inference later is higher than the cost of designing for it initially; the cost of bolting on cloud fallback to a 100%-on-device product is higher than the cost of designing cloud as an option from the start. Ship on-device for the hot path, cloud for the cold path, measure which workloads actually fall where, and move the boundary later.

For existing cloud-only apps, a 2026 audit is worth doing. We have seen several teams find that roughly 40-60% of their inference cost is for calls small enough to move on-device with zero user-visible latency penalty. The savings from that kind of rebalance can easily exceed six figures a year at modest traffic, and the engineering cost is typically a single engineer-quarter.