Edge ML Inference: iPhone vs Android TFLite Benchmarks (2024)

Name: MLSR Edge Inference Benchmarks 2024
Creator: ML Systems Review
Published: 2024-07-18
License: https://creativecommons.org/licenses/by/4.0/

TL;DR

On-device inference has reached the point where ResNet-50 runs in under 15 ms on all three 2024 flagships tested — iPhone 15 Pro, Pixel 8, and Galaxy S24. Apple's Neural Engine leads on FP16 median latency (8.1 ms for ResNet-50), while Google's Tensor G3 EdgeTPU leads on INT8 throughput. Thermal throttling remains the binding constraint for sustained workloads: all three phones lose 20–40% of peak throughput after ~90 seconds of back-to-back inferences.

This piece benchmarks three current-generation mobile accelerators on three widely deployed vision models. The goal is practical: if you are shipping a consumer app that runs a CNN on-device in mid-2024, these numbers tell you what latency and memory budget to plan for. We report median and p95 latency, peak memory, and sustained throughput after thermal throttling.

All tests were run on production handsets with the vendor SDKs: Core ML (via Core ML Tools 7.1) on iOS 17.5, TFLite 2.15 with the EdgeTPU delegate on Android 14 for Pixel, and TFLite 2.15 with the NNAPI delegate on Android 14 for Galaxy. Models were converted from PyTorch 2.3 checkpoints. Images are 224x224 RGB, uint8, preprocessed to the model's expected tensor layout.

Test setup

We measured three models:

ResNet-50 (25.5M parameters, 4.1 GFLOPs). The unfair workhorse — neither the newest architecture nor the most efficient, but widely used as a baseline.
MobileNetV3-Large (5.4M parameters, 0.22 GFLOPs). Designed for mobile, uses squeeze-and-excitation and h-swish nonlinearity.
EfficientNet-B0 (5.3M parameters, 0.39 GFLOPs). Compound-scaled; the smallest of the EfficientNet family.

For each model we built two variants: FP16 (native floating point on most accelerators) and INT8 post-training quantized with calibration on 1,000 ImageNet validation images. No quantization-aware training was used, which is a realistic choice for teams who do not own the training pipeline.

Each benchmark run consisted of 500 warm-up inferences followed by 5,000 timed inferences on single images. Batch size is 1 (the practical mobile case). We report median and p95 of the timed runs. Thermal runs were a separate 10-minute back-to-back pass with the device in a 22 degree C room, no case, on a wooden surface.

Results

MODEL: RESNET-50 (224x224, batch 1)

DEVICE            | PRECISION | MEDIAN (ms) | P95 (ms) | PEAK MEM (MB)
------------------+-----------+-------------+----------+---------------
iPhone 15 Pro     | FP16      |    8.1      |   10.4   |    112
iPhone 15 Pro     | INT8      |    6.3      |    8.0   |     58
Pixel 8           | FP16      |   11.7      |   14.2   |    118
Pixel 8           | INT8      |    6.1      |    7.8   |     61
Galaxy S24        | FP16      |   10.2      |   12.9   |    115
Galaxy S24        | INT8      |    7.4      |    9.1   |     60


MODEL: MOBILENETV3-LARGE (224x224, batch 1)

DEVICE            | PRECISION | MEDIAN (ms) | P95 (ms) | PEAK MEM (MB)
------------------+-----------+-------------+----------+---------------
iPhone 15 Pro     | FP16      |    3.6      |    4.9   |     36
iPhone 15 Pro     | INT8      |    2.8      |    3.6   |     20
Pixel 8           | FP16      |    4.9      |    6.2   |     38
Pixel 8           | INT8      |    2.5      |    3.3   |     21
Galaxy S24        | FP16      |    4.2      |    5.4   |     37
Galaxy S24        | INT8      |    3.1      |    4.0   |     21


MODEL: EFFICIENTNET-B0 (224x224, batch 1)

DEVICE            | PRECISION | MEDIAN (ms) | P95 (ms) | PEAK MEM (MB)
------------------+-----------+-------------+----------+---------------
iPhone 15 Pro     | FP16      |    4.2      |    5.5   |     44
iPhone 15 Pro     | INT8      |    3.3      |    4.2   |     23
Pixel 8           | FP16      |    6.1      |    7.4   |     46
Pixel 8           | INT8      |    3.2      |    4.1   |     24
Galaxy S24        | FP16      |    5.5      |    6.8   |     45
Galaxy S24        | INT8      |    3.8      |    4.7   |     24


SUSTAINED THROUGHPUT AFTER THERMAL THROTTLING
(MobileNetV3-L, INT8, 10-minute back-to-back run)

DEVICE            | PEAK (IPS) | 10-MIN AVG (IPS) | DROP
------------------+------------+------------------+------
iPhone 15 Pro     |    348     |      282         | -19%
Pixel 8           |    391     |      255         | -35%
Galaxy S24        |    316     |      220         | -30%

Table 1. Single-image inference latency and sustained throughput. IPS = inferences per second. Median of 5,000 runs; p95 as reported.

Interpretation

Apple Neural Engine on FP16

The iPhone 15 Pro's Neural Engine is fastest on FP16 across all three models. This is consistent with Apple's published design, which targets FP16 as the native format and pipelines computation through a mix of matrix-multiply units and DMA. INT8 gains are smaller on ANE (roughly 1.3x) because the accelerator does not have the same INT8 throughput multiplier that GPU-style DSPs do.

Tensor G3 EdgeTPU on INT8

The Pixel 8's EdgeTPU closes the gap — or in the case of MobileNetV3-Large INT8, surpasses Apple. The EdgeTPU is a systolic array optimized for INT8 matrix multiplies, and quantized MobileNet is exactly its happy path. For teams targeting Android with a well-quantized model, Pixel performance is excellent.

Snapdragon 8 Gen 3 on Galaxy S24

Galaxy S24's Hexagon NPU (via NNAPI) performs in the middle of the pack. The gap to Pixel's EdgeTPU on INT8 workloads is real; the gap to iPhone on FP16 workloads is smaller. Samsung's choice to route TFLite through NNAPI adds some overhead compared to Pixel's direct EdgeTPU delegate.

Memory and quantization

Memory numbers confirm the obvious: INT8 halves activation memory roughly as expected, with some additional overhead for scale and zero-point tensors. Peak memory reporting is the OS-reported maximum during inference, so it includes framework overhead — the model weights themselves are smaller than the peak suggests.

For teams pushing inference into existing apps with tight memory budgets, INT8 is almost always the right default. Accuracy drops are typically under 0.5 percentage points on ImageNet top-1 for these three models with standard post-training calibration, and often unmeasurable for downstream use cases like image classification for product tagging.

Conversion pipeline

The Python conversion used in this benchmark is straightforward and worth showing. The following is the TFLite INT8 path for MobileNetV3-Large; the Core ML path is analogous via coremltools.convert.

import tensorflow as tf
import numpy as np

# Assume a saved Keras model at ./mobilenetv3_large_fp32/
converter = tf.lite.TFLiteConverter.from_saved_model('./mobilenetv3_large_fp32/')
converter.optimizations = [tf.lite.Optimize.DEFAULT]

def representative_dataset():
    # Calibrate on 1000 ImageNet validation images.
    for i in range(1000):
        img = load_and_preprocess(f'val/{i}.jpg')  # (1,224,224,3) uint8
        yield [img.astype(np.float32)]

converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_quant = converter.convert()
with open('mobilenetv3_large_int8.tflite', 'wb') as f:
    f.write(tflite_quant)

On device, the TFLite interpreter is then constructed with the EdgeTPU (Pixel) or NNAPI (Samsung) delegate. Timing is measured with System.nanoTime() around interpreter.run(), with GC and display-off between batches to reduce noise.

Thermal behavior

All three devices throttle under sustained load. The Pixel 8 had the largest peak-to-sustained drop (-35%), which reflects its aggressive initial boost clocks. The iPhone 15 Pro had the smallest drop (-19%), suggesting Apple's thermal budget allocation is more conservative up front and therefore more sustainable. For batched workloads (e.g., processing a gallery of photos), expect the 10-minute-average numbers rather than peak.

Practical implication: if your app does short, bursty inference (a user takes one photo, the model runs once), you will see the peak numbers. If your app runs continuous inference (video frame classification at 30 fps), plan on the sustained numbers and probably drop resolution or frame rate rather than running every frame.

Updated 2026: A18, Tensor G4, and the INT4 wave

Updated 2026: The 2024 numbers above are now roughly two generations out of date. Apple's A18 Pro Neural Engine (iPhone 16 Pro) is approximately 1.4x faster on the same workloads, and Google's Tensor G4 closes the FP16 gap considerably. The more important architectural shift is INT4 weight quantization: TFLite 2.17 and Core ML Tools 8 added first-class support, and for many LLM and vision-transformer workloads INT4 weights with FP16 activations now outperform INT8 all-weights-all-activations on current accelerators. We are rerunning these benchmarks for our 2026 edge-inference piece.

Frequently asked questions

Which phone has the fastest on-device ML inference in 2024?

For vision workloads, the iPhone 15 Pro with the Apple Neural Engine (A17 Pro) wins on median latency for ResNet-50 and EfficientNet-B0 in our benchmarks, running FP16 in roughly 3.6 ms for MobileNetV3-Large. The Pixel 8 (Tensor G3) closes the gap on INT8-quantized models via the EdgeTPU path.

Does INT8 quantization meaningfully improve latency?

Yes, especially on Android accelerators. INT8 on the Tensor G3 EdgeTPU delivers roughly 1.9x the throughput of FP16 on the same model; on Apple Neural Engine, the improvement is smaller (1.3x) because FP16 is already the native format.

What causes thermal throttling on mobile ML workloads?

Sustained inference at full accelerator utilization pushes the SoC package temperature above throttle thresholds within 60-180 seconds depending on ambient conditions. Phones respond by downclocking the accelerator, which shows up as a 20-40% drop in sustained throughput after the first minute of back-to-back inferences.

Can you run ResNet-50 on a phone?

Yes. ResNet-50 at 224x224 resolution runs at approximately 8-14 ms median latency on modern flagships (iPhone 15 Pro, Pixel 8, Galaxy S24). Memory overhead is roughly 110 MB at FP16, 55 MB at INT8. The model is not the most efficient choice but is eminently practical.

Is TFLite the right framework for Android?

For most cases, yes. TFLite with the NNAPI delegate (pre-Android 15) or with direct EdgeTPU and GPU delegates is the mainstream path. Alternatives include ONNX Runtime Mobile, which is better if you are cross-platform and not relying on TensorFlow tooling.

Does Core ML support PyTorch models directly?

Not directly. The standard path is torch -> ONNX -> Core ML via coremltools, or torch.export -> Core ML via the newer PyTorch integrations. Core ML has its own model format (.mlpackage) and conversion adds a step that sometimes surfaces operator incompatibilities.

How much battery does a single inference cost?

Rough measurement: MobileNetV3-Large at INT8 on Apple Neural Engine consumes about 2.5 mJ per inference; on Pixel 8 EdgeTPU, about 3.1 mJ. At 100 inferences per second continuously, that corresponds to roughly 0.25-0.31 W of accelerator power, not counting display or ISP.

Should I use CPU or accelerator for inference?

Accelerator, almost always. CPU inference on ResNet-50 runs 5-8x slower than Neural Engine or EdgeTPU at comparable quantization, and consumes substantially more energy. CPU is useful only as a fallback for operators the accelerator does not support.

Benchmarks performed by ML Systems Review on retail-sample devices. No sponsorship, no vendor samples. Code and measurement scripts available on request.