Deploying Vision Transformers on Mobile: A 2023 Retrospective
Three years after the original ViT paper, vision transformers finally fit on a phone. Here is what it took — INT8 quantization, operator fusion, and a fair amount of pragmatism about which variants to run where.
In 2023, deploying a Vision Transformer on an iPhone or Android device is finally practical but still constrained. On an iPhone A16 (iPhone 14 Pro) with Core ML, a quantized ViT-B/16 runs at roughly 38ms per 224x224 image. MobileViT-S runs at 14ms on the same hardware. On flagship Android via TFLite with GPU delegate, expect 1.5x to 2x those numbers. INT8 post-training quantization costs about 0.8 top-1 accuracy on ImageNet; quantization-aware training recovers most of it.
At the time of writing in 2023, vision transformers are now three years old. Dosovitskiy et al.'s 2020 paper, An Image is Worth 16x16 Words, has accumulated more than twelve thousand citations. Every major vision benchmark has a ViT variant near the top of the leaderboard, usually ViT-L/16 or ViT-H/14 pretrained on JFT-300M or ImageNet-21k. What has taken a surprisingly long time is getting any of this to run usefully on a phone.
This article is a practitioner's retrospective, written in September 2023, on what that deployment looks like today. We cover three concrete axes: the hardware (Apple A15 / A16 vs flagship Android), the toolchain (Core ML vs TFLite), and the quantization strategy. The goal is not to push any one stack but to document what works when your product has a 40-millisecond inference budget and you cannot afford a cloud roundtrip for every frame.
Why this was hard for so long
The original ViT-B/16 has 86 million parameters and, more problematically for mobile, a computational pattern dominated by large matrix multiplications for self-attention. CNNs like MobileNet-V2 were designed top-to-bottom for mobile: depthwise-separable convolutions, small receptive fields, cache-friendly memory layouts. Vision transformers were designed top-to-bottom for TPUs.
Three things changed between 2020 and 2023:
- Mobile NPUs became serious. Apple's Neural Engine on the A15 (iPhone 13 Pro, 2021) hit about 15.8 TOPS. The A16 (iPhone 14 Pro, 2022) kept that headline figure but improved the memory-bandwidth ceiling that actually matters for transformers. Qualcomm's Hexagon DSP in the Snapdragon 8 Gen 2 (2023) hit a similar range.
- Hybrid architectures appeared. MobileViT (Mehta and Rastegari, 2021), MobileViT-V2 (2022), and EfficientFormer (2022) interleave convolutions and attention blocks so that the early layers, which process high-resolution feature maps, stay CNN-like and only the later, spatially small layers use global attention.
- Toolchain quantization caught up. Core ML Tools 6.3 (released mid-2023) added palletization and native INT8 support for transformer blocks. TFLite's post-training quantization path now handles most ViT layers without falling back to float32.
The benchmark setup
The numbers in this article come from a test harness we have been running since March 2023. The setup is deliberately simple: a single 224x224 RGB image, pre-normalised, passed to the model, measuring end-to-end inference latency (input marshalling through output dequantisation) averaged over 200 runs after a 30-run warm-up. We are not including pre-processing cost (resize, colour conversion) because it dominates and varies enormously by how you wire up your camera pipeline.
Hardware under test:
- iPhone 13 Pro (Apple A15 Bionic, 16-core Neural Engine)
- iPhone 14 Pro (Apple A16 Bionic, 16-core Neural Engine)
- Google Pixel 7 Pro (Tensor G2, TPU accelerator)
- Samsung Galaxy S23 Ultra (Snapdragon 8 Gen 2, Hexagon)
Software: Core ML 6, Core ML Tools 6.3, iOS 16.6; TFLite 2.13 with the GPU delegate; Android 13. Models are the reference implementations from timm converted through the appropriate exporter (coremltools for Apple, the TFLite converter after ONNX export for Android).
The comparison table
Model Params A15 (CoreML) A16 (CoreML) Tensor G2 (TFLite) SD 8G2 (TFLite) Top-1 (INT8) ------------ ------ ------------ ------------ ------------------ --------------- ------------ ViT-B/16 86.6M 52 ms 38 ms 81 ms 67 ms 80.3% ViT-L/16 304.3M 184 ms 141 ms 312 ms 248 ms 83.1% MobileViT-S 5.6M 19 ms 14 ms 31 ms 24 ms 78.0% MobileViT-V2-1.0 4.9M 16 ms 12 ms 27 ms 21 ms 78.2% EfficientFormer-L1 12.3M 22 ms 17 ms 38 ms 29 ms 79.2% DeiT-Tiny 5.7M 14 ms 11 ms 26 ms 20 ms 71.6%
Three observations from the table. First, the A16 is meaningfully faster than the A15 on transformers specifically — about 25–30% on ViT-B/16 — which reflects Apple's bandwidth improvements rather than raw TOPS. Second, the Tensor G2 comes out slower than the Snapdragon 8 Gen 2 in this test, which surprised us; we suspect the TFLite GPU delegate's attention fusion is not as tight on the Tensor's TPU path, though we have not reverse-engineered it. Third, ViT-L/16 is not a mobile model. At 141ms on the fastest hardware we tested, it is unusable for anything that needs to run at camera frame rate, and it is pushing the limits even for discrete capture.
Quantization strategies
The latency numbers above all use INT8. For vision transformers specifically, INT8 is non-obvious because the softmax inside self-attention is exponentially sensitive to quantization error. Naive post-training quantization often destroys accuracy in the attention blocks, dropping ImageNet top-1 by three to five points instead of the 0.5 to 1.5 typical for CNNs.
Three strategies work in practice in 2023:
- Mixed-precision PTQ. Quantize the MLP blocks and the projection layers to INT8 but keep the softmax and the layer-norm statistics in float16. Core ML Tools 6.3 does this automatically through the
activation_quantizationconfiguration. On ViT-B/16 this is the default path we used for the table above. - Quantization-aware training (QAT). Fine-tune the model for a few epochs with fake quantization operators inserted into the forward pass, then export. QAT recovers almost all of the PTQ accuracy loss: on our harness, QAT'd ViT-B/16 hit 80.9% top-1, within 0.2 points of the float baseline of 81.1%. The cost is training time: roughly 8 A100-hours for a ViT-B fine-tune on ImageNet.
- Palletization (Apple only). Core ML Tools introduced 4-bit and 6-bit weight palletization in 2023. On-device, the weights are stored compressed and dequantized on load. This is a memory-saving technique more than a latency technique; it lets you ship a ViT-B in roughly 22 MB instead of 88 MB without a large accuracy hit (about 0.6 points for 4-bit, 0.2 points for 6-bit on our tests).
Here is the Core ML Tools quantization config we have been using in practice. It is not glamorous:
import coremltools as ct
from coremltools.optimize.coreml import (
OpPalletizerConfig,
OptimizationConfig,
palletize_weights,
linear_quantize_activations,
OpLinearQuantizerConfig,
)
# Step 1: palletize weights to 6-bit.
palletize_config = OpPalletizerConfig(mode="kmeans", nbits=6)
model_pal = palletize_weights(
mlmodel, OptimizationConfig(global_config=palletize_config)
)
# Step 2: linear INT8 activation quantization,
# but skip the softmax and layer_norm ops.
activation_config = OpLinearQuantizerConfig(
mode="linear_symmetric",
weight_dtype="int8",
activation_dtype="int8",
)
model_q = linear_quantize_activations(
model_pal,
OptimizationConfig(
global_config=activation_config,
op_type_configs={
"softmax": None, # keep float16
"layer_norm": None, # keep float16
},
),
sample_data=calibration_batch, # 128 representative images
)
model_q.save("vit_b_16_int8_6bit.mlpackage") iPhone versus Android: the honest comparison
On raw latency at the flagship tier, Apple wins by 30–50% in 2023. Core ML's integration with the Neural Engine is tighter than TFLite's integration with any Android accelerator, largely because Apple controls the whole stack and because Core ML Tools is a much more opinionated compiler than the TFLite converter.
This is not the whole story. Apple wins at the top of the market; Android wins at the tail. The Android device population is enormously long-tailed, and most Android phones do not have anything resembling a Tensor G2 or a Snapdragon 8 Gen 2. If your app has to run on a 2021 mid-range phone with a Helio G85, your effective budget is not "what can MobileViT-S do on a Snapdragon 8 Gen 2" but "what can a distilled CNN do on a CPU with 4 ARM Cortex-A55 cores". The Android developer's job is to build a tiering system — usually three model sizes, selected at install time based on ChipsetInfo — that the iOS developer does not need.
A second asymmetry: TFLite's delegate system is more flexible. You can write a custom NNAPI delegate, a custom GPU shader, or fall back to the CPU operators with a single config change. Core ML is opaque by comparison — when a layer does not run on the Neural Engine you often do not learn that until you profile with Instruments and see that the workload silently fell back to the GPU or, worse, the CPU.
What to actually deploy
For a production mobile feature in 2023 where you need a real-time vision transformer, our recommended defaults are:
- Target architecture: MobileViT-V2-1.0 or EfficientFormer-L1, both in the 5–12M parameter range. DeiT-Tiny if you can accept the accuracy drop and need the last millisecond.
- Training-time precision: fine-tune in float32, apply QAT for the last 2 epochs with fake INT8 quantization.
- Export: ONNX as the interchange format, then to Core ML via
coremltoolsand to TFLite via the TF converter with ONNX-to-TF in between. Test both, because you will silently lose a layer on one of them. - Deploy: ship three model variants per platform (small, medium, large), choose at runtime based on a device capability check. On Android this is mandatory; on iOS it is still worth doing for older devices.
- Do not try to run ViT-L/16 on a phone. Use a server.
Limitations and known failure modes
A few caveats we have collected the hard way. First, Core ML's thermal throttling kicks in aggressively on sustained inference. A benchmark that measures 200 runs in a quiet second looks very different from a benchmark that runs continuously for five minutes. In sustained operation, expect 1.4x the single-shot latency on iPhone; Android is closer to 1.2x in our experience.
Second, input resolution matters more than the papers suggest. A ViT is nominally resolution-agnostic, but the memory layout of positional embeddings differs on Core ML compiled models when you change resolution, and we have had bad experiences with models that worked fine at 224x224 and fell off a cliff at 384x384. If your product needs a higher input resolution, test it.
Third, batch size greater than one buys you very little on mobile. All of the accelerators are optimised for the latency case, not the throughput case. If you find yourself wanting to batch on device, you are probably building the wrong product.
Closing note
In 2020, shipping a vision transformer on a phone was an open research problem. In 2023 it is an engineering problem with a playbook. The playbook is boring — pick a hybrid architecture, quantize carefully, export through ONNX, test on real devices — but it works. The next frontier, which we will write about when there is something concrete to say, is multimodal models (vision + text in one network) running on-device. In September 2023 the leaders of that pack are SigLIP and the smaller CLIP variants; whether any of them fit in a phone is, at the time of writing, an open question.
Reviewed for technical accuracy by Dr. Theo Nakamura before publication. Model references: Dosovitskiy et al. (ICLR 2021); Mehta and Rastegari (ICLR 2022); Li et al., EfficientFormer (NeurIPS 2022).