ML Systems Review

Production-Scale Vision Transformers: Cost Per Inference in 2025

How much does it cost to run a ViT in production in 2025? A detailed look at GPU economics, batching, quantization, and the managed-versus-self-hosted tradeoff.

MLOps
By Priya Ramachandran , MS Reviewed by Dr. Nadia Volkov , PhD
11 min read
TL;DR

Vision Transformer inference cost in 2025 is dominated by GPU choice, batch size, and quantization. For ViT-B and ViT-L under about 1M inferences per day, managed SageMaker and Vertex endpoints on NVIDIA L4 instances are the most cost-effective option. Above that, self-hosted Triton on reserved A100s or H100s wins by 40-70%. INT8 quantization roughly doubles throughput for negligible accuracy loss on most classification tasks.

Vision Transformers are in production. The 2021 ViT paper (Dosovitskiy et al., ICLR 2021) was met with polite skepticism about whether attention could actually beat convolutions for vision at scale; four years later, most serious image-understanding pipelines have a ViT in them somewhere. The engineering question for 2025 is no longer "does it work" but "what does it cost."

This piece is a cost-per-inference analysis across three common ViT sizes (ViT-B/16, ViT-L/16, ViT-H/14), three GPU options (A100 80GB, H100 80GB, L4), and two deployment patterns (managed cloud endpoints and self-hosted Triton). We include a table of cost per 1M inferences at mid-2025 cloud pricing, plus the batching and quantization settings we used to get there.

Scope and assumptions

Numbers in this piece assume 384x384 RGB inputs — a common production resolution for fine-grained classification — and patch size 16 for ViT-B and ViT-L, patch size 14 for ViT-H. Prices are based on on-demand pricing at AWS us-east-1, GCP us-central1, and Azure East US, snapshotted in July 2025. We report the median of the three clouds to smooth provider-specific noise.

Quantization is INT8 post-training quantization on A100 and L4, and FP8 on H100. Latency numbers are p50 single-request plus batched throughput at the GPU's peak batch, measured through NVIDIA Triton 24.07 with TensorRT 10.1 backends. We are deliberately reporting both p50 latency (what the user feels) and throughput (what the finance team feels); cost-per-inference falls out of throughput, but you cannot ignore latency because SLO violations force you off the throughput-optimal batch size.

GPU price and throughput

GPU On-demand $/hr (median) Reserved 1yr $/hr Memory Typical use
NVIDIA L4$0.97$0.5824 GBInference, ViT-B/L
NVIDIA A100 80GB$3.67$2.2080 GBInference, ViT-L/H
NVIDIA H100 80GB$8.10$4.8680 GBInference, ViT-H, FP8

Figure 1. GPU on-demand and 1-year reserved pricing, median across AWS, GCP, and Azure, July 2025. Reserved pricing assumes partial upfront commitment.

The headline is that L4s are remarkably cheap per hour. They are also the right shape for ViT-B and ViT-L inference — 24GB of HBM comfortably fits a ViT-L/16 at batch 32 in INT8 — as long as you do not need to co-locate a large decoder or serve very high batch sizes.

Throughput at batch

A single-request p50 latency number is misleading because production ViT deployments always batch. The table below shows throughput in images per second at the batch size that maximizes throughput subject to a 150ms p99 latency cap, which is a typical web-service SLO.

Backbone Precision L4 (img/s) A100 (img/s) H100 (img/s)
ViT-B/16INT8 / FP81,4203,9807,650
ViT-L/16INT8 / FP84201,3302,820
ViT-H/14INT8 / FP8OOM6101,510

Figure 2. Throughput (images per second) on 384x384 inputs at the maximum batch size meeting a 150ms p99 latency cap. Measured with Triton 24.07 + TensorRT 10.1. ViT-H on L4 runs out of memory at the SLO-feasible batch.

Reading the table from right to left: H100 dominates absolute throughput, but at 8.3x the hourly cost of an L4. For ViT-B the cost-per-inference is actually worse on H100 than on L4 because the model is small enough that the L4 is never truly compute-bound.

Cost per million inferences

Multiplying through: cost per 1M = (3600 / throughput) × hourly / 3.6. We express the result in US dollars per million inferences, using 1-year reserved pricing, because any team serving production traffic should be on reserved instances by the time they care about this number.

Backbone L4 $/1M A100 $/1M H100 $/1M SageMaker (managed) $/1M
ViT-B/16$0.11$0.15$0.18$0.34
ViT-L/16$0.38$0.46$0.48$0.91
ViT-H/14n/a$1.00$0.89$1.76

Figure 3. Cost per 1 million inferences, self-hosted on 1-year reserved pricing versus AWS SageMaker on-demand endpoints. SageMaker column uses the nearest-equivalent ml.g6 (L4), ml.p4d (A100), and ml.p5 (H100) instance types.

Three conclusions fall out of the numbers. First, L4 is the efficient frontier for ViT-B and ViT-L if you can keep the GPU fed. Second, H100 overtakes A100 for ViT-H because FP8 on H100 is not available on A100. Third, SageMaker (and Vertex, and Azure ML — their numbers are within 15% of each other) costs roughly 2-2.5x what self-hosted Triton does at the same hardware tier. That gap is the price of not managing autoscaling.

The SageMaker-versus-Triton decision

The managed-versus-self-hosted question has a predictable break-even. Below roughly 1 million inferences per day, the engineering time to stand up Triton with autoscaling, health checks, and a metrics pipeline costs more than the SageMaker premium. Above 1 million per day, that premium starts to look like a full engineer's salary, and self-hosting pays for itself.

A sample self-hosted configuration, in Triton's config.pbtxt format, for a ViT-L/16 INT8 deployment on A100:

name: "vit_l16_int8"
platform: "tensorrt_plan"
max_batch_size: 64
input [
  {
    name: "pixel_values"
    data_type: TYPE_FP16
    dims: [ 3, 384, 384 ]
  }
]
output [
  {
    name: "logits"
    data_type: TYPE_FP16
    dims: [ 1000 ]
  }
]
dynamic_batching {
  preferred_batch_size: [ 16, 32 ]
  max_queue_delay_microseconds: 15000
}
instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]
optimization {
  cuda { graphs: true }
  execution_accelerators {
    gpu_execution_accelerator : [
      { name : "tensorrt"
        parameters { key: "precision_mode" value: "INT8" }
        parameters { key: "max_workspace_size_bytes" value: "4294967296" }
      }
    ]
  }
}

Two instance groups on one GPU plus a 15ms dynamic-batching window is the recipe we usually start with. It gives up a few percent of peak throughput in exchange for substantially better tail latency, which is the right tradeoff for user-facing endpoints.

When cloud economics change: long-running and geographically distributed workloads

The analysis above assumes steady traffic in a single region. Two variations change the answer materially.

First, highly spiky traffic. If your peak is 10x your average, reserved instances leave money on the table. L4 instances on Spot (AWS) or Preemptible (GCP) price at roughly 30-40% of on-demand, and are a good fit for ViT inference because the models cold-start in under 10 seconds from a warm cache.

Second, multi-region serving. The dominant cost becomes egress, not compute, for any workload that sends images cross-region. Deploying ViT instances in each traffic region plus a small cache of recent embeddings is usually cheaper than a single global endpoint, even accounting for the reserved-instance minimum footprint.

Quantization effects

INT8 post-training quantization with calibration on 1,024 held-out images typically costs 0.2-0.6 percentage points of top-1 accuracy on ImageNet-class tasks for ViT-B and ViT-L. For fine-grained classification (iNaturalist, food categorization) the degradation is larger — we have seen 1.2-2.0 percentage points — and QAT becomes necessary. FP8 on H100 is generally within 0.1 percentage points of FP16 and does not need calibration.

Updated 2026

Updated 2026: The L4 price advantage held through 2025 and tightened further in early 2026 as GCP and AWS introduced spot-discounted L4 inventory specifically for inference workloads. H100 prices dropped roughly 22% on-demand between August 2025 and March 2026, closing the cost-per-inference gap on ViT-L considerably. The managed-versus-self-hosted break-even shifted upward to roughly 2M inferences per day. We cover the updated tradeoff in our 2026 economic analysis.

Conclusion

Cost per ViT inference in 2025 spans more than an order of magnitude depending on GPU choice, batch size, precision, and managed-versus-self-hosted. The efficient frontier is an L4 instance running Triton with INT8 quantization and dynamic batching, at roughly $0.11-0.38 per million inferences depending on backbone size. Moving to a managed endpoint is worth 2-2.5x the unit cost for teams who would otherwise need to hire an MLOps engineer to keep Triton up.

Frequently asked questions

What is the most cost-effective GPU for Vision Transformer inference in 2025?

For ViT-B and ViT-L at 224x224 or 384x384, the NVIDIA L4 is the best cost-per-inference option on major clouds. H100s win on throughput for ViT-H and for very large batch sizes, but their hourly price usually outruns the throughput gain for smaller backbones.

Should I use SageMaker, Vertex, or self-hosted Triton for ViT inference?

Below roughly 1 million inferences per day, managed endpoints are cheaper after accounting for on-call engineering. Above that threshold, self-hosted Triton on reserved instances typically wins by 40-70% per inference.

Does quantization change the cost analysis?

Substantially. INT8 post-training quantization typically doubles throughput and cuts memory by half with minimal accuracy loss on ViT backbones. FP8 on H100 achieves similar throughput gain with even smaller quality impact.

What about batching?

Batching is the biggest lever. Single-request latency of 35ms for ViT-L at batch 1 drops to roughly 4-6ms per request at batch 32 on an A100. Dynamic batching with a 10-30ms window is the standard Triton pattern.

Is TensorRT still the right optimizer for ViT in 2025?

For NVIDIA hardware, yes. TensorRT 10 added first-class support for attention kernel fusion on ViT blocks. ONNX Runtime with the TensorRT execution provider closes most of the remaining gap.

How does vLLM apply to vision models?

vLLM was designed for LLM inference with KV-cache focus. For pure ViT classification, vLLM adds complexity without throughput benefit. Use Triton or a custom server.

What is the cost difference between A100 and H100 for ViT?

H100s cost roughly 2.2x as much per hour as A100s in mid-2025. For ViT-B and ViT-L, throughput improvement is closer to 1.6-1.8x, so A100 wins on cost-per-inference. For ViT-H or FP8 workloads, H100 pulls ahead.

Should I consider AMD MI300X or Intel Gaudi for ViT inference?

In mid-2025 the tooling gap is the deciding factor. MI300X has competitive raw throughput but ROCm ViT kernel coverage trails CUDA. Gaudi-3 is cost-effective on AWS but requires more hand-holding than CUDA.