GPT-4o's Multimodal Architecture: What We Can Infer

TL;DR

GPT-4o, introduced by OpenAI in May 2024, is described as an end-to-end multimodal model with unified tokenization for text, vision, and audio. The architectural claim that matters: training happens jointly across modalities rather than in the CLIP-style staged pipeline used by GPT-4 + Whisper + TTS. The practical effect is a median audio response latency of 232 ms, down from 2.8 seconds in the previous staged pipeline. This piece walks through what is publicly disclosed, what is inferable, and what remains speculative.

GPT-4o matters less for benchmark scores — which are modest improvements over GPT-4 Turbo — than for what it implies architecturally. When OpenAI announced it on May 13, 2024, the standout number was not accuracy; it was latency. Voice Mode dropped from 2.8 seconds of median response time (on GPT-4 Turbo) to 232 milliseconds (on GPT-4o), a roughly 12x reduction. That kind of step change does not come from training more; it comes from changing the pipeline shape.

This article is an architectural analysis, not a reverse-engineering exercise. We rely entirely on OpenAI's public disclosures — the May 2024 blog post, the GPT-4o system card, API pricing, and statements from OpenAI engineers in interviews and at conferences. Where we speculate, we flag it.

The old pipeline: why 2.8 seconds

Prior to GPT-4o, ChatGPT's voice mode was a three-stage pipeline: audio in, text in the middle, audio out.

PRE-GPT-4O VOICE PIPELINE (STAGED)

  microphone ─▶ Whisper (ASR) ─▶ GPT-4 Turbo ─▶ TTS ─▶ speaker
                 ~400 ms          ~1.8 s        ~600 ms

  Total median: ~2.8 s
  Issues:
    - three network roundtrips
    - lossy at every modality boundary (tone, laughter, emphasis gone after ASR)
    - no ability to interrupt mid-generation in a coherent way

GPT-4O VOICE PIPELINE (END-TO-END)

  microphone ─────────▶ GPT-4o (one model) ──────▶ speaker
                           ~232 ms

  Joint tokenization of audio + text + vision tokens
  Autoregressive generation emits audio tokens directly

Figure 1. The structural change between the staged Whisper/GPT-4/TTS pipeline and GPT-4o's end-to-end architecture.

Each stage boundary in the old pipeline was a point where information was thrown away. Whisper emitted text; the TTS at the end re-synthesized prosody with no access to what the original speaker's tone was. The GPT-4 Turbo stage operated on pure text and had no representation of the caller's laughter or pause. Even if latency had been zero, the voice mode would have felt flat.

What OpenAI disclosed about GPT-4o

From the May 2024 announcement post and the subsequent system card, the following are directly stated:

GPT-4o is a single neural network trained end-to-end across text, vision, and audio.
All inputs and outputs are processed by the same model.
Median audio response latency is 232 ms; average is 320 ms.
The model performs comparably to GPT-4 Turbo on English text and code benchmarks, with improvements on non-English text and on vision benchmarks.
Pricing at launch: $5 per million input tokens, $15 per million output tokens, later reduced to $2.50 / $10.

Not disclosed: parameter count, training data volume, training compute, the specific tokenization scheme for audio, the number of audio tokens per second, the context window split between modalities, or whether a KV cache can be reused across modalities.

Inferable architecture

Given the disclosures, the most defensible architecture sketch is:

A unified token stream. Text tokens (BPE-style), image patch tokens (likely a convolutional or ViT-style patchifier producing a fixed number of tokens per image), and audio tokens (likely from a learned neural audio codec).
A single transformer stack that consumes all tokens. Cross-attention across modalities happens through self-attention within the stream, not through separate cross-modal layers.
Modality-specific embedding and unembedding layers. Input embedding tables are modality-specific (text, image patch, audio codec codebook); output heads produce the right modality's tokens depending on context.
Training on interleaved multimodal data — transcribed conversations with audio, captioned images, videos, etc. — such that attention learns cross-modal relationships natively.

The audio tokenizer is the most interesting open question. Published neural audio codecs that produce tokens at rates compatible with real-time inference include Encodec (Meta, 2022), SoundStream (Google, 2021), and DAC (Descript, 2023). These produce discrete codes at 50-75 Hz (tokens per second of audio) with a modest codebook size (1024-16384 per residual). A 30-second voice exchange at 75 Hz is 2,250 audio tokens — tractable for a GPT-4-class context window.

A plausible sketch of the token stream:

# Pseudocode for GPT-4o-style multimodal token sequence.
# Speculative — reconstructs the simplest design consistent with OpenAI's disclosures.

from dataclasses import dataclass
from typing import List, Union

@dataclass
class TextToken:   id: int
@dataclass
class ImageToken:  patch_id: int  # from a ViT-style patchifier
@dataclass
class AudioToken:  codec_id: int  # from a neural audio codec (e.g. Encodec-style)

Token = Union[TextToken, ImageToken, AudioToken]

def build_stream(turn):
    """A single conversational turn with audio+vision input and audio+text output."""
    stream: List[Token] = []
    stream += [TextToken(SPECIAL_USER_BEGIN)]
    stream += [AudioToken(c) for c in encode_audio(turn.user_audio)]   # ~75 Hz codec
    if turn.user_image is not None:
        stream += [ImageToken(p) for p in patchify(turn.user_image)]   # ~256 patches
    stream += [TextToken(SPECIAL_USER_END), TextToken(SPECIAL_ASSIST_BEGIN)]
    # During training, assistant audio + text are both present as targets.
    stream += [AudioToken(c) for c in encode_audio(turn.assistant_audio)]
    stream += [TextToken(SPECIAL_ASSIST_END)]
    return stream

To be clear: this sketch is not OpenAI's code. It is the simplest design consistent with what OpenAI has said publicly. The real system may differ — for example, it may use continuous audio embeddings in places rather than quantized tokens, or may have a separate low-latency "decoder head" that emits audio while the main stack is still reasoning about text. The 232 ms median latency is tight enough that something about the output path must be optimized beyond vanilla autoregressive decoding.

Comparison to CLIP-style staged architectures

CLIP, the 2021 OpenAI paper, trained a text encoder and an image encoder jointly to produce aligned embeddings. Most "multimodal LLMs" that followed (LLaVA, BLIP-2, Flamingo in various forms) used a variant of this pattern: a frozen or lightly fine-tuned vision encoder produces embeddings, which are projected into the language model's embedding space and prepended to the text tokens. Training is staged — vision encoder first, then fine-tune on multimodal instruction data.

The staged approach has three structural weaknesses:

Representation mismatch. The vision encoder is trained to maximize alignment with text captions, not to maximize utility for downstream reasoning. Fine-tuning on instruction data partially fixes this, but the embedding space is still optimized for a contrastive objective, not a next-token objective.
Information bottleneck. Whatever the vision encoder chose not to encode is unrecoverable later. End-to-end training learns to preserve what downstream reasoning actually needs.
Compute overhead. Separate encoders mean separate activations, separate caches, separate inference paths. The GPT-4o unified approach is, at inference time, simpler.

Against that, the CLIP-style approach has a clean advantage: you can swap the vision encoder independently. GPT-4o's unified model does not let you upgrade just the vision capabilities; you retrain the whole thing. For OpenAI at their scale, this is fine. For smaller teams, staged architectures remain the more practical engineering choice.

Inference cost

The pricing trajectory tells part of the story. GPT-4 Turbo launched at $10 / $30 per million input/output tokens. GPT-4o launched at $5 / $15 — halved. OpenAI subsequently reduced GPT-4o pricing to $2.50 / $10. At the same time, OpenAI claimed parity or better with GPT-4 Turbo on most benchmarks.

The simplest explanation is that GPT-4o is cheaper to serve per token — either it has fewer parameters than GPT-4 Turbo, or it uses a more efficient architecture (MoE, speculative decoding, better kernels), or both. Without published numbers, the cause is speculation. But the direction is consistent with a well-engineered end-to-end multimodal model displacing a more expensive staged stack.

What this means for builders

End-to-end is the direction of travel. The GPT-4o result is unlikely to be the last joint-multimodal model. Expect Anthropic, Google, and the open-source community to ship variants through 2025-2026.
Audio tokenization matters. If you are building voice products, the tokenizer choice (Encodec, DAC, proprietary) has first-order effects on latency, bandwidth, and output quality. This was invisible in the Whisper+TTS era.
Staged architectures are not obsolete. For teams building on open-source stacks, LLaVA-style patterns still make sense. You don't have OpenAI's training budget, so you can't train end-to-end from scratch.
Latency budgets are a feature. 232 ms vs. 2.8 s is not a speedup; it is a UX category change. Voice products built on the old pipeline will feel stale; those built on the new one will feel conversational.

Updated 2026: what has become public since

Updated 2026: OpenAI released a formal GPT-4o Realtime API in late 2024 exposing audio streaming at the token level, along with additional documentation confirming the joint-modality training approach. The 2025 release of GPT-4.5 and subsequent o-series models built on the same multimodal trunk. Competing models — Google's Gemini 2.0 multimodal family, Anthropic's Claude models with vision, and open-source efforts like LLaVA-Next and Qwen-VL — have converged on similar end-to-end training. Audio token rates of 50–100 Hz with neural codecs are now the de facto standard in the industry. The specific GPT-4o parameter count remains undisclosed.

Frequently asked questions

What does the "o" in GPT-4o stand for?

The "o" stands for "omni." OpenAI has described GPT-4o as a single model trained end-to-end across text, vision, and audio, rather than a text model with bolted-on adapters for other modalities.

Is GPT-4o really one unified model?

OpenAI has publicly described GPT-4o as trained end-to-end across text, vision, and audio with a single neural network. The stronger claim — that there is literally no modality-specific encoder or decoder — is not specifically confirmed in the disclosures; what is confirmed is that training is joint rather than staged.

How is audio tokenized in GPT-4o?

OpenAI has not published the tokenizer details publicly. The plausible implementations (consistent with the published demos and latency numbers) are a neural audio codec — similar in spirit to Encodec or SoundStream — producing discrete tokens at rates in the tens to low hundreds per second of audio.

How is GPT-4o different from a CLIP-based pipeline?

CLIP-style pipelines use a separate vision encoder (typically a ViT) whose output embeddings are fused with a text decoder. Training is typically staged. GPT-4o is trained end-to-end with all modalities present during pretraining, which (per OpenAI) gives better cross-modal reasoning and latency at inference.

What is the audio latency of GPT-4o?

OpenAI has reported end-to-end audio response latency as low as 232 ms median and 320 ms average for GPT-4o Voice Mode, compared to 2.8s for the previous staged ASR+LLM+TTS pipeline on GPT-4 Turbo.

How much does GPT-4o cost per token?

At the November 2024 pricing OpenAI disclosed, GPT-4o cost approximately $2.50 per million input tokens and $10.00 per million output tokens via the API — roughly half the cost of GPT-4 Turbo at launch.

Does GPT-4o support video?

GPT-4o accepts video as a sequence of frames in the API. True streaming video input (as opposed to frame-sampled) is available in more limited modalities via the real-time API. The demos OpenAI showed during launch used real-time video alongside audio.

Has OpenAI published a GPT-4o paper?

No full technical report has been published for GPT-4o as of the date of this analysis. OpenAI released a system card and the May 2024 announcement post with high-level architectural claims. Most specific parameter counts, training data volumes, and tokenizer details are not publicly disclosed.

Sources: OpenAI's May 13, 2024 announcement post ("Hello GPT-4o"), the GPT-4o system card, OpenAI API documentation, and published neural audio codec literature (Encodec, SoundStream, DAC). All speculation is flagged as such.