ML Systems Review

Discord's Architecture: Why They're Migrating From Elixir to Rust

How 26 million concurrent connections, BEAM garbage collection, and a single read-states service pushed Discord toward a selective Rust rewrite.

Distributed Systems
By Priya Ramachandran , MS Reviewed by Dr. Nadia Volkov , PhD
11 min read
TL;DR

Discord runs one of the largest Elixir deployments in production, sustaining more than 26 million concurrent users on BEAM. But at that scale, garbage-collection pauses on hot services produced p99 latency spikes above 100 ms. Discord's engineering team selectively rewrote cache-heavy services — starting with read_states — in Rust, cutting tail latency to single-digit milliseconds while keeping Elixir for supervision-tree workloads.

Discord is often cited as the archetype of a successful Elixir-at-scale system. The company's Gateway and presence services have absorbed a decade of growth on top of the BEAM virtual machine and OTP supervision primitives. But in 2020, Discord's infrastructure team published a now-famous engineering post describing a rewrite of one specific service — read_states — from Elixir to Rust. That post has since been cited as evidence of a broader migration. The truth is narrower and more interesting: Discord's architecture is now deliberately polyglot, with Rust and Elixir each assigned to the workloads they handle best.

This piece walks through the engineering pressures that forced the decision, the specific failure modes of BEAM at Discord's scale, and the architectural pattern that other engineering teams might adopt — or deliberately avoid.

Discord's historical Elixir infrastructure

Discord's core real-time layer was built on Elixir from the start. The Gateway service, which maintains WebSocket connections from every connected Discord client, is a textbook OTP application: one supervised GenServer per session, plus a Registry for routing, plus a clustered process group for fan-out. At peak, a single Gateway node has been reported to hold on the order of 1 million open WebSocket connections, coordinated across clusters using Erlang distribution and :pg (the BEAM process group module).

For routing presence updates, channel fan-out, and typing indicators, OTP's abstractions are close to ideal. Each connection is a process; each process is cheap (about 300–400 bytes of base overhead on BEAM); failures are isolated; supervisors restart crashed processes with declarative policy. There is essentially no language runtime in widespread use that gives you this primitive set for free.

Discord's infra team has spoken publicly about scaling this stack past conventional limits. Among the tricks: replacing the default Registry with a custom ETS-backed router to reduce lock contention, using manifold (Discord's open-source library for multi-node message passing) to bulk-route messages to many processes at once, and writing Rust NIFs for hot loops like sorted-set operations on member lists.

Where BEAM ran out of headroom

BEAM's generational garbage collector is per-process. For millions of tiny, short-lived processes, this is a feature — collections are fast because each heap is small and independent. For a small number of very large, long-lived processes holding megabytes of state, it becomes a liability. The read_states service was exactly that shape.

read_states tracks, for every user in every channel, the last message the user has seen. Discord's design kept this in memory for low-latency lookup: each user's read-state record lives in a GenServer, and the hot working set is cached aggressively. At Discord's scale — hundreds of millions of users, billions of channel-user pairs — a single read-states node held tens of millions of records in process memory.

The consequence was long major GC cycles. Discord's engineers reported that p99 response times for read-state lookups regularly exceeded 100 milliseconds, almost entirely attributable to BEAM's stop-the-world generational collections on oversized heaps. Enabling fullsweep_after = 0 and tuning heap growth did not close the gap; the working set was simply too large for BEAM's collector to handle without pauses.

The scheduler tax on NIFs

Discord's first instinct was to accelerate the hot paths with Rust NIFs. This works for short computations but breaks down when the NIF is itself doing cache-friendly work over a large structure. A normal NIF monopolizes a BEAM scheduler thread for its entire execution. If your scheduler budget per NIF call is ~1 ms and the NIF routinely takes 5 ms, you starve other processes on that scheduler. Dirty NIFs (introduced in OTP 17) move long-running native code to a separate pool, but at the cost of cross-thread handoff, which dominates for short calls.

In practice, the best NIFs are either under 1 ms or explicitly dirty. For Discord's read-state lookups, neither mode was a clean fit: the work was medium-duration (a few ms) and extremely frequent (hundreds of thousands of calls per second per node), which put the scheduler math in an awkward spot.

The Rust case

Rewriting read_states in Rust let Discord change three things at once: memory layout, memory management, and concurrency model.

  • Memory layout. Rust gave Discord direct control over struct alignment and cache-line packing. The read-state record was redesigned to fit a predictable number of entries per cache line, reducing L3 misses on the hot lookup path.
  • Memory management. No GC. Allocations happen deterministically; the system has no 100 ms tail events that must be papered over. The P99 latency in the new service dropped into the single-digit milliseconds.
  • Concurrency. Discord's Rust service uses Tokio's async runtime for I/O and a sharded in-memory store for state. Sharding replaces OTP's per-process isolation with lock striping, which is less elegant but trades fine-grained concurrency for predictable latency under high read load.

A simplified sketch of the shape that replaced the Elixir GenServer looks like this:

// Pseudocode for a sharded in-memory read-states store (Rust + Tokio).
use dashmap::DashMap;
use std::sync::Arc;

pub struct ReadStatesShard {
    // user_id -> channel_id -> last_message_id
    inner: DashMap<UserId, DashMap<ChannelId, MessageId>>,
}

impl ReadStatesShard {
    pub fn ack(&self, user: UserId, channel: ChannelId, mid: MessageId) {
        self.inner
            .entry(user)
            .or_insert_with(DashMap::new)
            .insert(channel, mid);
    }

    pub fn get(&self, user: UserId, channel: ChannelId) -> Option<MessageId> {
        self.inner.get(&user).and_then(|m| m.get(&channel).map(|v| *v))
    }
}

// Fan-out across N shards keyed by hash(user_id) % N.
pub struct ReadStates {
    shards: Arc<[ReadStatesShard]>,
}

The compared Elixir version was a single GenServer per user, serialized through the mailbox. That is textbook BEAM, and on a smaller deployment it is a better design. At Discord's scale, the mailbox became a bottleneck and the GenServer's working set became a GC liability at the same time.

Selective migration, not a rewrite

Discord's engineering team has been explicit that this is not a migration away from Elixir. The pattern is: Rust for services that are heap-heavy, latency-critical, and benefit from explicit memory layout; Elixir for services that are connection-heavy, supervision-tree-shaped, and benefit from OTP.

                  DISCORD SERVICE ALLOCATION (SIMPLIFIED)

  ┌─────────────────────────────┐     ┌──────────────────────────────┐
  │       ELIXIR / BEAM         │     │            RUST              │
  │                             │     │                              │
  │  Gateway (WebSockets)       │     │  read_states                 │
  │  Presence / typing          │     │  Some gRPC control plane     │
  │  Session routing            │     │  Video / voice paths (parts) │
  │  Guild (server) state       │     │  Hot-path caches             │
  │                             │     │                              │
  │  Strengths:                 │     │  Strengths:                  │
  │  - millions of processes    │     │  - no GC pauses              │
  │  - supervision trees        │     │  - explicit memory layout    │
  │  - hot-code upgrade         │     │  - deterministic tail lat.   │
  └─────────────────────────────┘     └──────────────────────────────┘
              │                                       │
              └───────── Rustler NIFs / gRPC ─────────┘
Figure 1. Polyglot service allocation inside Discord's backend. Elixir and Rust are deliberately matched to workload shape.

Why the critics were wrong

The 2020 Rust post triggered a wave of "Elixir doesn't scale" takes that were not supported by the original article. Discord's authors were careful to say that the BEAM scheduler was not the bottleneck — the garbage collector was, and only for services with that specific memory shape. The broader Elixir layer continues to absorb connection growth without trouble.

A useful mental model: if your service's state fits comfortably inside one BEAM process and fits comfortably inside one generational GC cycle, Elixir is fine at essentially any concurrency level. If your state is enormous and long-lived inside a single process, you have a layout problem that no GC setting will solve, and Rust (or manual memory management in any language) is the right answer.

Lessons for other teams

  • Profile before you rewrite. Discord's rewrite was justified by a very specific p99 metric on a very specific service. Without that metric, the rewrite would have been speculative.
  • Match language to workload shape. BEAM is unmatched for supervision trees. Rust is unmatched for cache-friendly hot paths. Use both.
  • Hybrid stacks have real cost. Discord now has two build systems, two runtime profiles, and two deploy pipelines. Smaller teams should think carefully before taking on that overhead.
  • NIFs are not a substitute for a rewrite. If your native code is long-running, dirty NIFs add overhead and short NIFs starve the scheduler. A standalone service is often cleaner.

Updated 2026: what has changed

Updated 2026: Discord has continued to publish engineering content describing expanded Rust usage inside voice and video infrastructure, and has invested in Rustler-based interop for cases where Elixir orchestrates Rust workers. OTP 27 (released late 2024) introduced improvements to the JIT and GC heuristics that narrow the gap for medium-heap processes, but the core architectural decision — Rust for heap-heavy, latency-critical services; Elixir for supervision-tree concurrency — remains in place. No public engineering post has reversed the 2020 read_states decision; the opposite pattern (services moving into Elixir from elsewhere) has not been reported.

If you are reading this because your team is weighing a similar migration: the honest answer is that you probably don't need to. Discord's scale is not your scale. Measure your own p99, identify the specific subsystem that is on the critical path, and rewrite only that. Everything else is engineering theater.

Frequently asked questions

Is Discord abandoning Elixir?

No. Discord continues to run the majority of its real-time messaging layer on Elixir and the BEAM. The Rust migration is selective, targeting a small number of services where garbage-collection pauses and per-process memory overhead were on the critical path.

What was the read_states service and why did it hit BEAM limits?

read_states tracks which messages each user has seen per channel. At Discord scale it held millions of cached user records in memory, and BEAM generational GC caused sporadic 100+ ms pauses that showed up as spikes in tail latency. Rust gave Discord deterministic memory management and eliminated the pauses.

How many concurrent users does Discord support?

Discord has publicly reported sustained concurrency in the tens of millions and peaks above 26 million connected users on a single Elixir cluster, with session fan-out handled by Gateway nodes written in Elixir and Rust.

What is a NIF and why are NIFs risky in Elixir?

A NIF (Native Implemented Function) is a C or Rust function called directly from BEAM. NIFs bypass the scheduler, so a slow NIF can block an entire scheduler thread. Rustler and dirty NIFs mitigate this, but the risk is why Discord moved high-throughput work out of NIFs into standalone Rust services.

Why not rewrite everything in Rust?

Elixir and OTP remain an excellent fit for Discord's supervision-tree-shaped concurrency: millions of lightweight processes, transparent clustering, and mature fault-tolerance primitives. Rewriting all of that in Rust would cost engineer-years without improving behavior where it already works.

What Rust libraries does Discord use?

Discord engineers have cited Tokio for async runtime, Tonic for gRPC, and custom memory-layout work for cache-friendly structures. They also use Rustler when interop with Elixir is required.

Did the Rust rewrite reduce tail latency?

Yes. Discord's 2020 engineering post on the read_states migration reported that p99 response time dropped from ~100 ms to single-digit milliseconds after moving off BEAM, largely by eliminating GC pauses.

Should my startup copy this migration?

Probably not. Discord's scale justifies the engineering investment. For most systems below a few hundred thousand concurrent sessions, Elixir, Go, or Node.js will run the workload comfortably without the complexity of a hybrid Rust/BEAM stack.


This analysis draws on Discord's published engineering writing, including the 2020 post "Why Discord is switching from Go to Rust" and the 2017 post "How Discord scaled Elixir to 5,000,000 concurrent users," along with public conference talks by Discord infrastructure engineers. No Discord proprietary data was used.