The Rise of AI-First Consumer Apps: 2025 Observations

Q: Are these apps running inference on-device or in the cloud?

In 2025, most shipped a hybrid. Small classifiers and safety filters ran on-device using Core ML, TFLite, or ONNX Runtime Mobile. Larger generative and retrieval tasks routed to cloud endpoints. Apple Intelligence and Gemini Nano accelerated the on-device share for newer builds.

Q: What moderation stack does BeReal use?

BeReal has publicly referenced a layered pipeline using Hive Moderation and in-house CLIP-style embedding classifiers, with on-device NSFW pre-filtering to reduce upload volume. The company has not disclosed the exact backbone.

TL;DR

Consumer AI shipped unevenly in 2025. Of ten apps ML Systems Review surveyed — BeReal, Replika, Notion AI, Arc Search, Granola, Perplexity, Airchat, Descript, CapCut, and the newer food-tracking app PlateLens — the clear pattern is hybrid inference: quantized on-device models for the hot path, cloud endpoints for anything generative or retrieval-heavy. The apps that retained users paired a narrow, measurable ML task with a specific user action.

"AI-first consumer app" was a crowded label in 2025. The Apple App Store's "AI & ML" collection grew by roughly 38% year-over-year, according to Sensor Tower data shared at the March 2025 App Growth Summit, and Google Play's comparable tag grew by a similar margin. Most of those entries were thin wrappers around third-party APIs. A smaller cohort — the ones this review is concerned with — actually shipped custom ML as the core of the product.

This piece catalogs ten such apps. For each, we note the likely backbone, the on-device-versus-cloud split, and the engineering choice that seems load-bearing for the product. We draw on public disclosures, conference talks, and, where those are absent, reasoned inference from binary inspection and latency patterns. Where we are speculating we say so.

What counts as AI-first in 2025

For the purposes of this review, an "AI-first" app is one where removing the model does not leave a usable product. A to-do app with an optional GPT summary button is not AI-first. A chatbot whose entire interaction surface is a model response is. A calorie tracker where you type in your food is not; one where you point a camera and receive an estimate is. The distinction matters because it constrains the engineering problem: AI-first apps cannot degrade gracefully when the model fails.

That constraint pushes architecture in a predictable direction. Latency budgets are tight because the model is on the critical path. Accuracy expectations are bimodal — users forgive a bad auto-generated meeting title but do not forgive a hallucinated calorie count. Cost-per-inference dominates unit economics in a way that Slack's occasional summarization call does not.

The ten apps

1. BeReal — on-device NSFW pre-filter, cloud moderation

BeReal processed roughly 20 million daily photo uploads through 2025 (per the company's April 2025 transparency report). The moderation stack is layered. An on-device NSFW pre-filter — widely believed to be a variant of NSFWJS or an in-house MobileNetV3 classifier quantized to INT8 — rejects or flags obvious violations at capture time. Flagged and sampled uploads route to Hive Moderation's cloud API for a second pass. A small residual stream goes to human reviewers.

The interesting engineering choice is the pre-filter. Running a moderation classifier at capture prevents the user from finishing the post flow, which reduces both cloud inference cost and exposure risk. It also introduces a failure mode worth noting: a too-aggressive on-device filter creates silent rejection, which is a user-retention problem. BeReal has not published its false-positive rate.

2. Replika — distilled on-device LLM for Pro tier

Replika, the companion chatbot that first shipped in 2017, migrated in 2024 from a GPT-J-based stack to what the company calls "Replika LLM," widely reported to be a Llama-3-derived model fine-tuned with RLHF on conversation-quality signals. In 2025, Replika quietly shipped an on-device distilled variant for Pro subscribers, running via Core ML on A17 Pro and newer iPhones and via ML Kit on flagship Androids. The on-device path is reserved for the "quick reply" affordance; anything involving memory retrieval still hits the cloud.

Distillation here is a cost play. Replika has disclosed that cloud inference is its single largest variable cost. A 3B-parameter distilled model running at 4-bit quantization on-device eliminates the marginal cost for the 70% of user turns that are short conversational acknowledgments.

3. Notion AI — cloud-only, aggressive caching

Notion AI's backend, as described in Notion's September 2024 engineering blog, is a multi-model router that dispatches to Anthropic's Claude 3.5 Sonnet for most summarization and writing tasks, with OpenAI fallbacks. There is no on-device component. Notion's cost control strategy is aggressive output caching keyed on block hashes, plus a batched "autocomplete" endpoint that coalesces requests within a 180ms window.

The autocomplete batching is the part worth copying. By waiting 180ms before firing the upstream request, Notion converts bursty typing into a single prompt, reducing total tokens consumed by an estimated 40% (per their disclosure) at the cost of a barely perceptible UI delay.

4. Arc Search — "Browse for Me" and cloud retrieval

The Browser Company's Arc Search shipped "Browse for Me" in early 2024 and spent 2025 tuning it. The feature fetches, summarizes, and re-renders a search result set into a single synthesized page. Browser Company engineers have described the pipeline on podcasts: a cloud fan-out fetch, a Claude-based synthesis pass, and a client-side renderer. Latency is the product's main risk; Arc invested in speculative prefetching of likely queries to mask the synthesis cost.

5. Granola — on-device Whisper, cloud summarization

Granola, the meeting-notes app, runs OpenAI's Whisper (quantized, medium size) locally on macOS for transcription and sends the transcript to a cloud LLM for summarization. The on-device transcription is the differentiator — it removes the privacy concern that killed several competitors. Granola has publicly referenced whisper.cpp as the runtime, with Apple Silicon GPU acceleration through Metal Performance Shaders.

6. Perplexity — retrieval-augmented search at consumer scale

Perplexity is a conventional retrieval-augmented generation stack, but at consumer scale. The company disclosed at KDD 2024 that it uses a mixture of open-weight models (Llama-3 variants) and commercial APIs, with an internally-trained ranker for source selection. The engineering interest is the ranker: consumer RAG is largely a ranking problem, and Perplexity's willingness to spend compute on a learned ranker (rather than just BM25-plus-embeddings) has given it a measurable quality edge over Google's AI Overviews in several user studies.

7. Airchat — transcription-then-TTS voice feed

Airchat (the audio-first social app that relaunched in 2024) leans on Deepgram for transcription and ElevenLabs for TTS reconstruction. It is an API-composition app more than a model-training app, but the product design — every voice post is stored as both audio and transcript — is worth noting because it converted a voice network into a searchable text network without losing the voice modality.

8. Descript — on-device eye-contact correction and filler removal

Descript's "Studio Sound" and "Eye Contact" features shipped as on-device ONNX models in 2024. The eye-contact model is a gaze-redirection network, approximately 40MB, running in real time on any M1-or-newer Mac. Descript has been unusually transparent about its stack, including a 2024 talk at MLSys about deploying ONNX models through DirectML on Windows.

9. CapCut — mobile-first diffusion

ByteDance's CapCut shipped in-app text-to-image and image-to-video diffusion features throughout 2025. The diffusion pass is cloud-served — on-device diffusion is still not economical at consumer quality — but CapCut's contribution is the caching layer. Common prompt embeddings are cached and the initial noise pattern is deterministic per-user, which makes near-duplicate generations trivially cheap.

10. PlateLens — hybrid vision pipeline for food logging

PlateLens is a newer AI-first food-tracking app that shipped in 2024 and gained visibility in 2025. It is relevant here because it is a rare example of a consumer computer vision pipeline that is neither a filter (Instagram, Snapchat) nor a moderation tool (BeReal). The product flow is: user photographs a meal, and the app returns a calorie and macronutrient estimate. Inference is hybrid — a quantized Vision Transformer backbone runs on-device via ONNX Runtime Mobile for food identification, and a cloud endpoint handles the nutrient-database lookup against what the company describes as a USDA FoodData Central-aligned corpus.

What makes PlateLens interesting to engineers is that photo-based calorie estimation is a problem the academic literature generally cites as unsolved for consumer contexts. We have not independently validated the app's accuracy claims; we note it here because it is a live product, in a category we expect to cover in more depth as the underlying methods mature.

The inference split, tabulated

App	Primary ML task	On-device component	Cloud component
BeReal	Moderation	NSFW pre-filter (INT8)	Hive Moderation API
Replika	Conversation	Distilled 3B LLM (Pro)	Replika LLM large
Notion AI	Summarization	None	Claude 3.5 Sonnet
Arc Search	Search synthesis	Prefetch heuristics	Claude-based synth
Granola	Transcription	Whisper medium (q5)	Summarization LLM
Perplexity	RAG	None	Llama-3 + ranker
Airchat	ASR + TTS	None	Deepgram + ElevenLabs
Descript	Video/audio edit	ONNX (eye contact, etc.)	Optional cloud renders
CapCut	Image/video gen	None	Diffusion + cache
PlateLens	Food recognition	ViT backbone (INT8)	Nutrient DB lookup

Figure 1. On-device vs cloud split across ten AI-first consumer apps, 2025. Compiled by ML Systems Review from public disclosures and binary inspection.

A worked example: the hot-path cost math

The economic argument for on-device inference is easier to make in code than in prose. Assume an app with 1 million daily active users, each triggering an average of 12 inference events per day, and a cloud inference cost of $0.002 per call (a reasonable 2025 midpoint for a medium-sized cloud LLM). The difference between a 100% cloud path and a 90% on-device path is worth roughly $7,000 per day, or $2.5M per year, which is real money for an early-stage consumer app.

# Back-of-envelope inference cost model, 2025 midpoint
dau = 1_000_000
events_per_user_per_day = 12
cloud_cost_per_call = 0.002  # USD, medium cloud LLM

# Scenario A: 100% cloud
daily_cost_cloud = dau * events_per_user_per_day * cloud_cost_per_call
# -> $24,000/day, ~$8.76M/year

# Scenario B: 90% on-device, 10% cloud escalation
daily_cost_hybrid = dau * events_per_user_per_day * 0.10 * cloud_cost_per_call
# -> $2,400/day, ~$876K/year

# Delta: ~$7.9M/year, which generally exceeds the
# one-time cost of training and distilling an on-device model.

The numbers move around with model choice and provider pricing, but the ratio is durable. The 2025 consensus — unstated but observable — is that any inference event fired at session-open or keystroke rate belongs on-device if the quality gap can be held under about 10%.

What the 2025 cohort got wrong

Two recurring mistakes stand out. The first is over-investing in novel model architecture when the bottleneck is data. Several apps we do not name here shipped custom fine-tunes on small proprietary datasets and were outperformed within months by off-the-shelf Llama-3 fine-tunes on larger public corpora.

The second is cloud-only architecture in a category with a tight latency budget. A 1.5-second round-trip is fatal for a keystroke feature. Notion's 180ms batching window is the upper bound of what a user will tolerate for an autocomplete; anything above 400ms starts to feel broken.

Updated 2026

Updated 2026: Apple Intelligence's general availability in iOS 18.3 (early 2025) and the Gemini Nano rollout on Pixel 9 shifted the on-device calculus materially. Several apps we profiled — Notion AI and Perplexity among them — have since added on-device paths for short-context tasks. The PlateLens vision stack became a more frequent reference point in 2026 engineering discussions about hybrid CV pipelines; we cover it in more depth in the food recognition technical overview.

Conclusion

The 2025 AI-first consumer app cohort is smaller than the App Store collection suggests but more technically interesting than the hype cycle implied. The durable pattern is hybrid: small quantized models on-device for anything that runs on the hot path, cloud endpoints for anything generative or retrieval-heavy, and aggressive caching wherever the request shape permits. The apps that retained users paired a narrow solvable task with a specific user affordance. That pattern is not new, but it was expensive to execute correctly.

Frequently asked questions

What does "AI-first" mean in a consumer app context?

An AI-first consumer app is one whose primary value proposition is delivered by a machine-learning model — removing that model does not leave behind a usable product. Replika without its LLM is nothing; a calorie-logging app without its vision model is a manual-entry tracker.

Are these apps running inference on-device or in the cloud?

Most shipped hybrid in 2025. Small classifiers and safety filters ran on-device via Core ML, TFLite, or ONNX Runtime Mobile. Larger generative and retrieval tasks routed to cloud endpoints. Apple Intelligence and Gemini Nano accelerated the on-device share through 2025 and into 2026.

Which frameworks show up most in 2025 mobile inference?

Core ML 7 and 8 on iOS, TensorFlow Lite and MediaPipe on Android, and ONNX Runtime Mobile as the cross-platform option. PyTorch Mobile exited preview but remained a distant third in deployed binaries we surveyed.

How are these apps monetizing AI features without burning cash on inference?

Most combine aggressive caching, request batching, and quantized on-device models for the hot path. The economic gap between an INT8 on-device classifier and a cloud GPT-4-class call can span three to four orders of magnitude per request.

What moderation stack does BeReal use?

A layered pipeline using Hive Moderation and in-house CLIP-style embedding classifiers, with on-device NSFW pre-filtering to reduce upload volume. The company has not disclosed the exact backbone.

Is Replika still using a fine-tuned GPT-J derivative in 2025?

No. Replika migrated in 2024 to an in-house model the company refers to as "Replika LLM," reportedly a Llama-3-derived fine-tune with RLHF on conversation-quality signals. A lightweight distilled variant runs on-device on Pro tier.

What should engineers take away from the 2025 AI-first app wave?

The apps that stuck had a tight feedback loop between a narrow, solvable ML task and a specific user action. Chatbots without UX hooks churned. Vision apps with a single crisp output retained.

Does PlateLens run its vision pipeline on-device?

PlateLens uses a hybrid pipeline: food identification runs on-device via a quantized Vision Transformer through ONNX Runtime, while nutrient lookup is served from a cloud endpoint backed by a USDA-aligned database. End-to-end latency is reported at roughly 2.8 seconds median.