Menu
≈ why?
See the rankings

Best for

Best low-latency voice platforms

Short answer. Cartesia leads on its sub-100ms speech claim, the tightest latency pitch here. Rime's Mist model is built for live phone agents, Deepgram claims sub-200ms voice, and OpenAI's speech-to-speech loop removes the transcription seam. Vapi places last because its speed depends on whichever parts you wire in. All figures are vendor claims, not our measurements.

By Voxrater. Reviewed , updated . How we test.

Some links here are affiliate links, we may earn a commission. How this works.

At a glance

Ranked by our editorial read of fit for this job, best first. The tables below are the sourced numbers behind that call.

What each one costs

Platform All-in /min Headline /min Cheapest paid plan Narration /1k chars
Cartesia $0.08–0.15 ≈ €0.07–0.13≈ £0.06–0.11≈ ₹7.66–14.36≈ R$0.40–0.75≈ A$0.11–0.21 $0.06 $4/mo $0.04
Rime $0.04–0.10 ≈ €0.03–0.09≈ £0.03–0.07≈ ₹3.83–9.57≈ R$0.20–0.50≈ A$0.06–0.14 Pay as you go $0.03
Deepgram $0.08–0.18 ≈ €0.07–0.15≈ £0.06–0.13≈ ₹7.66–17.23≈ R$0.40–0.90≈ A$0.11–0.25 $0.08 Pay as you go $0.03
OpenAI Realtime API $0.07–0.48 ≈ €0.06–0.41≈ £0.05–0.36≈ ₹6.70–45.94≈ R$0.35–2.41≈ A$0.10–0.67 $0.18 Pay as you go
Vapi $0.05–0.30 ≈ €0.04–0.26≈ £0.04–0.22≈ ₹4.79–28.71≈ R$0.25–1.51≈ A$0.07–0.42 $0.05 Pay as you go

Our scores (editorial preview)

Platform Overall Voice quality Voice range Ease of use Value
Cartesia 7.3 Strong 9/10 8/10 5/10 8/10
Rime 6.6 Strong 7/10 6/10 5/10 8/10
Deepgram 6.0 Capable 7/10 4/10 4/10 8/10
OpenAI Realtime API 6.0 Capable 8/10 5/10 4/10 7/10
Vapi 6.9 Strong 8/10 9/10 5/10 7/10

Capabilities and compliance

Platform Voices Languages SIP trunking Warm transfer Batch calling HIPAA SOC 2 GDPR
Cartesia 100+ 40+ Yes No No No No No
Rime 300+ 7+ No No No Yes No No
Deepgram 91+ 7+ No No No Yes Yes Yes
OpenAI Realtime API 10+ 32+ Yes No No Yes Yes Yes
Vapi Yes Yes Yes Yes Yes Yes

Latency is the thing a voice agent lives or dies on, and it is the hardest thing to judge from a marketing page. You ring a number, you say something, and the half-second before the agent answers is the whole game. Too long and the person on the other end talks over it, gets confused, or hangs up. Get it tight enough and the call feels like a conversation instead of a walkie-talkie. So this page ranks the platforms whose pitch is built around speed, and it does so honestly, because the numbers you are about to read are vendor claims, not anything we have measured ourselves yet.

A quick plain-language note first, because “latency” gets thrown around to mean different things and the difference matters. When a vendor says “sub-100ms” they almost always mean time-to-first-audio, also called time-to-first-byte: how long after the text is ready before the first sound comes out of the voice model. That is one slice of the pipeline. What the caller actually feels is the whole round trip: the time to turn their speech into text, the time for the AI to work out a reply, the time to turn that reply back into a voice, and the network hops in between. A platform can win the voice-model slice and still feel slow if the rest of the stack drags. Keep that gap in mind for every number below.

Here is the trap this page exists to call out. Five platforms here advertise speed, and on a spec sheet they all sound fast. But “fast” is measured at different points by each one, against different test conditions, on hardware you do not control, in regions that may not be yours. A sub-100ms text-to-speech claim and a sub-200ms time-to-first-byte claim are not the same measurement, and neither is the end-to-end latency a real caller hears. The whole point of ranking these is to be precise about what each claim covers and what it leaves out, because the single number on the homepage is the question that sounds answered and is not.

What actually decides perceived latency

Four things build up the delay a caller feels, and no single vendor controls all four.

  • Speech-to-text (STT). Turning the caller’s words into text the AI can read. This has to happen before the agent can even start thinking, so a slow transcriber adds delay at the very front of every turn. Some platforms own this; some make you bring it.
  • The language model. The AI working out what to say. This is often the biggest single chunk of the delay, and it depends on which model you run, how long your prompt is, and how much conversation history gets re-sent on each turn. A cheap fast model and an expensive slow one can differ by hundreds of milliseconds.
  • Text-to-speech (TTS). Turning the reply back into audio. This is the slice most “sub-100ms” claims are about. It is real and it matters, but it is the end of the pipeline, not the whole of it.
  • The network and the phone line. Where your servers sit relative to the caller, the carrier in the middle, the round trips over SIP or WebRTC. A platform with a brilliant engine in the wrong region loses to a duller one next door. This is the slice vendors talk about least because it is partly your problem, not theirs.

The speech-to-speech models change this picture, and we will come to them, but the rule holds: perceived latency is the sum of the pipeline, and the vendor only owns part of it. So read each entry below for which slice the claim covers, not just the headline.

How I ranked these

The order below is my editorial read of how strong each platform’s low-latency story is, best first. It is not the raw average from our score columns, because “best for latency” is a narrow question about speed, not all-round polish. The platforms here are the ones that either build their whole pitch around speed or sit at the centre of the latency conversation. I have left off the platforms that compete on other things first, and I explain those omissions near the end.

One honesty note that sits over this entire page, louder here than on any other comparison we run. We have not yet placed our own timed test calls to any of these platforms. There are no Voxrater latency numbers anywhere below. Every speed figure you read is the vendor’s own claim, attributed as such, and the 1 to 10 scores on the individual vendor pages are an editorial preview, not a measured result. When our test rig ships we will publish measured end-to-end latency against the same scenarios, from a known region with a stated sample size, and if a measured number contradicts a vendor’s claim, the measured number wins. Until then, treat this ranking as a sourced read of who is most credibly built for speed, not a leaderboard of tested results.

One disclosure up front, the same one we put on every roundup. Some of these platforms run affiliate programmes we may earn from. The ranking is not for sale, no vendor saw this page before it went live, and if a platform ever pays to appear it will be labelled sponsored and kept out of the ranked positions, so a paid slot can never pose as an earned one.

1. Cartesia: the speed specialist, on paper the tightest claim

Cartesia tops this list because speed is not one feature among many for it, speed is the entire pitch. Its Sonic voice engine is built around a single claim, that speech starts in under 100 milliseconds, and on its own Sonic page that sub-100ms time-to-first-audio is the headline number. In other people’s published tests Sonic is one of the few engines that consistently lands near that mark, which is why it carries our “fastest” badge as a provisional read. If you have ever heard a phone agent stumble over an awkward pause before it answers, that pause is exactly what Cartesia is built to remove.

Here is the honest framing on that number. Sub-100ms is the text-to-speech slice, the time from text-ready to first sound. It is genuinely fast and it is the right slice to optimise for a live agent, because TTS sits at the end of every turn and a slow voice model adds delay the caller hears directly. But it is not the whole round trip. Cartesia bundles its own Ink speech-to-text into the agent rate, which keeps the front of the pipeline under one roof, and you bring your own language model at about $0.01 a minute, which is the slice Cartesia does not control. So the end-to-end latency a caller feels still depends on the model you wire in and the region you run from. The sub-100ms claim is about Cartesia’s part, and Cartesia’s part is fast.

The pricing fits the speed story rather than fighting it. Agent calls run $0.06 a minute on the Pro tier and up, the phone line adds $0.014 a minute on a Cartesia number, and with your own model the realistic all-in sits around $0.08 to 0.15 a minute. That is among the cheaper live-agent rates going, so you are not paying a premium for the speed. One trade to know: Cartesia’s compliance posture is thin. Its profile marks HIPAA, SOC 2 and GDPR all as not yet verified from a primary certificate, so if you are in a regulated industry the speed will not save you, and you should get the paperwork in writing first.

Pick Cartesia if time-to-first-audio is your deciding factor, you want a cheap fast engine for live calls, and you are comfortable bringing your own model and confirming compliance separately. Read the full Cartesia review for the tier detail.

2. Rime: a model built specifically for the live phone use case

Rime sits second because it is the one engine here whose lowest-latency model exists for exactly this job and nothing else. Rime is a voice engine, not a full agent platform, so you wire it into something else (Vapi, LiveKit, Pipecat) that runs the call. What it sells is the voice, and its production lineup has a model picked out for speed: Mist, the low-latency model for live phone agents, at $0.03 per 1,000 characters. The newer Mist v3 keeps the same speaker lineup as v2 with a lower time-to-first-audio, per Rime’s own docs, which is the kind of targeted speed work you want from a voice supplier rather than a one-size engine.

The reason Rime ranks below Cartesia is not the voice model, it is the framing of the claim. Cartesia puts a specific sub-100ms number on its homepage; Rime’s latency story is “Mist is the low-latency model, built for live agents, with lower time-to-first-audio than the previous version,” which is a relative claim rather than an absolute one we can quote. Both are vendor statements we have not measured, but a relative “faster than our last model” is harder to compare across vendors than a stated millisecond figure, so it sits a notch lower on a pure latency read. Rime’s other models trade speed for other things: Arcana is the more expressive multilingual one at $0.04 per 1,000 characters, and Coda is the newest flagship at $0.05. For a live phone agent you would run Mist.

There is a real strength that lifts Rime for a particular buyer. Its whole pitch is getting hard words right, names, medications, account numbers, addresses, the detail that quietly loses a customer when an agent fumbles it. And unlike Cartesia, Rime’s HIPAA path is sourced and ticked, with a BAA stated on its Enterprise tier. So if your low-latency calls are also high-stakes regulated ones, Rime answers two questions at once. The catch is the same as any voice-only engine: it brings no speech-to-text, no model and no phone line, so it is a building block, not a plug-in-and-go platform, and the language list is short at seven languages.

Pick Rime if you are building your own stack, you want a voice model tuned for live calls, and pronunciation accuracy on regulated calls matters as much as raw speed. The Rime review has the per-model breakdown and the compliance detail.

3. Deepgram: a unified runtime with a sub-200ms voice claim

Deepgram ranks third because it attacks latency from a different and genuinely useful angle: it folds the three pieces a phone agent needs, the speech-to-text, the language model and its Aura-2 voice, into one runtime instead of three separate services with three sets of network hops between them. Deepgram started as a speech-to-text company, so the front of the pipeline is its home turf, with Nova-3 streaming transcription and the newer Flux model. On the voice side, its GA announcement and model docs state a sub-200ms time-to-first-byte for Aura-2. That is the vendor’s claim, attributed as such, and it is a different measurement from Cartesia’s sub-100ms, so do not read 200 versus 100 as a head-to-head ranking, read it as two engines describing different slices under different conditions.

The genuine latency argument for Deepgram is the unification, not any single number. When the transcription, the model and the voice live in one runtime billed on one websocket connection, you cut out the network round trips between separate suppliers, and those hops are exactly the slice vendors talk about least. For a developer who would otherwise be stitching a transcriber, a model and a voice across three services, a single combined runtime is a real latency win on top of being cheap, about $0.075 a minute all-in for the bundle. That price is among the lowest of any serious platform, which is why Deepgram carries our budget-pick badge.

So why third rather than higher? Two reasons, stated plainly. First, the sub-200ms claim is the voice slice, and 200ms is a looser figure than Cartesia’s sub-100ms even allowing for the different measurement, so on the narrow “fastest voice model” question Deepgram does not lead. Second, Deepgram brings no phone line, so you bolt on Twilio at about $0.014 a minute, and that is another supplier and another hop in the path. The unified-runtime advantage is real, but it stops at the edge of Deepgram’s own stack. Where Deepgram does pull ahead of the two above is compliance: its trust page states SOC 2 Type 1 and Type 2, a HIPAA BAA on request, plus GDPR and PCI, which is the strongest paperwork in this group.

Pick Deepgram if you want one cheap unified runtime that cuts inter-service hops, you are happy to add your own phone line, and you value the compliance paperwork. The Deepgram review covers the bring-your-own options and the rate tiers.

4. OpenAI Realtime API: the speech-to-speech bet on a different kind of fast

The Realtime API earns fourth place because it changes the latency question rather than just answering it faster. Every platform above runs a pipeline: speech in, transcribe, think, speak out, four stages with handoffs between them. OpenAI’s Realtime API collapses that into one speech-to-speech model, meaning a single model that hears the caller, works out what to say, and speaks back, without a separate transcription step handing off to a separate language model handing off to a separate voice. That tight loop is why it sounds so good, and why interruptions feel natural instead of jarring. There is no robotic seam where one stage passes to the next, because there are no separate stages to seam.

In principle, removing the handoffs removes a source of delay, which is a strong latency argument. So why not rank it first? Because OpenAI does not publish a time-to-first-audio number the way Cartesia and Deepgram do, so there is no vendor latency figure to quote at all here, only the architectural claim that a single model avoids inter-stage handoffs. That is a credible reason to expect low latency, but it is an argument, not a stated measurement, and on a page that ranks latency claims it cannot outrank a platform that puts a specific fast number on the table. The flagship gpt-realtime-2 and its newer voices, marin and cedar, are about as natural as anything you can buy, and it speaks 32-plus languages with native prosody, so on voice quality it leads the whole group. Latency is an inference from the design, not a number.

Two honest catches keep it mid-table. First, you build the agent yourself, there is no dashboard or phone line, you bring Twilio at about $0.014 a minute and write the code that connects a call to the model. Second, the billing is by audio tokens, not by the minute, and the text context (your system prompt and the conversation so far) gets re-sent every turn, which both costs money and adds processing on each turn, so a long prompt works against the very latency the architecture is meant to win. Build with prompt caching from day one. The raw audio is around $0.05 a minute, but a realistic working range runs about $0.18 to $0.46 a minute uncached per an independent breakdown.

Pick the Realtime API if voice quality is your top priority, you want the speech-to-speech architecture’s tight loop, and you have a developer to wire it up and tune the prompt for speed. The OpenAI Realtime review shows the full token-pricing workings.

5. Vapi: the orchestrator, where your latency is your own design

Vapi places last on a latency list, and the reason is structural rather than a knock on the product. Vapi is the layer that runs the call, and that is the only part it sets. The three slices that actually build up latency, the speech-to-text, the model and the voice, are all suppliers you choose and plug in (Deepgram, OpenAI, ElevenLabs, Cartesia, Rime and the rest). So Vapi does not have a latency number of its own to claim, because the number depends entirely on what you wire into it. Put a fast transcriber, a fast model and Cartesia’s voice behind Vapi and it can be quick. Put a slow model and a chatty prompt behind it and it will drag, through no fault of the orchestration layer.

That is the honest framing, and it cuts both ways. The upside of owning none of the slices is that you can tune every one of them. If latency is your bottleneck, Vapi lets you swap a slow component out for a fast one without leaving the platform, run a cheaper quicker model on simple turns, and pick exactly the voice engine that wins the time-to-first-audio race. None of the platforms above gives you that much control over the whole pipeline. The downside is that the speed is your job to assemble and tune, not something Vapi hands you tuned. There is also a real orchestration cost: every extra hop between the suppliers Vapi coordinates is a slice of latency, the same inter-service hop problem Deepgram’s unified runtime avoids.

Where Vapi pulls ahead of everything above is the operational and trust story, which is why it belongs on the list at all rather than being left off. It carries warm transfer, batch calling, SIP trunking and MCP support, its Trust Center lists SOC 2 Type II, GDPR and PCI DSS, and it has the strongest customer roster in the category. For a team that wants control over the latency pipeline and a production-grade platform around it, that combination is worth the assembly. For a team that wants a fast agent handed to them, the do-it-yourself nature of the speed is exactly why it sits at number five here.

Pick Vapi if you want component-level control to tune latency yourself, you have a developer, and you value the operational features and trust posture around the call. The Vapi review covers the pass-through pricing model.

The latency reality check

Here is the part buyers skip until a call goes badly, and it deserves saying plainly. A vendor’s headline latency number does not tell you how your agent will feel on a real call. It tells you how one slice of the pipeline performed under the vendor’s conditions.

Three reasons it rarely survives contact with production. First, the number is usually time-to-first-audio or time-to-first-byte, the voice slice, not the end-to-end round trip the caller hears. A sub-100ms voice model sitting behind a slow model and a long prompt can still leave a full second of dead air after the caller stops talking. Second, the test conditions are the vendor’s, not yours: their region, their hardware, their happy-path script. Run from a different continent, through a different carrier, with your own awkward edge cases, and the number moves. Third, the model and the prompt you choose often dominate the bill of latency more than the voice engine does, and those are your decisions, not the platform’s.

So the practical rule is the same one we apply to pricing: the headline is the floor, your build is the real number. The platform gives you a fast building block. Choosing a fast model, keeping the prompt tight, caching what you can, and running your servers near your callers is your job, and it stays your job whichever engine you pick.

Who I left off, and why

Several capable platforms are not on this list, and the reasons are honest ones.

  • Retell, Synthflow and Bland are strong all-in calling platforms, and they appear on our outbound-sales and HIPAA roundups for good reason, but none of them leads with a specific latency claim the way the five above do. They compete on dialers, no-code building, bundled pricing and turnkey setup first. Speed is part of their product, not the centre of their pitch, so they do not belong on a list that ranks latency claims specifically. When our test rig measures end-to-end latency, any of them could place well or badly on the real number, and we will report it then.
  • ElevenLabs and Murf lead on voice quality and narration rather than live-call speed. ElevenLabs has a low-latency mode, but its centre of gravity is voiceover, dubbing and the largest voice library going, not the time-to-first-audio race. Murf is a narration studio, not a live-call engine at all.
  • Quote-only enterprise platforms are off this list on principle. If a platform will not show a price and a latency claim without a sales call, it cannot be ranked fairly against ones that publish, so we do not slot it into a numbered position.

I have also kept our own site off this list, and I always will. A directory that ranks itself into its own “best” roundups has told you everything you need to know about trusting it. The only names here are platforms you could actually build on.

Before you commit, test this

Do not pick a platform off a homepage latency number, because that number is the vendor’s best slice under the vendor’s conditions. Spend an afternoon measuring the thing that actually matters: the round trip your callers will feel.

  • Measure end-to-end, not the voice slice. Time from the moment the caller stops speaking to the moment the agent starts. That full round trip is what feels human or robotic, and it is the only number that matters. The vendor’s time-to-first-audio is one ingredient in it, not the dish.
  • Test from your region, on your carrier. A fast engine in the wrong place loses to a slower one next door. Run the calls from where your callers actually are, through the phone line you will actually use, not the vendor’s demo environment.
  • Test with your real model and your real prompt. The model you choose and the length of your system prompt often move latency more than the voice engine does. Swap in a faster model or trim the prompt and re-measure. On the speech-to-speech option, switch prompt caching on and measure the difference.
  • Use your awkward edge cases. The caller who interrupts, the long pause, the question your script did not plan for. Latency is easy on the happy path and tells the truth on the messy one.

That afternoon will tell you more than any homepage number, this one included. When our test rig ships we will publish measured end-to-end latency against the same scenarios, from a stated region with a stated sample size, and if it contradicts what a vendor claimed, the measured number wins.

Bottom line

There is no single winner here yet, because we have not measured these ourselves, and because “lowest latency” is really a question about which slice of the pipeline you are optimising and which slices you control.

  • The tightest published claim, a cheap engine built entirely around speed: Cartesia, sub-100ms time-to-first-audio, vendor-claimed.
  • A voice model built specifically for live phone agents, with regulated-call accuracy on top: Rime, Mist, with a real HIPAA path.
  • A unified runtime that cuts inter-service hops, cheap, with the best compliance paperwork: Deepgram, sub-200ms voice claim, vendor-stated.
  • The speech-to-speech bet that removes the transcription handoff and leads on voice quality: OpenAI Realtime, latency inferred from the architecture, not a published number.
  • The orchestrator where you tune the latency yourself, for teams that want control and a production platform: Vapi.

If you are still torn, let your build break the tie. A developer assembling a stack who wants the fastest single engine starts with Cartesia or Rime. A team that wants one cheap unified runtime leans Deepgram. A team that wants the most natural voice and will write code leans the Realtime API. A team that wants to tune every slice and own the platform leans Vapi. None of these is a wrong answer. They are answers to different versions of the same question, and the real answer waits on the test calls.

Start with the Cartesia and Rime reviews if speed is your single deciding factor, read the Deepgram and OpenAI Realtime profiles if you are building your own stack, and put your real call volume through the cost calculator before you commit, because the fastest engine is no bargain if the all-in rate does not fit your budget.

Common questions

Which voice AI has the lowest latency?
Cartesia leads on its sub-100ms speech claim, the tightest pitch here. Rime's Mist model is built for live phone agents, Deepgram claims sub-200ms voice, and OpenAI's speech-to-speech loop removes the transcription seam. These are vendor claims until our test calls land.
Why does latency matter for voice agents?
A long pause after the caller speaks feels broken and makes people talk over the agent. Lower round-trip latency keeps the conversation natural, which is why it is the headline metric for live phone use.
Are these latency numbers measured or claimed?
Today they are the vendors' own published claims, clearly labelled as such. Our blind test calls with timed latency are coming; the methodology page explains how we will measure them independently.

Where to go next

Every figure here is pulled live from each platform's sourced profile, so it stays in step with the dated numbers on those pages. When the test calls land, the timed latency will appear too.