OpenAI Realtime API
OpenAI's speech-to-speech model and API for building your own voice agent, billed by audio tokens, not by the minute.
Paid link, we may earn a commission. How this works.
Scored on the same voice-agent rubric as the full platforms, so a building block like this scores low on the axes it does not address. Read its value score against its job.
See how it stacks up · Full rankings →The raw engine, not a finished product. OpenAI's Realtime API is one model that hears speech, thinks and speaks back, with the most natural voices going. You build the agent yourself and add a phone line. Billing is by audio tokens, so the per-minute cost swings a lot.
About $0.07 to 0.48 for a minute of conversation, once the phone line and the AI are added in.
That's roughly $4.20–28.80 an hour. Plans: $0/mo (Pay as you go).
Pricing
Show the cost breakdown
| What the platform charges to run the agent, before the phone line and the AI usage are added on. | — |
|---|---|
| The step that turns what the caller says out loud into text the AI can read. | — |
| The AI 'brain' that reads what the caller said and works out what to say back. | $0.18 /min |
| The step that turns the AI's written reply back into a spoken voice. | — |
| The phone line itself: the service that connects the call to a real phone number. Usually billed on top of the platform. | $0.01 /min |
| The total you actually pay for one minute of conversation once every piece is added up: the platform, the AI, the voice and the phone line. | $0.07–0.48 /min |
The Realtime API is token-priced, not per-minute, so the headline here is an ESTIMATE, not a published rate. OpenAI's own pricing page lists gpt-realtime-2 at $32.00 per 1M audio input tokens ($0.40 per 1M cached input) and $64.00 per 1M audio output tokens; text context is $4.00 per 1M input ($0.40 cached) and $24.00 per 1M output. The cheaper gpt-realtime-mini is $10.00 in / $0.30 cached / $20.00 out per 1M audio tokens. Audio is duration-encoded: user (input) audio is 1 token per 100ms (600 tokens per spoken minute) and assistant (output) audio is 1 token per 50ms (1,200 tokens per spoken minute), per OpenAI's realtime-costs guide. ASSUMPTION for the per-minute figure: a minute of conversation split roughly half caller, half agent (about 300 input + 600 output audio tokens) is around $0.05/min in raw audio alone, but in practice the system prompt and conversation context are re-sent as text tokens each turn, which dominates the bill. An independent 11-profile breakdown puts a realistic working range at about $0.18 to $0.46/min uncached and $0.05 to $0.10/min with prompt caching, hence headline 0.18 and the wide all-in band. Telephony is NOT included: you add Twilio (about $0.014/min, a third-party cost) and build the agent loop yourself. Speech-to-text and text-to-speech are 0 here because the model bundles speech-in, reasoning and speech-out into one combined rate that sits in the llm line. SOC 2 Type 2, HIPAA BAA on the API, GDPR with a Data Processing Addendum, plus ISO 27001/27701, are stated on OpenAI's enterprise-privacy and trust pages.
Every plan in one place: the monthly fee, what each one includes, and the features it unlocks. Anything beyond a plan's allowance, or on a pay-as-you-go tier, is billed at the per-minute rate above. A blank in the features means the vendor's plan page does not state it for that plan, not that it is unavailable.
| Pay as you go | |
|---|---|
| Price | Free |
| Included | Pay per use |
| Plan notes | No platform fee or subscription. You pay per audio token used, billed against your OpenAI API account. |
| What each plan unlocks | |
| API access | Yes |
| Priority support | Community forum + docs (Enterprise support is separate) |
- Pay as you go FreePay per use
No platform fee or subscription. You pay per audio token used, billed against your OpenAI API account.
- API access
- Yes
- Priority support
- Community forum + docs (Enterprise support is separate)
Prices in USD as set by the vendor · last checked 2026-06-03 · vendor pricing →
Slide your expected monthly volume to see roughly what OpenAI Realtime API would cost.
A rough estimate from OpenAI Realtime API's sourced rates, not a quote. Always confirm on the vendor's own pricing page before you commit.
At a glance
- Speech-to-text
- OpenAI Realtime (speech-to-speech, native)
- Text-to-speech
- OpenAI Realtime (speech-to-speech, native)
- Languages
- en, es, fr, de, ja, pt, hi, ar, zh
- Integrations
- Twilio (SIP / Media Streams), SIP trunk providers, LiveKit, Pipecat, WebRTC / WebSocket native, Native SDKs (Python/JS), Function calling / tools
Compliance
Our full take
The Realtime API is not a voice-agent product you log into and configure. It is the engine underneath one. OpenAI gives you a single speech-to-speech model, meaning one model that hears the caller, works out what to say, and speaks back, without bolting together a separate speech-to-text step, a language model and a voice. That tight loop is why it sounds so good. The current flagship is gpt-realtime-2, and its voices, especially the two newer ones called marin and cedar, are about as natural as anything you can buy today.
So that is the upside, and it is a real one. Speech quality and the OpenAI ecosystem. If you want an agent that handles interruptions gracefully and does not have that tell-tale robotic seam where transcription hands off to the model, this is the strongest starting point on the market. It also speaks 32-plus languages with native prosody (the rhythm and stress that make speech sound local rather than translated), and it does tool calling and MCP, so it can actually look things up and take actions mid-call.
Now the catches, and there are three worth knowing before you compare it to a Vapi or a Retell.
First, you build the agent yourself. There is no dashboard, no flow builder, no campaign manager. You get an API and SDKs (Python and JavaScript), and you write the code that connects a phone call to the model. For a developer that is a day or two of work. For a non-technical buyer it is a project you need to hire for.
Second, there is no phone line. The Realtime API connects over WebRTC, WebSocket or SIP, but it does not give you a number people can ring. You bring your own telephony, almost always Twilio, and pay Twilio separately, usually around $0.014 a minute on top. That is the same trade Deepgram and Cartesia ask of you, and it is fine once you know it is coming.
Third, and this is the one that trips people up, the billing is by audio tokens, not by the minute. This is genuinely harder to predict than a flat per-minute rate, so let me show the workings.
OpenAI’s pricing page lists gpt-realtime-2 at $32.00 per 1M audio input tokens ($0.40 per 1M if the input is cached) and $64.00 per 1M audio output tokens. Text context, your system prompt and the running conversation, is billed separately at $4.00 per 1M input ($0.40 cached) and $24.00 per 1M output. There is also a cheaper gpt-realtime-mini at $10.00 in, $0.30 cached, $20.00 out per 1M audio tokens, for when top quality is not the point.
To turn tokens into minutes you need OpenAI’s encoding rule: input (caller) audio is one token per 100 milliseconds, so 600 tokens a spoken minute, and output (agent) audio is one token per 50 milliseconds, so 1,200 tokens a spoken minute. Take a minute of conversation split roughly half caller, half agent. That is about 300 input audio tokens (about $0.01) plus 600 output audio tokens (about $0.04), so the raw audio is only around $0.05 a minute.
Here is the honest part. That $0.05 is the floor, not the bill. The expensive bit is the text context, because your system prompt and the whole conversation so far get re-sent on every turn, and a chatty agent with a long prompt racks those text tokens up fast. An independent breakdown across 11 call profiles landed at roughly $0.18 to $0.46 a minute with no caching, dropping to about $0.05 to $0.10 a minute once prompt caching is switched on. Caching matters a lot here, so if you build on this, build it in from day one. The headline on this page is an estimate of $0.18 a minute, not a published rate, and the all-in band is wide on purpose because your prompt design genuinely moves the number.
On compliance the Realtime API is on solid ground, because it inherits the OpenAI API platform’s posture. OpenAI’s enterprise-privacy and trust pages state SOC 2 Type 2 for the API, the ability to sign a HIPAA Business Associate Agreement, GDPR support with a Data Processing Addendum, and ISO 27001 and 27701. That is enough paperwork for a regulated build, though as always you get the BAA signed before you put protected health data anywhere near it.
My read. Pick the Realtime API if voice quality is the thing you will not compromise on, you are happy to write code, and you can stomach a per-minute cost that you have to model rather than read off a page. Pick a packaged platform instead if you want the phone line, the dashboard and a predictable bill handed to you. The voice is the best reason to choose this. The token pricing and the do-it-yourself assembly are the reasons plenty of teams do not.
The 1 to 10 scores on this page are an editorial preview, our provisional read to get the framework in place, not a measured result. We have not run the Realtime API through our own call tests yet, so there is no Voxrater latency figure here. The pricing, voice and compliance detail is sourced from OpenAI’s pricing, realtime and enterprise-privacy pages plus one independent cost analysis, captured 2026-05-31.
OpenAI Realtime API compared
Our in-depth pieces that put OpenAI Realtime API side by side with the field, with the sourced numbers and a clear pick.
Alternatives to OpenAI Realtime API
Other platforms that overlap with OpenAI Realtime API on the same kind of work, ranked by how many capabilities they share, then by cheaper all-in cost per minute. Compare any of them side by side on the compare page.
Tracking OpenAI Realtime API? Get the next test result
We re-test and re-price the platforms we cover. Join the list and the next dated update lands in your inbox.
Newsletter launching soon.
Sources
- OpenAI Realtime API pricing page re-captured 2026-06-02 for the quarterly re-verification; pricing reviewed against the live page (screenshot in evidence/). · captured 2026-06-02
- OpenAI API plan/pricing page: pay-as-you-go, no platform fee; gpt-realtime-2 audio $32/$0.40 cached in, $64 out per 1M; gpt-realtime-mini $10/$0.30/$20 · captured 2026-05-31
- OpenAI developer pricing reference: per-1M audio and text token rates for gpt-realtime-2, gpt-realtime-1.5, gpt-realtime-mini · captured 2026-05-31
- Realtime API guide: speech-to-speech (listen, reason, speak, call tools), gpt-realtime-2, SIP for telephony · captured 2026-05-31
- Realtime conversations doc: built-in voices (alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, cedar), marin/cedar recommended · captured 2026-05-31
- Independent cost-per-minute analysis: audio is 1 token/100ms in, 1 token/50ms out; ~$0.18 to 0.46/min uncached, ~$0.05 to 0.10/min cached across 11 call profiles · captured 2026-05-31
- OpenAI enterprise privacy: SOC 2 Type 2 for the API platform, HIPAA BAA available, GDPR Data Processing Addendum, ISO 27001/27701 · captured 2026-05-31
- May 2026 announcement of gpt-realtime-2, gpt-realtime-translate, gpt-realtime-whisper; 32+ languages with native prosody · captured 2026-05-31