OpenAI Realtime API

Dev framework

OpenAI's speech-to-speech model and API for building your own voice agent, billed by audio tokens, not by the minute.

Best for developers who want OpenAI speech quality and ecosystem in their own agent

Watch for no phone line or dashboard handed to you, and a token bill you must model

Paid link, we may earn a commission. How this works.

Reviewed by Voxrater Last reviewed 2026-06-02 Methodology

Our scores editorial preview

6.2 Capable overall / 10

Voice quality 8

Voice range 5

Ease of use 4

Value 7

All-in /min $0.07–0.48

headline /min $0.18

voices 10+

languages 32+

✓ HIPAA✓ SOC 2 Type II✓ GDPR

Scored on the same voice-agent rubric as the full platforms, so a building block like this scores low on the axes it does not address. Read its value score against its job.

See how it stacks up · Full rankings →

The raw engine, not a finished product. OpenAI's Realtime API is one model that hears speech, thinks and speaks back, with the most natural voices going. You build the agent yourself and add a phone line. Billing is by audio tokens, so the per-minute cost swings a lot.

What you'll pay

About $0.07 to 0.48 for a minute of conversation, once the phone line and the AI are added in.

That's roughly $4.20–28.80 an hour. Plans: $0/mo (Pay as you go).

Pricing

$ 0.07–0.48/min ≈ €0.06–0.41≈ £0.05–0.36≈ ₹6.70–45.94≈ R$0.35–2.41≈ A$0.10–0.67 headline $0.18 /min

Show the cost breakdown

	—
	—
	$0.18 /min
	—
	$0.01 /min
	$0.07–0.48 /min

The Realtime API is token-priced, not per-minute, so the headline here is an ESTIMATE, not a published rate. OpenAI's own pricing page lists gpt-realtime-2 at $32.00 per 1M audio input tokens ($0.40 per 1M cached input) and $64.00 per 1M audio output tokens; text context is $4.00 per 1M input ($0.40 cached) and $24.00 per 1M output. The cheaper gpt-realtime-mini is $10.00 in / $0.30 cached / $20.00 out per 1M audio tokens. Audio is duration-encoded: user (input) audio is 1 token per 100ms (600 tokens per spoken minute) and assistant (output) audio is 1 token per 50ms (1,200 tokens per spoken minute), per OpenAI's realtime-costs guide. ASSUMPTION for the per-minute figure: a minute of conversation split roughly half caller, half agent (about 300 input + 600 output audio tokens) is around $0.05/min in raw audio alone, but in practice the system prompt and conversation context are re-sent as text tokens each turn, which dominates the bill. An independent 11-profile breakdown puts a realistic working range at about $0.18 to $0.46/min uncached and $0.05 to $0.10/min with prompt caching, hence headline 0.18 and the wide all-in band. Telephony is NOT included: you add Twilio (about $0.014/min, a third-party cost) and build the agent loop yourself. Speech-to-text and text-to-speech are 0 here because the model bundles speech-in, reasoning and speech-out into one combined rate that sits in the llm line. SOC 2 Type 2, HIPAA BAA on the API, GDPR with a Data Processing Addendum, plus ISO 27001/27701, are stated on OpenAI's enterprise-privacy and trust pages.

Plans & what you get

Every plan in one place: the monthly fee, what each one includes, and the features it unlocks. Anything beyond a plan's allowance, or on a pay-as-you-go tier, is billed at the per-minute rate above. A blank in the features means the vendor's plan page does not state it for that plan, not that it is unavailable.

	Pay as you go
Price	Free
Included	Pay per use
Plan notes	No platform fee or subscription. You pay per audio token used, billed against your OpenAI API account.
What each plan unlocks
API access	Yes
Priority support	Community forum + docs (Enterprise support is separate)

Pay as you go Free

Pay per use

No platform fee or subscription. You pay per audio token used, billed against your OpenAI API account.

API access

Yes

Priority support

Community forum + docs (Enterprise support is separate)

Prices in USD as set by the vendor · last checked 2026-06-03 · vendor pricing →

Estimate your bill

Slide your expected monthly volume to see roughly what OpenAI Realtime API would cost.

Call minutes a month

(~33 hours)

Estimated monthly cost$140–960≈ €120.54–826.59≈ £104.11–713.92≈ ₹13,399.40–91,881.60≈ R$702.58–4,817.66≈ A$195.50–1,340.54all-in per-minute estimate

Compare every platform at this volume →

A rough estimate from OpenAI Realtime API's sourced rates, not a quote. Always confirm on the vendor's own pricing page before you commit.

At a glance

✓ · · ✓

Speech-to-text: OpenAI Realtime (speech-to-speech, native)
Text-to-speech: OpenAI Realtime (speech-to-speech, native)
Languages: en, es, fr, de, ja, pt, hi, ar, zh
Integrations: Twilio (SIP / Media Streams), SIP trunk providers, LiveKit, Pipecat, WebRTC / WebSocket native, Native SDKs (Python/JS), Function calling / tools

Compliance

✓ HIPAA✓ SOC 2 Type II✓ GDPR

Our full take

The Realtime API is not a voice-agent product you log into and configure. It is the engine underneath one. OpenAI gives you a single speech-to-speech model, meaning one model that hears the caller, works out what to say, and speaks back, without bolting together a separate speech-to-text step, a language model and a voice. That tight loop is why it sounds so good. The current flagship is gpt-realtime-2, and its voices, especially the two newer ones called marin and cedar, are about as natural as anything you can buy today.

So that is the upside, and it is a real one. Speech quality and the OpenAI ecosystem. If you want an agent that handles interruptions gracefully and does not have that tell-tale robotic seam where transcription hands off to the model, this is the strongest starting point on the market. It also speaks 32-plus languages with native prosody (the rhythm and stress that make speech sound local rather than translated), and it does tool calling and MCP, so it can actually look things up and take actions mid-call.

Now the catches, and there are three worth knowing before you compare it to a Vapi or a Retell.

First, you build the agent yourself. There is no dashboard, no flow builder, no campaign manager. You get an API and SDKs (Python and JavaScript), and you write the code that connects a phone call to the model. For a developer that is a day or two of work. For a non-technical buyer it is a project you need to hire for.

Second, there is no phone line. The Realtime API connects over WebRTC, WebSocket or SIP, but it does not give you a number people can ring. You bring your own telephony, almost always Twilio, and pay Twilio separately, usually around $0.014 a minute on top. That is the same trade Deepgram and Cartesia ask of you, and it is fine once you know it is coming.

Third, and this is the one that trips people up, the billing is by audio tokens, not by the minute. This is genuinely harder to predict than a flat per-minute rate, so let me show the workings.

OpenAI’s pricing page lists gpt-realtime-2 at $32.00 per 1M audio input tokens ($0.40 per 1M if the input is cached) and $64.00 per 1M audio output tokens. Text context, your system prompt and the running conversation, is billed separately at $4.00 per 1M input ($0.40 cached) and $24.00 per 1M output. There is also a cheaper gpt-realtime-mini at $10.00 in, $0.30 cached, $20.00 out per 1M audio tokens, for when top quality is not the point.

To turn tokens into minutes you need OpenAI’s encoding rule: input (caller) audio is one token per 100 milliseconds, so 600 tokens a spoken minute, and output (agent) audio is one token per 50 milliseconds, so 1,200 tokens a spoken minute. Take a minute of conversation split roughly half caller, half agent. That is about 300 input audio tokens (about $0.01) plus 600 output audio tokens (about $0.04), so the raw audio is only around $0.05 a minute.

Here is the honest part. That $0.05 is the floor, not the bill. The expensive bit is the text context, because your system prompt and the whole conversation so far get re-sent on every turn, and a chatty agent with a long prompt racks those text tokens up fast. An independent breakdown across 11 call profiles landed at roughly $0.18 to $0.46 a minute with no caching, dropping to about $0.05 to $0.10 a minute once prompt caching is switched on. Caching matters a lot here, so if you build on this, build it in from day one. The headline on this page is an estimate of $0.18 a minute, not a published rate, and the all-in band is wide on purpose because your prompt design genuinely moves the number.

On compliance the Realtime API is on solid ground, because it inherits the OpenAI API platform’s posture. OpenAI’s enterprise-privacy and trust pages state SOC 2 Type 2 for the API, the ability to sign a HIPAA Business Associate Agreement, GDPR support with a Data Processing Addendum, and ISO 27001 and 27701. That is enough paperwork for a regulated build, though as always you get the BAA signed before you put protected health data anywhere near it.

My read. Pick the Realtime API if voice quality is the thing you will not compromise on, you are happy to write code, and you can stomach a per-minute cost that you have to model rather than read off a page. Pick a packaged platform instead if you want the phone line, the dashboard and a predictable bill handed to you. The voice is the best reason to choose this. The token pricing and the do-it-yourself assembly are the reasons plenty of teams do not.

The 1 to 10 scores on this page are an editorial preview, our provisional read to get the framework in place, not a measured result. We have not run the Realtime API through our own call tests yet, so there is no Voxrater latency figure here. The pricing, voice and compliance detail is sourced from OpenAI’s pricing, realtime and enterprise-privacy pages plus one independent cost analysis, captured 2026-05-31.