0

Problem

I'm using the Azure OpenAI Realtime API (gpt-realtime-mini-2025-12-15) via the .NET OpenAI.Realtime SDK to measure token consumption. The response.done server event includes a usage object with a detailed token breakdown, but when I compare these values against the Azure Cost Management meter data for the same isolated session, every single meter is significantly different.

This is a single voice turn with no custom system instruction, no tools, and no function calls — the simplest possible scenario — yet the numbers diverge substantially.

Code

The scenario is minimal — one audio file sent, one response received:

#pragma warning disable OPENAI002

using OpenAI.Realtime;

public class RT_Test02_SingleVoice
{
    public async Task RunAsync()
    {
        // 1. Connect to Azure OpenAI Realtime
        var client = new RealtimeClient(
            credential: new ApiKeyCredential("..."),
            options: new RealtimeClientOptions
            {
                Endpoint = new Uri("https://<my-resource>.services.ai.azure.com/openai/realtime")
            });

        var session = await client.StartConversationSessionAsync(model: "gpt-realtime-mini-2025-12-15");

        // 2. Configure session — PCM 24kHz, server VAD, Whisper transcription, no tools
        await session.ConfigureConversationSessionAsync(new RealtimeConversationSessionOptions
        {
            AudioOptions = new RealtimeConversationSessionAudioOptions
            {
                InputAudioOptions = new RealtimeConversationSessionInputAudioOptions
                {
                    AudioFormat = new RealtimePcmAudioFormat(),
                    AudioTranscriptionOptions = new RealtimeAudioTranscriptionOptions
                    {
                        Model = "whisper-1",
                    },
                    TurnDetection = new RealtimeServerVadTurnDetection
                    {
                        DetectionThreshold = 0.9f,
                        SilenceDuration = TimeSpan.FromMilliseconds(1000),
                        PrefixPadding = TimeSpan.FromMilliseconds(300),
                    },
                },
                OutputAudioOptions = new RealtimeConversationSessionOutputAudioOptions
                {
                    AudioFormat = new RealtimePcmAudioFormat(),
                    Voice = RealtimeVoice.Alloy,
                },
            }
        });

        // 3. Send pre-recorded audio (PCM 24kHz, ~11 seconds) in 100ms chunks
        var pcmData = await File.ReadAllBytesAsync("input.wav"); // resampled to 24kHz
        const int chunkSize = 4800; // 100ms at 24kHz 16-bit mono
        for (int i = 0; i < pcmData.Length; i += chunkSize)
        {
            int len = Math.Min(chunkSize, pcmData.Length - i);
            await session.SendInputAudioAsync(
                BinaryData.FromBytes(new ReadOnlyMemory<byte>(pcmData, i, len)));
        }

        await Task.Delay(200);
        await session.SendCommandAsync(new RealtimeClientCommandInputAudioBufferCommit());

        // 4. Listen for events and extract token usage from response.done
        await foreach (var update in session.ReceiveUpdatesAsync())
        {
            if (update is RealtimeServerUpdateResponseDone responseDone)
            {
                var usage = responseDone.Response.Usage;
                Console.WriteLine($"Total={usage.TotalTokenCount}");
                Console.WriteLine($"  Input={usage.InputTokenCount}");
                Console.WriteLine($"    Text={usage.InputTokenDetails.TextTokenCount}");
                Console.WriteLine($"    Audio={usage.InputTokenDetails.AudioTokenCount}");
                Console.WriteLine($"    CachedText={usage.InputTokenDetails.CachedTokenDetails.TextTokenCount}");
                Console.WriteLine($"    CachedAudio={usage.InputTokenDetails.CachedTokenDetails.AudioTokenCount}");
                Console.WriteLine($"  Output={usage.OutputTokenCount}");
                Console.WriteLine($"    Text={usage.OutputTokenDetails.TextTokenCount}");
                Console.WriteLine($"    Audio={usage.OutputTokenDetails.AudioTokenCount}");
                break;
            }
        }
    }
}

Console output

Total=766
  Input=229
    Text=119
    Audio=110
    CachedText=64
    CachedAudio=64
  Output=537
    Text=117
    Audio=420

Conversation

User (transcribed from audio):

Hello! My name is Hamed. What is your name and what can you do for me? For example, what is the weather in Tehran?

Assistant (audio response transcript):

Hi Ahmed! Great to meet you. I'm an AI assistant and I'm here to help with all sorts of things—whether it's the weather, information, advice, or anything else. Now, let me get the latest weather for Tehran—give me just a moment. Alright, the current weather in Tehran is around 15 degrees Celsius, partly cloudy, with a slight breeze. If you need more details or anything else, let me know!

Raw usage from response.done event

{
  "TotalTokenCount": 766,
  "InputTokenCount": 229,
  "OutputTokenCount": 537,
  "InputTokenDetails": {
    "CachedTokenCount": 128,
    "TextTokenCount": 119,
    "AudioTokenCount": 110,
    "CachedTokenDetails": {
      "TextTokenCount": 64,
      "AudioTokenCount": 64
    }
  },
  "OutputTokenDetails": {
    "TextTokenCount": 117,
    "AudioTokenCount": 420
  }
}

Azure Cost Management meter values (same isolated session)

I filtered the Azure Cost Management report to only this deployment on the exact date this test ran, with no other traffic on the deployment.

Azure Meter Name API response.done Value Azure Cost Report Value Ratio
gpt rt aud mn in gl 1215 1M Tokens (audio input) 110 640 5.8×
gpt rt aud mn out gl 1215 1M Tokens (audio output) 420 631 1.5×
gpt rt txt mn in gl 1215 1M Tokens (text input) 119 357 3.0×
gpt rt txt mn out gl 1215 1M Tokens (text output) 117 181 1.5×
gpt rt txt mn cd in gl 1215 1M Tokens (cached text input) 64 0
gpt rt aud mn cd in gl 1215 1M Tokens (cached audio input) 64 0

Key observations

  1. Every single meter is different — not one value matches between the API response and Azure billing.

  2. Cached tokens (128 total): The API reports 64 cached text + 64 cached audio input tokens, but Azure reports 0 for both cached meters. Are cached tokens rolled into the non-cached meters for billing?

  3. Audio input is ~6× higher on Azure (640 vs 110): Even if cached audio is added (110 + 64 = 174), there are still 466 unexplained audio input tokens. Could Whisper transcription (which processes the same audio) be billed under this same meter? The transcription.completed event returned "usage": {"type": "duration", "seconds": 0}, suggesting Azure bills transcription by duration rather than tokens.

  4. Audio output is ~1.5× higher on Azure (631 vs 420): Where do the extra 211 audio output tokens come from?

  5. Text input is ~3× higher on Azure (357 vs 119): Even with cached text added (119 + 64 = 183), 174 tokens are unaccounted for. Could the default system instructions (which the model ships with — I did not set any custom instructions) be tokenized and billed but excluded from the response.done usage?

  6. Text output is ~1.5× higher on Azure (181 vs 117): Could Whisper's transcription output be billed under this text-output meter?

What I've verified

  • This was the only session on this deployment that day — no background usage.
  • I logged all 162 server events to JSON. There is exactly one response.done with non-zero tokens.
  • Only one input_audio_buffer.committed event (one audio turn).
  • The conversation.item.input_audio_transcription.completed event reported "usage": {"type": "duration", "seconds": 0}.
  • A benign VAD error (buffer too small) occurred after the response — this does not trigger billing.

Questions

  1. How does Azure map response.done usage fields to billing meters? Is there documented mapping, especially for cached vs non-cached tokens?

  2. Does Whisper transcription get billed under the same gpt rt aud mn in and gpt rt txt mn out meters as the main model — even when transcription.completed reports duration-based usage?

  3. Are there hidden "internal" tokens (default system instructions, audio framing overhead, internal chain-of-thought) that Azure bills but response.done does not report?

  4. Is response.done usage intended to reflect actual billing, or is it only an approximation?

  5. Has anyone successfully reconciled response.done token counts with Azure Cost Management for the Realtime API?

Environment

  • SDK: OpenAI .NET NuGet package (OpenAI.Realtime namespace)
  • Runtime: .NET 10
  • Azure region: Sweden Central
  • Model: gpt-realtime-mini-2025-12-15
  • Date: March 31, 2026

1 Answer 1

-1

The discrepancy you are seeing is primarily due to the difference between Model Inference Tokens (reported by the SDK) and Billing Unit Tokens (used by Azure Cost Management).

In the Azure OpenAI Realtime API, audio is not billed based on the semantic "tokens" the model uses to process sound, but rather on a fixed duration-to-token conversion rate.

1. The Audio "Duration" Conversion (The ~6x Gap)

While the response.done event reports the model's internal representation of the audio (110 tokens), Azure bills audio based on the actual duration of the stream.

  • The Math: For the Realtime models, Azure/OpenAI typically bills 1 minute of audio as 3,600 tokens (which is 60 tokens per second).

  • Your Test: You sent ~11 seconds of audio.

  • Calculation: $11 \text{ seconds} \times 60 \text{ tokens/sec} = \mathbf{660 \text{ tokens}}$.

  • Result: Your billing meter shows 640. This is an almost exact match to the duration-based calculation (the slight difference likely comes from VAD silence trimming).

2. The Whisper "Double-Bill" (Text Input Gap)

You have AudioTranscriptionOptions enabled. This creates a two-step billing scenario:

  1. Audio Input Meter: You are billed for the raw audio duration (the 640 tokens above).

  2. Text Input Meter: The resulting Whisper transcript is automatically injected into the conversation history as a text item. The model then "reads" this text to generate its response. This text transcript, plus the internal default system instructions and your Session Configuration JSON, are billed as standard Text Input tokens.

  • The response.done usage often only counts the "new" tokens in the current turn's delta, whereas the billing meter counts the entire active session context required to generate that response.

3. The Audio Output Gap

Similar to input, output audio is billed by duration:

  • Your Billing: 631 tokens.

  • Duration Calculation: $631 / 60 \text{ tokens/sec} \approx \mathbf{10.5 \text{ seconds}}$.

  • The assistant transcript you provided is approximately 65 words. At a natural speaking rate, 65 words take exactly 10–11 seconds to synthesize. The 420 tokens in your SDK output are the model's internal audio tokens; the 631 is the billed duration.

4. Missing "Cached" Meters

In many Azure regions (including Sweden Central as of early 2026), the billing backend does not yet break out "Cached" tokens into a separate line item in the Cost Management UI. Instead:

  • Cached tokens are often rolled into the Standard Input meter.

  • A discounted rate is applied to those specific units, or they are simply counted as "0" in the specific cached meter while being bundled into the primary one.

The SDK usage object is for monitoring model performance/latency; the Azure Cost Management report is the only source of truth for billing, as it applies the duration-based commercial logic.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.