Gemini Live API

Gemini Live API 支持与 Gemini 建立低延迟的双向语音和视频互动。借助 Live API,您可以为最终用户提供自然的、类似人类的语音对话体验,并能够使用语音指令中断模型的回答。Live API 可以处理文本、音频和视频输入,并提供文本和音频输出。

特性

Live API 具有以下技术规范:

  • 输入:文本、音频和视频
  • 输出:文本和音频(合成语音)
  • 默认会话时长:10 分钟
    • 会话时长可根据需要按 10 分钟为单位延长
  • 上下文窗口:32,000 个令���
  • 可从 8 种语音中选择回复语音
  • 支持 31 种语言的回答

使用 Live API

以下部分提供了有关如何使用 Live API 功能的示例。

如需了解详情,请参阅 Gemini Live API 参考指南

发送文本并接收音频

Gen AI SDK for Python

voice_name = "Aoede"  # @param ["Aoede", "Puck", "Charon", "Kore", "Fenrir", "Leda", "Orus", "Zephyr"]

config = LiveConnectConfig(
    response_modalities=["AUDIO"],
    speech_config=SpeechConfig(
        voice_config=VoiceConfig(
            prebuilt_voice_config=PrebuiltVoiceConfig(
                voice_name=voice_name,
            )
        ),
    ),
)

async with client.aio.live.connect(
    model=MODEL_ID,
    config=config,
) as session:
    text_input = "Hello? Gemini are you there?"
    display(Markdown(f"**Input:** {text_input}"))

    await session.send_client_content(
        turns=Content(role="user", parts=[Part(text=text_input)]))

    audio_data = []
    async for message in session.receive():
        if (
            message.server_content.model_turn
            and message.server_content.model_turn.parts
        ):
            for part in message.server_content.model_turn.parts:
                if part.inline_data:
                    audio_data.append(
                        np.frombuffer(part.inline_data.data, dtype=np.int16)
                    )

    if audio_data:
        display(Audio(np.concatenate(audio_data), rate=24000, autoplay=True))
      

发送和接收短信

Gen AI SDK for Python

安装

pip install --upgrade google-genai
如需了解详情,请参阅 SDK 参考文档

设置环境变量以将 Gen AI SDK 与 Vertex AI 搭配使用:

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=us-central1
export GOOGLE_GENAI_USE_VERTEXAI=True

from google import genai
from google.genai.types import (
    Content,
    LiveConnectConfig,
    HttpOptions,
    Modality,
    Part,
)

client = genai.Client(http_options=HttpOptions(api_version="v1beta1"))
model_id = "gemini-2.0-flash-live-preview-04-09"

async with client.aio.live.connect(
    model=model_id,
    config=LiveConnectConfig(response_modalities=[Modality.TEXT]),
) as session:
    text_input = "Hello? Gemini, are you there?"
    print("> ", text_input, "\n")
    await session.send_client_content(
        turns=Content(role="user", parts=[Part(text=text_input)])
    )

    response = []

    async for message in session.receive():
        if message.text:
            response.append(message.text)

    print("".join(response))
# Example output:
# >  Hello? Gemini, are you there?
# Yes, I'm here. What would you like to talk about?

发送语音

Gen AI SDK for Python

import asyncio
import wave
from google import genai

client = genai.Client(api_key="GEMINI_API_KEY", http_options={'api_version': 'v1alpha'})
model = "gemini-2.0-flash-live-preview-04-09"

config = {"response_modalities": ["AUDIO"]}

async def main():
    async with client.aio.live.connect(model=model, config=config) as session:
        wf = wave.open("audio.wav", "wb")
        wf.setnchannels(1)
        wf.setsampwidth(2)
        wf.setframerate(24000)

        message = "Hello? Gemini are you there?"
        await session.send_client_content(
            turns=Content(role="user", parts=[Part(text=message)]))

        async for idx,response in async_enumerate(session.receive()):
            if response.data is not None:
                wf.writeframes(response.data)

            # Un-comment this code to print audio data info
            # if response.server_content.model_turn is not None:
            #      print(response.server_content.model_turn.parts[0].inline_data.mime_type)

        wf.close()

if __name__ == "__main__":
    asyncio.run(main())
      

Live API 支持以下音频格式:

  • 输入音频格式:16kHz 小端字节序的原始 16 位 PCM 音频
  • 输出音频格式:24kHz 小端字节序的原始 16 位 PCM 音频

音频转写

Live API 可以转写输入和输出音频:

Gen AI SDK for Python

# Set model generation_config
CONFIG = {
    'response_modalities': ['AUDIO'],
}

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {bearer_token[0]}",
}

# Connect to the server
async with connect(SERVICE_URL, additional_headers=headers) as ws:
    # Setup the session
    await ws.send(
        json.dumps(
            {
                "setup": {
                    "model": "gemini-2.0-flash-live-preview-04-09",
                    "generation_config": CONFIG,
                    'input_audio_transcription': {},
                    'output_audio_transcription': {}
                }
            }
        )
    )

    # Receive setup response
    raw_response = await ws.recv(decode=False)
    setup_response = json.loads(raw_response.decode("ascii"))

    # Send text message
    text_input = "Hello? Gemini are you there?"
    display(Markdown(f"**Input:** {text_input}"))

    msg = {
        "client_content": {
            "turns": [{"role": "user", "parts": [{"text": text_input}]}],
            "turn_complete": True,
        }
    }

    await ws.send(json.dumps(msg))

    responses = []
    input_transcriptions = []
    output_transcriptions = []

    # Receive chucks of server response
    async for raw_response in ws:
        response = json.loads(raw_response.decode())
        server_content = response.pop("serverContent", None)
        if server_content is None:
            break

        if (input_transcription := server_content.get("inputTranscription")) is not None:
            if (text := input_transcription.get("text")) is not None:
                input_transcriptions.append(text)
        if (output_transcription := server_content.get("outputTranscription")) is not None:
            if (text := output_transcription.get("text")) is not None:
                output_transcriptions.append(text)

        model_turn = server_content.pop("modelTurn", None)
        if model_turn is not None:
            parts = model_turn.pop("parts", None)
            if parts is not None:
                for part in parts:
                    pcm_data = base64.b64decode(part["inlineData"]["data"])
                    responses.append(np.frombuffer(pcm_data, dtype=np.int16))

        # End of turn
        turn_complete = server_content.pop("turnComplete", None)
        if turn_complete:
            break

    if input_transcriptions:
        display(Markdown(f"**Input transcription >** {''.join(input_transcriptions)}"))

    if responses:
        # Play the returned audio message
        display(Audio(np.concatenate(responses), rate=24000, autoplay=True))

    if output_transcriptions:
        display(Markdown(f"**Output transcription >** {''.join(output_transcriptions)}"))
      

更改语音和语言设置

Live API 使用 Chirp 3 支持 8 种 HD 语音和 31 种语言的合成语音响应。

您可以从以下语音中进行选择:

  • Aoede(女性)
  • Charon(男性)
  • Fenrir(男性)
  • Kore(女性)
  • Leda(女性)
  • Orus(男性)
  • Puck(男性)
  • Zephyr(女性)

如需试听这些语音的声音,以及查看可用语言的完整列表,请参阅 Chirp 3:高清语音

如需设置回答语音和语言,请执行以下操作:

Gen AI SDK for Python

config = LiveConnectConfig(
    response_modalities=["AUDIO"],
    speech_config=SpeechConfig(
        voice_config=VoiceConfig(
            prebuilt_voice_config=PrebuiltVoiceConfig(
                voice_name=voice_name,
            )
        ),
        language_code="en-US",
    ),
)
      

控制台

  1. 依次打开 Vertex AI Studio > 实时 API
  2. 输出展开器中,从语音下拉菜单中选择一种语音。
  3. 在同一展开式菜单中,从语言下拉菜单中选择一种语言。
  4. 点击 开始���话以启动会话。

如需提示模型以非英语语言进行回答,并要求模型以非英语语言进行回答,请在系统说明中添加以下内容,以便获得最佳结果:

RESPOND IN LANGUAGE. YOU MUST RESPOND UNMISTAKABLY IN LANGUAGE.

进行流式对话

Gen AI SDK for Python

通过 API 设置对话,以便您发送文本提示并接收音频响应:

# Set model generation_config
CONFIG = {"response_modalities": ["AUDIO"]}

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {bearer_token[0]}",
}

async def main() -> None:
    # Connect to the server
    async with connect(SERVICE_URL, additional_headers=headers) as ws:

        # Setup the session
        async def setup() -> None:
            await ws.send(
                json.dumps(
                    {
                        "setup": {
                            "model": "gemini-2.0-flash-live-preview-04-09",
                            "generation_config": CONFIG,
                        }
                    }
                )
            )

            # Receive setup response
            raw_response = await ws.recv(decode=False)
            setup_response = json.loads(raw_response.decode("ascii"))
            print(f"Connected: {setup_response}")
            return

        # Send text message
        async def send() -> bool:
            text_input = input("Input > ")
            if text_input.lower() in ("q", "quit", "exit"):
                return False

            msg = {
                "client_content": {
                    "turns": [{"role": "user", "parts": [{"text": text_input}]}],
                    "turn_complete": True,
                }
            }

            await ws.send(json.dumps(msg))
            return True

        # Receive server response
        async def receive() -> None:
            responses = []

            # Receive chucks of server response
            async for raw_response in ws:
                response = json.loads(raw_response.decode())
                server_content = response.pop("serverContent", None)
                if server_content is None:
                    break

                model_turn = server_content.pop("modelTurn", None)
                if model_turn is not None:
                    parts = model_turn.pop("parts", None)
                    if parts is not None:
                        for part in parts:
                            pcm_data = base64.b64decode(part["inlineData"]["data"])
                            responses.append(np.frombuffer(pcm_data, dtype=np.int16))

                # End of turn
                turn_complete = server_content.pop("turnComplete", None)
                if turn_complete:
                    break

            # Play the returned audio message
            display(Markdown("**Response >**"))
            display(Audio(np.concatenate(responses), rate=24000, autoplay=True))
            return

        await setup()

        while True:
            if not await send():
                break
            await receive()
      

发起对话,输入提示,或输入 qquitexit 退出。

await main()
      

控制台

  1. 依次打开 Vertex AI Studio > 实时 API
  2. 点击 开始会话以开始对话会话。

如要结束会话,请点击 Stop session(结束会话)。

会话时长

对话会话的默认时长上限为 10 分钟。系统会在会话结束前 60 秒向客户端发送 go_away 通知 (BidiGenerateContentServerMessage.go_away)。

使用此 API 时,您可以按 10 分钟的增量延长会话时长。您可以无限次延长会话。如需查看有关如何延长会话时长的示例,请参阅启用和停用会话恢复。此功能目前仅适用于 API,而不适用于 Vertex AI Studio。

上下文窗口

默认情况下,Live API 中会话的上下文长度上限为 32,768 个令牌,这些令牌用于存储以每秒 25 个令牌 (TPS) 的速率(音频)和 258 TPS 的速率(视频)流式传输的实时数据,以及其他内容,包括基于文本的输入、模型输出等。

如果上下文窗口超出上下文长度上限,则上下文窗口中最旧转弯的上下文将被截断,以便整个上下文窗口大小低于限制。

您可以分别使用设置消息的 context_window_compression.trigger_tokenscontext_window_compression.sliding_window.target_tokens 字段来配置会话的默认上下文长度和截断后的目标上下文长度。

并发会话

默认情况下,每个项目最多可以有 10 个并发会话。

在会话中更新系统说明

借助 Live API,您可以在活跃会话期间更新系统说明。您可以使用它在会话中调整模型的回答,例如将模型的回答语言更改为其他语言,或修改您希望模型回答时的语气。

更改语音活动检测设置

默认情况下,该模型会自动对连续的音频输入流执行语音活动检测 (VAD)。您可以使用设置消息realtimeInputConfig.automaticActivityDetection 字段配置 VAD。

当音频流暂停超过一秒时(例如,由于用户关闭了麦克风),应发送 audioStreamEnd 事件以刷新所有缓存的音频。客户端可以随时恢复发送音频数据。

或者,您也可以在设置消息中将 realtimeInputConfig.automaticActivityDetection.disabled 设置为 true 以停用自动 VAD。在此配置中,客户端负责检测用户语音并在适当的时间发送 activityStartactivityEnd 消息。在此配置中,系统不会发送 audioStreamEnd。而是会通过 activityEnd 消息标记任何流中断。

启用和停用会话恢复

此功能默认处于停用状态。用户每次调用该 API 时都必须通过在 API 请求中指定该字段来启用该功能,系统会对缓存的数据强制执行项目级隐私权设置。启用会话恢复功能后,用户可以将缓存的数据(包括文本、视频和音频提示数据以及模型输出)存储最多 24 小时,从而能够在 24 小时内重新连接到之前的会话。如要实现零数据保留,请勿启用此功能。

如需启用会话恢复功能,请��置 BidiGenerateContentSetup 消息的 session_resumption 字段。如果启用,服务器会定期拍摄当前缓存的会话上下文的快照,并将其存储在内部存储空间中。成功截取快照后,系统会返回一个 resumption_update,其中包含您可以记录的句柄 ID,以便稍后使用该 ID 从快照中恢复会话。

以下示例展示了如何启用会话恢复功能并收集句柄 ID 信息:

Gen AI SDK for Python

# Set model generation_config
CONFIG = {"response_modalities": ["TEXT"]}

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {bearer_token[0]}",
}

# Connect to the server
async with connect(SERVICE_URL, additional_headers=headers) as ws:
    # Setup the session
    await ws.send(
        json.dumps(
            {
                "setup": {
                    "model": "gemini-2.0-flash-live-preview-04-09",
                    "generation_config": CONFIG,
                    # Enable session resumption.
                    "session_resumption": {},
                }
            }
        )
    )

    # Receive setup response
    raw_response = await ws.recv(decode=False)
    setup_response = json.loads(raw_response.decode("ascii"))

    # Send text message
    text_input = "Hello? Gemini are you there?"
    display(Markdown(f"**Input:** {text_input}"))

    msg = {
        "client_content": {
            "turns": [{"role": "user", "parts": [{"text": text_input}]}],
            "turn_complete": True,
        }
    }

    await ws.send(json.dumps(msg))

    responses = []
    handle_id = ""

    turn_completed = False
    resumption_received = False

    # Receive chucks of server response,
    # wait for turn completion and resumption handle.
    async for raw_response in ws:
        response = json.loads(raw_response.decode())

        server_content = response.pop("serverContent", None)
        resumption_update = response.pop("sessionResumptionUpdate", None)

        if server_content is not None:
          model_turn = server_content.pop("modelTurn", None)
          if model_turn is not None:
              parts = model_turn.pop("parts", None)
              if parts is not None:
                  responses.append(parts[0]["text"])

          # End of turn
          turn_complete = server_content.pop("turnComplete", None)
          if turn_complete:
            turn_completed = True

        elif resumption_update is not None:
          handle_id = resumption_update['newHandle']
          resumption_received = True
        else:
          continue

        if turn_complete and resumption_received:
          break

    # Print the server response
    display(Markdown(f"**Response >** {''.join(responses)}"))
    display(Markdown(f"**Session Handle ID >** {handle_id}"))
      

如果您想恢复上一个会话,可以将 setup.session_resumption 配置的 handle 字段设置为之前记录的句柄 ID:

Gen AI SDK for Python

# Set model generation_config
CONFIG = {"response_modalities": ["TEXT"]}

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {bearer_token[0]}",
}

# Connect to the server
async with connect(SERVICE_URL, additional_headers=headers) as ws:
    # Setup the session
    await ws.send(
        json.dumps(
            {
                "setup": {
                    "model": "gemini-2.0-flash-live-preview-04-09",
                    "generation_config": CONFIG,
                    # Enable session resumption.
                    "session_resumption": {
                        "handle": handle_id,
                    },
                }
            }
        )
    )

    # Receive setup response
    raw_response = await ws.recv(decode=False)
    setup_response = json.loads(raw_response.decode("ascii"))

    # Send text message
    text_input = "What was the last question I asked?"
    display(Markdown(f"**Input:** {text_input}"))

    msg = {
        "client_content": {
            "turns": [{"role": "user", "parts": [{"text": text_input}]}],
            "turn_complete": True,
        }
    }

    await ws.send(json.dumps(msg))

    responses = []
    handle_id = ""

    turn_completed = False
    resumption_received = False

    # Receive chucks of server response,
    # wait for turn completion and resumption handle.
    async for raw_response in ws:
        response = json.loads(raw_response.decode())

        server_content = response.pop("serverContent", None)
        resumption_update = response.pop("sessionResumptionUpdate", None)

        if server_content is not None:
          model_turn = server_content.pop("modelTurn", None)
          if model_turn is not None:
              parts = model_turn.pop("parts", None)
              if parts is not None:
                  responses.append(parts[0]["text"])

          # End of turn
          turn_complete = server_content.pop("turnComplete", None)
          if turn_complete:
            turn_completed = True

        elif resumption_update is not None:
          handle_id = resumption_update['newHandle']
          resumption_received = True
        else:
          continue

        if turn_complete and resumption_received:
          break

    # Print the server response
    # Expected answer: "You just asked if I was there."
    display(Markdown(f"**Response >** {''.join(responses)}"))
    display(Markdown(f"**Session Handle >** {resumption_update}"))
      

如果您想实现无缝会话恢复,可以启用透明模式

Gen AI SDK for Python

await ws.send(
        json.dumps(
            {
                "setup": {
                    "model": "gemini-2.0-flash-live-preview-04-09",
                    "generation_config": CONFIG,
                    # Enable session resumption.
                    "session_resumption": {
                        "transparent": True,
                    },
                }
            }
        )
    )
      

启用透明模式后,系统会明确返回与上下文快照对应的客户端消息的索引。这有助于确定在从恢复句柄恢复会话时,您需要重新发送哪些客户端消息。

使用函数调用

您可以使用函数调用来创建函数的说明,然后在请求中将该说明传递给模型。模型的响应包括与说明匹配的函数名称以及用于调用该函数的参数。

必须在会话开始时声明所有函数,方法是将工具定义作为 setup 消息的一部分发送。

Gen AI SDK for Python

# Set model generation_config
CONFIG = {"response_modalities": ["TEXT"]}

# Define function declarations
TOOLS = {
    "function_declarations": {
        "name": "get_current_weather",
        "description": "Get the current weather in the given location",
        "parameters": {
            "type": "OBJECT",
            "properties": {"location": {"type": "STRING"}},
        },
    }
}

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {bearer_token[0]}",
}

# Connect to the server
async with connect(SERVICE_URL, additional_headers=headers) as ws:
    # Setup the session
    await ws.send(
        json.dumps(
            {
                "setup": {
                    "model": "gemini-2.0-flash-live-preview-04-09",
                    "generation_config": CONFIG,
                    "tools": TOOLS,
                }
            }
        )
    )

    # Receive setup response
    raw_response = await ws.recv(decode=False)
    setup_response = json.loads(raw_response.decode())

    # Send text message
    text_input = "Get the current weather in Santa Clara, San Jose and Mountain View"
    display(Markdown(f"**Input:** {text_input}"))

    msg = {
        "client_content": {
            "turns": [{"role": "user", "parts": [{"text": text_input}]}],
            "turn_complete": True,
        }
    }

    await ws.send(json.dumps(msg))

    responses = []

    # Receive chucks of server response
    async for raw_response in ws:
        response = json.loads(raw_response.decode("UTF-8"))

        if (tool_call := response.get("toolCall")) is not None:
            for function_call in tool_call["functionCalls"]:
                responses.append(f"FunctionCall: {str(function_call)}\n")

        if (server_content := response.get("serverContent")) is not None:
            if server_content.get("turnComplete", True):
                break

    # Print the server response
    display(Markdown("**Response >** {}".format("\n".join(responses))))
      

使用代码执行

您可以将代码执行功能与 Live API 搭配使用,直接生成和执行 Python 代码。

Gen AI SDK for Python

# Set model generation_config
CONFIG = {"response_modalities": ["TEXT"]}

# Set code execution
TOOLS = {"code_execution": {}}

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {bearer_token[0]}",
}

# Connect to the server
async with connect(SERVICE_URL, additional_headers=headers) as ws:
    # Setup the session
    await ws.send(
        json.dumps(
            {
                "setup": {
                    "model": "gemini-2.0-flash-live-preview-04-09",
                    "generation_config": CONFIG,
                    "tools": TOOLS,
                }
            }
        )
    )

    # Receive setup response
    raw_response = await ws.recv(decode=False)
    setup_response = json.loads(raw_response.decode())

    # Send text message
    text_input = "Write code to calculate the 15th fibonacci number then find the nearest palindrome to it"
    display(Markdown(f"**Input:** {text_input}"))

    msg = {
        "client_content": {
            "turns": [{"role": "user", "parts": [{"text": text_input}]}],
            "turn_complete": True,
        }
    }

    await ws.send(json.dumps(msg))

    responses = []

    # Receive chucks of server response
    async for raw_response in ws:
        response = json.loads(raw_response.decode("UTF-8"))

        if (server_content := response.get("serverContent")) is not None:
            if (model_turn:= server_content.get("modelTurn")) is not None:
              if (parts := model_turn.get("parts")) is not None:
                if parts[0].get("text"):
                    responses.append(parts[0]["text"])
                for part in parts:
                    if (executable_code := part.get("executableCode")) is not None:
                        display(
                            Markdown(
                                f"""**Executable code:**
```py
{executable_code.get("code")}
```
                            """
                            )
                        )
            if server_content.get("turnComplete", False):
                break

    # Print the server response
    display(Markdown(f"**Response >** {''.join(responses)}"))
      

您可以使用 google_searchGrounding with Google Search 与 Live API 搭配使用:

Gen AI SDK for Python

# Set model generation_config
CONFIG = {"response_modalities": ["TEXT"]}

# Set google search
TOOLS = {"google_search": {}}

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {bearer_token[0]}",
}

# Connect to the server
async with connect(SERVICE_URL, additional_headers=headers) as ws:
    # Setup the session
    await ws.send(
        json.dumps(
            {
                "setup": {
                    "model": "gemini-2.0-flash-live-preview-04-09",
                    "generation_config": CONFIG,
                    "tools": TOOLS,
                }
            }
        )
    )

    # Receive setup response
    raw_response = await ws.recv(decode=False)
    setup_response = json.loads(raw_response.decode())

    # Send text message
    text_input = "What is the current weather in San Jose, CA?"
    display(Markdown(f"**Input:** {text_input}"))

    msg = {
        "client_content": {
            "turns": [{"role": "user", "parts": [{"text": text_input}]}],
            "turn_complete": True,
        }
    }

    await ws.send(json.dumps(msg))

    responses = []

    # Receive chucks of server response
    async for raw_response in ws:
        response = json.loads(raw_response.decode())
        server_content = response.pop("serverContent", None)
        if server_content is None:
            break

        model_turn = server_content.pop("modelTurn", None)
        if model_turn is not None:
            parts = model_turn.pop("parts", None)
            if parts is not None:
                responses.append(parts[0]["text"])

        # End of turn
        turn_complete = server_content.pop("turnComplete", None)
        if turn_complete:
            break

    # Print the server response
    display(Markdown("**Response >** {}".format("\n".join(responses))))
      

限制

如需查看 Live API 当前限制的完整列表,请参阅参考文档的 Gemini Live API 限制部分

价格

如需了解详情,请参阅我们的价格页面

更多信息

如需详细了解 Live API(例如 WebSocket API 参考文档),请参阅 Gemini API 文档