Gemini Live API 支持与 Gemini 建立低延迟的双向语音和视频互动。借助 Live API,您可以为最终用户提供自然的、类似人类的语音对话体验,并能够使用语音指令中断模型的回答。Live API 可以处理文本、音频和视频输入,并提供文本和音频输出。
特性
Live API 具有以下技术规范:
- 输入:文本、音频和视频
- 输出:文本和音频(合成语音)
- 默认会话时长:10 分钟
- 会话时长可根据需要按 10 分钟为单位延长
- 上下文窗口:32,000 个令���
- 可从 8 种语音中选择回复语音
- 支持 31 种语言的回答
使用 Live API
以下部分提供了有关如何使用 Live API 功能的示例。
如需了解详情,请参阅 Gemini Live API 参考指南。
发送文本并接收音频
Gen AI SDK for Python
voice_name = "Aoede" # @param ["Aoede", "Puck", "Charon", "Kore", "Fenrir", "Leda", "Orus", "Zephyr"] config = LiveConnectConfig( response_modalities=["AUDIO"], speech_config=SpeechConfig( voice_config=VoiceConfig( prebuilt_voice_config=PrebuiltVoiceConfig( voice_name=voice_name, ) ), ), ) async with client.aio.live.connect( model=MODEL_ID, config=config, ) as session: text_input = "Hello? Gemini are you there?" display(Markdown(f"**Input:** {text_input}")) await session.send_client_content( turns=Content(role="user", parts=[Part(text=text_input)])) audio_data = [] async for message in session.receive(): if ( message.server_content.model_turn and message.server_content.model_turn.parts ): for part in message.server_content.model_turn.parts: if part.inline_data: audio_data.append( np.frombuffer(part.inline_data.data, dtype=np.int16) ) if audio_data: display(Audio(np.concatenate(audio_data), rate=24000, autoplay=True))
发送和接收短信
Gen AI SDK for Python
安装
pip install --upgrade google-genai
设置环境变量以将 Gen AI SDK 与 Vertex AI 搭配使用:
# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values # with appropriate values for your project. export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT export GOOGLE_CLOUD_LOCATION=us-central1 export GOOGLE_GENAI_USE_VERTEXAI=True
发送语音
Gen AI SDK for Python
import asyncio import wave from google import genai client = genai.Client(api_key="GEMINI_API_KEY", http_options={'api_version': 'v1alpha'}) model = "gemini-2.0-flash-live-preview-04-09" config = {"response_modalities": ["AUDIO"]} async def main(): async with client.aio.live.connect(model=model, config=config) as session: wf = wave.open("audio.wav", "wb") wf.setnchannels(1) wf.setsampwidth(2) wf.setframerate(24000) message = "Hello? Gemini are you there?" await session.send_client_content( turns=Content(role="user", parts=[Part(text=message)])) async for idx,response in async_enumerate(session.receive()): if response.data is not None: wf.writeframes(response.data) # Un-comment this code to print audio data info # if response.server_content.model_turn is not None: # print(response.server_content.model_turn.parts[0].inline_data.mime_type) wf.close() if __name__ == "__main__": asyncio.run(main())
Live API 支持以下音频格式:
- 输入音频格式:16kHz 小端字节序的原始 16 位 PCM 音频
- 输出音频格式:24kHz 小端字节序的原始 16 位 PCM 音频
音频转写
Live API 可以转写输入和输出音频:
Gen AI SDK for Python
# Set model generation_config CONFIG = { 'response_modalities': ['AUDIO'], } headers = { "Content-Type": "application/json", "Authorization": f"Bearer {bearer_token[0]}", } # Connect to the server async with connect(SERVICE_URL, additional_headers=headers) as ws: # Setup the session await ws.send( json.dumps( { "setup": { "model": "gemini-2.0-flash-live-preview-04-09", "generation_config": CONFIG, 'input_audio_transcription': {}, 'output_audio_transcription': {} } } ) ) # Receive setup response raw_response = await ws.recv(decode=False) setup_response = json.loads(raw_response.decode("ascii")) # Send text message text_input = "Hello? Gemini are you there?" display(Markdown(f"**Input:** {text_input}")) msg = { "client_content": { "turns": [{"role": "user", "parts": [{"text": text_input}]}], "turn_complete": True, } } await ws.send(json.dumps(msg)) responses = [] input_transcriptions = [] output_transcriptions = [] # Receive chucks of server response async for raw_response in ws: response = json.loads(raw_response.decode()) server_content = response.pop("serverContent", None) if server_content is None: break if (input_transcription := server_content.get("inputTranscription")) is not None: if (text := input_transcription.get("text")) is not None: input_transcriptions.append(text) if (output_transcription := server_content.get("outputTranscription")) is not None: if (text := output_transcription.get("text")) is not None: output_transcriptions.append(text) model_turn = server_content.pop("modelTurn", None) if model_turn is not None: parts = model_turn.pop("parts", None) if parts is not None: for part in parts: pcm_data = base64.b64decode(part["inlineData"]["data"]) responses.append(np.frombuffer(pcm_data, dtype=np.int16)) # End of turn turn_complete = server_content.pop("turnComplete", None) if turn_complete: break if input_transcriptions: display(Markdown(f"**Input transcription >** {''.join(input_transcriptions)}")) if responses: # Play the returned audio message display(Audio(np.concatenate(responses), rate=24000, autoplay=True)) if output_transcriptions: display(Markdown(f"**Output transcription >** {''.join(output_transcriptions)}"))
更改语音和语言设置
Live API 使用 Chirp 3 支持 8 种 HD 语音和 31 种语言的合成语音响应。
您可以从以下语音中进行选择:
Aoede
(女性)Charon
(男性)Fenrir
(男性)Kore
(女性)Leda
(女性)Orus
(男性)Puck
(男性)Zephyr
(女性)
如需试听这些语音的声音,以及查看可用语言的完整列表,请参阅 Chirp 3:高清语音。
如需设置回答语音和语言,请执行以下操作:
Gen AI SDK for Python
config = LiveConnectConfig( response_modalities=["AUDIO"], speech_config=SpeechConfig( voice_config=VoiceConfig( prebuilt_voice_config=PrebuiltVoiceConfig( voice_name=voice_name, ) ), language_code="en-US", ), )
控制台
- 依次打开 Vertex AI Studio > 实时 API。
- 在输出展开器中,从语音下拉菜单中选择一种语音。
- 在同一展开式菜单中,从语言下拉菜单中选择一种语言。
- 点击 开始���话以启动会话。
如需提示模型以非英语语言进行回答,并要求模型以非英语语言进行回答,请在系统说明中添加以下内容,以便获得最佳结果:
RESPOND IN LANGUAGE. YOU MUST RESPOND UNMISTAKABLY IN LANGUAGE.
进行流式对话
Gen AI SDK for Python
通过 API 设置对话,以便您发送文本提示并接收音频响应:
# Set model generation_config CONFIG = {"response_modalities": ["AUDIO"]} headers = { "Content-Type": "application/json", "Authorization": f"Bearer {bearer_token[0]}", } async def main() -> None: # Connect to the server async with connect(SERVICE_URL, additional_headers=headers) as ws: # Setup the session async def setup() -> None: await ws.send( json.dumps( { "setup": { "model": "gemini-2.0-flash-live-preview-04-09", "generation_config": CONFIG, } } ) ) # Receive setup response raw_response = await ws.recv(decode=False) setup_response = json.loads(raw_response.decode("ascii")) print(f"Connected: {setup_response}") return # Send text message async def send() -> bool: text_input = input("Input > ") if text_input.lower() in ("q", "quit", "exit"): return False msg = { "client_content": { "turns": [{"role": "user", "parts": [{"text": text_input}]}], "turn_complete": True, } } await ws.send(json.dumps(msg)) return True # Receive server response async def receive() -> None: responses = [] # Receive chucks of server response async for raw_response in ws: response = json.loads(raw_response.decode()) server_content = response.pop("serverContent", None) if server_content is None: break model_turn = server_content.pop("modelTurn", None) if model_turn is not None: parts = model_turn.pop("parts", None) if parts is not None: for part in parts: pcm_data = base64.b64decode(part["inlineData"]["data"]) responses.append(np.frombuffer(pcm_data, dtype=np.int16)) # End of turn turn_complete = server_content.pop("turnComplete", None) if turn_complete: break # Play the returned audio message display(Markdown("**Response >**")) display(Audio(np.concatenate(responses), rate=24000, autoplay=True)) return await setup() while True: if not await send(): break await receive()
发起对话,输入提示,或输入 q
、quit
或 exit
退出。
await main()
控制台
- 依次打开 Vertex AI Studio > 实时 API。
- 点击 开始会话以开始对话会话。
如要结束会话,请点击
Stop session(结束会话)。会话时长
对话会话的默认时长上限为 10 分钟。系统会在会话结束前 60 秒向客户端发送 go_away
通知 (BidiGenerateContentServerMessage.go_away
)。
使用此 API 时,您可以按 10 分钟的增量延长会话时长。您可以无限次延长会话。如需查看有关如何延长会话时长的示例,请参阅启用和停用会话恢复。此功能目前仅适用于 API,而不适用于 Vertex AI Studio。
上下文窗口
默认情况下,Live API 中会话的上下文长度上限为 32,768 个令牌,这些令牌用于存储以每秒 25 个令牌 (TPS) 的速率(音频)和 258 TPS 的速率(视频)流式传输的实时数据,以及其他内容,包括基于文本的输入、模型输出等。
如果上下文窗口超出上下文长度上限,则上下文窗口中最旧转弯的上下文将被截断,以便整个上下文窗口大小低于限制。
您可以分别使用设置消息的 context_window_compression.trigger_tokens
和 context_window_compression.sliding_window.target_tokens
字段来配置会话的默认上下文长度和截断后的目标上下文长度。
并发会话
默认情况下,每个项目最多可以有 10 个并发会话。
在会话中更新系统说明
借助 Live API,您可以在活跃会话期间更新系统说明。您可以使用它在会话中调整模型的回答,例如将模型的回答语言更改为其他语言,或修改您希望模型回答时的语气。
更改语音活动检测设置
默认情况下,该模型会自动对连续的音频输入流执行语音活动检测 (VAD)。您可以使用设置消息的 realtimeInputConfig.automaticActivityDetection
字段配置 VAD。
当音频流暂停超过一秒时(例如,由于用户关闭了麦克风),应发送 audioStreamEnd
事件以刷新所有缓存的音频。客户端可以随时恢复发送音频数据。
或者,您也可以在设置消息中将 realtimeInputConfig.automaticActivityDetection.disabled
设置为 true
以停用自动 VAD。在此配置中,客户端负责检测用户语音并在适当的时间发送 activityStart
和 activityEnd
消息。在此配置中,系统不会发送 audioStreamEnd
。而是会通过 activityEnd
消息标记任何流中断。
启用和停用会话恢复
此功能默认处于停用状态。用户每次调用该 API 时都必须通过在 API 请求中指定该字段来启用该功能,系统会对缓存的数据强制执行项目级隐私权设置。启用会话恢复功能后,用户可以将缓存的数据(包括文本、视频和音频提示数据以及模型输出)存储最多 24 小时,从而能够在 24 小时内重新连接到之前的会话。如要实现零数据保留,请勿启用此功能。
如需启用会话恢复功能,请��置 BidiGenerateContentSetup
消息的 session_resumption
字段。如果启用,服务器会定期拍摄当前缓存的会话上下文的快照,并将其存储在内部存储空间中。成功截取快照后,系统会返回一个 resumption_update
,其中包含您可以记录的句柄 ID,以便稍后使用该 ID 从快照中恢复会话。
以下示例展示了如何启用会话恢复功能并收集句柄 ID 信息:
Gen AI SDK for Python
# Set model generation_config CONFIG = {"response_modalities": ["TEXT"]} headers = { "Content-Type": "application/json", "Authorization": f"Bearer {bearer_token[0]}", } # Connect to the server async with connect(SERVICE_URL, additional_headers=headers) as ws: # Setup the session await ws.send( json.dumps( { "setup": { "model": "gemini-2.0-flash-live-preview-04-09", "generation_config": CONFIG, # Enable session resumption. "session_resumption": {}, } } ) ) # Receive setup response raw_response = await ws.recv(decode=False) setup_response = json.loads(raw_response.decode("ascii")) # Send text message text_input = "Hello? Gemini are you there?" display(Markdown(f"**Input:** {text_input}")) msg = { "client_content": { "turns": [{"role": "user", "parts": [{"text": text_input}]}], "turn_complete": True, } } await ws.send(json.dumps(msg)) responses = [] handle_id = "" turn_completed = False resumption_received = False # Receive chucks of server response, # wait for turn completion and resumption handle. async for raw_response in ws: response = json.loads(raw_response.decode()) server_content = response.pop("serverContent", None) resumption_update = response.pop("sessionResumptionUpdate", None) if server_content is not None: model_turn = server_content.pop("modelTurn", None) if model_turn is not None: parts = model_turn.pop("parts", None) if parts is not None: responses.append(parts[0]["text"]) # End of turn turn_complete = server_content.pop("turnComplete", None) if turn_complete: turn_completed = True elif resumption_update is not None: handle_id = resumption_update['newHandle'] resumption_received = True else: continue if turn_complete and resumption_received: break # Print the server response display(Markdown(f"**Response >** {''.join(responses)}")) display(Markdown(f"**Session Handle ID >** {handle_id}"))
如果您想恢复上一个会话,可以将 setup.session_resumption
配置的 handle
字段设置为之前记录的句柄 ID:
Gen AI SDK for Python
# Set model generation_config CONFIG = {"response_modalities": ["TEXT"]} headers = { "Content-Type": "application/json", "Authorization": f"Bearer {bearer_token[0]}", } # Connect to the server async with connect(SERVICE_URL, additional_headers=headers) as ws: # Setup the session await ws.send( json.dumps( { "setup": { "model": "gemini-2.0-flash-live-preview-04-09", "generation_config": CONFIG, # Enable session resumption. "session_resumption": { "handle": handle_id, }, } } ) ) # Receive setup response raw_response = await ws.recv(decode=False) setup_response = json.loads(raw_response.decode("ascii")) # Send text message text_input = "What was the last question I asked?" display(Markdown(f"**Input:** {text_input}")) msg = { "client_content": { "turns": [{"role": "user", "parts": [{"text": text_input}]}], "turn_complete": True, } } await ws.send(json.dumps(msg)) responses = [] handle_id = "" turn_completed = False resumption_received = False # Receive chucks of server response, # wait for turn completion and resumption handle. async for raw_response in ws: response = json.loads(raw_response.decode()) server_content = response.pop("serverContent", None) resumption_update = response.pop("sessionResumptionUpdate", None) if server_content is not None: model_turn = server_content.pop("modelTurn", None) if model_turn is not None: parts = model_turn.pop("parts", None) if parts is not None: responses.append(parts[0]["text"]) # End of turn turn_complete = server_content.pop("turnComplete", None) if turn_complete: turn_completed = True elif resumption_update is not None: handle_id = resumption_update['newHandle'] resumption_received = True else: continue if turn_complete and resumption_received: break # Print the server response # Expected answer: "You just asked if I was there." display(Markdown(f"**Response >** {''.join(responses)}")) display(Markdown(f"**Session Handle >** {resumption_update}"))
如果您想实现无缝会话恢复,可以启用透明模式:
Gen AI SDK for Python
await ws.send( json.dumps( { "setup": { "model": "gemini-2.0-flash-live-preview-04-09", "generation_config": CONFIG, # Enable session resumption. "session_resumption": { "transparent": True, }, } } ) )
启用透明模式后,系统会明确返回与上下文快照对应的客户端消息的索引。这有助于确定在从恢复句柄恢复会话时,您需要重新发送哪些客户端消息。
使用函数调用
您可以使用函数调用来创建函数的说明,然后在请求中将该说明传递给模型。模型的响应包括与说明匹配的函数名称以及用于调用该函数的参数。
必须在会话开始时声明所有函数,方法是将工具定义作为 setup
消息的一部分发送。
Gen AI SDK for Python
# Set model generation_config CONFIG = {"response_modalities": ["TEXT"]} # Define function declarations TOOLS = { "function_declarations": { "name": "get_current_weather", "description": "Get the current weather in the given location", "parameters": { "type": "OBJECT", "properties": {"location": {"type": "STRING"}}, }, } } headers = { "Content-Type": "application/json", "Authorization": f"Bearer {bearer_token[0]}", } # Connect to the server async with connect(SERVICE_URL, additional_headers=headers) as ws: # Setup the session await ws.send( json.dumps( { "setup": { "model": "gemini-2.0-flash-live-preview-04-09", "generation_config": CONFIG, "tools": TOOLS, } } ) ) # Receive setup response raw_response = await ws.recv(decode=False) setup_response = json.loads(raw_response.decode()) # Send text message text_input = "Get the current weather in Santa Clara, San Jose and Mountain View" display(Markdown(f"**Input:** {text_input}")) msg = { "client_content": { "turns": [{"role": "user", "parts": [{"text": text_input}]}], "turn_complete": True, } } await ws.send(json.dumps(msg)) responses = [] # Receive chucks of server response async for raw_response in ws: response = json.loads(raw_response.decode("UTF-8")) if (tool_call := response.get("toolCall")) is not None: for function_call in tool_call["functionCalls"]: responses.append(f"FunctionCall: {str(function_call)}\n") if (server_content := response.get("serverContent")) is not None: if server_content.get("turnComplete", True): break # Print the server response display(Markdown("**Response >** {}".format("\n".join(responses))))
使用代码执行
您可以将代码执行功能与 Live API 搭配使用,直接生成和执行 Python 代码。
Gen AI SDK for Python
# Set model generation_config CONFIG = {"response_modalities": ["TEXT"]} # Set code execution TOOLS = {"code_execution": {}} headers = { "Content-Type": "application/json", "Authorization": f"Bearer {bearer_token[0]}", } # Connect to the server async with connect(SERVICE_URL, additional_headers=headers) as ws: # Setup the session await ws.send( json.dumps( { "setup": { "model": "gemini-2.0-flash-live-preview-04-09", "generation_config": CONFIG, "tools": TOOLS, } } ) ) # Receive setup response raw_response = await ws.recv(decode=False) setup_response = json.loads(raw_response.decode()) # Send text message text_input = "Write code to calculate the 15th fibonacci number then find the nearest palindrome to it" display(Markdown(f"**Input:** {text_input}")) msg = { "client_content": { "turns": [{"role": "user", "parts": [{"text": text_input}]}], "turn_complete": True, } } await ws.send(json.dumps(msg)) responses = [] # Receive chucks of server response async for raw_response in ws: response = json.loads(raw_response.decode("UTF-8")) if (server_content := response.get("serverContent")) is not None: if (model_turn:= server_content.get("modelTurn")) is not None: if (parts := model_turn.get("parts")) is not None: if parts[0].get("text"): responses.append(parts[0]["text"]) for part in parts: if (executable_code := part.get("executableCode")) is not None: display( Markdown( f"""**Executable code:** ```py {executable_code.get("code")} ``` """ ) ) if server_content.get("turnComplete", False): break # Print the server response display(Markdown(f"**Response >** {''.join(responses)}"))
使用“依托 Google 搜索进行接地”
您可以使用 google_search
将 Grounding with Google Search 与 Live API 搭配使用:
Gen AI SDK for Python
# Set model generation_config CONFIG = {"response_modalities": ["TEXT"]} # Set google search TOOLS = {"google_search": {}} headers = { "Content-Type": "application/json", "Authorization": f"Bearer {bearer_token[0]}", } # Connect to the server async with connect(SERVICE_URL, additional_headers=headers) as ws: # Setup the session await ws.send( json.dumps( { "setup": { "model": "gemini-2.0-flash-live-preview-04-09", "generation_config": CONFIG, "tools": TOOLS, } } ) ) # Receive setup response raw_response = await ws.recv(decode=False) setup_response = json.loads(raw_response.decode()) # Send text message text_input = "What is the current weather in San Jose, CA?" display(Markdown(f"**Input:** {text_input}")) msg = { "client_content": { "turns": [{"role": "user", "parts": [{"text": text_input}]}], "turn_complete": True, } } await ws.send(json.dumps(msg)) responses = [] # Receive chucks of server response async for raw_response in ws: response = json.loads(raw_response.decode()) server_content = response.pop("serverContent", None) if server_content is None: break model_turn = server_content.pop("modelTurn", None) if model_turn is not None: parts = model_turn.pop("parts", None) if parts is not None: responses.append(parts[0]["text"]) # End of turn turn_complete = server_content.pop("turnComplete", None) if turn_complete: break # Print the server response display(Markdown("**Response >** {}".format("\n".join(responses))))
限制
如需查看 Live API 当前限制的完整列表,请参阅参考文档的 Gemini Live API 限制部分。
价格
如需了解详情,请参阅我们的价格页面。
更多信息
如需详细了解 Live API(例如 WebSocket
API 参考文档),请参阅 Gemini API 文档。