-
Notifications
You must be signed in to change notification settings - Fork 1.4k
RAG training becomes slow over time with frequent MCP retrieve timeouts #504
Copy link
Copy link
Open
Description
Describe the bug
When training a RAG task with AgentLightning, the training gradually becomes extremely slow and starts to produce frequent errors related to MCP retrieve timeouts and OTLP trace handling.
From the logs:
- The time per iteration keeps increasing to thousands of seconds (
s/it).
Training Progress: 0%| | 0/125000 [00:00<?, ?it/s]
Training Progress: 0%| | 1/125000 [01:05<2288:35:44, 65.91s/it]
Training Progress: 0%| | 2/125000 [02:13<2329:03:57, 67.08s/it]
Training Progress: 0%| | 3/125000 [04:13<3167:15:15, 91.22s/it]
……
Training Progress: 1%| | 698/125000 [103:28:02<47135:04:52, 1365.11s/it]
ERROR:2026-02-18 04:35:13,188:Error invoking MCP tool retrieve: Timed out while waiting for response to ClientRequest. Waited 60.0 seconds.
(TaskRunner pid=2492560)
Training Progress: 1%| | 699/125000 [103:46:37<44547:19:03, 1290.18s/it]
ERROR:2026-02-18 07:30:21,360:Error invoking MCP tool retrieve: Timed out while waiting for response to ClientRequest. Waited 60.0 seconds.
ERROR:2026-02-18 07:36:57,659:Error invoking MCP tool retrieve: Timed out while waiting for response to ClientRequest. Waited 60.0 seconds.
ERROR:2026-02-18 08:27:00,963:Error invoking MCP tool retrieve: Timed out while waiting for response to ClientRequest. Waited 60.0 seconds.
(TaskRunner pid=2492560)
Training Progress: 1%| | 700/125000 [108:31:43<208373:35:24, 6034.96s/it]
...
(TaskRunner pid=2492560)
Training Progress: 1%| | 719/125000 [118:23:48<42790:52:24, 1239.51s/it]
ERROR:2026-02-18 20:20:41,745:Error invoking MCP tool retrieve: Timed out while waiting for response to ClientRequest. Waited 60.0 seconds.
Observed behavior:
- In the early phase training speed is acceptable.
- After several hundred global steps (~700), each step becomes slower and slower.
- Logs start to show many
Error invoking MCP tool retrieve: Timed out while waiting for response to ClientRequest. Waited 60.0 seconds.messages. starlette.requests.ClientDisconnectis raised in the OTLP traces endpoint.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels