Skip to content

RAG training becomes slow over time with frequent MCP retrieve timeouts #504

@wolf-yang

Description

@wolf-yang

Describe the bug

When training a RAG task with AgentLightning, the training gradually becomes extremely slow and starts to produce frequent errors related to MCP retrieve timeouts and OTLP trace handling.

From the logs:

  • The time per iteration keeps increasing to thousands of seconds (s/it).
Training Progress:   0%|          | 0/125000 [00:00<?, ?it/s]
Training Progress:   0%|          | 1/125000 [01:05<2288:35:44, 65.91s/it]
Training Progress:   0%|          | 2/125000 [02:13<2329:03:57, 67.08s/it]
Training Progress:   0%|          | 3/125000 [04:13<3167:15:15, 91.22s/it]
……
Training Progress:   1%|          | 698/125000 [103:28:02<47135:04:52, 1365.11s/it]
ERROR:2026-02-18 04:35:13,188:Error invoking MCP tool retrieve: Timed out while waiting for response to ClientRequest. Waited 60.0 seconds.
(TaskRunner pid=2492560)
Training Progress:   1%|          | 699/125000 [103:46:37<44547:19:03, 1290.18s/it]
ERROR:2026-02-18 07:30:21,360:Error invoking MCP tool retrieve: Timed out while waiting for response to ClientRequest. Waited 60.0 seconds.
ERROR:2026-02-18 07:36:57,659:Error invoking MCP tool retrieve: Timed out while waiting for response to ClientRequest. Waited 60.0 seconds.
ERROR:2026-02-18 08:27:00,963:Error invoking MCP tool retrieve: Timed out while waiting for response to ClientRequest. Waited 60.0 seconds.
(TaskRunner pid=2492560)
Training Progress:   1%|          | 700/125000 [108:31:43<208373:35:24, 6034.96s/it]
...
(TaskRunner pid=2492560)
Training Progress:   1%|          | 719/125000 [118:23:48<42790:52:24, 1239.51s/it]
ERROR:2026-02-18 20:20:41,745:Error invoking MCP tool retrieve: Timed out while waiting for response to ClientRequest. Waited 60.0 seconds.

Observed behavior:

  • In the early phase training speed is acceptable.
  • After several hundred global steps (~700), each step becomes slower and slower.
  • Logs start to show many Error invoking MCP tool retrieve: Timed out while waiting for response to ClientRequest. Waited 60.0 seconds. messages.
  • starlette.requests.ClientDisconnect is raised in the OTLP traces endpoint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions