Steps to replicate the issue (include links if applicable):
- Run a deploy
What happens?:
Sometimes the internal API calls fail e.g.
Deployment ID: 20250828-130049-x5ucaa3ign Created: 20250828-130049 Status: failed Long status: Got exception: Failed run for component redis: HTTPSConnectionPool(host='api.svc.tools.eqiad1.wikimedia.cloud', port=30003): Read timed out. (read timeout=20) Builds: add-dangling-edits-to-group(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer add-edits-to-queue(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer add-reported-edits(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer add-reviews-from-huggle(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer add-reviews-from-report(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer celery-flower(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer celery-worker(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer cleanup-user-records(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer cluebotng-reviewer(successful): id:cluebotng-review-buildpacks-pipelinerun-h2ppb You can see the logs with `toolforge build logs cluebotng-review-buildpacks-pipelinerun-h2ppb` export-statistics(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer grafana-alloy(skipped): id:cluebotng-review-buildpacks-pipelinerun-frx2d Reusing existing build grant-review-access-from-wikipedia-rights(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer import-training-data(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer mark-edits-as-deleted(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer mark-edits-as-having-data(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer redis(successful): id:cluebotng-review-buildpacks-pipelinerun-5n759 You can see the logs with `toolforge build logs cluebotng-review-buildpacks-pipelinerun-5n759` update-edit-classifications(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer Runs: add-dangling-edits-to-group(successful): [info] (Job add-dangling-edits-to-group is already up to date) add-edits-to-queue(successful): [info] (Job add-edits-to-queue is already up to date) add-reported-edits(successful): [info] (Job add-reported-edits is already up to date) add-reviews-from-huggle(successful): [info] (Job add-reviews-from-huggle is already up to date) add-reviews-from-report(successful): [info] (Job add-reviews-from-report is already up to date) celery-flower(successful): [info] (Job celery-flower is already up to date) celery-worker(successful): [info] (Job celery-worker is already up to date) cleanup-user-records(successful): [info] (Job cleanup-user-records is already up to date) cluebotng-reviewer(successful): [info] (Job cluebotng-reviewer created) export-statistics(successful): [info] (Job export-statistics is already up to date) grafana-alloy(successful): [info] (Job grafana-alloy is already up to date) grant-review-access-from-wikipedia-rights(successful): [info] (Job grant-review-access-from-wikipedia-rights is already up to date) import-training-data(successful): [info] (Job import-training-data is already up to date) mark-edits-as-deleted(successful): [info] (Job mark-edits-as-deleted is already up to date) mark-edits-as-having-data(successful): [info] (Job mark-edits-as-having-data is already up to date) redis(failed): HTTPSConnectionPool(host='api.svc.tools.eqiad1.wikimedia.cloud', port=30003): Read timed out. (read timeout=20) update-edit-classifications(skipped): Skipped due to previous failure
Deployment ID: 20250828-105932-qebopov37q Created: 20250828-105932 Status: failed Long status: Got exception: Failed run for component update-edit-classifications: HTTPSConnectionPool(host='api.svc.tools.eqiad1.wikimedia.cloud', port=30003): Read timed out. (read timeout=20) Builds: add-dangling-edits-to-group(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer add-edits-to-queue(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer add-reported-edits(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer add-reviews-from-huggle(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer add-reviews-from-report(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer celery-flower(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer celery-worker(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer cleanup-user-records(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer cluebotng-reviewer(successful): id:cluebotng-review-buildpacks-pipelinerun-z9gk5 You can see the logs with `toolforge build logs cluebotng-review-buildpacks-pipelinerun-z9gk5` export-statistics(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer grafana-alloy(skipped): id:cluebotng-review-buildpacks-pipelinerun-vphm9 Reusing existing build grant-review-access-from-wikipedia-rights(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer import-training-data(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer irc-relay(skipped): id:cluebotng-review-buildpacks-pipelinerun-28bqc Reusing existing build mark-edits-as-deleted(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer mark-edits-as-having-data(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer redis(successful): id:cluebotng-review-buildpacks-pipelinerun-hr5fw You can see the logs with `toolforge build logs cluebotng-review-buildpacks-pipelinerun-hr5fw` update-edit-classifications(skipped): id:no-build-needed Component re-uses build from cluebotng-reviewer Runs: add-dangling-edits-to-group(successful): [info] (Job add-dangling-edits-to-group updated) add-edits-to-queue(successful): [info] (Job add-edits-to-queue updated) add-reported-edits(successful): [info] (Job add-reported-edits updated) add-reviews-from-huggle(successful): [info] (Job add-reviews-from-huggle updated) add-reviews-from-report(successful): [info] (Job add-reviews-from-report updated) celery-flower(successful): [info] (Job celery-flower updated) celery-worker(successful): [info] (Job celery-worker updated) cleanup-user-records(successful): [info] (Job cleanup-user-records updated) cluebotng-reviewer(successful): [info] (Job cluebotng-reviewer created) export-statistics(successful): [info] (Job export-statistics updated) grafana-alloy(successful): [info] (Job grafana-alloy updated) grant-review-access-from-wikipedia-rights(successful): [info] (Job grant-review-access-from-wikipedia-rights updated) import-training-data(successful): [info] (Job import-training-data updated) irc-relay(successful): [info] (Job irc-relay updated) mark-edits-as-deleted(successful): [info] (Job mark-edits-as-deleted updated) mark-edits-as-having-data(successful): [info] (Job mark-edits-as-having-data updated) redis(successful): [info] (Job redis created) update-edit-classifications(failed): HTTPSConnectionPool(host='api.svc.tools.eqiad1.wikimedia.cloud', port=30003): Read timed out. (read timeout=20)
There is very little that can be done as the end consumer other than to re-trigger the deployment, consuming time and resources.
What should have happened instead?:
Internal API calls should be re-tried with a backoff allowing for transient issues.
The timeout in ToolforgeClient was bumped under T376710 so it appears the jobs api is not finishing the response within 20 seconds.
Investigation needs to be done as to if this is a slow path/regression or the read timeout needs adjusting.