GitLab Runner Autoscaling - Feedback issue for the new runner autoscaling solution
<!--IssueSummary start--> <details> <summary> Everyone can contribute. [Help move this issue forward](https://handbook.gitlab.com/handbook/marketing/developer-relations/contributor-success/community-contributors-workflows/#contributor-links) while earning points, leveling up and collecting rewards. </summary> - [Close this issue](https://contributors.gitlab.com/manage-issue?action=close&projectId=278964&issueIid=408131) </details> <!--IssueSummary end--> # Summary of Feedback on GitLab Runner Autoscaling (GitLab Duo) ## General Feedback - Users are generally positive about the new autoscaling solution, with one customer noting they "solved all the lingering issues we had with docker-machine by switching to Fleeting" - The architecture allows for running ARM/Graviton hosts with an Intel x86 orchestrator, which was praised - Users appreciate the ability to run Windows fleeting nodes with a Linux orchestrator ## Configuration Issues - **AWS Region Configuration**: Several users reported issues with missing AWS region configuration, requiring manual creation of config files - **Docker TLS Verification**: When `tls_verify = true`, users encountered connection errors - **Resource Allocation**: Questions about how to determine CPU, memory, and storage for containers to ensure consistent performance - **Docker Registry Mirrors**: Users needed guidance on configuring registry mirrors to avoid Docker Hub throttling - **Storage Options**: Clarification needed on configuring storage limits via `storage-opt` and `volume_driver_ops` ## Performance Issues - **SSH Key Generation**: Dynamic SSH key generation causes significant delays (30-35 seconds) compared to static keys (2-3 seconds) - **Warm Pools**: Users reported significant differences in instance provisioning speed with warm pools vs. without - **Job Queue Delays**: Some users experienced high job pending queues despite not reaching max concurrency - **Scaling Speed**: With warm pools, it took 7 minutes to scale from 3 to 32 instances, while without warm pools it took only 1 minute to spawn 36 instances ## Stability Issues - **Instance Termination**: Instances sometimes get terminated while still running jobs - **Connection Errors**: Users reported various connection errors like "EC2 Instance Connect is not supported on a terminated instance" - **Heartbeat Checks**: The feature flag `FF_USE_FLEETING_ACQUIRE_HEARTBEATS` was introduced to help with instance connectivity issues - **ASG Rebalancing**: Disabling `AZRebalance` process on ASGs resolved issues with jobs hanging - **API Rate Limiting**: Excessive AWS API calls can lead to rate limiting, causing service disruptions ## Feature Requests - **AWS Warm Pools Support**: Better integration with AWS warm pools to reduce startup times - **Multiple Runners per ASG**: Support for sharing the same Auto Scaling Group across different runners - **Disk Space Management**: Better handling of "No Space Left on Device" errors - **Graceful Shutdown**: Improved handling of SIGTERM for proper cleanup - **State Persistence**: Ability to persist state between processes for rolling deployments - **Throttling Control**: More control over API call frequency to avoid rate limiting - **Error Handling**: Better detection when scaling fails to prevent continuous failed attempts ## Cloud Provider Specific Feedback - **AWS**: Most feedback was related to AWS, with specific issues around spot instances, ASG configuration, and API limits - **Azure**: Issues with WinRM configuration on Windows VMs and VMSS scaling problems when hitting resource quotas - **GCP**: Some users reported that instance groups only scale down but not up ## Documentation Needs - Better documentation on ASG setup requirements (like disabling AZRebalance) - Clearer explanation of configuration parameters and their interactions - More examples for different cloud providers - Documentation on pros/cons of dynamic vs. static SSH key usage The feedback shows that while the new autoscaling solution offers significant improvements over docker-machine, there are still areas that need refinement, particularly around stability, performance, and documentation. Sources: [GitLab Runner Autoscaling - Feedback issue for the new runner autoscaling solution](/gitlab-org/gitlab/-/issues/408131) ### Background As of GitLab Runner 15.11, the new [Docker Autoscaler](https://docs.gitlab.com/runner/executors/docker_autoscaler.html), [Instance executor](https://docs.gitlab.com/runner/executors/instance.html), and fleeting plugin for AWS is available. The technical details for this [new autoscaling solution](https://docs.gitlab.com/runner/runner_autoscale/#gitlab-runner-autoscaling) are documented in the [Next Runner autoscaling architecture blueprint](https://docs.gitlab.com/ee/architecture/blueprints/runner_scaling/index.html). **Please add comments below with your feedback or questions:**. - [General feedback](https://gitlab.com/gitlab-org/gitlab/-/issues/408131#note_1359439213) - [Bugs](https://gitlab.com/gitlab-org/gitlab/-/issues/408131#note_1359440009) - [Missing functionality](https://gitlab.com/gitlab-org/gitlab/-/issues/408131#note_1359440864) - [Questions](https://gitlab.com/gitlab-org/gitlab/-/issues/408131#note_1359441337)
issue