GitLab Runner Autoscaling - Feedback issue for the new runner autoscaling solution
<!--IssueSummary start-->
<details>
<summary>
Everyone can contribute. [Help move this issue forward](https://handbook.gitlab.com/handbook/marketing/developer-relations/contributor-success/community-contributors-workflows/#contributor-links) while earning points, leveling up and collecting rewards.
</summary>
- [Close this issue](https://contributors.gitlab.com/manage-issue?action=close&projectId=278964&issueIid=408131)
</details>
<!--IssueSummary end-->
# Summary of Feedback on GitLab Runner Autoscaling (GitLab Duo)
## General Feedback
- Users are generally positive about the new autoscaling solution, with one customer noting they "solved all the lingering issues we had with docker-machine by switching to Fleeting"
- The architecture allows for running ARM/Graviton hosts with an Intel x86 orchestrator, which was praised
- Users appreciate the ability to run Windows fleeting nodes with a Linux orchestrator
## Configuration Issues
- **AWS Region Configuration**: Several users reported issues with missing AWS region configuration, requiring manual creation of config files
- **Docker TLS Verification**: When `tls_verify = true`, users encountered connection errors
- **Resource Allocation**: Questions about how to determine CPU, memory, and storage for containers to ensure consistent performance
- **Docker Registry Mirrors**: Users needed guidance on configuring registry mirrors to avoid Docker Hub throttling
- **Storage Options**: Clarification needed on configuring storage limits via `storage-opt` and `volume_driver_ops`
## Performance Issues
- **SSH Key Generation**: Dynamic SSH key generation causes significant delays (30-35 seconds) compared to static keys (2-3 seconds)
- **Warm Pools**: Users reported significant differences in instance provisioning speed with warm pools vs. without
- **Job Queue Delays**: Some users experienced high job pending queues despite not reaching max concurrency
- **Scaling Speed**: With warm pools, it took 7 minutes to scale from 3 to 32 instances, while without warm pools it took only 1 minute to spawn 36 instances
## Stability Issues
- **Instance Termination**: Instances sometimes get terminated while still running jobs
- **Connection Errors**: Users reported various connection errors like "EC2 Instance Connect is not supported on a terminated instance"
- **Heartbeat Checks**: The feature flag `FF_USE_FLEETING_ACQUIRE_HEARTBEATS` was introduced to help with instance connectivity issues
- **ASG Rebalancing**: Disabling `AZRebalance` process on ASGs resolved issues with jobs hanging
- **API Rate Limiting**: Excessive AWS API calls can lead to rate limiting, causing service disruptions
## Feature Requests
- **AWS Warm Pools Support**: Better integration with AWS warm pools to reduce startup times
- **Multiple Runners per ASG**: Support for sharing the same Auto Scaling Group across different runners
- **Disk Space Management**: Better handling of "No Space Left on Device" errors
- **Graceful Shutdown**: Improved handling of SIGTERM for proper cleanup
- **State Persistence**: Ability to persist state between processes for rolling deployments
- **Throttling Control**: More control over API call frequency to avoid rate limiting
- **Error Handling**: Better detection when scaling fails to prevent continuous failed attempts
## Cloud Provider Specific Feedback
- **AWS**: Most feedback was related to AWS, with specific issues around spot instances, ASG configuration, and API limits
- **Azure**: Issues with WinRM configuration on Windows VMs and VMSS scaling problems when hitting resource quotas
- **GCP**: Some users reported that instance groups only scale down but not up
## Documentation Needs
- Better documentation on ASG setup requirements (like disabling AZRebalance)
- Clearer explanation of configuration parameters and their interactions
- More examples for different cloud providers
- Documentation on pros/cons of dynamic vs. static SSH key usage
The feedback shows that while the new autoscaling solution offers significant improvements over docker-machine, there are still areas that need refinement, particularly around stability, performance, and documentation.
Sources: [GitLab Runner Autoscaling - Feedback issue for the new runner autoscaling solution](/gitlab-org/gitlab/-/issues/408131)
### Background
As of GitLab Runner 15.11, the new [Docker Autoscaler](https://docs.gitlab.com/runner/executors/docker_autoscaler.html), [Instance executor](https://docs.gitlab.com/runner/executors/instance.html), and fleeting plugin for AWS is available.
The technical details for this [new autoscaling solution](https://docs.gitlab.com/runner/runner_autoscale/#gitlab-runner-autoscaling) are documented in the [Next Runner autoscaling architecture blueprint](https://docs.gitlab.com/ee/architecture/blueprints/runner_scaling/index.html).
**Please add comments below with your feedback or questions:**.
- [General feedback](https://gitlab.com/gitlab-org/gitlab/-/issues/408131#note_1359439213)
- [Bugs](https://gitlab.com/gitlab-org/gitlab/-/issues/408131#note_1359440009)
- [Missing functionality](https://gitlab.com/gitlab-org/gitlab/-/issues/408131#note_1359440864)
- [Questions](https://gitlab.com/gitlab-org/gitlab/-/issues/408131#note_1359441337)
issue