Cloud-VPSComponent
ActivePublic

Details

Description

Bugs related to Cloud VPS infrastructure and not a specific Cloud VPS project (see VPS-Projects for that).

Issues which are related to Toolforge should go in Toolforge instead.

Request new projects by filing a task in Cloud-VPS (Project-requests)

Request increased quota for existing projects by filing a task in Cloud-VPS (Quota-requests)

Recent Activity

Thu, Oct 2

Gehel updated subscribers of T405395: DPE SRE work to enable testing of Blazegraph alternatives.

From a quick chat with @taavi :

Thu, Oct 2, 1:38 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Wikidata-Platform, Wikidata
Gehel edited projects for T405395: DPE SRE work to enable testing of Blazegraph alternatives, added: Cloud-VPS; removed Cloud-Services.
Thu, Oct 2, 1:37 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Wikidata-Platform, Wikidata
JMeybohm added a comment to T405970: unable to "apt install helmfile" on CloudVPS debian 13 vm.

Let me try to shed some light:

Thu, Oct 2, 10:10 AM · cloud-services-team, Cloud-VPS

Wed, Oct 1

taavi triaged T405478: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration as Medium priority.
Wed, Oct 1, 2:15 PM · Patch-For-Review, cloud-services-team (FY2025/26-Q1), Cloud-VPS, SRE, DC-Ops
taavi triaged T405374: cloudceph: automate nic names in profile::cloudceph::osd::hosts as Low priority.
Wed, Oct 1, 2:07 PM · cloud-services-team, Cloud-VPS, Ceph
Andrew closed Restricted Task, a subtask of T348634: ceph slow ops 2023-10-11, as Resolved.
Wed, Oct 1, 1:57 PM · cloud-services-team, Cloud-VPS
taavi closed T405970: unable to "apt install helmfile" on CloudVPS debian 13 vm as Resolved.
Wed, Oct 1, 1:51 PM · cloud-services-team, Cloud-VPS

Tue, Sep 30

Andrew updated subscribers of T405970: unable to "apt install helmfile" on CloudVPS debian 13 vm.

Well, that mystery aside, the actual issue is resolved with

Tue, Sep 30, 4:31 PM · cloud-services-team, Cloud-VPS
Andrew added a comment to T405970: unable to "apt install helmfile" on CloudVPS debian 13 vm.
root@apt1002:~# reprepro ls helm
helm | 2.17.0-1 |   buster-wikimedia | amd64, source
helm | 3.17.2-1 | bookworm-wikimedia | amd64
helm | 3.18.6-1 | bookworm-wikimedia | amd64
helm | 3.18.4-1 |   trixie-wikimedia | amd64
Tue, Sep 30, 4:26 PM · cloud-services-team, Cloud-VPS

Mon, Sep 29

SDunlap created T405970: unable to "apt install helmfile" on CloudVPS debian 13 vm.
Mon, Sep 29, 7:50 PM · cloud-services-team, Cloud-VPS
fnegri moved T399858: Cloud Ceph misbehaving on Debian Bookworm from In progress to Done on the cloud-services-team (FY2025/26-Q1) board.
Mon, Sep 29, 9:23 AM · SRE-OnFire, Sustainability (Incident Followup), cloud-services-team (FY2025/26-Q1), Cloud-VPS

Fri, Sep 26

taavi triaged T404747: Review RAM allocation for cloudceph OSDs as Medium priority.
Fri, Sep 26, 12:36 PM · Ceph, Cloud-VPS, cloud-services-team
taavi triaged T404052: Add fqdn input to instance-related wmcs cookbooks as Low priority.
Fri, Sep 26, 11:39 AM · Cloud-VPS, cloud-services-team

Thu, Sep 25

gerritbot added a comment to T347681: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right.

Change #1191327 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] wmcs: have additional IPs survive reboots

https://gerrit.wikimedia.org/r/1191327

Thu, Sep 25, 10:45 AM · Patch-For-Review, Cloud-VPS, Cloud-Services-Origin-Team, Cloud-Services-Worktype-Unplanned, User-dcaro, SRE-OnFire, Sustainability (Incident Followup), cloud-services-team
gerritbot added a project to T347681: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right: Patch-For-Review.
Thu, Sep 25, 10:45 AM · Patch-For-Review, Cloud-VPS, Cloud-Services-Origin-Team, Cloud-Services-Worktype-Unplanned, User-dcaro, SRE-OnFire, Sustainability (Incident Followup), cloud-services-team
gerritbot added a comment to T347681: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right.

Change #1191326 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] interface: new define for additional IPs

https://gerrit.wikimedia.org/r/1191326

Thu, Sep 25, 10:45 AM · Patch-For-Review, Cloud-VPS, Cloud-Services-Origin-Team, Cloud-Services-Worktype-Unplanned, User-dcaro, SRE-OnFire, Sustainability (Incident Followup), cloud-services-team
fgiunchedi claimed T347681: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right.
Thu, Sep 25, 7:55 AM · Patch-For-Review, Cloud-VPS, Cloud-Services-Origin-Team, Cloud-Services-Worktype-Unplanned, User-dcaro, SRE-OnFire, Sustainability (Incident Followup), cloud-services-team

Wed, Sep 24

Andrew added a comment to T405478: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration.

Oh, also: if you want to reimage either of them: everything currently needs to be Bookworm, and the debian installer sometimes gets caught up during the partman phase. I've had good luck connecting to the console, escaping out of the partition phase in the installer and trying again, it seems to always take the second time.

Wed, Sep 24, 8:28 PM · Patch-For-Review, cloud-services-team (FY2025/26-Q1), Cloud-VPS, SRE, DC-Ops
Andrew added a comment to T405478: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration.

@fgiunchedi, 1050 and 1051 should already be fully puppetized with Ceph 18.x packages and mostly ready to go. The next step to get them in service is the 'wmcs.ceph.osd.bootstrap_and_add' cookbook.

Wed, Sep 24, 8:13 PM · Patch-For-Review, cloud-services-team (FY2025/26-Q1), Cloud-VPS, SRE, DC-Ops
Andrew updated subscribers of T404747: Review RAM allocation for cloudceph OSDs.
Wed, Sep 24, 7:49 PM · Ceph, Cloud-VPS, cloud-services-team
Andrew added a comment to T404747: Review RAM allocation for cloudceph OSDs.

For now I'm leaving codfw1 with osd_memory_target_autotune=true and eqiad1 with osd_memory_target_autotune=false and osd_memory_target=6442450944 -- I'm not convinced there's any change here which would be worth the extra complexity but I am interested in seeing what happens with autotune in the long rune.

Wed, Sep 24, 7:48 PM · Ceph, Cloud-VPS, cloud-services-team
Andrew added a comment to T404747: Review RAM allocation for cloudceph OSDs.

So... it doesn't look like ceph will make use of more RAM even if we offer it up. If it did, changing the cluster-wide setting from 6 to 8 would run the risk of OOMing most of the cluster.

Wed, Sep 24, 7:31 PM · Ceph, Cloud-VPS, cloud-services-team
Andrew added a comment to T404747: Review RAM allocation for cloudceph OSDs.

Here's another example: Cloudcephosd1016 is using around 29G or RAM according to grafana.

Wed, Sep 24, 7:07 PM · Ceph, Cloud-VPS, cloud-services-team
cmooney added a comment to T405478: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration.

I can confirm the switch is already set to accept tagged traffic for the storage vlan on the ports connecting to both these two hosts (think it was set that way from before).

Wed, Sep 24, 7:06 PM · Patch-For-Review, cloud-services-team (FY2025/26-Q1), Cloud-VPS, SRE, DC-Ops
gerritbot added a project to T405478: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration: Patch-For-Review.
Wed, Sep 24, 4:21 PM · Patch-For-Review, cloud-services-team (FY2025/26-Q1), Cloud-VPS, SRE, DC-Ops
gerritbot added a comment to T405478: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration.

Change #1191086 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Cloudcephosd1050: Configure ceph with a single nic

https://gerrit.wikimedia.org/r/1191086

Wed, Sep 24, 4:21 PM · Patch-For-Review, cloud-services-team (FY2025/26-Q1), Cloud-VPS, SRE, DC-Ops
fgiunchedi updated subscribers of T405478: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration.
Wed, Sep 24, 3:12 PM · Patch-For-Review, cloud-services-team (FY2025/26-Q1), Cloud-VPS, SRE, DC-Ops
Andrew added a comment to T395910: cloudcephosd10[48-52] service implementation.

1050 and 1051 won't be pooled immediately, they're being reserved for T405478

Wed, Sep 24, 3:11 PM · cloud-services-team (FY2025/26-Q1), Cloud-VPS, SRE, DC-Ops
Andrew updated the task description for T395910: cloudcephosd10[48-52] service implementation.
Wed, Sep 24, 3:06 PM · cloud-services-team (FY2025/26-Q1), Cloud-VPS, SRE, DC-Ops
Andrew closed T404249: [ceph,eqiad1] upgrade from quincy->reef (and bookworm) as Resolved.
root@cloudcephmon1004:~# ceph versions
{
    "mon": {
        "ceph version 18.2.4 (e7ad5345525c7aa95470c26863873b581076945d) reef (stable)": 3
    },
    "mgr": {
        "ceph version 18.2.4 (e7ad5345525c7aa95470c26863873b581076945d) reef (stable)": 3
    },
    "osd": {
        "ceph version 18.2.4 (e7ad5345525c7aa95470c26863873b581076945d) reef (stable)": 276
    },
    "rgw": {
        "ceph version 18.2.4 (e7ad5345525c7aa95470c26863873b581076945d) reef (stable)": 3
    },
    "overall": {
        "ceph version 18.2.4 (e7ad5345525c7aa95470c26863873b581076945d) reef (stable)": 285
    }
}
Wed, Sep 24, 2:59 PM · cloud-services-team, Cloud-VPS, Ceph
fgiunchedi created T405478: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration.
Wed, Sep 24, 2:59 PM · Patch-For-Review, cloud-services-team (FY2025/26-Q1), Cloud-VPS, SRE, DC-Ops
Maintenance_bot removed a project from T404249: [ceph,eqiad1] upgrade from quincy->reef (and bookworm): Patch-For-Review.
Wed, Sep 24, 2:33 PM · cloud-services-team, Cloud-VPS, Ceph
taavi moved T330759: Modernize openstack rbac from Blocked to Inbox on the cloud-services-team board.
Wed, Sep 24, 2:17 PM · Patch-For-Review, cloud-services-team, Cloud-VPS
gerritbot added a comment to T404249: [ceph,eqiad1] upgrade from quincy->reef (and bookworm).

Change #1191027 merged by Andrew Bogott:

[operations/puppet@production] Update cloudceph specs to expect reef

https://gerrit.wikimedia.org/r/1191027

Wed, Sep 24, 1:56 PM · cloud-services-team, Cloud-VPS, Ceph
gerritbot added a project to T404249: [ceph,eqiad1] upgrade from quincy->reef (and bookworm): Patch-For-Review.
Wed, Sep 24, 1:53 PM · cloud-services-team, Cloud-VPS, Ceph
gerritbot added a comment to T404249: [ceph,eqiad1] upgrade from quincy->reef (and bookworm).

Change #1191027 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Update cloudceph specs to expect reef

https://gerrit.wikimedia.org/r/1191027

Wed, Sep 24, 1:53 PM · cloud-services-team, Cloud-VPS, Ceph
Maintenance_bot removed a project from T405462: Allow customizing which mounts to enable per VM: Patch-For-Review.
Wed, Sep 24, 1:32 PM · tools-infrastructure-team, Cloud-VPS
gerritbot added a comment to T405462: Allow customizing which mounts to enable per VM.

Change #1191005 abandoned by Majavah:

[operations/puppet@production] P:wmcs::nfsclient: Allow granular control of mounted volumes

https://gerrit.wikimedia.org/r/1191005

Wed, Sep 24, 12:34 PM · tools-infrastructure-team, Cloud-VPS
taavi closed T405462: Allow customizing which mounts to enable per VM as Declined.

Filippo convinced me that this is probably not that useful.

Wed, Sep 24, 12:34 PM · tools-infrastructure-team, Cloud-VPS
gerritbot added a project to T405462: Allow customizing which mounts to enable per VM: Patch-For-Review.
Wed, Sep 24, 11:50 AM · tools-infrastructure-team, Cloud-VPS
gerritbot added a comment to T405462: Allow customizing which mounts to enable per VM.

Change #1191005 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:wmcs::nfsclient: Allow granular control of mounted volumes

https://gerrit.wikimedia.org/r/1191005

Wed, Sep 24, 11:50 AM · tools-infrastructure-team, Cloud-VPS
taavi created T405462: Allow customizing which mounts to enable per VM.
Wed, Sep 24, 11:44 AM · tools-infrastructure-team, Cloud-VPS
fgiunchedi added a comment to T347681: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right.

Ok if we have systemd-networkd everywhere then we should be installing/removing drop-in files for networkd to pick up instead.

Wed, Sep 24, 11:34 AM · Patch-For-Review, Cloud-VPS, Cloud-Services-Origin-Team, Cloud-Services-Worktype-Unplanned, User-dcaro, SRE-OnFire, Sustainability (Incident Followup), cloud-services-team
taavi closed T362956: nova-api can get the listen queue of socket full as Resolved.

This has not happened recently.

Wed, Sep 24, 11:28 AM · Cloud-VPS, cloud-services-team
taavi closed T391718: tf-infra-test misbehavior in codfw1dev as Resolved.

Seems fixed?

Wed, Sep 24, 11:17 AM · Patch-For-Review, Cloud-VPS, cloud-services-team
taavi added a comment to T347681: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right.

My understanding is that the reason for this is the puppetization expecting ifupdown (with /etc/network/interfaces) but the Debian cloud images switched to netplan/systemd-networkd some releases ago.

Wed, Sep 24, 10:01 AM · Patch-For-Review, Cloud-VPS, Cloud-Services-Origin-Team, Cloud-Services-Worktype-Unplanned, User-dcaro, SRE-OnFire, Sustainability (Incident Followup), cloud-services-team
fgiunchedi added a comment to T347681: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right.

I can confirm this is still the case, Profile::Wmcs::Nfs::Standalone does a manual Exec[ip addr add]

Wed, Sep 24, 9:46 AM · Patch-For-Review, Cloud-VPS, Cloud-Services-Origin-Team, Cloud-Services-Worktype-Unplanned, User-dcaro, SRE-OnFire, Sustainability (Incident Followup), cloud-services-team
Jclark-ctr closed T405258: cloudcephosd1025 won't reimage as Resolved.

Server completed Reimage by andrew

Wed, Sep 24, 9:08 AM · cloud-services-team, Ceph, SRE, ops-eqiad, DC-Ops, Cloud-VPS

Tue, Sep 23

Andrew closed T399858: Cloud Ceph misbehaving on Debian Bookworm as Resolved.

I am now upgrading the cluster to Bookworm + Reef (18.x) and that seems to bypass this issue.

Tue, Sep 23, 8:39 PM · SRE-OnFire, Sustainability (Incident Followup), cloud-services-team (FY2025/26-Q1), Cloud-VPS
taavi closed T262350: bad failure cases for wmcs custom puppet enc as Resolved.
Tue, Sep 23, 5:58 PM · Cloud-VPS, cloud-services-team