Skip to content

[Celestica] Ladakh800bcls: Agent Test: Fix AgentTrunkLoadBalancerTests no used vlan create route interface fail in warmboot#1337

Open
cel-gl wants to merge 1 commit into
facebook:mainfrom
cel-gl:Fix_AgentTrunkLoadBalancerTests
Open

[Celestica] Ladakh800bcls: Agent Test: Fix AgentTrunkLoadBalancerTests no used vlan create route interface fail in warmboot#1337
cel-gl wants to merge 1 commit into
facebook:mainfrom
cel-gl:Fix_AgentTrunkLoadBalancerTests

Conversation

@cel-gl

@cel-gl cel-gl commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Pre-submission checklist

  • I've ran the linters locally and fixed lint errors related to the files I modified in this PR. You can install the linters by running pip install -r requirements-dev.txt && pre-commit install
  • pre-commit run
    clang-format.............................................................Passed
    shellcheck...........................................(no files to check)Skipped
    shfmt................................................(no files to check)Skipped
    trim trailing whitespace.................................................Passed
    fix end of files.........................................................Passed
    check yaml...........................................(no files to check)Skipped
    check json...........................................(no files to check)Skipped
    check for merge conflicts................................................Passed
    ruff check...........................................(no files to check)Skipped
    ruff format..........................................(no files to check)Skipped
    Prevent sai_impl in fboss manifest.......................................Passed

Summary

Issue Summary

On multi-NPU hardware platforms, the AgentTrunkLoadBalancerTest suite was experiencing a fatal warmboot failure, The failure manifested as a no sai vlan for VlanID 2001 error

The root cause was an inconsistency between the agent's software SwitchState and the actual hardware state programmed in the SAI layer after a cold boot, which then became fatal upon warmboot.

Root Cause Analysis

  1. Initial Config Creates Redundant VLANs: The test's setup phase begins by calling initialConfig(), which uses utility::onePortPerInterfaceConfig. This utility creates a clean L3 topology by assigning a unique VLAN and Router Interface to every single physical port on the switch (e.g., Port 1
    → VLAN 2000, Port 2 → VLAN 2001, etc.).

  2. Aggregate Port Creation Orphans VLANs: Immediately after, the configureAggregatePorts helper is called, which bundles a subset of these physical ports into Link Aggregation Groups (LAGs). For example, it might place Ports 1, 2, and 3 into agg1. As part of this, it correctly re-assigns all
    member ports to a single VLAN (e.g., VLAN 2000).

  3. State Mismatch: The critical issue was that this function only updated the port-to-VLAN mappings but did not remove the now-empty VLANs and Interfaces (e.g., VLAN 2001, VLAN 2002, and their corresponding L3 Interfaces) from the main configuration lists (vlans and interfaces).

  4. Warmboot Failure:

    • During the initial cold boot, applyNewConfig would see these orphaned but still present VLAN/Interface objects in the configuration and program them into the hardware.
    • However, because these VLANs had no member ports on npu0, some SAI implementations would garbage-collect or simply not retain the empty VLAN in hardware.
    • Upon warm boot, the agent would reload its software state (which still contained the "ghost" InterfaceID 2001) and attempt to re-initialize it. When SaiRouterInterfaceManager tried to create the router interface, it would query the SAI layer for the handle of its underlying VLAN (VlanID
      2001). Since the hardware for this empty VLAN no longer existed on npu0, the call would fail, leading to the fatal crash.

Solution Implemented

The solution was to make the configuration declaratively correct by explicitly pruning any orphaned VLANs and interfaces before applying the config.

This was achieved by modifying the configureAggregatePorts function in fboss/agent/test/agent_hw_tests/AgentTrunkLoadBalancerTests.cpp:

  1. Identify Active VLANs: After the aggregate ports are created, the code now iterates through all vlanPorts in the configuration to build a set of activeVlans—VLANs that still have at least one physical or aggregate port member.

  2. Prune Unused VLANs and Interfaces: The code uses std::remove_if to filter both the vlans and interfaces lists within the cfg::SwitchConfig object. Any VLAN or interface whose ID is not in the activeVlans set is removed.

This ensures that the config object passed to applyNewConfig is a precise representation of the desired final state, without any orphaned entries. As a result, the SwitchState delta calculation is now correct, and the agent properly removes the unused VLANs from the hardware during the initial
setup, preventing the warmboot failure.

Test Plan

test on npu0 and npu1 cold and warm boot
=== Cold boot: AgentTrunkLoadBalancerTest.ECMPFullTrunkHalfHash4X3WideTrunksCpuTraffic (switch_id=0) ===
Waiting for config files...
Config files ready after 1s
Starting hw_agent processes...
=== Warm boot: AgentTrunkLoadBalancerTest.ECMPFullTrunkHalfHash4X3WideTrunksCpuTraffic (switch_id=0) ===

=== Test Results ===
Cold boot: PASSED (exit code 0)
Warm boot: PASSED (exit code 0)

=== Cold boot: AgentTrunkLoadBalancerTest.ECMPFullTrunkHalfHash4X3WideTrunksCpuTraffic (switch_id=1) ===
Waiting for config files...
Config files ready after 1s
Starting hw_agent processes...
=== Warm boot: AgentTrunkLoadBalancerTest.ECMPFullTrunkHalfHash4X3WideTrunksCpuTraffic (switch_id=1) ===

=== Test Results ===
Cold boot: PASSED (exit code 0)
Warm boot: PASSED (exit code 0)

AI skill code check

image image
@cel-gl cel-gl requested a review from a team as a code owner June 26, 2026 06:40
@meta-cla meta-cla Bot added the CLA Signed label Jun 26, 2026
interfaces.begin(),
interfaces.end(),
[&](const cfg::Interface& intf) {
return activeVlans.find(VlanID(*intf.vlanID())) ==

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prunes interfaces by vlanID unconditionally, but on Chenab (InterfaceType::PORT) and VOQ (SYSTEM_PORT) the interfaces onePortPerInterfaceConfig builds carry a placeholder vlanID()==0 and are keyed by portID/intfID — none are in the VLAN-port-derived activeVlans, so this would delete every valid PORT/SYSTEM_PORT RIF and break the config on those platforms. Guard on interface type:

return *intf.type() == cfg::InterfaceType::VLAN &&
activeVlans.find(VlanID(*intf.vlanID())) == activeVlans.end();
(*intf.vlanID() itself is safe — the field is non-optional i32 — so this is a semantics bug, not a crash.)

@@ -207,6 +200,33 @@ class AgentTrunkLoadBalancerTest : public AgentHwTest {
cfg::SwitchConfig configureAggregatePorts(AggPortInfo aggInfo) {
auto config = initialConfig(*getAgentEnsemble());
addAggregatePorts(&config, aggInfo);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The orphaned VLANs/RIFs are created by the shared utility::addAggPort (TrunkUtils.cpp:53-76), which ~30 trunk/LAG/SRv6/MPLS/MAC/copp tests call — they all inherit the same orphan condition. Putting the prune inline here fixes only this test. Consider moving it into a shared, type-safe TrunkUtils helper (e.g. removeUnusedVlansAndInterfaces(config)) called from addAggPort, so every caller benefits — or note why only this test needs it.

@msomasundaran

Copy link
Copy Markdown
Contributor

As per the description, the problem is happening due to empty vlans left out after creating aggregate port. if that is the case, this problem should happen on every platform and should not ladakh specific. Any reason why this is happening only on ladakh/leh platforms?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

3 participants