[Celestica] Ladakh800bcls: Agent Test: Fix AgentTrunkLoadBalancerTests no used vlan create route interface fail in warmboot#1337
Conversation
| interfaces.begin(), | ||
| interfaces.end(), | ||
| [&](const cfg::Interface& intf) { | ||
| return activeVlans.find(VlanID(*intf.vlanID())) == |
There was a problem hiding this comment.
This prunes interfaces by vlanID unconditionally, but on Chenab (InterfaceType::PORT) and VOQ (SYSTEM_PORT) the interfaces onePortPerInterfaceConfig builds carry a placeholder vlanID()==0 and are keyed by portID/intfID — none are in the VLAN-port-derived activeVlans, so this would delete every valid PORT/SYSTEM_PORT RIF and break the config on those platforms. Guard on interface type:
return *intf.type() == cfg::InterfaceType::VLAN &&
activeVlans.find(VlanID(*intf.vlanID())) == activeVlans.end();
(*intf.vlanID() itself is safe — the field is non-optional i32 — so this is a semantics bug, not a crash.)
| @@ -207,6 +200,33 @@ class AgentTrunkLoadBalancerTest : public AgentHwTest { | |||
| cfg::SwitchConfig configureAggregatePorts(AggPortInfo aggInfo) { | |||
| auto config = initialConfig(*getAgentEnsemble()); | |||
| addAggregatePorts(&config, aggInfo); | |||
|
|
|||
There was a problem hiding this comment.
The orphaned VLANs/RIFs are created by the shared utility::addAggPort (TrunkUtils.cpp:53-76), which ~30 trunk/LAG/SRv6/MPLS/MAC/copp tests call — they all inherit the same orphan condition. Putting the prune inline here fixes only this test. Consider moving it into a shared, type-safe TrunkUtils helper (e.g. removeUnusedVlansAndInterfaces(config)) called from addAggPort, so every caller benefits — or note why only this test needs it.
|
As per the description, the problem is happening due to empty vlans left out after creating aggregate port. if that is the case, this problem should happen on every platform and should not ladakh specific. Any reason why this is happening only on ladakh/leh platforms? |
Pre-submission checklist
pip install -r requirements-dev.txt && pre-commit installpre-commit runclang-format.............................................................Passed
shellcheck...........................................(no files to check)Skipped
shfmt................................................(no files to check)Skipped
trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check yaml...........................................(no files to check)Skipped
check json...........................................(no files to check)Skipped
check for merge conflicts................................................Passed
ruff check...........................................(no files to check)Skipped
ruff format..........................................(no files to check)Skipped
Prevent sai_impl in fboss manifest.......................................Passed
Summary
Issue Summary
On multi-NPU hardware platforms, the AgentTrunkLoadBalancerTest suite was experiencing a fatal warmboot failure, The failure manifested as a no sai vlan for VlanID 2001 error
The root cause was an inconsistency between the agent's software SwitchState and the actual hardware state programmed in the SAI layer after a cold boot, which then became fatal upon warmboot.
Root Cause Analysis
Initial Config Creates Redundant VLANs: The test's setup phase begins by calling initialConfig(), which uses utility::onePortPerInterfaceConfig. This utility creates a clean L3 topology by assigning a unique VLAN and Router Interface to every single physical port on the switch (e.g., Port 1
→ VLAN 2000, Port 2 → VLAN 2001, etc.).
Aggregate Port Creation Orphans VLANs: Immediately after, the configureAggregatePorts helper is called, which bundles a subset of these physical ports into Link Aggregation Groups (LAGs). For example, it might place Ports 1, 2, and 3 into agg1. As part of this, it correctly re-assigns all
member ports to a single VLAN (e.g., VLAN 2000).
State Mismatch: The critical issue was that this function only updated the port-to-VLAN mappings but did not remove the now-empty VLANs and Interfaces (e.g., VLAN 2001, VLAN 2002, and their corresponding L3 Interfaces) from the main configuration lists (vlans and interfaces).
Warmboot Failure:
2001). Since the hardware for this empty VLAN no longer existed on npu0, the call would fail, leading to the fatal crash.
Solution Implemented
The solution was to make the configuration declaratively correct by explicitly pruning any orphaned VLANs and interfaces before applying the config.
This was achieved by modifying the configureAggregatePorts function in fboss/agent/test/agent_hw_tests/AgentTrunkLoadBalancerTests.cpp:
Identify Active VLANs: After the aggregate ports are created, the code now iterates through all vlanPorts in the configuration to build a set of activeVlans—VLANs that still have at least one physical or aggregate port member.
Prune Unused VLANs and Interfaces: The code uses std::remove_if to filter both the vlans and interfaces lists within the cfg::SwitchConfig object. Any VLAN or interface whose ID is not in the activeVlans set is removed.
This ensures that the config object passed to applyNewConfig is a precise representation of the desired final state, without any orphaned entries. As a result, the SwitchState delta calculation is now correct, and the agent properly removes the unused VLANs from the hardware during the initial
setup, preventing the warmboot failure.
Test Plan
test on npu0 and npu1 cold and warm boot
=== Cold boot: AgentTrunkLoadBalancerTest.ECMPFullTrunkHalfHash4X3WideTrunksCpuTraffic (switch_id=0) ===
Waiting for config files...
Config files ready after 1s
Starting hw_agent processes...
=== Warm boot: AgentTrunkLoadBalancerTest.ECMPFullTrunkHalfHash4X3WideTrunksCpuTraffic (switch_id=0) ===
=== Test Results ===
Cold boot: PASSED (exit code 0)
Warm boot: PASSED (exit code 0)
=== Cold boot: AgentTrunkLoadBalancerTest.ECMPFullTrunkHalfHash4X3WideTrunksCpuTraffic (switch_id=1) ===
Waiting for config files...
Config files ready after 1s
Starting hw_agent processes...
=== Warm boot: AgentTrunkLoadBalancerTest.ECMPFullTrunkHalfHash4X3WideTrunksCpuTraffic (switch_id=1) ===
=== Test Results ===
Cold boot: PASSED (exit code 0)
Warm boot: PASSED (exit code 0)
AI skill code check