[Accton][wedge800cact] Fix VerifyCallbacksOnMacEntryChange warmboot timeout#1303
[Accton][wedge800cact] Fix VerifyCallbacksOnMacEntryChange warmboot timeout#1303BrandonCheng0121 wants to merge 3 commits into
Conversation
…imeout After warmboot, the flood-prevention port may remain DOWN (preserved from coldboot state) or may be in a transient state due to link flap. Calling bringDownPort directly on an already-DOWN port causes LinkStateToggler to wait indefinitely because SDK won't generate a new DOWN notification when state hasn't changed. Add bringDownPortIfUp() helper that polls for the port to come UP (10s timeout, 100ms interval) before calling bringDownPort. If the port stays DOWN (already in desired state), skip the bringDown call and proceed with the test.
|
@Tianyu-Meta |
|
I suspect this is a race condition: if the bringDownPort operation is triggered before the warm-boot initialization completely finishes, it leads to unexpected behavior.
|
…ipping wait for ports already in desired state LinkStateToggler::portStateChangeImpl() unconditionally waits for link state change events via condition variable, even when ports are already in the desired OperState. During warmboot scenarios, ports may retain their state (e.g., already DOWN), causing SDK to skip event generation. This results in infinite wait and test timeout. The Solution: Implement waitForPortEventsOrSkipIfAlreadyInState() to check port OperState before waiting. If all ports are already in the desired state: - Skip condition variable wait (avoid infinite hang) - Skip redundant OperState update (performance optimization)
|
Hi @Tianyu-Meta |
|
@Tianyu-Meta has imported this pull request. If you are a Meta employee, you can view this in D110075280. |
After warmboot, the flood-prevention port may remain DOWN (preserved from coldboot state) or may be in a transient state due to link flap. Calling bringDownPort directly on an already-DOWN port causes LinkStateToggler to wait indefinitely because SDK won't generate a new DOWN notification when state hasn't changed.
The Solution:
Based on @Tianyu-Meta suggestions
'e.g. skip linkEventCV_.wai if all ports have oper == (up ? UP : DOWN)'
on the G-chat,
implement waitForPortEventsOrSkipIfAlreadyInState() to check port OperState before waiting. If all ports are already in the desired state:
Pre-submission checklist
pip install -r requirements-dev.txt && pre-commit installpre-commit run --files fboss/agent/test/LinkStateToggler.cpp fboss/agent/test/LinkStateToggler.h fboss/agent/test/agent_hw_tests/AgentMacLearningTests.cpp clang-format.............................................................Passed shellcheck...........................................(no files to check)Skipped shfmt................................................(no files to check)Skipped trim trailing whitespace.................................................Passed fix end of files.........................................................Passed check yaml...........................................(no files to check)Skipped check json...........................................(no files to check)Skipped check for merge conflicts................................................Passed ruff check...........................................(no files to check)Skipped ruff format..........................................(no files to check)Skipped Prevent sai_impl in fboss manifest.......................................PassedSummary
Before this fix, the warm-boot test would typically encounter a TIMEOUT issue within 40 consecutive runs.
With this fix, ports already in the desired state after warmboot no longer wait for SDK events that will never arrive and the test consistently passes for 50 consecutive runs.
Test Plan
Test command:
for i in {1..50}; do echo "=== Execute $i run ==="; time ./bin/run_test.py sai_agent --agent-run-mode mono --filter=AgentMacSwLearningModeTest.VerifyCallbacksOnMacEntryChange --skip-known-bad-tests "leaba/25.11.4210/25.11.4210/graphene202x" --enable-production-features g202x --config /opt/fboss/share/hw_test_configs/wedge800cact.agent.materialized_JSON --fruid-path /home/Go_FBOSS_Test/W800CA-Fix/./fboss-configs/fboss/oss/scripts/run_configs/fruid.json --mgmt-if eth0 --platform_mapping_override_path /home/Go_FBOSS_Test/W800CA-Fix/./fboss-configs/fboss/lib/platform_mapping_v2/generated_platform_mappings/wedge800cact_platform_mapping-2026-0418-v0.7-honglim_20260409-del_pie.json 2>&1 | tee 463e_W800CACT_VerifyCallbacksOnMacEntryChange_DUT35_DVT1_tiayu_v1_$(date '+%Y-%m-%d-%H:%M')_${i}.log; doneThe test consistently passes for 50 consecutive runs.