Skip to content

[Accton][wedge800cact] Fix VerifyCallbacksOnMacEntryChange warmboot timeout#1303

Open
BrandonCheng0121 wants to merge 3 commits into
facebook:mainfrom
BrandonCheng0121:fix_L2EntryChange_timout
Open

[Accton][wedge800cact] Fix VerifyCallbacksOnMacEntryChange warmboot timeout#1303
BrandonCheng0121 wants to merge 3 commits into
facebook:mainfrom
BrandonCheng0121:fix_L2EntryChange_timout

Conversation

@BrandonCheng0121

@BrandonCheng0121 BrandonCheng0121 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

After warmboot, the flood-prevention port may remain DOWN (preserved from coldboot state) or may be in a transient state due to link flap. Calling bringDownPort directly on an already-DOWN port causes LinkStateToggler to wait indefinitely because SDK won't generate a new DOWN notification when state hasn't changed.

The Solution:
Based on @Tianyu-Meta suggestions
'e.g. skip linkEventCV_.wai if all ports have oper == (up ? UP : DOWN)'
on the G-chat,
implement waitForPortEventsOrSkipIfAlreadyInState() to check port OperState before waiting. If all ports are already in the desired state:

  • Skip condition variable wait (avoid infinite hang)
  • Skip redundant OperState update (performance optimization)

Pre-submission checklist

  • I've ran the linters locally and fixed lint errors related to the files I modified in this PR. You can install the linters by running pip install -r requirements-dev.txt && pre-commit install
  • pre-commit run --files fboss/agent/test/LinkStateToggler.cpp fboss/agent/test/LinkStateToggler.h fboss/agent/test/agent_hw_tests/AgentMacLearningTests.cpp clang-format.............................................................Passed shellcheck...........................................(no files to check)Skipped shfmt................................................(no files to check)Skipped trim trailing whitespace.................................................Passed fix end of files.........................................................Passed check yaml...........................................(no files to check)Skipped check json...........................................(no files to check)Skipped check for merge conflicts................................................Passed ruff check...........................................(no files to check)Skipped ruff format..........................................(no files to check)Skipped Prevent sai_impl in fboss manifest.......................................Passed

Summary

Before this fix, the warm-boot test would typically encounter a TIMEOUT issue within 40 consecutive runs.
With this fix, ports already in the desired state after warmboot no longer wait for SDK events that will never arrive and the test consistently passes for 50 consecutive runs.

Test Plan

Test command:
for i in {1..50}; do echo "=== Execute $i run ==="; time ./bin/run_test.py sai_agent --agent-run-mode mono --filter=AgentMacSwLearningModeTest.VerifyCallbacksOnMacEntryChange --skip-known-bad-tests "leaba/25.11.4210/25.11.4210/graphene202x" --enable-production-features g202x --config /opt/fboss/share/hw_test_configs/wedge800cact.agent.materialized_JSON --fruid-path /home/Go_FBOSS_Test/W800CA-Fix/./fboss-configs/fboss/oss/scripts/run_configs/fruid.json --mgmt-if eth0 --platform_mapping_override_path /home/Go_FBOSS_Test/W800CA-Fix/./fboss-configs/fboss/lib/platform_mapping_v2/generated_platform_mappings/wedge800cact_platform_mapping-2026-0418-v0.7-honglim_20260409-del_pie.json 2>&1 | tee 463e_W800CACT_VerifyCallbacksOnMacEntryChange_DUT35_DVT1_tiayu_v1_$(date '+%Y-%m-%d-%H:%M')_${i}.log; done

The test consistently passes for 50 consecutive runs.

…imeout

After warmboot, the flood-prevention port may remain DOWN (preserved
from coldboot state) or may be in a transient state due to link flap.
Calling bringDownPort directly on an already-DOWN port causes
LinkStateToggler to wait indefinitely because SDK won't generate a new
DOWN notification when state hasn't changed.

Add bringDownPortIfUp() helper that polls for the port to come UP
(10s timeout, 100ms interval) before calling bringDownPort. If the
port stays DOWN (already in desired state), skip the bringDown call
and proceed with the test.
@BrandonCheng0121 BrandonCheng0121 requested a review from a team as a code owner June 16, 2026 07:27
@meta-cla meta-cla Bot added the CLA Signed label Jun 16, 2026
@BrandonCheng0121 BrandonCheng0121 marked this pull request as draft June 16, 2026 07:32
@BrandonCheng0121 BrandonCheng0121 marked this pull request as ready for review June 16, 2026 08:00
@BrandonCheng0121

Copy link
Copy Markdown
Contributor Author

@Tianyu-Meta
Please take a look at this PR, thanks.

@BrandonCheng0121

BrandonCheng0121 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@Tianyu-Meta

I suspect this is a race condition: if the bringDownPort operation is triggered before the warm-boot initialization completely finishes, it leads to unexpected behavior.
To resolve this, I propose two potential solutions:

  1. Wait for the warm-boot port state initialization to complete before executing bringDownPort on the target port.(This is used by this PR.)

  2. Alternatively, we could provide a dedicated initialConfig for this test case. However, according to the FBOSS coding standards (fboss/skills/fboss-code-standards/references/testing-patterns.md), this solution does not seem to be allowed.

@BrandonCheng0121 BrandonCheng0121 marked this pull request as draft June 29, 2026 09:50
…ipping wait for ports already in desired state

LinkStateToggler::portStateChangeImpl() unconditionally waits for link state
change events via condition variable, even when ports are already in the
desired OperState. During warmboot scenarios, ports may retain their state
(e.g., already DOWN), causing SDK to skip event generation. This results in
infinite wait and test timeout.

The Solution:
Implement waitForPortEventsOrSkipIfAlreadyInState() to check port OperState
before waiting. If all ports are already in the desired state:
- Skip condition variable wait (avoid infinite hang)
- Skip redundant OperState update (performance optimization)
@BrandonCheng0121 BrandonCheng0121 marked this pull request as ready for review June 29, 2026 10:33
@BrandonCheng0121

Copy link
Copy Markdown
Contributor Author

Hi @Tianyu-Meta
Based on your suggestions
'e.g. skip linkEventCV_.wai if all ports have oper == (up ? UP : DOWN)'
on the G-chat,
I've modified it. Please help review it, thanks.

@meta-codesync

meta-codesync Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

@Tianyu-Meta has imported this pull request. If you are a Meta employee, you can view this in D110075280.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

2 participants