Skip to content

Ladakh800bcls: fix exception in case:AgentNeighborTest/*.LinkDown*OnR…#820

Closed
gang-tao wants to merge 2 commits into
facebook:mainfrom
gang-tao:ladakh800bcls_agent_fix2
Closed

Ladakh800bcls: fix exception in case:AgentNeighborTest/*.LinkDown*OnR…#820
gang-tao wants to merge 2 commits into
facebook:mainfrom
gang-tao:ladakh800bcls_agent_fix2

Conversation

@gang-tao

@gang-tao gang-tao commented Jan 19, 2026

Copy link
Copy Markdown
Contributor

Pre-submission checklist

  • [✓ ] I've ran the linters locally and fixed lint errors related to the files I modified in this PR. You can install the linters by running pip install -r requirements-dev.txt && pre-commit install
  • [✓ ] pre-commit run
clang-format.............................................................Passed
black................................................(no files to check)Skipped
shellcheck...........................................(no files to check)Skipped
shfmt................................................(no files to check)Skipped
trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check yaml...........................................(no files to check)Skipped
check json...........................................(no files to check)Skipped
check for merge conflicts................................................Passed
ruff check...........................................(no files to check)Skipped

Summary

AgentNeighborTest/.LinkDownOnResolvedEntry failed with below exception:
/var/FBOSS/ladakh_test_1215/fboss/agent/test/LinkStateToggler.cpp:223: Failure
Expected equality of these values:
ensemble_->getProgrammedState() ->getPorts() ->getNodeIf(portId) ->getOperState()
Which is: 4-byte object <01-00 00-00>
(up ? PortFields::OperState::UP : PortFields::OperState::DOWN)
Which is: 4-byte object <00-00 00-00>

Motivation

When performing bringDownPorts for these cases, the loopback mode is set to 'none.' In this scenario, although the physical port briefly goes down, it quickly links back up because the port is still physically connected. This transition happens within just a few milliseconds. Consequently, in the original code, the port hadn't gone down yet during the initial check, and by the one-second mark, it had already recovered to an 'up' state. This cycle continues through all 60 check iterations.

In the original design, ZeroPreemphasis was intended to prevent the link peer's linking up. However, for the TH6, it is currently set to unsupported. We confirmed with Broadcom that this feature is supported. Therefore, this attribute should be changed to supported.

Test Plan

start agent 0

TS=(date +%Y%m%d_%H%M%S) LD_LIBRARY_PATH=/root/gangtao/opt/fboss/lib/ /root/gangtao/img/fboss_hw_agent-sai_impl \ --config /root/gangtao/cfg/ladakh800bc.agent.conf-1111-no-65-connection-2-npus \ --platform_mapping_override_path /root/gangtao/cfg/ladakh800bc_platform_mapping_is_multi_npu.json-no-65 \ --thrift_test_utils_thrift_handler \ --switchIndex 0 \ --multi_npu_platform_mapping \ --logging=DBG5 \ -enable_replayer \ -enable_elapsed_time_log \ -sai_replayer_sdk_log_level DEBUG \ -sai_log /root/gangtao/log/npu0-sai-{TS}.log
2>&1 | tee /root/gangtao/log/fboss_hw_agent0_${TS}.log

start agent 1

TS=(date +%Y%m%d_%H%M%S) LD_LIBRARY_PATH=/root/gangtao/opt/fboss/lib/ /root/gangtao/img/fboss_hw_agent-sai_impl \ --config /root/gangtao/cfg/ladakh800bc.agent.conf-1111-no-65-connection-2-npus \ --platform_mapping_override_path /root/gangtao/cfg/ladakh800bc_platform_mapping_is_multi_npu.json-no-65 \ --thrift_test_utils_thrift_handler \ --switchIndex 1 \ --multi_npu_platform_mapping \ --logging=DBG5 \ -enable_replayer \ -enable_elapsed_time_log \ -sai_replayer_sdk_log_level DEBUG \ -sai_log /root/gangtao/log/npu1-sai-{TS}.log
2>&1 | tee /root/gangtao/log/fboss_hw_agent1_${TS}.log

start multi_switch_agent_hw_test

LD_LIBRARY_PATH=/root/gangtao/opt/fboss/lib/ /root/gangtao/img/multi_switch_agent_hw_test \
--thrift_test_utils_thrift_handler \
--gtest_filter=AgentNeighborTest/0.LinkDownOnResolvedEntry    \
--config /root/gangtao/cfg/ladakh800bc.agent.conf-1111-no-65-connection-2-npus \
--platform_mapping_override_path /root/gangtao/cfg/ladakh800bc_platform_mapping_is_multi_npu.json-no-65 \
--multi_npu_platform_mapping \
--logging=DBG5 \
2>&1 | tee /root/gangtao/log/multi_switch_agent-${TS}.log

also for
--gtest_filter=AgentNeighborTest/2.LinkDownOnResolvedEntry 
--gtest_filter=AgentNeighborTest/4.LinkDownOnResolvedEntry
--gtest_filter=AgentNeighborTest/6.LinkDownOnResolvedEntry
--gtest_filter=AgentNeighborTest/0.LinkDownAndUpOnResolvedEntry
--gtest_filter=AgentNeighborTest/2.LinkDownAndUpOnResolvedEntry
--gtest_filter=AgentNeighborTest/4.LinkDownAndUpOnResolvedEntry
--gtest_filter=AgentNeighborTest/6.LinkDownAndUpOnResolvedEntry

result:

[root@localhost log]# ls -lrt | sed -n '/============test/,$p' | grep multi_switch_agent-202  | awk '{print $9}' | xargs -n 1 grep OK
[       OK ] AgentNeighborTest/0.LinkDownOnResolvedEntry (174276 ms)
[       OK ] AgentNeighborTest/2.LinkDownOnResolvedEntry (158831 ms)
[       OK ] AgentNeighborTest/4.LinkDownOnResolvedEntry (151803 ms)
[       OK ] AgentNeighborTest/6.LinkDownOnResolvedEntry (270947 ms)
[       OK ] AgentNeighborTest/0.LinkDownAndUpOnResolvedEntry (190111 ms)
[       OK ] AgentNeighborTest/2.LinkDownAndUpOnResolvedEntry (175714 ms)
[       OK ] AgentNeighborTest/4.LinkDownAndUpOnResolvedEntry (164928 ms)
[       OK ] AgentNeighborTest/6.LinkDownAndUpOnResolvedEntry (166013 ms)
@meta-cla meta-cla Bot added the CLA Signed label Jan 19, 2026
@meta-codesync

meta-codesync Bot commented Jan 22, 2026

Copy link
Copy Markdown
Contributor

@arunsarat has imported this pull request. If you are a Meta employee, you can view this in D91188322.

Comment thread fboss/agent/test/LinkStateToggler.cpp Outdated
for (const auto& portId : ports) {
WITH_RETRIES_N_TIMED(60, std::chrono::milliseconds(1000), {
bool fastPollSuccess = false;
// Fast poll for 120ms without throwing exceptions

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the port coming back up because it is physically connected with a cable? in that case, it is not guaranteed that a 120ms poll might work always. What if the port comes back up in 100ms? Would this problem go away if the port is set to loopback?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the code to do bring port up in test will do a loopback to get port up and code to do bring port down in test will remove the loopback. With cable connected, this won't work. The fix proposed here is catching the short port down when loopback mode setting changes. it is not a real port down. For connected ports, we need another mechanism to bring ports up and down.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review.
Yes, it's cable connected via test fixture.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this internally at meta. This is the feedback from Harshit.

All of our Ladakh blades are in a loopback test fixture, meaning the ports are externally looped back. These ports connect from one NPU to another, and some are likely looped back within the same NPU.

The policy for hardware tests is that they must be runnable on any device, regardless of external connections, as they were designed to work even in production setups.

Hardware tests initially bring the port down by setting pre-emphasis to 0. After this, the test controls the port's up/down state by enabling or disabling loopback mode. If the ports fail to come down after setting pre-emphasis, we should follow up with Broadcom, as this issue has occurred on other ASICs previously and required fixes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. but seems zero preemphasis is not supported for TH6: fboss/agent/hw/switch_asics/Tomahawk6Asic.cpp:

bool Tomahawk6Asic::isSupported(Feature feature) const {

    case HwAsic::Feature::PORT_SERDES_ZERO_PREEMPHASIS:
      return false;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. but seems zero preemphasis is not supported for TH6: fboss/agent/hw/switch_asics/Tomahawk6Asic.cpp:

bool Tomahawk6Asic::isSupported(Feature feature) const {

    case HwAsic::Feature::PORT_SERDES_ZERO_PREEMPHASIS:
      return false;

Can you try returning true?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tried that, and it failed as before.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tried that, and it failed as before.

If the port is not staying down, it is an sdk issue and needs to be triaged by broadcom.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I’ll try to check with Broadcom. If this feature is supported, I’ll change Feature::PORT_SERDES_ZERO_PREEMPHASIS to true in function isSupported. That would be much cleaner.

@gang-tao gang-tao Jan 30, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checked and verified with broadcom, this is supported. I updated the feature to be supported in code.

@gang-tao gang-tao force-pushed the ladakh800bcls_agent_fix2 branch from 3744dad to 1ceb56c Compare January 30, 2026 06:15
@facebook-github-bot

Copy link
Copy Markdown
Contributor

@gang-tao has updated the pull request. You must reimport the pull request before landing.

@gang-tao gang-tao force-pushed the ladakh800bcls_agent_fix2 branch from 1ceb56c to 01b84a1 Compare February 2, 2026 01:44
@facebook-github-bot

Copy link
Copy Markdown
Contributor

@gang-tao has updated the pull request. You must reimport the pull request before landing.

@msomasundaran

Copy link
Copy Markdown
Contributor

Can you please update the description and tests to reflect the new diff

@gang-tao

gang-tao commented Feb 3, 2026

Copy link
Copy Markdown
Contributor Author

Can you please update the description and tests to reflect the new diff

sure.

@facebook-github-bot

Copy link
Copy Markdown
Contributor

@gang-tao has updated the pull request. You must reimport the pull request before landing.

@gang-tao

gang-tao commented Feb 4, 2026

Copy link
Copy Markdown
Contributor Author

@msomasundaran
I have updated again to resolve the conflicts.

@meta-codesync

meta-codesync Bot commented Feb 7, 2026

Copy link
Copy Markdown
Contributor

@arunsarat merged this pull request in dccb840.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

3 participants