Ladakh800bcls: fix exception in case:AgentNeighborTest/*.LinkDown*OnR…#820
Ladakh800bcls: fix exception in case:AgentNeighborTest/*.LinkDown*OnR…#820gang-tao wants to merge 2 commits into
Conversation
|
@arunsarat has imported this pull request. If you are a Meta employee, you can view this in D91188322. |
| for (const auto& portId : ports) { | ||
| WITH_RETRIES_N_TIMED(60, std::chrono::milliseconds(1000), { | ||
| bool fastPollSuccess = false; | ||
| // Fast poll for 120ms without throwing exceptions |
There was a problem hiding this comment.
Is the port coming back up because it is physically connected with a cable? in that case, it is not guaranteed that a 120ms poll might work always. What if the port comes back up in 100ms? Would this problem go away if the port is set to loopback?
There was a problem hiding this comment.
It looks like the code to do bring port up in test will do a loopback to get port up and code to do bring port down in test will remove the loopback. With cable connected, this won't work. The fix proposed here is catching the short port down when loopback mode setting changes. it is not a real port down. For connected ports, we need another mechanism to bring ports up and down.
There was a problem hiding this comment.
Thank you for review.
Yes, it's cable connected via test fixture.
There was a problem hiding this comment.
I discussed this internally at meta. This is the feedback from Harshit.
All of our Ladakh blades are in a loopback test fixture, meaning the ports are externally looped back. These ports connect from one NPU to another, and some are likely looped back within the same NPU.
The policy for hardware tests is that they must be runnable on any device, regardless of external connections, as they were designed to work even in production setups.
Hardware tests initially bring the port down by setting pre-emphasis to 0. After this, the test controls the port's up/down state by enabling or disabling loopback mode. If the ports fail to come down after setting pre-emphasis, we should follow up with Broadcom, as this issue has occurred on other ASICs previously and required fixes.
There was a problem hiding this comment.
Thanks for the comments. but seems zero preemphasis is not supported for TH6: fboss/agent/hw/switch_asics/Tomahawk6Asic.cpp:
bool Tomahawk6Asic::isSupported(Feature feature) const {
case HwAsic::Feature::PORT_SERDES_ZERO_PREEMPHASIS:
return false;
There was a problem hiding this comment.
Thanks for the comments. but seems zero preemphasis is not supported for TH6: fboss/agent/hw/switch_asics/Tomahawk6Asic.cpp:
bool Tomahawk6Asic::isSupported(Feature feature) const { case HwAsic::Feature::PORT_SERDES_ZERO_PREEMPHASIS: return false;
Can you try returning true?
There was a problem hiding this comment.
Yes, I tried that, and it failed as before.
There was a problem hiding this comment.
Yes, I tried that, and it failed as before.
If the port is not staying down, it is an sdk issue and needs to be triaged by broadcom.
There was a problem hiding this comment.
OK, I’ll try to check with Broadcom. If this feature is supported, I’ll change Feature::PORT_SERDES_ZERO_PREEMPHASIS to true in function isSupported. That would be much cleaner.
There was a problem hiding this comment.
checked and verified with broadcom, this is supported. I updated the feature to be supported in code.
3744dad to
1ceb56c
Compare
|
@gang-tao has updated the pull request. You must reimport the pull request before landing. |
1ceb56c to
01b84a1
Compare
|
@gang-tao has updated the pull request. You must reimport the pull request before landing. |
|
Can you please update the description and tests to reflect the new diff |
sure. |
|
@gang-tao has updated the pull request. You must reimport the pull request before landing. |
|
@msomasundaran |
|
@arunsarat merged this pull request in dccb840. |
Pre-submission checklist
pip install -r requirements-dev.txt && pre-commit installpre-commit runSummary
AgentNeighborTest/.LinkDownOnResolvedEntry failed with below exception:
/var/FBOSS/ladakh_test_1215/fboss/agent/test/LinkStateToggler.cpp:223: Failure
Expected equality of these values:
ensemble_->getProgrammedState() ->getPorts() ->getNodeIf(portId) ->getOperState()
Which is: 4-byte object <01-00 00-00>
(up ? PortFields::OperState::UP : PortFields::OperState::DOWN)
Which is: 4-byte object <00-00 00-00>
Motivation
When performing bringDownPorts for these cases, the loopback mode is set to 'none.' In this scenario, although the physical port briefly goes down, it quickly links back up because the port is still physically connected. This transition happens within just a few milliseconds. Consequently, in the original code, the port hadn't gone down yet during the initial check, and by the one-second mark, it had already recovered to an 'up' state. This cycle continues through all 60 check iterations.
In the original design, ZeroPreemphasis was intended to prevent the link peer's linking up. However, for the TH6, it is currently set to unsupported. We confirmed with Broadcom that this feature is supported. Therefore, this attribute should be changed to supported.
Test Plan
start agent 0
start agent 1
start multi_switch_agent_hw_test
result: