[Celestica] Leh800bcls: Resolve multi-switch empty fb303 counters in IS_OSS Split Agent mode#1339
[Celestica] Leh800bcls: Resolve multi-switch empty fb303 counters in IS_OSS Split Agent mode#1339gang-tao wants to merge 1 commit into
Conversation
| #pragma once | ||
|
|
||
| #include <fb303/ServiceData.h> | ||
| #include <fb303/ThreadCachedServiceData.h> |
There was a problem hiding this comment.
no fb303 BUCK dep exists on the targets that compile this header — hw_test_thrift_handler_h (fboss/agent/hw/test/BUCK), the shared hw_test_thrift_handler lib, or the SAI/BCM agent_hw_test_thrift_handler libs. Per "don't rely on transitive deps," add
fb303:service_data
and
fb303:thread_cached_service_data
to the relevant target(s) and run arc lint --take AUTODEPS --apply-patches.
There was a problem hiding this comment.
Thank you for your constructive comments!
Since my local development and testing environment is strictly OSS-compliant and relies on CMake, I don't have the Buck toolchain set up locally to fully compile-verify BUCK dependencies.
However, to keep the header clean as you suggested, I have declared the methods in HwTestThriftHandler.h and moved their implementations into the platform-agnostic HwTestLogCaptureThriftHandler.cpp.
Accordingly, I have added:
- "//fb303:service_data"
- "//fb303:thread_cached_service_data"
to the "exported_deps" list of the compiled library target hw_test_thrift_handler insidefboss/agent/hw/test/BUCK, and removed them from the header target to keep the dependency strictly on the single includer.
Could you please help double-check if this BUCK modification is fully compliant and correctly resolves the transitive dependencies in your internal build?
I have built the changes and re-tested, all original 3 cases passed.
| std::vector<std::string>& out, | ||
| std::unique_ptr<std::string> substring) override; | ||
|
|
||
| void getFb303RegexCounters( |
There was a problem hiding this comment.
Every other AgentHwTestCtrlSvIf override here is declared in the header and defined in a platform .cpp; the platform-agnostic log-capture methods were deliberately moved into a single shared HwTestLogCaptureThriftHandler.cpp (see comment ~L253) to avoid duplicating bodies into every SAI+BCM TU. These two methods are also platform-agnostic — please declare them here and define them once in HwTestLogCaptureThriftHandler.cpp. That also keeps the fb303 dep on one lib instead of every includer.
There was a problem hiding this comment.
Thank you for this suggestion. It makes a lot of sense and provides much cleaner decoupling.
To follow this pattern, I have:
- Removed the inline implementations from
HwTestThriftHandler.hand declared both methods as clean virtual overrides. - Moved the implementations of
getFb303RegexCountersandgetFb303Counterinto the shared, platform-agnosticHwTestLogCaptureThriftHandler.cppalongside the log-capture methods. - This also allows us to remove the fb303 BUCK dependencies entirely from the header target (
hw_test_thrift_handler_h) and keep them strictly on the compiled library target (hw_test_thrift_handler), avoiding transmitting unnecessary dependencies to other includers of this header.
Please let me know if this looks correct to you!
| #else | ||
| // TODO: This needs to be updated to support multi-switch. | ||
| counters = facebook::fb303::fbData->getRegexCounters(regex); | ||
| if (!getSw()->isRunModeMultiSwitch()) { |
There was a problem hiding this comment.
if (!multiSwitch) local else { try remote; catch warn+local } block is duplicated nearly verbatim across all three functions (getFb303CountersByRegex, getFb303Counter, getFb303RegexCounters). Extract a helper (e.g. queryHwAgentFb303RegexCounters(switchID, regex) / queryHwAgentFb303Counter(switchID, key)) so routing + warning + fallback live in one place.
There was a problem hiding this comment.
Thank you for the excellent suggestion!
I have completely extracted the duplicated logic into two private helper methods in AgentEnsemble:
- `queryHwAgentFb303RegexCounters(switchID, regex)
queryHwAgentFb303Counter(switchID, key)
| auto switchID = scopeResolver().scope(portId).switchId(); | ||
| auto hwTestClient = getHwAgentTestClient(switchID); | ||
| hwTestClient->sync_getFb303RegexCounters(counters, regex); | ||
| } catch (const std::exception& ex) { |
There was a problem hiding this comment.
In multi-switch the fallback reads the local fb303::fbData, which is exactly the empty in-process singleton this fix exists to work around — so on a remote-RPC failure the test silently gets empty/0 counters and fails as "watermark == 0" instead of surfacing the RPC error. Consider letting the exception propagate (or throw) in the multi-switch branch so a real HwAgent/connectivity failure is debuggable, rather than degrading to a known-blank source.
There was a problem hiding this comment.
Good catch!
I have updated the catch blocks inside both queryHwAgentFb303RegexCounters and queryHwAgentFb303Counter under the multi-switch branch to log the error at XLOG(ERR) and explicitly re-throw the exception (throw;) up to the test execution context.
| // TODO: This needs to be updated to support multi-switch. | ||
| counter = facebook::fb303::fbData->getCounter(key); | ||
| if (!getSw()->isRunModeMultiSwitch()) { | ||
| counter = facebook::fb303::fbData->getCounterIfExists(key).value_or(0); |
There was a problem hiding this comment.
Behavior change: the monolithic path went from getCounter(key) (throws on a missing counter) to getCounterIfExists(key).value_or(0) (silently 0). Right fix for the watchdog crash on the remote path, but in mono mode it now turns a genuinely-missing counter into a silent 0 (a test could pass when the counter is absent). Intentional? If the goal is just crash-safety for the cross-process case, consider keeping the strict read in mono and only softening the remote branch, or add a comment justifying the relaxed contract.
There was a problem hiding this comment.
Good catch!
To preserve the strictness of the original contract:
I have reverted the monolithic path (!isRunModeMultiSwitch()) inside queryHwAgentFb303Counter back to the original facebook::fb303::fbData->getCounter(key). This ensures that on a monolithic platform, any missing counter will still throw and fail the test loudly as designed.
…IS_OSS Split Agent mode
dba20c8 to
a1bdcf1
Compare
|
@jagpalgill Thank you for the review! I have made the changes according to your comments and re-run the tests, and all the previous test cases have passed. Please let me know if I misunderstood anything or if any changes look incorrect, and I'll be happy to update them again. Thanks a lot! |
Pre-submission checklist
pip install -r requirements-dev.txt && pre-commit installpre-commit runclang-format.............................................................Passed
shellcheck...........................................(no files to check)Skipped
shfmt................................................(no files to check)Skipped
trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check yaml...........................................(no files to check)Skipped
check json...........................................(no files to check)Skipped
check for merge conflicts................................................Passed
ruff check...........................................(no files to check)Skipped
ruff format..........................................(no files to check)Skipped
Prevent sai_impl in fboss manifest.......................................Passed
Summary
The agent test case verifyBufferPoolWatermarks, verifyIngressPriorityGroupWatermarks, and PfcWatchdogDetection tests all failed on both npu0 and npu1 on the leh800bcls.
Root Cause
In Split Agent mode, the test main thread runs in the software agent process (multi_switch_agent), while physical metrics are published strictly inside the local facebook::fb303::fbData singleton of the isolated hardware agent process (fboss_hw_agent-sai_impl).
In IS_OSS open-source builds, the Monitor client classes are completely excluded from compilation, forcing AgentEnsemble.cpp to execute the legacy fallback:
This causes the test process to query its own local blank fbData instead of the switch under test. The local fbData in the test process is empty, causing watermark regex matches to return empty maps, and exact-key queries (such as watchdog deadlock detection counters) to throw C++ std::invalid_argument exceptions, crashing the test body.
Solution
Added getFb303RegexCounters and getFb303Counter APIs to AgentHwTestCtrl.
Implemented these overrides inline. To eliminate microsecond timing gaps, we manually trigger an in-band cache flush before returning the counters, using safe getCounterIfExists to prevent exceptions:
Refactored #else // IS_OSS branches. If !isRunModeMultiSwitch() (Monolithic mode), we directly read local fbData to completely avoid false-positive warning logs and connection attempts on monolithic platforms. If in multi-switch mode, we remotely query the hardware agent via getHwAgentTestClient(switchID).
Test Plan
Verified test case verifyBufferPoolWatermarks, verifyIngressPriorityGroupWatermarks, and PfcWatchdogDetection passed on npu0 and npu1.