Skip to content

[Celestica] Leh800bcls: Resolve multi-switch empty fb303 counters in IS_OSS Split Agent mode#1339

Open
gang-tao wants to merge 1 commit into
facebook:mainfrom
gang-tao:leh800bcls_agent_fix12
Open

[Celestica] Leh800bcls: Resolve multi-switch empty fb303 counters in IS_OSS Split Agent mode#1339
gang-tao wants to merge 1 commit into
facebook:mainfrom
gang-tao:leh800bcls_agent_fix12

Conversation

@gang-tao

Copy link
Copy Markdown
Contributor

Pre-submission checklist

  • [ ✓ ] I've ran the linters locally and fixed lint errors related to the files I modified in this PR. You can install the linters by running pip install -r requirements-dev.txt && pre-commit install
  • [ ✓ ] pre-commit run
    clang-format.............................................................Passed
    shellcheck...........................................(no files to check)Skipped
    shfmt................................................(no files to check)Skipped
    trim trailing whitespace.................................................Passed
    fix end of files.........................................................Passed
    check yaml...........................................(no files to check)Skipped
    check json...........................................(no files to check)Skipped
    check for merge conflicts................................................Passed
    ruff check...........................................(no files to check)Skipped
    ruff format..........................................(no files to check)Skipped
    Prevent sai_impl in fboss manifest.......................................Passed

Summary

The agent test case verifyBufferPoolWatermarks, verifyIngressPriorityGroupWatermarks, and PfcWatchdogDetection tests all failed on both npu0 and npu1 on the leh800bcls.

  • AgentTrafficPfcGenTest.verifyBufferPoolWatermarks
/var/FBOSS/fboss/fboss/agent/test/agent_hw_tests/AgentTrafficPfcTests.cpp:168: Failure
Expected: (globalHeadroomWatermark) > (0), actual: 0 vs 0

/var/FBOSS/fboss/fboss/agent/test/agent_hw_tests/AgentTrafficPfcTests.cpp:170: Failure
Expected: (globalSharedWatermark) > (0), actual: 0 vs 0
  • AgentTrafficPfcGenTest.verifyIngressPriorityGroupWatermarks
/var/FBOSS/fboss/fboss/agent/test/agent_hw_tests/AgentTrafficPfcTests.cpp:200: Failure
Expected equality of these values:
  counters.size()
    Which is: 0
  numKeys
    Which is: 2

/var/FBOSS/fboss/fboss/agent/test/agent_hw_tests/AgentTrafficPfcTests.cpp:206: Failure
Expected: (totalWatermark) > (0), actual: 0 vs 0
  • AgentTrafficPfcWatchdogTest.PfcWatchdogDetection
unknown file: Failure
C++ exception with description "no such counter "switch.0.pfc_deadlock_detection_count.sum"" thrown in the test body.

Root Cause

In Split Agent mode, the test main thread runs in the software agent process (multi_switch_agent), while physical metrics are published strictly inside the local facebook::fb303::fbData singleton of the isolated hardware agent process (fboss_hw_agent-sai_impl).

In IS_OSS open-source builds, the Monitor client classes are completely excluded from compilation, forcing AgentEnsemble.cpp to execute the legacy fallback:

#ifndef IS_OSS
...  monitoringClient.sync_getRegexCounters ...
#else
  // TODO: This needs to be updated to support multi-switch.
  counters = facebook::fb303::fbData->getRegexCounters(regex);
#endif

This causes the test process to query its own local blank fbData instead of the switch under test. The local fbData in the test process is empty, causing watermark regex matches to return empty maps, and exact-key queries (such as watchdog deadlock detection counters) to throw C++ std::invalid_argument exceptions, crashing the test body.

Solution

  1. Protocol Extension (agent_hw_test_ctrl.thrift):
    Added getFb303RegexCounters and getFb303Counter APIs to AgentHwTestCtrl.
  2. Server Implementation (HwTestThriftHandler.h):
    Implemented these overrides inline. To eliminate microsecond timing gaps, we manually trigger an in-band cache flush before returning the counters, using safe getCounterIfExists to prevent exceptions:
void getFb303RegexCounters(counters, regex) override {
  facebook::fb303::ThreadCachedServiceData::get()->publishStats(); // Force flush
  counters = facebook::fb303::fbData->getRegexCounters(*regex);
}
  1. Ensemble Pull & Mode Duality (AgentEnsemble.cpp):
    Refactored #else // IS_OSS branches. If !isRunModeMultiSwitch() (Monolithic mode), we directly read local fbData to completely avoid false-positive warning logs and connection attempts on monolithic platforms. If in multi-switch mode, we remotely query the hardware agent via getHwAgentTestClient(switchID).

Test Plan

Verified test case verifyBufferPoolWatermarks, verifyIngressPriorityGroupWatermarks, and PfcWatchdogDetection passed on npu0 and npu1.

[root@localhost log]# grep " OK " AgentTxNpu[10]/mu*.log
AgentTxNpu0/multi_switch_agent-AgentTrafficPfcGenTest.verifyBufferPoolWatermarks-20260626_130335.log:[       OK ] AgentTrafficPfcGenTest.verifyBufferPoolWatermarks (65648 ms)
AgentTxNpu0/multi_switch_agent-AgentTrafficPfcGenTest.verifyIngressPriorityGroupWatermarks-20260626_130447.log:[       OK ] AgentTrafficPfcGenTest.verifyIngressPriorityGroupWatermarks (65813 ms)
AgentTxNpu0/multi_switch_agent-AgentTrafficPfcWatchdogTest.PfcWatchdogDetection-20260626_130600.log:[       OK ] AgentTrafficPfcWatchdogTest.PfcWatchdogDetection (70769 ms)
AgentTxNpu1/multi_switch_agent-AgentTrafficPfcGenTest.verifyBufferPoolWatermarks-20260626_125951.log:[       OK ] AgentTrafficPfcGenTest.verifyBufferPoolWatermarks (65464 ms)
AgentTxNpu1/multi_switch_agent-AgentTrafficPfcGenTest.verifyIngressPriorityGroupWatermarks-20260626_130103.log:[       OK ] AgentTrafficPfcGenTest.verifyIngressPriorityGroupWatermarks (66169 ms)
AgentTxNpu1/multi_switch_agent-AgentTrafficPfcWatchdogTest.PfcWatchdogDetection-20260626_130217.log:[       OK ] AgentTrafficPfcWatchdogTest.PfcWatchdogDetection (70933 ms)

@gang-tao gang-tao requested a review from a team as a code owner June 26, 2026 13:29
@meta-cla meta-cla Bot added the CLA Signed label Jun 26, 2026
#pragma once

#include <fb303/ServiceData.h>
#include <fb303/ThreadCachedServiceData.h>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no fb303 BUCK dep exists on the targets that compile this header — hw_test_thrift_handler_h (fboss/agent/hw/test/BUCK), the shared hw_test_thrift_handler lib, or the SAI/BCM agent_hw_test_thrift_handler libs. Per "don't rely on transitive deps," add
fb303:service_data
and
fb303:thread_cached_service_data
to the relevant target(s) and run arc lint --take AUTODEPS --apply-patches.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your constructive comments!

Since my local development and testing environment is strictly OSS-compliant and relies on CMake, I don't have the Buck toolchain set up locally to fully compile-verify BUCK dependencies.

However, to keep the header clean as you suggested, I have declared the methods in HwTestThriftHandler.h and moved their implementations into the platform-agnostic HwTestLogCaptureThriftHandler.cpp.

Accordingly, I have added:

  • "//fb303:service_data"
  • "//fb303:thread_cached_service_data"
    to the "exported_deps" list of the compiled library target hw_test_thrift_handler inside fboss/agent/hw/test/BUCK, and removed them from the header target to keep the dependency strictly on the single includer.

Could you please help double-check if this BUCK modification is fully compliant and correctly resolves the transitive dependencies in your internal build?

I have built the changes and re-tested, all original 3 cases passed.

std::vector<std::string>& out,
std::unique_ptr<std::string> substring) override;

void getFb303RegexCounters(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every other AgentHwTestCtrlSvIf override here is declared in the header and defined in a platform .cpp; the platform-agnostic log-capture methods were deliberately moved into a single shared HwTestLogCaptureThriftHandler.cpp (see comment ~L253) to avoid duplicating bodies into every SAI+BCM TU. These two methods are also platform-agnostic — please declare them here and define them once in HwTestLogCaptureThriftHandler.cpp. That also keeps the fb303 dep on one lib instead of every includer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this suggestion. It makes a lot of sense and provides much cleaner decoupling.

To follow this pattern, I have:

  1. Removed the inline implementations from HwTestThriftHandler.h and declared both methods as clean virtual overrides.
  2. Moved the implementations of getFb303RegexCounters and getFb303Counter into the shared, platform-agnostic HwTestLogCaptureThriftHandler.cpp alongside the log-capture methods.
  3. This also allows us to remove the fb303 BUCK dependencies entirely from the header target (hw_test_thrift_handler_h) and keep them strictly on the compiled library target (hw_test_thrift_handler), avoiding transmitting unnecessary dependencies to other includers of this header.

Please let me know if this looks correct to you!

Comment thread fboss/agent/test/AgentEnsemble.cpp Outdated
#else
// TODO: This needs to be updated to support multi-switch.
counters = facebook::fb303::fbData->getRegexCounters(regex);
if (!getSw()->isRunModeMultiSwitch()) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!multiSwitch) local else { try remote; catch warn+local } block is duplicated nearly verbatim across all three functions (getFb303CountersByRegex, getFb303Counter, getFb303RegexCounters). Extract a helper (e.g. queryHwAgentFb303RegexCounters(switchID, regex) / queryHwAgentFb303Counter(switchID, key)) so routing + warning + fallback live in one place.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the excellent suggestion!
I have completely extracted the duplicated logic into two private helper methods in AgentEnsemble:

  1. `queryHwAgentFb303RegexCounters(switchID, regex)
  2. queryHwAgentFb303Counter(switchID, key)
Comment thread fboss/agent/test/AgentEnsemble.cpp Outdated
auto switchID = scopeResolver().scope(portId).switchId();
auto hwTestClient = getHwAgentTestClient(switchID);
hwTestClient->sync_getFb303RegexCounters(counters, regex);
} catch (const std::exception& ex) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In multi-switch the fallback reads the local fb303::fbData, which is exactly the empty in-process singleton this fix exists to work around — so on a remote-RPC failure the test silently gets empty/0 counters and fails as "watermark == 0" instead of surfacing the RPC error. Consider letting the exception propagate (or throw) in the multi-switch branch so a real HwAgent/connectivity failure is debuggable, rather than degrading to a known-blank source.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!
I have updated the catch blocks inside both queryHwAgentFb303RegexCounters and queryHwAgentFb303Counter under the multi-switch branch to log the error at XLOG(ERR) and explicitly re-throw the exception (throw;) up to the test execution context.

Comment thread fboss/agent/test/AgentEnsemble.cpp Outdated
// TODO: This needs to be updated to support multi-switch.
counter = facebook::fb303::fbData->getCounter(key);
if (!getSw()->isRunModeMultiSwitch()) {
counter = facebook::fb303::fbData->getCounterIfExists(key).value_or(0);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavior change: the monolithic path went from getCounter(key) (throws on a missing counter) to getCounterIfExists(key).value_or(0) (silently 0). Right fix for the watchdog crash on the remote path, but in mono mode it now turns a genuinely-missing counter into a silent 0 (a test could pass when the counter is absent). Intentional? If the goal is just crash-safety for the cross-process case, consider keeping the strict read in mono and only softening the remote branch, or add a comment justifying the relaxed contract.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!
To preserve the strictness of the original contract:
I have reverted the monolithic path (!isRunModeMultiSwitch()) inside queryHwAgentFb303Counter back to the original facebook::fb303::fbData->getCounter(key). This ensures that on a monolithic platform, any missing counter will still throw and fail the test loudly as designed.

@gang-tao gang-tao force-pushed the leh800bcls_agent_fix12 branch from dba20c8 to a1bdcf1 Compare June 30, 2026 16:32
@gang-tao

Copy link
Copy Markdown
Contributor Author

@jagpalgill Thank you for the review! I have made the changes according to your comments and re-run the tests, and all the previous test cases have passed. Please let me know if I misunderstood anything or if any changes look incorrect, and I'll be happy to update them again. Thanks a lot!

[root@localhost log]# grep " OK " AgentTxNpu[01]/mu*.log
AgentTxNpu0/multi_switch_agent-AgentTrafficPfcGenTest.verifyBufferPoolWatermarks-20260630_163657.log:[       OK ] AgentTrafficPfcGenTest.verifyBufferPoolWatermarks (68390 ms)
AgentTxNpu0/multi_switch_agent-AgentTrafficPfcGenTest.verifyIngressPriorityGroupWatermarks-20260630_163814.log:[       OK ] AgentTrafficPfcGenTest.verifyIngressPriorityGroupWatermarks (68845 ms)
AgentTxNpu0/multi_switch_agent-AgentTrafficPfcWatchdogTest.PfcWatchdogDetection-20260630_163931.log:[       OK ] AgentTrafficPfcWatchdogTest.PfcWatchdogDetection (74475 ms)
AgentTxNpu1/multi_switch_agent-AgentTrafficPfcGenTest.verifyBufferPoolWatermarks-20260630_164053.log:[       OK ] AgentTrafficPfcGenTest.verifyBufferPoolWatermarks (71730 ms)
AgentTxNpu1/multi_switch_agent-AgentTrafficPfcGenTest.verifyIngressPriorityGroupWatermarks-20260630_164213.log:[       OK ] AgentTrafficPfcGenTest.verifyIngressPriorityGroupWatermarks (67900 ms)
AgentTxNpu1/multi_switch_agent-AgentTrafficPfcWatchdogTest.PfcWatchdogDetection-20260630_164329.log:[       OK ] AgentTrafficPfcWatchdogTest.PfcWatchdogDetection (73948 ms)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

2 participants