Skip to content

[Celestica] Icecube: Config: Implement Software Overtemp Protection (OTP) for TH6 ASIC#1208

Open
zhongedward wants to merge 1 commit into
facebook:mainfrom
zhongedward:Implement_software_OTP_for_TH6_ASIC
Open

[Celestica] Icecube: Config: Implement Software Overtemp Protection (OTP) for TH6 ASIC#1208
zhongedward wants to merge 1 commit into
facebook:mainfrom
zhongedward:Implement_software_OTP_for_TH6_ASIC

Conversation

@zhongedward

@zhongedward zhongedward commented May 19, 2026

Copy link
Copy Markdown
Contributor

Pre-submission checklist

  • I've ran the linters locally and fixed lint errors related to the files I modified in this PR. You can install the linters by running pip install -r requirements-dev.txt && pre-commit install
  • pre-commit run
icecube_fan_otp

Summary

This PR introduces a comprehensive software-based Overtemp Protection (OTP) mechanism for the Icecube platform. By integrating sensor monitoring with the fan-service shutdown logic, we ensure the hardware is protected during thermal anomalies before reaching catastrophic physical limits.

Currently, the Icecube platform lacks a software-driven emergency power-off sequence for the TH6 ASIC. Relying solely on hardware-level protection can be risky if the thermal ramp is too steep. This change establishes a "Soft OTP" layer to trigger an orderly shutdown when the TH6 temperature hits the critical threshold.

Key Changes

  • platform_manager.json: Exported the SMB_CPLD sysfs path to ensure sensor_service has consistent access to temperature registers.

  • sensor_service.json: Defined the TH6_TEMP sensor (mapped to SMB_CPLD) with a critical threshold (upperCriticalVal) of 101.0°C.

  • fan_service.json:

    • Implemented shutdownCondition triggered by TH6_TEMP.
    • Defined shutdownCmd to explicitly disable TH6 power via SMB_CPLD (echo 0 > /run/devmap/cplds/SMB_CPLD/th6_pwr_en).

Test Plan

  1. Syntax Validation: Validated JSON syntax.

  2. Formatting: Pretty-printed configurations using the jq command for readability.

  3. Build & Config Tests: Compilation and configuration validation tests passed successfully.

  4. Service Verification: Confirmed that the following services start and run without errors:

    • platform_manager/platform_hw_test/platform_manager_hw_test
    • sensor_service/sensor_service_client/sensor_service_sw_test/sensor_service_hw_test
    • fan_service/fan_service_sw_test/fan_service_hw_test
  5. End-to-End Thermal Protection Verification (Soft OTP)
    To verify the effectiveness of the software shutdown logic, we performed a controlled thermal stress test:

  • Methodology:

    • Hardware Guardrail Adjustment: Temporarily increased the hardware initialization threshold of the TMP432 sensor to 110°C via platform_manager. This ensures the hardware-level protection is bypassed during the test window, allowing the software logic to be the primary defender.
    • Controlled Thermal Ramp: Adjusted the CDU (Cooling Distribution Unit) to allow the TH6 ASIC temperature to rise naturally.
    • Observation: Monitored the fan_service polling cycle and system logs to capture the exact trigger point.
  • Test Result:

    • Trigger Point: Once TH6_TEMP hit the software-defined threshold of 101°C, the fan_service successfully identified the shutdownCondition.
    • Action Executed: The shutdownCmd was immediately triggered, executing:
      echo 0 > /run/devmap/cplds/SMB_CPLD/th6_pwr_en
    • Conclusion: Confirmed that the TH6 power rail was successfully disabled by the software trigger, preventing the temperature from reaching the 110°C hardware limit.
    • Audit Logs: The attached .zip contains specific evidence:
      • temp_shutdown_monitor: Captures the thermal ramp and the fan_service trigger event at 101.604°C.
      • pcie_shutdown_monitor: Verifies the physical removal of TH6 from the PCIe bus post-shutdown, confirming successful power-off.
image

Attachment:
icecube_sw_OTP_test_2026_04_24_log.zip

@meta-cla meta-cla Bot added the CLA Signed label May 19, 2026
@zhongedward zhongedward marked this pull request as ready for review May 19, 2026 08:11
@zhongedward zhongedward requested a review from a team as a code owner May 19, 2026 08:11
@meta-codesync

meta-codesync Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

@mikechoifb has imported this pull request. If you are a Meta employee, you can view this in D106028063.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

1 participant