Skip to content

[Celestica] Tahansb: fan_service: Fixing OTP issues by optimizing the OTP algorithm#1338

Open
QiuyunXie wants to merge 1 commit into
facebook:mainfrom
QiuyunXie:optimize_OTP
Open

[Celestica] Tahansb: fan_service: Fixing OTP issues by optimizing the OTP algorithm#1338
QiuyunXie wants to merge 1 commit into
facebook:mainfrom
QiuyunXie:optimize_OTP

Conversation

@QiuyunXie

Copy link
Copy Markdown
Contributor

Pre-submission checklist

  • I've ran the linters locally and fixed lint errors related to the files I modified in this PR. You can install the linters by running pip install -r requirements-dev.txt && pre-commit install
  • pre-commit run
image

Summary

Meta reported a TH6 PCIe link issue, which, after debugging, was found to be caused by an SW OTP. It appears that during the TH6 SDK/agent initialization process, the TH6_TEMP sensor reads an abnormal, untrustworthy temperature jump. This anomalous reading value should not be used to trigger OTP.
The OTP algorithm has been optimized. The previous algorithm, which calculated the average of windowsize TH6 temperature samples to determine OTP, was a bit unreasonable. When an abnormally high temperature value is read within the window size, the calculated average will still remain high enough to exceed the overtemp threshold. It has been replaced with an algorithm that requires the TH6 temperature to exceed the overtempThreshold for windowsize consecutive samples before triggering OTP.

Test Plan

Continuously execute the agent initialization stress test while simultaneously running platform_manager, sensor_service, and fan_service.
Our SVT team has been running the stress test continuously for a week, and the issue has not been reproduced.

@QiuyunXie QiuyunXie requested a review from a team as a code owner June 26, 2026 13:05
@meta-cla meta-cla Bot added the CLA Signed label Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

1 participant