-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dynamic sensor logic for fixed and psu presence/state checking in thermalctld #401
Conversation
This PR is being introduced as was described in #387 |
@gechiang , Can u review this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
This definitely needs to go in 202305 and should also probably go into 202205 as well (and the other branches moving forward though I assume that is expected). Fixes bug in both where stale thermal sensors will be left in temp db and reported on until device power cycle as well as undefined behavior on fixed platforms where dependent on the sensors, a device that is not on but present can still report sensor information which is not behavior we would expect. |
@Junchao-Mellanox , please help review this PR. Thanks! |
/azp run |
Commenter does not have sufficient privileges for PR 401 in repo sonic-net/sonic-platform-daemons |
If my understanding is correct, this PR will change the output of the "show platform psustatus" in some cases(when some PSU is not present), some sonic-mgmt test cases is relying on analyzing the CLI output. I would like suggest providing a test result on those cases against this PR, to make sure no case is broken. If some test case is broken, as a full solution, new PR shall also be raised to fix those test cases according to the new logic. Besides the potential impaction to the test cases, we should also update the CLI command reference: https://github.com/sonic-net/sonic-utilities/blob/master/doc/Command-Reference.md |
Same concern as Kebo mentioned. |
@keboliu @Junchao-Mellanox I have confirmed that this DOES pass |
@prgeor , when you have a chance, please help review this PR. |
MSFT ADO: 25872447 |
@prgeor |
@gregoryboudreau CLI/DB part I understand. Which functionality is broken? ... like what is broken that mandates this to be ported to other stable branches 202205? |
Because of the stale sensors, if a device is removed and tests are run without a power cycle, it can lead to the failure of mgmt tests such as platform_tests/test_thermal_state_db.py. It is also incorrect functionality for these stale sensors to remain in the first place (though not sure how pertinent that is to the branch discussion). These changes were begun due to the above described test failure being reported to us by MSFT on 202205 devices. If this is unable to get into 202205, is it still up for 202305? |
@prgeor Pending the decision for potential inclusion in the past release branches, can this be merged into master? |
@yxieca , @StormLiangMS , Please help review/approve this for 202205 and 202305. |
@gregoryboudreau please raise a separate PR under 202205 to resolve the conflict as this POR cannot be cherry-picked directly... |
@gregoryboudreau @gechiang could you pls have a test on top of 202305 with this PR? to avoid regression at this stage. |
@gechiang Please find a double commit PR for 202205 for these changes here: #409 @gechiang @StormLiangMS On 202305 based systems, tests are passing, specifically reran |
… thermalctld (#401) * add modular sensor logic even for fixed devices and presence/status checking for PSUs * test changes * fixing accidental removal of name * logic correction * isolating key error * psu runtime change * fixing whitespace addition * remove powergood check from thermalctld logic
@gregoryboudreau this PR already included in master Branch. Please remove Request for master label. |
This PR looks like it is already in 202311. Also the auto-reply seems to be calling the wrong branch name. |
@gregoryboudreau This PR is not possible to be picked cleanly into 202311 directly. Please create a PR on top of the 202311 branch. Thanks! |
@gechiang I might be missing something, what part of this PR isn't in 202311? https://github.com/sonic-net/sonic-platform-daemons/blob/202311/sonic-thermalctld/scripts/thermalctld#L569, it looks like its there? |
You are absolutely correct. So it looks like this PR got merged at the border line when the 202311 got branched out which picked it up already. Nice! One less thing to worry about. |
Add dynamic sensor logic for fixed and psu presence/state checking in thermalctld
Description
Extends the existing dynamic sensor logic of modular chassis to fixed and adds a check for PSU presence/status in thermalctld
Motivation and Context
In current form, if a PSU or other device with sensors is removed from the device or disabled during runtime on a fixed chassis, they will become stale and remain in the temperature sensor DB which should not be correct. This change brings the existing logic checks that modular chassis' have to fixed platform as well as checks for PSU presence/status to account for scenario where PSU may be disabled during runtime.
How Has This Been Tested?
On both fixed and modular platforms (Cisco 8800RP and Cisco 8101_32FH), I confirmed presence of all thermal sensors. Then disable the PSU and confirmed its associated sensors were removed from the DB while it was either inactive and present or not present. Reenabled PSU and confirmed its associated sensors were updated back into the DB.
On test_thermalctld.py added change in
test_update_module_thermals
to simulate scenario where PSU which had thermals added was disabled (status set to false in this case) at runtime but still present.Additional Information (Optional)