-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[thermalctld] Add check for psu presence to thermalctld when updating PSU thermals #387
Conversation
@@ -592,6 +592,8 @@ class TemperatureUpdater(logger.Logger): | |||
self._refresh_temperature_status(CHASSIS_INFO_KEY, thermal, index) | |||
|
|||
for psu_index, psu in enumerate(self.chassis.get_all_psus()): | |||
if not psu.get_presence(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that we should check presence here. "N/A" is good enough to tell user that the PSU thermal is not available at the moment, no matter it is because of PSU absence or hardware error. Maybe they should fix their test expectation. But if you insist the change, please consider following items:
- Do we need check psu.get_power_good_status here? If PSU is present but not powered on, thermal sensor should not be available.
- Do we need remove previous DB entry here?
- For
show platform fan
andshow platform psu
, non presence PSU will be displayed, should we align it with this design? - For existing sonic-mgmt test, can we guarantee no test is affected?
- For other vendor, who may have private test based on existing design might run into issue
- Unit test is needed to cover the new change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cytsao1 @Junchao-Mellanox how about get_chassis.all_psus() return only those PSUs which are present? In that case, "N/A" would indicate some issue with physical present PSU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only returning present PSUs from get_all_psus
would cause issues with inconsistent indexing throughout the code; when enumerate
is used, numbering will always be sequential from 0
, and inserting a new PSU may cause numbering of existing PSUs to change.
@Junchao-Mellanox , I still don't get why you need to keep sensors info for PSU that are not present in redis-db? what is the purpose of it? It has absolutely no useful info so why are we keeping this? I think not including it is the correct way of handling this. |
It tells user that there are 2 PSU sensors, and 1 is not available, I don't see an issue here. To my understanding, this change will cause sonic-mgmt |
Is N/A in redis-db indicate Not Available Or it is bad sensor data? how can one tell? Also, for your example, there is only 1 PSU sensor. there is really no "2 PSU sensors" since one PSU is not even present?? How do you differentiate the case where the PSU is really present but not providing proper sensor data which resulted to N/A? |
We don't have mechanism to check bad sensor data. So, with or without this change, we cannot tell it. For PSU presence, you can check |
There will be a new PR raised to hopefully address this more completely. One issue I haven't seen mentioned here is that without some sort of presence check, sensors from PSUs that were previously present will not be removed from redis when the PSU is removed, and stale data from those sensors will remain in redis (this also affects snmp). |
@spilkey-cisco , |
New PR will be created, this PR will not be merged and can be closed. |
closing @spilkey-cisco @cytsao1 |
Description
issue #384
For any nonpresent PSU, thermalctld will add its sensors to the redis database, with all values N/A.
thermalctld should check for PSU presence before adding their sensors to the db.
Motivation and Context
The empty non-present sensor entries are not necessary to be tracked, and causes test case failures as values are expected to be retrieved from the sensors.
How Has This Been Tested?
With the update to thermalctld, ran relevant CLI and checked the db.
show plat inv
show plat temp (no empty entries)
db entries (no non present PSUs)
Additional Information (Optional)