-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes for port reinitialization in case of syncd/swss/oa crash and NPU SI settings update for CMIS transceivers #403
base: master
Are you sure you want to change the base?
Conversation
…NPU SI settings update for CMIS transceivers
@prgeor @shyam77git @jaganbal-a - It will be great if you can help in reviewing this. |
sonic-xcvrd/xcvrd/xcvrd.py
Outdated
@@ -48,6 +48,8 @@ | |||
|
|||
TRANSCEIVER_STATUS_TABLE_SW_FIELDS = ["status", "error"] | |||
|
|||
CMIS_MODULE_TYPES = ['QSFP-DD', 'QSFP_DD', 'OSFP', 'QSFP+C'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a new API to detect whether the cable is CMIS? otherwise, we have to update this list when new cables are supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed this now and added a API to identify CMIS based transceivers using "isinstance" API.
sonic-xcvrd/xcvrd/xcvrd.py
Outdated
@@ -631,7 +633,7 @@ def get_media_settings_value(physical_port, key): | |||
if len(default_dict) != 0: | |||
return default_dict | |||
else: | |||
helper_logger.log_error("Error: No values for physical port '{}'".format(physical_port)) | |||
helper_logger.log_info("Error: No values for physical port '{}'".format(physical_port)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not a Error level log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On SKUs which have media_settings.json file present, but have ports with transceivers not requiring NPU SI settings, we see this error.
Hence, to handle such scenario, I am planning to change this to info.
@tshalvi - It will be great if you can review this PR |
sonic-xcvrd/xcvrd/xcvrd.py
Outdated
""" | ||
Subscribes APPL_DB PORT_TABLE table events | ||
""" | ||
def subscribe_appl_port_table(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is an existing function wait_for_port_config_done(self, namespace)
that has the same logic, suggest reducing the duplicated code by calling this new function in that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now removed the duplicate code and xcvrd is now subscribing only once for APPL_DB updates
sonic-xcvrd/xcvrd/xcvrd.py
Outdated
@@ -2421,6 +2654,7 @@ def wait_for_port_config_done(self, namespace): | |||
(key, op, fvp) = port_tbl.pop() | |||
if key in ["PortConfigDone", "PortInitDone"]: | |||
break | |||
sel.removeSelectable(port_tbl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the newly added code it also subscribes to the same table, is possible to keep this subscription and reuse it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have addressed this now.
|
||
while not self.stop_event.is_set() and not generate_sigabrt: | ||
(state, _) = sel.select(port_mapping.SELECT_TIMEOUT_MSECS) | ||
if state == swsscommon.Select.TIMEOUT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the next condition if state != swsscommon.Select.OBJECT:
is enough and this one is not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have kept this check to avoid flooding of logs when select times out
@mihirpat1 can you check the build failure and coverage |
I have resolved it now. |
@mihirpat1 could you resolve the conflicts? |
…d updated branch to the latest
if not key: | ||
break | ||
self.log_info("STATE_DB subscriber: Received {} op for {} key".format(op, key)) | ||
if op == swsscommon.DEL_COMMAND and "Ethernet" in key: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if interface gets deleted in DPB case, can this get sigabrt unexpectedly triggered ?
@@ -650,6 +650,117 @@ def check_port_in_range(range_str, physical_port): | |||
return True | |||
return False | |||
|
|||
""" | |||
Checks if the module is CMIS SM driven |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what does SM
stand for here? Would it be better to use non-abbreviation?
else: | ||
if cmis_reinit_rqd_val == "true": | ||
state_port_table = self.xcvr_table_helper.get_state_port_tbl(asic_id) | ||
state_port_table.set(lport, [("CMIS_REINIT_REQUIRED", "false")]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if CMIS_REINIT_REQUIRED
is the best flag/naming, it seems to be ambiguous since re-init can also be required in cases such as error/timeout/host_tx_ready_change/etc.
Moreover, force_cmis_reinit()
can get triggered at any time if the listened DB contents get touched/SET
, please refer to below code snippet (though not sure if the code is doing the correct thing by expanding the coverage of re-init to so many cases):
sonic-platform-daemons/sonic-xcvrd/xcvrd/xcvrd.py
Lines 912 to 930 in 70e875e
if port_change_event.event_type == port_change_event.PORT_SET: | |
if pport >= 0: | |
self.port_dict[lport]['index'] = pport | |
if 'speed' in port_change_event.port_dict and port_change_event.port_dict['speed'] != 'N/A': | |
self.port_dict[lport]['speed'] = port_change_event.port_dict['speed'] | |
if 'lanes' in port_change_event.port_dict: | |
self.port_dict[lport]['lanes'] = port_change_event.port_dict['lanes'] | |
if 'host_tx_ready' in port_change_event.port_dict: | |
self.port_dict[lport]['host_tx_ready'] = port_change_event.port_dict['host_tx_ready'] | |
if 'admin_status' in port_change_event.port_dict: | |
self.port_dict[lport]['admin_status'] = port_change_event.port_dict['admin_status'] | |
if 'laser_freq' in port_change_event.port_dict: | |
self.port_dict[lport]['laser_freq'] = int(port_change_event.port_dict['laser_freq']) | |
if 'tx_power' in port_change_event.port_dict: | |
self.port_dict[lport]['tx_power'] = float(port_change_event.port_dict['tx_power']) | |
if 'subport' in port_change_event.port_dict: | |
self.port_dict[lport]['subport'] = int(port_change_event.port_dict['subport']) | |
self.force_cmis_reinit(lport, 0) |
There might be a better flag/naming. (maybe CMIS_INIT_DONE
/CMIS_DONE
/XXX_DONE
/etc is less ambiguous? considering DONE
can only be True
when state machine goes to certain final state. Here Done
flag equals to not CMIS_REINIT_REQUIRED
.)
Description
Code changes for implementing the HLD OA crash handling to reinitialize port through xcvrd
This is needed to ensure that XCVRD re-initializes all ports if syncd/swss/orchagent crashes.
Motivation and Context
In addition to the implementation mentioned in the HLD, following are the changes being done through this PR
1. Remove wait for port config completion from CMIS thread
2. Add print for module and DP states
3. Change "media settings" to NPU SI settings to clearly differentiate optics and NPU SI settings
How Has This Been Tested?
Summary
Process and device restart and interface config command handling testplan
Transceiver OIR testing
Redis-db snippet
Testing in progress on multi-asic switch
Additional Information (Optional)