Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes for port reinitialization in case of syncd/swss/oa crash and NPU SI settings update for CMIS transceivers #403

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

mihirpat1
Copy link
Contributor

@mihirpat1 mihirpat1 commented Nov 2, 2023

Description

Code changes for implementing the HLD OA crash handling to reinitialize port through xcvrd
This is needed to ensure that XCVRD re-initializes all ports if syncd/swss/orchagent crashes.

Motivation and Context

In addition to the implementation mentioned in the HLD, following are the changes being done through this PR
1. Remove wait for port config completion from CMIS thread
2. Add print for module and DP states
3. Change "media settings" to NPU SI settings to clearly differentiate optics and NPU SI settings

How Has This Been Tested?

Summary
Process and device restart and interface config command handling testplan

Event STATE_DB_<asic_n> cleared Xcvrd restarted NPU SI settings renotify CMIS re-init triggered Link flap
Xcvrd restart N Y N N N
Pmon restart N Y N N N
orchagent restart Y Y Y Y N/A
Swss restart Y Y Y Y N/A
Syncd restart Y Y Y Y N/A
config reload Y Y Y Y N/A
Cold reboot Y Y Y Y N/A
config interface shutdown N N N N N/A
config interface startup N N N N N/A

Transceiver OIR testing

Event STATE_DB_<asic_n> cleared Xcvrd restarted NPU SI settings notified NPU_SI_SETTINGS_SYNC_STATUS value upon event completion CMIS init triggered
Transceiver Removal N N Y NPU_SI_SETTINGS_DEFAULT N/A
Transceiver Insertion N N Y NPU_SI_SETTINGS_DONE Y

Redis-db snippet

#redis-cli -n 6 hgetall "PORT_TABLE|Ethernet8"
 1) "CMIS_REINIT_REQUIRED"
 2) "false"
 3) "NPU_SI_SETTINGS_SYNC_STATUS"
 4) "NPU_SI_SETTINGS_DEFAULT"
 5) "state"
 6) "ok"
 7) "netdev_oper_status"
 8) "down"
 9) "admin_status"
10) "up"
11) "mtu"
12) "9100"
13) "supported_speeds"
14) "40000,100000,200000,400000"
15) "supported_fecs"
16) "rs"
17) "host_tx_ready"
18) "true"
19) "speed"
20) "400000"

Testing in progress on multi-asic switch

Additional Information (Optional)

…NPU SI settings update for CMIS transceivers
@mihirpat1
Copy link
Contributor Author

@prgeor @shyam77git @jaganbal-a - It will be great if you can help in reviewing this.

@@ -48,6 +48,8 @@

TRANSCEIVER_STATUS_TABLE_SW_FIELDS = ["status", "error"]

CMIS_MODULE_TYPES = ['QSFP-DD', 'QSFP_DD', 'OSFP', 'QSFP+C']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a new API to detect whether the cable is CMIS? otherwise, we have to update this list when new cables are supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this now and added a API to identify CMIS based transceivers using "isinstance" API.

@@ -631,7 +633,7 @@ def get_media_settings_value(physical_port, key):
if len(default_dict) != 0:
return default_dict
else:
helper_logger.log_error("Error: No values for physical port '{}'".format(physical_port))
helper_logger.log_info("Error: No values for physical port '{}'".format(physical_port))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not a Error level log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On SKUs which have media_settings.json file present, but have ports with transceivers not requiring NPU SI settings, we see this error.
Hence, to handle such scenario, I am planning to change this to info.

@mihirpat1
Copy link
Contributor Author

@tshalvi - It will be great if you can review this PR

"""
Subscribes APPL_DB PORT_TABLE table events
"""
def subscribe_appl_port_table(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an existing function wait_for_port_config_done(self, namespace) that has the same logic, suggest reducing the duplicated code by calling this new function in that one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now removed the duplicate code and xcvrd is now subscribing only once for APPL_DB updates

@@ -2421,6 +2654,7 @@ def wait_for_port_config_done(self, namespace):
(key, op, fvp) = port_tbl.pop()
if key in ["PortConfigDone", "PortInitDone"]:
break
sel.removeSelectable(port_tbl)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the newly added code it also subscribes to the same table, is possible to keep this subscription and reuse it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this now.


while not self.stop_event.is_set() and not generate_sigabrt:
(state, _) = sel.select(port_mapping.SELECT_TIMEOUT_MSECS)
if state == swsscommon.Select.TIMEOUT:
Copy link
Collaborator

@keboliu keboliu Nov 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the next condition if state != swsscommon.Select.OBJECT: is enough and this one is not needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have kept this check to avoid flooding of logs when select times out

@prgeor
Copy link
Collaborator

prgeor commented Nov 15, 2023

@mihirpat1 can you check the build failure and coverage

@mihirpat1
Copy link
Contributor Author

@mihirpat1 can you check the build failure and coverage

I have resolved it now.

@kevinskwang
Copy link

@mihirpat1 could you resolve the conflicts?

if not key:
break
self.log_info("STATE_DB subscriber: Received {} op for {} key".format(op, key))
if op == swsscommon.DEL_COMMAND and "Ethernet" in key:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if interface gets deleted in DPB case, can this get sigabrt unexpectedly triggered ?

@@ -650,6 +650,117 @@ def check_port_in_range(range_str, physical_port):
return True
return False

"""
Checks if the module is CMIS SM driven
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what does SM stand for here? Would it be better to use non-abbreviation?

else:
if cmis_reinit_rqd_val == "true":
state_port_table = self.xcvr_table_helper.get_state_port_tbl(asic_id)
state_port_table.set(lport, [("CMIS_REINIT_REQUIRED", "false")])
Copy link
Contributor

@longhuan-cisco longhuan-cisco Jul 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if CMIS_REINIT_REQUIRED is the best flag/naming, it seems to be ambiguous since re-init can also be required in cases such as error/timeout/host_tx_ready_change/etc.
Moreover, force_cmis_reinit() can get triggered at any time if the listened DB contents get touched/SET , please refer to below code snippet (though not sure if the code is doing the correct thing by expanding the coverage of re-init to so many cases):

if port_change_event.event_type == port_change_event.PORT_SET:
if pport >= 0:
self.port_dict[lport]['index'] = pport
if 'speed' in port_change_event.port_dict and port_change_event.port_dict['speed'] != 'N/A':
self.port_dict[lport]['speed'] = port_change_event.port_dict['speed']
if 'lanes' in port_change_event.port_dict:
self.port_dict[lport]['lanes'] = port_change_event.port_dict['lanes']
if 'host_tx_ready' in port_change_event.port_dict:
self.port_dict[lport]['host_tx_ready'] = port_change_event.port_dict['host_tx_ready']
if 'admin_status' in port_change_event.port_dict:
self.port_dict[lport]['admin_status'] = port_change_event.port_dict['admin_status']
if 'laser_freq' in port_change_event.port_dict:
self.port_dict[lport]['laser_freq'] = int(port_change_event.port_dict['laser_freq'])
if 'tx_power' in port_change_event.port_dict:
self.port_dict[lport]['tx_power'] = float(port_change_event.port_dict['tx_power'])
if 'subport' in port_change_event.port_dict:
self.port_dict[lport]['subport'] = int(port_change_event.port_dict['subport'])
self.force_cmis_reinit(lport, 0)

There might be a better flag/naming. (maybe CMIS_INIT_DONE/CMIS_DONE/XXX_DONE/etc is less ambiguous? considering DONE can only be True when state machine goes to certain final state. Here Done flag equals to not CMIS_REINIT_REQUIRED.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants