Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OA crash handling to reinitialize port through xcvrd #1432

Merged
merged 16 commits into from
Sep 27, 2024
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
247 changes: 238 additions & 9 deletions doc/sfp-cmis/Interface-Link-bring-up-sequence.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ Deterministic Approach for Interface Link bring-up sequence
* [Pre-requisite](#pre-requisite)
* [Breakout handling](#breakout-handling)
* [Proposed Work-Flows](#proposed-work-flows)
* [Port reinitialization during syncd/swss/orchagent crash](#port-reinitialization-during-syncdswssorchagent-crash)

# List of Tables
* [Table 1: Definitions](#table-1-definitions)
Expand Down Expand Up @@ -184,18 +185,246 @@ if transceiver is not present:
- All the workflows mentioned above will reamin same ( or get exercised) till host_tx_ready field update
- xcvrd will not perform any action on receiving host_tx_ready field update


# Port reinitialization during syncd/swss/orchagent crash
## Overview

When syncd/swss/orchagent crashes, all ports in the corresponding namespace will be reinitialized by xcvrd irrespective of the current state of the port.
mihirpat1 marked this conversation as resolved.
Show resolved Hide resolved
If just xcvrd crashes and restarts, then forced reinitialization (CMIS reinit + media settings notify) of port will not be performed.
Following infra will ensure port reinitialization by xcvrd in case of syncd/swss/orchagent crash:

1. XCVRD main thread init
- XCVRD main thread creates the key CMIS_REINIT_REQUIRED in PORT_TABLE:\<port\> (APPL_DB) with value as true for ports which do NOT have this key present
mihirpat1 marked this conversation as resolved.
Show resolved Hide resolved
- XCVRD main thread creates the key MEDIA_SETTINGS_SYNC_STATUS in PORT_TABLE:\<port\> (APPL_DB) with value MEDIA_SETTINGS_DEFAULT for ports which do NOT have this key present.
- For transceivers which do not support media settings, MEDIA_SETTINGS_SYNC_STATUS will stay with value MEDIA_SETTINGS_DEFAULT

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: For transceivers which do not "require" media settings, since this not a module feature, it is a requirement on the NPU side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have modified this now.


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To know that the XCVRD main thread might need to parse "media_settings.json" file. Ports which have no match in this file don't require media settings. Is it the intention to do ? I know that with recent changes parsing/setting of media settings moved from init code of the XCVRD main thread to init of SfpStateUpdate task. May be this setting after parsing of .json file should be done there ? Or MEDIA_SETTING_DEFAULT can be put for all ports regardless media settings for them in .json file ? Then if SfpStateUpdate taks finds data for a port in .json it will set MEDIA_SETTING_NOTIFIED ? What is the impact of having MEDIA_SETTING_DEFAULT for all ports ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To know that the XCVRD main thread might need to parse "media_settings.json" file. Ports which have no match in this file don't require media settings. Is it the intention to do ?
[MP] Yes - if the media settings of a transceiver are not part of "media_settings.json", we would still keep the port with the value MEDIA_SETTING_DEFAULT

I know that with recent changes parsing/setting of media settings moved from init code of the XCVRD main thread to init of SfpStateUpdate task. May be this setting after parsing of .json file should be done there ? Or MEDIA_SETTING_DEFAULT can be put for all ports regardless media settings for them in .json file ?
[MP] I think it is better to initialize this key to MEDIA_SETTING_DEFAULT from xcvrd main thread since this key is used by both SfpStateUpdateTask and CmisManagerTask for handling media settings application for a port and these threads are running parallelly.

Then if SfpStateUpdate taks finds data for a port in .json it will set MEDIA_SETTING_NOTIFIED ? What is the impact of having MEDIA_SETTING_DEFAULT for all ports ?
[MP] Yes, once SfpStateUpdateTask find the media settings in .json file, the value of MEDIA_SETTINGS_SYNC_STATUS changes to MEDIA_SETTING_NOTIFIED. I think there is no harm in setting to MEDIA_SETTING_DEFAULT for all ports irrespective of its media settings requirement from the xcvrd main thread perspective. Also, the advantage with this approach is we do not introduce any race conditions between OA, SfpStateUpdateTask and CmisManagerTask threads related to the role or sequence in modifying the value of MEDIA_SETTINGS_SYNC_STATUS .

Following table describes the various values for MEDIA_SETTINGS_SYNC_STATUS

| Value | Modifier thread and event | Consumer thread and purpose |
|:-----------------------:|:------------------------------------------------------:|:--------------------------------------------------------------------------------------------:|
| MEDIA_SETTINGS_DEFAULT | XCVRD main thread during cold start of xcvrd | XCVRD main thread during boot-up for deciding to notify media settings |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this row, I believe consumer thread captures both uses cases - boot-up and transceiver insertion.
In that case, please add 'transceiver insertion' here too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a check in OA as well to perform port toggle and set MEDIA_SETTINGS_DONE only if MEDIA_SETTINGS_SYNC_STATUS == MEDIA_SETTINGS_NOTIFIED?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, SfpStateUpdateTask will perform apply media settings upon transceiver insertion irrespective of the value of MEDIA_SETTINGS_SYNC_STATUS (unlike the case of SfpStateUpdateTask checking the value during boot-up).

| | SfpStateUpdateTask during transceiver removal | |
Copy link
Contributor

@shyam77git shyam77git Jul 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct indentation (unnecessary gaps between words)
Also bullet the two threads for readability, as:

  • XCVRD main thread....
  • SfpStateUpdateTask thread...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this now.

| MEDIA_SETTINGS_NOTIFIED | SfpStateUpdateTask while updating the media settings | Not being used currently |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this state used for? no consumer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This state is currently unused. We can use this for debugging purpose for now to check if XCVRD has applied and notified the media settings to OA

| MEDIA_SETTINGS_DONE | Orchagent after applying the SI settings | CmisManagerTask for proceeding to CMIS_STATE_DP_DEINIT from CMIS_STATE_MEDIA_SETTINGS_WAIT |

2. SfpStateUpdateTask thread will notify the media settings to OA based on the value of PORT_TABLE:\<port\>.MEDIA_SETTINGS_SYNC_STATUS
If PORT_TABLE:\<port\>.MEDIA_SETTINGS_SYNC_STATUS != MEDIA_SETTINGS_DONE, notify media settings will be invoked and will be set to MEDIA_SETTINGS_NOTIFIED for a port supporting media settings.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check for MEDIA_SETTINGS_NOTIFIED also? Could there be a scenario where SfpStateUpdateTask tries to renotify media settings before MEDIA_SETTINGS_DONE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SfpStateUpdateTask could try to renotify the media settings before MEDIA_SETTINGS_DONE if xcvrd crashes just after notifying the media settings (in this case, MEDIA_SETTINGS_SYNC_STATUS would be MEDIA_SETTINGS_NOTIFIED).
In such case, OA would apply the media settings but CmisManagerTask will not perform CMIS initialization (along with handling port toggle message from OA) since CmisManagerTask is already dead.
Eventually, once xcvrd spawn again, SfpStateUpdateTask would renotify the media settings to OA followed by CmisManagerTask proceeding with CMIS initialization.
Hence, to allow port toggle message to be sent from OA to CmisManagerTask and allow CMIS initialization in the above xcvrd crash scenario, I am currently not checking for MEDIA_SETTINGS_NOTIFIED. Do you think there could be an issue if we apply the media settings twice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In xcvrd restart scenario, re-applying or re-notifying media settings might have an impact to the system i.e.
OA is just a conduit. Re-applying would lead to re-notifying media_settings from OA->syncd->SAI->SDK and in turn making port toggle.
Should avoid port toggling here

3. The OA upon receiving media settings will
- Disable port admin status
mihirpat1 marked this conversation as resolved.
Show resolved Hide resolved
- Apply SI settings
- PORT_TABLE:\<port\>.MEDIA_SETTINGS_SYNC_STATUS = MEDIA_SETTINGS_DONE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1] Please elaborate "Apply SI settings" to something on lines...Ask SAI-SDK to apply SI settings
[2] Add Next bullet - Based on the update/response as SUCCESS from SAI call,
set PORT_TABLE:<port>.MEDIA_SETTINGS_SYNC_STATUS = MEDIA_SETTINGS_DONE
Enable port admin status

In case of failure update/response from SAI call, what would be the system flow? how would the failure be handled?
I beleive host_tx_ready settings is part of this workflow. Please add it here to the workflow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this now.

4. In the CMIS_STATE_INSERTED state, if 'admin_status' is up and 'host_tx_ready' is true, CmisManagerTask thread will check if
- the port supports media settings (will be checked using g_dict and finding valid SI values) and
- MEDIA_SETTINGS_SYNC_STATUS != MEDIA_SETTINGS_DONE
If all the above conditions are true, CMIS SM transitions to CMIS_STATE_MEDIA_SETTINGS_WAIT state.
If port doesn't require media settings to be applied, CMIS SM will proceed with normal code flow (transitions to CMIS_STATE_DP_DEINIT)
Overall, no functionality change related to CMIS SM transitions is intended for ports not supporting media settings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If port doesn't require media settings to be applied,..."
This should be checked as part of previous point/bullet (i.e. the port supports media settings...)
bail out from there itself (i.e. no further action) in case port doesn't require media settings

5. CMIS_STATE_MEDIA_SETTINGS_WAIT state will wait for MEDIA_SETTINGS_DONE and upon reaching to MEDIA_SETTINGS_DONE, CMIS SM will transition to CMIS_STATE_DP_DEINIT.
There will be a timeout of 5s for every retry
mihirpat1 marked this conversation as resolved.
Show resolved Hide resolved
6. The CmisManagerTask thread will set “CMIS_REINIT_REQUIRED" to false after CMIS SM reaches to a steady state (CMIS_STATE_UNKNOWN, CMIS_STATE_FAILED, CMIS_STATE_READY and CMIS_STATE_REMOVED) for the corresponding port
mihirpat1 marked this conversation as resolved.
Show resolved Hide resolved
7. XCVRD will subscribe to PORT_TABLE in APPL_DB and trigger self-restart if the PORT_TABLE is deleted for the namespace.
mihirpat1 marked this conversation as resolved.
Show resolved Hide resolved
All threads will be gracefully terminated and xcvrd deinit will be performed followed by issuing a SIGABRT to ensure XCVRD is restarted automatically by supervisord. After respawn, CMIS re-init and media_settings notified is triggered for the ports belonging to the affected namespace
8. syncd/swss/orchagent restart clears the entire APPL-DB, including “MEDIA_SETTINGS_SYNC_STATUS” and "CMIS_REINIT_REQUIRED" in PORT_TABLE
mihirpat1 marked this conversation as resolved.
Show resolved Hide resolved

## XCVRD init sequence to support port reinitialization during syncd/swss/orchagent crash

```mermaid
sequenceDiagram
participant APPL_DB
participant XCVRDMT as XCVRD main thread
participant CmisManagerTask
participant SfpStateUpdateTask
participant DomInfoUpdateTask

Note over XCVRDMT: Load new platform specific api class,<br> sfputil class and load namespace details
XCVRDMT ->> XCVRDMT: Wait for port config completion
loop lport in logical_port_list
alt if CMIS_REINIT_REQUIRED not in PORT_TABLE:<lport>
XCVRDMT ->> APPL_DB: PORT_TABLE:<lport>.CMIS_REINIT_REQUIRED = true
end
alt if MEDIA_SETTINGS_SYNC_STATUS not in PORT_TABLE:<lport>
XCVRDMT ->> APPL_DB: PORT_TABLE:<lport>.MEDIA_SETTINGS_SYNC_STATUS = MEDIA_SETTINGS_DEFAULT
end
end
Note over APPL_DB: PORT_TABLE:<lport><br>CMIS_REINIT_REQUIRED : true/false<br>MEDIA_NOTIFY_REQUIRED : true/false
XCVRDMT ->> CmisManagerTask: Spawns
XCVRDMT ->> DomInfoUpdateTask: Spawns
XCVRDMT ->> SfpStateUpdateTask: Spawns
par XCVRDMT, CmisManagerTask, SfpStateUpdateTask, DomInfoUpdateTask
loop Wait for stop_event else poll every 60s
DomInfoUpdateTask->>DomInfoUpdateTask: Update TRANSCEIVER_DOM_SENSOR,<br>TRANSCEIVER_STATUS (HW section)<br>TRANSCEIVER_PM tables
end
loop Wait for stop_event
XCVRDMT->>XCVRDMT: Check for changes in PORT_TABLE and act upon receiving DEL event
end
Note over CmisManagerTask: Subscribe to CONFIG_DB:PORT,<br>STATE_DB:TRANSCEIVER_INFO and STATE_DB:PORT_TABLE
loop Wait for stop_event
Note over CmisManagerTask: Start the CMIS SM and act based on subscribed DB related changes
end
Note over SfpStateUpdateTask: _post_port_sfp_info_and_dom_thr_to_db_once<br>_init_port_sfp_status_tbl<br>Subscribe to CONFIG_DB:PORT
loop Wait for stop_event
SfpStateUpdateTask ->> SfpStateUpdateTask: Handle config change event<br>retry_eeprom_reading()<br>_wrapper_get_transceiver_change_event
end
end
```

## SfpStateUpdateTask's role to notify media settings to OA

```mermaid
sequenceDiagram
participant OA
participant APPL_DB
participant SfpStateUpdateTask

Note over SfpStateUpdateTask: Subscribe to CONFIG_DB:PORT,<br>STATE_DB:TRANSCEIVER_INFO and STATE_DB:PORT_TABLE
Note over SfpStateUpdateTask: Following loop represents _post_port_sfp_info_and_dom_thr_to_db_once
loop lport in logical_port_list
alt post_port_sfp_info_to_db != SFP_EEPROM_NOT_READY
Note over SfpStateUpdateTask: post_port_dom_threshold_info_to_db
opt PORT_TABLE:<lport>.MEDIA_SETTINGS_SYNC_STATUS != MEDIA_SETTINGS_DONE
opt if lport supports media settings
SfpStateUpdateTask ->> APPL_DB: PORT_TABLE:<lport>.MEDIA_SETTINGS_SYNC_STATUS = MEDIA_SETTINGS_NOTIFIED
APPL_DB -->> OA: Notify media settings for ports
Note over OA: Disable admin status<br>setPortSerdesAttribute
OA ->> APPL_DB: PORT_TABLE:<lport>.MEDIA_SETTINGS_SYNC_STATUS = MEDIA_SETTINGS_DONE
Note over OA: initHostTxReadyState
end
end
else
Note over SfpStateUpdateTask: retry_eeprom_set.add(lport)
end
end
Note over SfpStateUpdateTask: _init_port_sfp_status_tbl<br>Subscribe to CONFIG_DB
loop Wait for stop_event
SfpStateUpdateTask ->> SfpStateUpdateTask: Handle config change event<br>retry_eeprom_reading()<br>_wrapper_get_transceiver_change_event
end
```

mihirpat1 marked this conversation as resolved.
Show resolved Hide resolved
## CMIS State machine with CMIS_STATE_MEDIA_SETTINGS_WAIT state

The below state machine is a high level flow and doesn't capture details for states other than CMIS_STATE_MEDIA_SETTINGS_WAIT

```mermaid
stateDiagram
[*] --> CMIS_STATE_INSERTED
state if_state <<choice>>
state if_state2 <<choice>>
CMIS_STATE_INSERTED --> if_state
if_state --> CMIS_STATE_READY : if host_tx_ready != True or<br>admin_status != up<br> Action - disable TX
if_state --> if_state2 : if host_tx_ready == True and<br>admin_status == up
if_state2 --> CMIS_STATE_DP_DEINIT : if PORT_TABLE.port.CMIS_REINIT_REQUIRED == true or<br>is_cmis_application_update_required
if_state2 --> CMIS_STATE_MEDIA_SETTINGS_WAIT : if is_media_settings_supported and<br>MEDIA_SETTINGS_SYNC_STATUS != MEDIA_SETTINGS_DONE
note left of CMIS_STATE_READY : PORT_TABLE.port.CMIS_REINIT_REQUIRED = false
if_state2 --> CMIS_STATE_FAILED : if appl < 1 or <br>host_lanes_mask <= 0 or <br>media_lanes_mask <= 0
note left of CMIS_STATE_FAILED : PORT_TABLE.port.CMIS_REINIT_REQUIRED = false

CMIS_STATE_MEDIA_SETTINGS_WAIT --> CMIS_STATE_DP_DEINIT : if PORT_TABLE&ltport&gt.MEDIA_SETTINGS_SYNC_STATUS == MEDIA_SETTINGS_DONE
CMIS_STATE_MEDIA_SETTINGS_WAIT --> CMIS_STATE_INSERTED : Through force_cmis_reinit upon reaching timeout
note right of CMIS_STATE_MEDIA_SETTINGS_WAIT
Checks if PORT_TABLE&ltport&gt.MEDIA_SETTINGS_SYNC_STATUS == MEDIA_SETTINGS_DONE
After 5s timeout, force_cmis_reinit will be called
end note

CMIS_STATE_DP_DEINIT --> CMIS_STATE_AP_CONF
CMIS_STATE_AP_CONF --> CMIS_STATE_DP_INIT
CMIS_STATE_DP_INIT --> CMIS_STATE_DP_TXON
CMIS_STATE_DP_TXON --> CMIS_STATE_DP_ACTIVATE
CMIS_STATE_DP_ACTIVATE --> CMIS_STATE_READY
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not related to this specific design but what is the reason for setting CMIS_STATE_READY when either "host_tx_ready" is not yet TRUE or admin_status is not UP ? Why do we skip to the final state of the CMIS state machine in this use-case ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this was initially implemented, the thought at that time was to move the CMIS SM to a fixed state rather than waiting for an event.
However, if you think we should create a new state to handle host_tx_ready and admin_status rather than moving to CMIS_STATE_READY if either fields have the value false/down respectively, I can plan to implement it accordingly. Let me know your opinion on this.

## Transceiver OIR handling

mihirpat1 marked this conversation as resolved.
Show resolved Hide resolved
```mermaid
sequenceDiagram
participant STATE_DB
participant OA
participant APPL_DB
participant CmisManagerTask
participant SfpStateUpdateTask

SfpStateUpdateTask ->> SfpStateUpdateTask : event = SFP_STATUS_REMOVED
SfpStateUpdateTask -x STATE_DB : Delete TRANSCEIVER_INFO table for the port
par CmisManagerTask, SfpStateUpdateTask
CmisManagerTask ->> CmisManagerTask : Transition CMIS SM to CMIS_STATE_REMOVED

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set CMIS_REINIT_REQUIRED to True upon module removal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not currently changing the value of CMIS_REINIT_REQUIRED after setting it as False since the purpose of this flag is to drive CMIS re-initialization and not CMIS initialization.
The existing flow in the CMIS SM already ensures that CMIS initialization will be performed after module insertion.

Let me know your thoughts on this.

SfpStateUpdateTask ->> APPL_DB : PORT_TABLE:<lport>.MEDIA_SETTINGS_SYNC_STATUS = <br> MEDIA_SETTINGS_DEFAULT
end

SfpStateUpdateTask ->> SfpStateUpdateTask : event = SFP_STATUS_INSERTED
SfpStateUpdateTask ->> STATE_DB : Create TRANSCEIVER_INFO table for the port
par CmisManagerTask, SfpStateUpdateTask
CmisManagerTask ->> CmisManagerTask : Transition CMIS SM to CMIS_STATE_INSERTED
SfpStateUpdateTask ->> APPL_DB: PORT_TABLE:<lport>.MEDIA_SETTINGS_SYNC_STATUS = <br> MEDIA_SETTINGS_NOTIFIED
activate OA
APPL_DB -->> OA: Notify media settings for ports
Note over OA: Disable admin status<br>setPortSerdesAttribute
OA ->> APPL_DB: PORT_TABLE:<lport>.MEDIA_SETTINGS_SYNC_STATUS = MEDIA_SETTINGS_DONE
Note over OA: initHostTxReadyState
deactivate OA
end
```

## XCVRD termination during syncd/swss/orchagent crash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if any SFF complaint transceiver requires different SI value to be programmed to NPU. eg copper module with different length without ANLT may need a different SI on NPU serdes.
This approach is tied with cmis manager/CMIS complaint modules only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The below will take care of handling SFF compliant transceivers requiring different NPU SI settings. Let me know if I am missing something here.

"For non-CMIS SM driven transceivers, SfpStateUpdateTask thread will update NPU SI settings in the PORT_TABLE (APPL_DB) and notify to OA based on the value of PORT_TABLE:.NPU_SI_SETTINGS_SYNC_STATUS"


The below sequence diagram captures the termination of XCVRD during syncd/swss/orchagent crash.
<br> supervisord will respawn XCVRD after termination as xcvrd is killed using SIGABRT signal

```mermaid
sequenceDiagram
participant OA
participant APPL_DB
participant XCVRDMT as XCVRD main thread
participant CmisManagerTask
participant DomInfoUpdateTask
participant SfpStateUpdateTask

activate OA
activate XCVRDMT
activate CmisManagerTask
activate DomInfoUpdateTask
activate SfpStateUpdateTask
OA -x OA: Crashes while handling a routine
deactivate OA
OA ->> APPL_DB : DEL PORT_TABLE

XCVRDMT -x APPL_DB : XCVRD main thread proecesses DEL event of APPL_DB PORT_TABLE
Note over XCVRDMT: generate_sigabrt = True
alt If threads > 0 are dead
XCVRDMT -x XCVRDMT : Kill XCVRD with SIGKILL
end
XCVRDMT -x CmisManagerTask : Stop CmisManagerTask
deactivate CmisManagerTask
XCVRDMT -x DomInfoUpdateTask : Stop DomInfoUpdateTask
deactivate DomInfoUpdateTask
XCVRDMT -x SfpStateUpdateTask : Stop SfpStateUpdateTask
deactivate SfpStateUpdateTask
Note over XCVRDMT : deinit()
alt self.sfp_error_event.is_set()
XCVRDMT -x XCVRDMT : sys.exit(SFP_SYSTEM_ERROR)
else if generate_sigabrt is True
XCVRDMT -x XCVRDMT : Kill XCVRD with SIGABRT

else
XCVRDMT -x XCVRDMT : Graceful exit
end
deactivate XCVRDMT
```

## Test plan and expectation
| Event | APPL_DB cleared | Xcvrd restarted | Media renotify | MEDIA_SETTINGS_SYNC_STATUS value on xcvrd boot-up for initialized transceiver | CMIS re-init triggered | Link flap |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change Media renotify to Media settings renotify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this now

|:----------------:|:---------------:|:---------------:|:--------------:|:-----------------------------------------------------------------------------:|:----------------------:|:---------:|
| Xcvrd restart | N | Y | N | MEDIA_SETTINGS_DONE | N | N |
| Pmon restart | N | Y | N | MEDIA_SETTINGS_DONE | N | N |
| Swss restart | Y | Y | Y | MEDIA_SETTINGS_DEFAULT | Y | Y |
| Syncd restart | Y | Y | Y | MEDIA_SETTINGS_DEFAULT | Y | Y |
| config reload | Y | Y | Y | MEDIA_SETTINGS_DEFAULT | Y | Y |
| Cold reboot | Y | Y | Y | MEDIA_SETTINGS_DEFAULT | Y | Y |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config reload and cold reboot triggers/use-cases doesn't have link flap (until link come operationally up and then go down post bring up).
Since all SW components went down along with HW ones (NPU, PHY, optics etc.), so should mark this case as N/A under this link flap column

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this now.

| Config shut | N | N | N | MEDIA_SETTINGS_DONE | N | Y |
| Config no shut | N | N | N | MEDIA_SETTINGS_DONE | N | Y |
| Warm reboot | N | Y | N | MEDIA_SETTINGS_DONE | N | N |
# Out of Scope
mihirpat1 marked this conversation as resolved.
Show resolved Hide resolved
Following items are not in the scope of this document. They would be taken up separately
1. xcvrd restart
- If the xcvrd goes for restart, then all the DB events will be replayed.
Here the Datapath init/activate for CMIS compliant optical modules, tx-disable register set (for SFF complaint optical modules), will be a no-op if the optics is already in that state
2. syncd/gbsyncd/swss docker container restart
- Cleanup scenario - Check if the host_tx_ready field in STATE-DB need to be updated to “False” for any use-case, either in going down or coming up path
- Discuss further on the possible use-cases
3. CMIS API feature is not part of this design and the APIs will be used in this design. For CMIS HLD, Please refer to:
1. CMIS API feature is not part of this design and the APIs will be used in this design. For CMIS HLD, Please refer to:
https://github.com/sonic-net/SONiC/blob/9d480087243fd1158e785e3c2f4d35b73c6d1317/doc/sfp-cmis/cmis-init.md
4. Error handling of SAI attributes
2. Error handling of SAI attributes
a) At present, If there is a set attribute failure, orch agent will exit.
Refer the error handling API : https://github.com/sonic-net/sonic-swss/blob/master/orchagent/orch.cpp#L885
b) Error handling for SET_ADMIN_STATUS attribute will be added in future.
Expand Down