SFU Refactoring #52

daniel-abramov · 2022-11-23T17:07:42Z

This closes #39 and does add structured logging, but not the framework for #50

The main idea of this refactoring was to provide us with a somewhat stable foundation on top of which we could build new features with as much compile-time safety and with some guarantees that we lower race conditions.

See #39 for a description of things that were improved and fixed.

Summary: [TODO, write some better explanation and highlights of this refactoring].

A rough diagram of how components are conceptually connected with each other:

NB: This is still work-in-progress and must be tested before merging.

dbkr

This is feeling like a nice refactor so far - using channels & messaging passing in favour of locks definitely feels like a step in the right direction and is good to get done while the project is still a fairly manageable size.

pkg/signaling/client.go

pkg/conference/start.go

pkg/signaling/matrix.go

SimonBrandner

This is largely looking good to me though I sometimes struggle a bit not to get lost... Perhaps a better file naming would be a bit, some of the file names aren't very descriptive from my POV.

I am also a little concerned about completely removing the subscriber/publisher model which is very useful for simulcast and possibly other things; it's also very similar to what ion-sfu and livekit does, iirc

I would also like my PR to get merged before this and for us to somehow integrate it into this one

All of the TODO are also a little concerning

pkg/conference/start.go

pkg/conference/state.go

pkg/conference/processor.go

pkg/peer/webrtc.go

pkg/conference/matrix.go

daniel-abramov · 2022-12-01T17:15:44Z

@SimonBrandner,

This is largely looking good to me though I sometimes struggle a bit not to get lost... Perhaps a better file naming would be a bit, some of the file names aren't very descriptive from my POV.

Is there something that could be improved? I want to understand what makes you feel lost. Or is it just because the PR is too large?

I'm definitely in favor of smaller PRs and I'm sorry that this PR is now that big: it's as big as it currently is since it's simply a complete rewrite of a previous implementation (I barely touched main and profiling things, but the rest was changed drastically).

Perhaps it would be easier if I published it as a new repository? (since 90% of things anyway changed) Then it would be easier to review it in an editor (or an IDE) as a new SFU rather than trying to comprehend how the new changes relate to the previous implementation. (which I agree must be confusing)

What I normally do for the large PRs is to split it into smaller "atomic" commits where each commit changes a single thing and explains why it was changed: that's the approach I'm in favor of and that I try to use in all projects (I've noticed that in Matrix and Element it's not always followed: many commits just alter the state of a previous one or combine unrelated things in one commit: something that I want to change!), but I think if I could present it not as one big change, but as series of well-separated commits, it would be much easier to understand (I normally rebase commits before a push and make sure they are clear, but unfortunately this time I published it earlier than I planned so that Dave can take a look at it earlier as we spoke about it).

I am also a little concerned about completely removing the subscriber/publisher model which is very useful for simulcast and possibly other things; it's also very similar to what ion-sfu and livekit does, iirc

Good point, expected you to ask about it :)

So, the subscriber and publisher model does exist in the current implementation, but it does not have a separate module or file for it since for our current implementation it does not seem to be useful.

I.e. I can create a function in a Conference and name it getSubscribers() (it's easy to create one), but then it would immediately pose questions like what exactly the subscriber structure should contain, i.e. should it be a structure that contains the ID of a subscriber and the list of IDs of other participants to which the subscriber is subscribed to? Or should it rather be the ID of a subscriber (the participant ID) and the list of tracks? Or should it be the list of tracks and the list of participants to which we're subscribed? Etc... Currently, these can't be answered simply because we don't really need an explicit entity such as Subscriber or Publisher.

We may indeed need it in the future (e.g. when implementing simulcast)! And once we need it, we definitely give it a good thought and introduce it! But adding these entities right now seemed like a "premature optimization" to me (since it would pose the questions listed above) that would complicate our current (pretty trivial and simple) use case.

As for the other SFUs, I'm only familiar with some internals of Janus and LiveKit. You mentioned LiveKit: yes, they have a concept of a subscription and publishing a track (but that concept also we have in our implementation 🙂), but they, like we, don't have special files or packages for subscriber.go or publisher.go, also there are no such structures as Subscriber or Publisher in their implementation. They do have a more concrete entity though like SubscribedTrack or Downtrack.

So I would not say that we don't have a publisher-subscriber model. In fact, we do have it between peers and a conference (peers publish the messages to which the conference is subscribed). And we also know which tracks everyone is subscribed it and which tracks are being published by each participant (see participant.go). We just do it different compared to the implementation in main branch - I could not inherit the Publisher/Subscriber logic from main since it unfortunately suffers from race conditions in multiple places and the entities are not separated there, so it also has bugs like a double-subscription (e.g. check Publisher::Subscribe() in the main). So they had to be reworked anyway [unfortunately].

And I'm pretty sure that with time we'll also have a more elaborated logic for this with more structures that have a publisher or subscriber-specific logic inside.

I would also like my PR to get merged before this and for us to somehow integrate it into this one

Would you like me to integrate the logic from the RTCP PR into this PR before merging it into main? Or would you prefer to add it to this PR?

I can do it, but I was wondering if you would like to do it yourself to get a feel of what the rewritten version looks like and how such things are implemented within the new architecture.

All of the TODO are also a little concerning

Yep, but there are also TODOs and FIXMEs in main :) There is only one place (a FIXME) in this PR that talks about a feature that is implemented in main, but not here. The rest existed in the previous implementation (it's just some of them were not marked as concerns and did not have a dedicated TODO, like for instance handling the data channel being busy or closing the SFU once Sync() ends). I decided to mark these places with TODOs (if it's not critical) and FIXMEs (if it's more critical and must be discussed), so that we don't forget about them.

SimonBrandner · 2022-12-01T18:13:50Z

@SimonBrandner,

This is largely looking good to me though I sometimes struggle a bit not to get lost... Perhaps a better file naming would be a bit, some of the file names aren't very descriptive from my POV.

Is there something that could be improved? I want to understand what makes you feel lost. Or is it just because the PR is too large?

The PR being large is definitely a part of it but I feel like the names don't tell me much in some cases...

I'm definitely in favor of smaller PRs and I'm sorry that this PR is now that big: it's as big as it currently is since it's simply a complete rewrite of a previous implementation (I barely touched main and profiling things, but the rest was changed drastically).

Perhaps it would be easier if I published it as a new repository? (since 90% of things anyway changed) Then it would be easier to review it in an editor (or an IDE) as a new SFU rather than trying to comprehend how the new changes relate to the previous implementation. (which I agree must be confusing)

What I normally do for the large PRs is to split it into smaller "atomic" commits where each commit changes a single thing and explains why it was changed: that's the approach I'm in favor of and that I try to use in all projects (I've noticed that in Matrix and Element it's not always followed: many commits just alter the state of a previous one or combine unrelated things in one commit: something that I want to change!), but I think if I could present it not as one big change, but as series of well-separated commits, it would be much easier to understand (I normally rebase commits before a push and make sure they are clear, but unfortunately this time I published it earlier than I planned so that Dave can take a look at it earlier as we spoke about it).

No worries about the PR being large, no reason to create a separate repo, I realize it's a large refactor and so there aren't many alternatives

I am also a little concerned about completely removing the subscriber/publisher model which is very useful for simulcast and possibly other things; it's also very similar to what ion-sfu and livekit does, iirc

Good point, expected you to ask about it :)

So, the subscriber and publisher model does exist in the current implementation, but it does not have a separate module or file for it since for our current implementation it does not seem to be useful.

I.e. I can create a function in a Conference and name it getSubscribers() (it's easy to create one), but then it would immediately pose questions like what exactly the subscriber structure should contain, i.e. should it be a structure that contains the ID of a subscriber and the list of IDs of other participants to which the subscriber is subscribed to? Or should it rather be the ID of a subscriber (the participant ID) and the list of tracks? Or should it be the list of tracks and the list of participants to which we're subscribed? Etc... Currently, these can't be answered simply because we don't really need an explicit entity such as Subscriber or Publisher.

We may indeed need it in the future (e.g. when implementing simulcast)! And once we need it, we definitely give it a good thought and introduce it! But adding these entities right now seemed like a "premature optimization" to me (since it would pose the questions listed above) that would complicate our current (pretty trivial and simple) use case.

As for the other SFUs, I'm only familiar with some internals of Janus and LiveKit. You mentioned LiveKit: yes, they have a concept of a subscription and publishing a track (but that concept also we have in our implementation slightly_smiling_face), but they, like we, don't have special files or packages for subscriber.go or publisher.go, also there are no such structures as Subscriber or Publisher in their implementation. They do have a more concrete entity though like SubscribedTrack or Downtrack.

So I would not say that we don't have a publisher-subscriber model. In fact, we do have it between peers and a conference (peers publish the messages to which the conference is subscribed). And we also know which tracks everyone is subscribed it and which tracks are being published by each participant (see participant.go). We just do it different compared to the implementation in main branch - I could not inherit the Publisher/Subscriber logic from main since it unfortunately suffers from race conditions in multiple places and the entities are not separated there, so it also has bugs like a double-subscription (e.g. check Publisher::Subscribe() in the main). So they had to be reworked anyway [unfortunately].

And I'm pretty sure that with time we'll also have a more elaborated logic for this with more structures that have a publisher or subscriber-specific logic inside.

Alright, let's leave it out for now and re-add it with simulcast

I would also like my PR to get merged before this and for us to somehow integrate it into this one

Would you like me to integrate the logic from the RTCP PR into this PR before merging it into main? Or would you prefer to add it to this PR?

I can do it, but I was wondering if you would like to do it yourself to get a feel of what the rewritten version looks like and how such things are implemented within the new architecture.

Yeah, I actually wanted to have a go at adding it to this PR exactly for that reason :D Might take a bit of time, but I would like to have a look tomorrow

All of the TODO are also a little concerning

Yep, but there are also TODOs and FIXMEs in main :) There is only one place (a FIXME) in this PR that talks about a feature that is implemented in main, but not here. The rest existed in the previous implementation (it's just some of them were not marked as concerns and did not have a dedicated TODO, like for instance handling the data channel being busy or closing the SFU once Sync() ends). I decided to mark these places with TODOs (if it's not critical) and FIXMEs (if it's more critical and must be discussed), so that we don't forget about them.

Sure, can we just check we're not removing any functionality with this refactor (e.g. keep-alive)?

SimonBrandner · 2022-12-03T07:58:52Z

pkg/conference/participant.go

+	peer            *peer.Peer[ParticipantID]
+	remoteSessionID id.SessionID
+	streamMetadata  event.CallSDPStreamMetadata
+	publishedTracks map[event.SFUTrackDescription]*webrtc.TrackLocalStaticRTP


Could we not use event.SFUTrackDescription here, please? It doesn't feel good that the Matrix event types and internal types being coupled if they aren't exactly the same thing. Also, event.SFUTrackDescription optionally includes resolution, so it just feels wrong to use it as a key here since we'll never want to use resolution in the key

In general, I totally agree with it. I think the only reason I used it like this is that it was not clear what exactly we are going to use as a track identifier, but I agree that having a separate identifier for tracks makes more sense (it would also simplify the adjacent functions that compare things).

Would you prefer me to change it now or after mering the PR? - I was in favor of merging this ASAP as it would allow us to create PRs against main. Now we're creating the PRs again in this branch and other branches on top of other branches :) I.e. me changing this thing now would conflict with changes that you did in the RTCP PR. But if you want, we can today merge your PR and this PR and then I'll make a quick change on top of main then.

Yep, let's merge this ASAP - have made #56. I might actually have a look into that myself if I get a change before you

Pretty generic work-in-progress state.

This is a WIP state that does not work as expected!

This allows to generalize the message sink and get rid of a lot of copy-paste in the handling functions. Also this moves types to the right modules, so that `peer` is now completely matrix-unaware module that contains only plain WebRTC logic.

This seems to be more idiomatic in Go projects.

New participants are now properly handled depending on whether or not they were in the session. Also, the conference and peer get documented.

Once the track become obsolete, we unsubscribe from them.

This is a temporary solution.

Instead of calling `LocalDescription()` to get it again. It does not seem to make any difference in our particular case?

Don't include tracks in the metadata that are not yet published (for which we don't have any remote streams available). Also, inform about metadata changes once tracks get published and unpublished.

This is a temporary measure to fix the screen sharing.

Co-authored-by: Šimon Brandner <[email protected]>

Note that we start the keepalive routine once the peer is created. We do it like this since the keepalive deadline is actually quite high and I would say that if within that deadline no heartbeat messages were sent, then we can consider the connection as stalled. I.e. starting the keepalive timer only once the peer is connected is like sparing a second that a peer normally needs to establish a connection?

Signed-off-by: Šimon Brandner <[email protected]>

dbkr · 2022-12-05T16:25:50Z

pkg/common/channel.go

+	// The channel itself.
+	channel chan<- M
+	// Atomic variable that indicates whether the channel is closed.
+	receiverClosed *atomic.Bool


My understanding is that this is considered a bit of a code smell in Go, since a sender should never be trying to write to a closed channel. I guess this means we have something where we can't know this and need to check each time?

Good that you've noticed it: tbh I did not like the part of writing this particular structure as it felt like it's not very elegant, but unfortunately I did not find a better way to write it.

The problem that I'm trying to solve is to indeed not write to the closed channel, but the problem is that from Go's standpoint the channel is not closed since the channels could only be closed in Go from the sender's side, but not from the receiver's side.

A practical example of where it happens in our code:

There is a router in the code: a structure that simply has a map of all conferences running on the SFU. The router receives signaling messages (To-Device messages), checks their conf_id, and simply sends the message to the conference that is responsible for this message. So there is a channel between the router and a particular conference.

Conference is listening to the channel and reacting to the messages. But at some point, all participants leave the conference which means that the conference must be ended. So the conference goroutine and the main loop of the conference end, sending a message to the router telling "Ok, I'm done, you can remove it".

The problem occurs because at the moment the conference is considered ended, the channel remains open (the conference can't close the channel because the conference only holds a receiver part of the channel). And in Go there is no way to check if the channel is alive or if there is someone listening on the channel. So this means that the router might have sent certain messages to the conference expecting that the conference will read them whereas, in reality, the conference has stopped listening on the channel by the moment the router gets to know that the conference is dead.

To sum up:

There is a way in Go to check if the channel is closed by the sender on the receiver's side.

There is no way in Go to check if there is any listener on the sender's side.

We need a way to know (on the sender side) if someone is listening to the channel (in order to know if it makes sense to post a message).

That's pretty much the problem that I tried to solve with this logic, i.e. create a wrapper around the channel where I could inform the sender that the receiver is not listening to the changes, so that if the sender tries to send the message to such a channel, it gets the message back and knows that there are no receivers anymore.

Thanks for the detailed explanation - I've also found Go is a bit light on actually explaining how you would write correct & non-racy code sticking to the given paradigms.

Right! I've also spent some time trying to understand how to do things safely. In Rust, for instance, we always know if there is a sender and if there is a receiver, but in Go, an elegant implementation of this does not work because of garbage collection: the receivers are not deleted immediately after leaving the scope of a function, but only get removed when the GC kicks in which means that the lifetime of the objects is not strictly defined, hence the consequence of us not knowing when the other counterpart of a channel is dead.

dbkr · 2022-12-05T16:34:35Z

pkg/common/message_sink.go

+	messageSink chan<- Message[SenderType, MessageType]
+	// Atomic variable that indicates whether the message sink is sealed.
+	// This is used to prevent sending messages to a sealed message sink.
+	// The variable is atomic because it may be accessed from multiple goroutines.


For clarity's sake, could we define what 'sealed' means here?

Good point, will add a comment about that! Perhaps I need to rewrite the comment for this variable altogether.

dbkr · 2022-12-05T16:44:24Z

pkg/conference/participant.go

+func (p *Participant) sendDataChannelMessage(toSend event.SFUMessage) {
+	jsonToSend, err := json.Marshal(toSend)
+	if err != nil {
+		p.logger.Error("Failed to marshal data channel message")


shouldn't this be returning here, otherwise we'll send blank messages?

Good catch, thanks! Will be fixed right away.

daniel-abramov requested a review from a team as a code owner November 23, 2022 17:07

dbkr reviewed Nov 23, 2022

View reviewed changes

daniel-abramov force-pushed the refactoring branch 2 times, most recently from 4fd4a70 to 6b8eee8 Compare November 24, 2022 16:40

daniel-abramov changed the title ~~[DRAFT] SFU Refactoring - Work In Progress~~ SFU Refactoring - Work In Progress Nov 26, 2022

daniel-abramov force-pushed the refactoring branch 4 times, most recently from d845e7b to f358ee1 Compare November 28, 2022 21:08

SimonBrandner reviewed Nov 30, 2022

View reviewed changes

pkg/signaling/client.go Outdated Show resolved Hide resolved

SimonBrandner reviewed Nov 30, 2022

View reviewed changes

pkg/conference/start.go Outdated Show resolved Hide resolved

pkg/conference/start.go Outdated Show resolved Hide resolved

pkg/signaling/matrix.go Outdated Show resolved Hide resolved

SimonBrandner reviewed Dec 1, 2022

View reviewed changes

daniel-abramov changed the title ~~SFU Refactoring - Work In Progress~~ SFU Refactoring Dec 1, 2022

SimonBrandner reviewed Dec 3, 2022

View reviewed changes

SimonBrandner mentioned this pull request Dec 5, 2022

Don't use event.SFUTrackDescription internally for track identification #56

Closed

daniel-abramov requested a review from a team December 5, 2022 16:29

SimonBrandner approved these changes Dec 5, 2022

View reviewed changes

daniel-abramov added 11 commits December 5, 2022 17:48

refactor: clean up main.go a bit

bf6b987

refactor: renaming modules and functions names

c5c6206

Pretty generic work-in-progress state.

refactor: define a new skeleton for the project

ba2fa4b

This is a WIP state that does not work as expected!

refactor: create a better structure for the sources

85aa12c

refactor: finalize signaling and peer communication

ba6eb3f

refactor: define a package for message sink

abef0c5

This allows to generalize the message sink and get rid of a lot of copy-paste in the handling functions. Also this moves types to the right modules, so that `peer` is now completely matrix-unaware module that contains only plain WebRTC logic.

general: rename src to pkg

532773f

This seems to be more idiomatic in Go projects.

conference: define sane logic for new participants

4d7970c

New participants are now properly handled depending on whether or not they were in the session. Also, the conference and peer get documented.

conference: handle publish messages from peers

83a49b5

paer: unsubscribe from obsolete tracks

ce32481

Once the track become obsolete, we unsubscribe from them.

peer: add naive handling of RTCPs

b47600c

This is a temporary solution.

daniel-abramov and others added 25 commits December 5, 2022 17:49

peer: don't fail on certain RTCP write errors

d5ffe81

peer: use sdpAnswer from crateAnswer

5563dd5

Instead of calling `LocalDescription()` to get it again. It does not seem to make any difference in our particular case?

conference: handle metadata of streams properly

ab53c8e

Don't include tracks in the metadata that are not yet published (for which we don't have any remote streams available). Also, inform about metadata changes once tracks get published and unpublished.

conference: add hacky way to send unknown metadata

0c8d528

This is a temporary measure to fix the screen sharing.

conference: update TODOs and FIXMEs

bd313c0

conference: attach stream metadata to offers

1b7bdbe

conference: accept the metadata from the DC offer

39deed0

minor: fix typo in a comment in a signaling comment

74014f0

Co-authored-by: Šimon Brandner <[email protected]>

Update the documentation in a conference state

5fe188d

Co-authored-by: Šimon Brandner <[email protected]>

minor: rename UserID to userID

88997da

minor: rename RunSync -> RunSyncing()

ec67906

Remove ugly RTCP handling

4ef47e0

Signed-off-by: Šimon Brandner <[email protected]>

Implement RTCP forwarding

ce110b5

Signed-off-by: Šimon Brandner <[email protected]>

Further refactor code

ae5da43

Signed-off-by: Šimon Brandner <[email protected]>

Make minimalPLIInterval local

73a6947

Signed-off-by: Šimon Brandner <[email protected]>

Fix typo

5a78cca

Signed-off-by: Šimon Brandner <[email protected]>

Handle missing track

eecdde2

Signed-off-by: Šimon Brandner <[email protected]>

ForwardRTCP -> RTCPReceived

e208b4b

Signed-off-by: Šimon Brandner <[email protected]>

Decapitalize

70a4ede

Signed-off-by: Šimon Brandner <[email protected]>

Explain lastPLITimestamp

fd545d0

Signed-off-by: Šimon Brandner <[email protected]>

Use time.Time as type

2649e1f

Signed-off-by: Šimon Brandner <[email protected]>

Simplify lastPLITimestamp

0369070

Signed-off-by: Šimon Brandner <[email protected]>

Remove leftover

dc09318

Signed-off-by: Šimon Brandner <[email protected]>

general: resolve leftovers after rebase

537f4c0

daniel-abramov force-pushed the refactoring branch from 1b6743e to 537f4c0 Compare December 5, 2022 16:52

daniel-abramov merged commit 2b0597d into main Dec 5, 2022

SimonBrandner deleted the refactoring branch December 5, 2022 16:53

dbkr reviewed Dec 5, 2022

View reviewed changes

daniel-abramov mentioned this pull request Feb 15, 2023

Performant RTP Publisher #134

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SFU Refactoring #52

SFU Refactoring #52

daniel-abramov commented Nov 23, 2022 •

edited

Loading

dbkr left a comment

SimonBrandner left a comment •

edited

Loading

daniel-abramov commented Dec 1, 2022

SimonBrandner commented Dec 1, 2022

SimonBrandner Dec 3, 2022

daniel-abramov Dec 5, 2022 •

edited

Loading

SimonBrandner Dec 5, 2022

dbkr Dec 5, 2022

daniel-abramov Dec 5, 2022 •

edited

Loading

dbkr Dec 5, 2022

daniel-abramov Dec 5, 2022

dbkr Dec 5, 2022

daniel-abramov Dec 5, 2022

dbkr Dec 5, 2022

daniel-abramov Dec 5, 2022

SFU Refactoring #52

SFU Refactoring #52

Conversation

daniel-abramov commented Nov 23, 2022 • edited Loading

dbkr left a comment

Choose a reason for hiding this comment

SimonBrandner left a comment • edited Loading

Choose a reason for hiding this comment

daniel-abramov commented Dec 1, 2022

SimonBrandner commented Dec 1, 2022

Choose a reason for hiding this comment

daniel-abramov Dec 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-abramov Dec 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-abramov commented Nov 23, 2022 •

edited

Loading

SimonBrandner left a comment •

edited

Loading

daniel-abramov Dec 5, 2022 •

edited

Loading

daniel-abramov Dec 5, 2022 •

edited

Loading