Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC3401: Native Group VoIP Signalling #3401

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open
Changes from 3 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
05fd5af
MSC3401: Native Group VoIP Signalling
ara4n Sep 19, 2021
7f5ee49
comments & cosmetics
ara4n Sep 20, 2021
083fd9a
grammar
ara4n Sep 20, 2021
5ee96fb
incorporate review
ara4n Sep 22, 2021
b90b85e
more feedback
ara4n Sep 23, 2021
ed37a0d
add `purpose` from #3077
ara4n Sep 23, 2021
33a64f2
Update proposals/3401-group-voip.md
ara4n Sep 23, 2021
7fd1ba6
converge better with #3077 and WebRTC norms
ara4n Sep 25, 2021
669d471
tracks have to be identified by stream + track tuple
ara4n Sep 25, 2021
48526ad
spell out that you should ignore `m.call.member` state events from pa…
ara4n Oct 12, 2021
dfd4ffe
Add basic call sequence diagram
robertlong Mar 9, 2022
3c306cc
Remove SFU datachannel ping/pong timeout section
robertlong Mar 9, 2022
4d43aae
Update m.call.member and call setup sections
robertlong Mar 10, 2022
856ddc7
spell out the unstable prefix
ara4n May 28, 2022
d109b54
add tracks back into m.call.member for SFUs to use
ara4n May 30, 2022
07f9547
add session IDs & labels
ara4n Jun 3, 2022
7a06ed7
Let call member events expire (#3831)
robintown Jun 16, 2022
32f566a
Rip out SFU bits out of MSC3401 (#3897)
SimonBrandner Oct 21, 2022
3fde32b
Move expiration timestamps to be per-device (#3941)
robintown Nov 30, 2022
05b5db2
Specify who calls who (#3942)
robintown Nov 30, 2022
43dc42f
Clarify `expires_ts`
SimonBrandner Dec 3, 2022
5635cee
Add `seq`
SimonBrandner Dec 3, 2022
b8ebe27
Use heading for Legend
SimonBrandner Dec 3, 2022
6b98d66
Fix-up some formatting
SimonBrandner Dec 3, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
267 changes: 267 additions & 0 deletions proposals/3401-group-voip.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,267 @@
# MSC3401: Native Group VoIP signalling
robertlong marked this conversation as resolved.
Show resolved Hide resolved

## Problem

VoIP signalling in Matrix is currently conducted via timeline events in a 1:1 room.
This has some limitations, especially if you try to broaden the approach to multiparty VoIP calls:

* VoIP signalling can generate a lot of events as candidates are incrementally discovered, and for rapid call setup these need to be relayed as rapidly as possible.
* Putting these into the room timeline means that if the client has a gappy sync, for VoIP to be reliable it will need to go back and fill in the gap before it can process any VoIP events, slowing things down badly.
* Timeline events are (currently) subject to harsh rate limiting, as they are assumed to be a spam vector.
* VoIP signalling leaks IP addresses. There is no reason to keep these around for posterity, and they should only be exposed to the devices which care about them.
* Candidates are ephemeral data, and there is no reason to keep them around for posterity - they're just clogging up the DAG.

Meanwhile we have no native signalling for group calls at all, forcing you to instead embed a separate system such as Jitsi, which has its own dependencies and doesn't directly leverage any of Matrix's encryption, decentralisation, access control or data model.

## Proposal

This proposal provides a signalling framework using to-device messages which can be applied to native Matrix 1:1 calls, full-mesh calls, SFU calls, cascaded SFU calls and in future MCU calls, and hybrid SFU/MCU approaches. It replaces the early flawed sketch at [MSC2359](https://github.com/matrix-org/matrix-doc/pull/2359).

This does not immediately replace the current 1:1 call signalling, but may in future provide a migration path to unified signalling for 1:1 and group calls.

Diagramatically, this looks like:

1:1:
```
A -------- B
```

Full mesh between clients
```
A -------- B
\ /
\ /
\ /
\ /
C
```

SFU (aka Focus):
Copy link
Contributor

@robertlong robertlong Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bikeshedding warning: I'm relatively new to the WebRTC/VoIP industry, but I have never heard the term focus used in place of SFU. Is this a commonly known term? Should we be using SFU in this spec instead? Including renaming m.foci -> m.sfus?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason i originally went with foci is because the field originally described the (mxid, deviceid) tuples where a given mxid could be contacted - which might either be a local device (for full mesh) or an SFU.

However, in the current simpler draft, the only time you include this field is if you are using a conferencing focus of some kind.

But, this proposal is not meant to just be for SFUs - the device you use to focus together your view of the conference could (in future) equally be an MCU as much as an SFU. Hence using the correct more generic term of 'focus' rather than making it specific to SFU technology. For instance, the server could advertise a stream which composites together a mosaic of different feeds for a non-E2EE call... at which point it's acting as a (hybrid) MCU.

The term 'focus' comes from SIP (e.g. https://datatracker.ietf.org/doc/html/rfc3840#section-10.18) and is the standard term there for "an endpoint you connect to which mixes together other endpoints". I'm slightly inclined to keep it, to keep thing flexible for future more sophisticated foci tech.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call it call_focus or stream_focus or something a bit more descriptive than a not-well-known dictionary word?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

focus is a pretty well-known word, and foci is its plural. i don't particularly want to call it 'focuses', given that's a different word (the 3rd person present form of 'to focus'). not sure this is a showstopper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely isn't a showstopper but I would like to come up with a better name if we can. It is also a bit of a red-flag that just about everything else in the MSC is calling it a SFU.

Copy link
Contributor

@kyrias kyrias Sep 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While focus is a well-known word, outside of Britain its plural is 'focuses', so I would expect that a lot of people are going to be similarly confused over its meaning. Even the Cambridge Dictionary lists 'focuses' as the plural, while listing 'foci' as the formal plural in the UK.

Might it be possible to at least mention in the spec that it's used in this sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

```
A __ __ B
\ /
F
|
|
C
Where F is an SFU focus
```

Cascaded decentralised SFU:
```
A1 --. .-- B1
A2 ---Fa ----- Fb--- B2
\ /
\ /
\ /
\ /
Fc
| |
C1 C2
Where Fa, Fb and Fc are SFU foci, one per homeserver, each with two clients.
```

### m.conf state event

The user who wants to initiate a call sends a `m.conf` state event into the room to inform the room participants that a call is happening in the room. This effectively becomes the placeholder event in the timeline which clients would use to display the call in their scrollback (including duration and termination reason using `m.terminated`). Its body has the following fields:
ara4n marked this conversation as resolved.
Show resolved Hide resolved

* `m.intent` to describe the intended UX for handling the call. One of:
* `m.ring` if the call is meant to cause the room participants devices to ring (e.g. 1:1 call or group call)
* `m.conference` is the call should be presented as a conference call which users in the room may connect to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if these two are basically the same thing with different push rules? Is this influenced by push rules?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there is a sort of intersection. I can see a use case where in the same room we may have "weekly sync" where we should buzz everyone and "debugging session" where people may drop in. Of course there are some rooms where I may not care about m.ring.

Maybe it is better to rephrase this as "priority"? Intent is very vague. Intent for what?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess you could put this distinction into push rules, but it seems a bit simpler (especially given what a mess push rules are) to make it explicit here. After all, the difference between ringing and conferencing is not just the type of push notification you receive, but the whole UX (e.g. CallKit on iOS, or whether you display a dedicated ringing UX etc).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While that is true, I feel like we shouldn't use something that are not push rules for influencing notifications

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ara4n do you see any resolution to this? I agree, it probably should be separate to the push rules. Implementations should use m.ring as the first clue as to whether or not to ring a device and push rules should apply on top of it. The m.ring type also defines what UI to render in a client.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense to use intentional mentions for this. If none are included, it's like a conference call. Otherwise, in conjunction with the fact that it's a call event, the client would know to start ringing and not just pinging. To ring everyone in a room, you'd simply mention @room.

* `m.immersive` if the call should be presented as a voice/video channel in which the user is immediately immersed on selecting the room.
ara4n marked this conversation as resolved.
Show resolved Hide resolved
* `m.type` to say whether the type of call is voice only (`m.voice`) or video (`m.video`)
ara4n marked this conversation as resolved.
Show resolved Hide resolved
* `m.conf_id` as a unique identifier for the current ongoing call. (We can't use the event ID, given `m.type` is mutable. However, does this risk users causing problems with maliciously colliding IDs?).
robertlong marked this conversation as resolved.
Show resolved Hide resolved
* `m.terminated` if this event indicates that the call in question has finished, including the reason why. (do we need a duration, or can we figure that out from the previous state event?)
ara4n marked this conversation as resolved.
Show resolved Hide resolved
* `m.name` as an optional human-visible label for the call (e.g. "Conference call").
robertlong marked this conversation as resolved.
Show resolved Hide resolved
* `m.foci` as an optional list of recommended SFUs that the call initiator can recommend to users who do not want to use their own SFU (because they don't have one, or because they spot they would be the only person on their SFU for their call, and so choose to connect direct to save bandwidth).
ara4n marked this conversation as resolved.
Show resolved Hide resolved
* State key must be blank.

For instance:

```jsonc
{
"type": "m.conf",
"state_key": "",
"content": {
"m.intent": "m.immersive",
"m.type": "m.voice",
"m.conf_id": "cvsiu2893",
"m.name": "Voice room",
"m.foci": [
"@sfu-lon:matrix.org",
"@sfu-nyc:matrix.org",
],
SimonBrandner marked this conversation as resolved.
Show resolved Hide resolved
}
}
```

We mandate at most one call per room at any given point to avoid UX nightmares - if you want the user to participate in multiple parallel calls, you should simply create multiple rooms, each with one call.
Copy link
Contributor

@ShadowJonathan ShadowJonathan Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is worth considering though, the UX nightmare might not be that bad (some clients might even work entirely with this possibility), and personally i think that putting the conf ID in a sub-field is just asking for problems (if the previous call information gets overridden by a person sending another state event for a "new" call while the last one is still in-progress.)

Why not move conf_id into the state_key, currently declare multiple calls UB and unsupported, while noting that speccing it and properly seating it would be a case for a future MSC?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-opening this one because we've just had a glare-like bug on Element Call where multiple people entered the call at the same time (as you do) and multiple conferences got created in the same room. In general, we're going to want some way to handle glare of several people hitting the 'start conference call' button at the same time. Allowing multiple calls in a room means we need to handle this somehow. It's not impossible (eg. we could define some common ID for 'the' call in a room allowing you to use other IDs for other calls?) but I'd just like to check that we really want to deal with this complexity.

Copy link
Contributor

@SimonBrandner SimonBrandner Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also very much in favour of having the state_key be just "" because having multiple group calls in one room often leads to more problems rather than benefits

With MSC3985 we now also have a separate method to create break-out rooms, so it feels like multiple calls in one room are no longer necessary

I also think we should be able to use the m.termintated to calculate the call length

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still an issue with relying on m.terminated to determine the call length: If a client wants to display a timeline tile with the duration at the point where the call was ended, then it works, but if clients want to display the tile at the point that the call was started (like Element Web does), and we're reusing the same state key for all calls, it's difficult to get the duration from that event. In fact, if there's a call ongoing in the room, there's no way to tell whether a given call event is part of the current call or not, short of crawling the timeline, so clients won't know whether to label it with "call ended".

With separate state keys, this is a lot easier, because it gives you a way to efficiently look up the current state of any call, current or historical.


### Call participation

Users who want to participate in the call declare this by adding an `m.conf` field to their `m.room.member` state event. Ideally, we'd use a dedicated state event type for this, making it easier to rapidly spot who is in a conference. But given we don't want other people editing our state event and Matrix doesn't yet provide that level of access control, instead we (ab)use the `m.room.member` event to declare our participation in the conference in the context of the room. Therefore any profile updates need to be careful to preserve the `m.conf` field.
ara4n marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to worry about clients setting a conf state event and then falling off a cliff, leaving the user stuck as if they're in a conference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that line handles clients setting an m.call.member event and falling off a cliff. if a client sets an m.call event and falls off a cliff, however, anyone with permission to overwrite the state event in the room can go and remove the 'stuck' call.


The fields within the `m.conf` field are:
ara4n marked this conversation as resolved.
Show resolved Hide resolved

* `m.conf_id` - the ID of the conference the user is claiming to participate in. If this doesn't match the current `m.conf` event, it should be ignored.
* `m.foci` - Optionally, if the user wants to be contacted via an SFU rather than called directly (either 1:1 or full mesh), the user can also specify the SFUs their client(s) are connecting to.
* `m.sources` - Optionally, the user can list the various combinations of media streams they are able to send. This is important if connecting to an SFU, as it lets the SFU know what simulcast resolutions the sender can send. In theory the offered SDP should include this, but if we are multiplexing all streams into the same SDP it seems likely that this will get lost, hence publishing it here.

For instance:

```jsonc
{
"type": "m.room.member",
"state_key": "@matthew:matrix.org",
"content": {
"avatar_url": "mxc://matrix.org/oUxxDyzQOHdVDMxgwFzyCWEe",
"displayname": "Matthew",
"membership": "join",
"m.conf": {
"m.conf_id": "cvsiu2893",
"m.foci": [
"@sfu-lon:matrix.org",
"@sfu-nyc:matrix.org",
],
"m.sources": [
{
"id": "qegwy64121wqw",
"name": "Webcam", // optional, just to help users understand what multiple streams from the same person mean.
ara4n marked this conversation as resolved.
Show resolved Hide resolved
"device_id": "ASDUHDGFYUW", // just in case people ending up dialing this directly for full mesh or 1:1
"voice": [
{ "id": "zbhsbdhwe", "format": { "channels": 2, "rate": 48000, "maxbr": 32000 } },
ara4n marked this conversation as resolved.
Show resolved Hide resolved
],
"video": [
{ "id": "zbhsbdhzs", "format": { "res": { "width": 1280, "height": 720 }, "fps": 30, "maxbr": 512000 } },
{ "id": "zbhsbdhzx", "format": { "res": { "width": 320, "height": 240 }, "fps": 15, "maxbr": 48000 } },
],
"mosaic": {}, // for composited video streams?
},
{
"id": "suigv372y8378",
"name": "Screenshare", // optional
"device_id": "ASDUHDGFYUW",
"video": [
{ "id": "xhsbdhzs", "format": { "res": { "width": 1280, "height": 720 }, "fps": 30, "maxbr": 512000 } },
{ "id": "xbhsbdhzx", "format": { "res": { "width": 320, "height": 240 }, "fps": 15, "maxbr": 48000 } },
]
},
]
}
}
}
```

XXX: properly specify the formats here (webrtc constraints perhaps)?

It's acceptable to advertise rigid formats here rather than dynamically negotiating resolution, bitrate etc, as in a group call we should just pick plausible desirable formats rather than try to please everyone.

If a device loses connectivity, it is not particularly problematic that the membership data will be stale: all that will happen is that calls to the disconnected device will fail due to media or data-channel keepalive timeouts, and then subsequent attempts to call that device will fail. Therefore (unlike the earlier demos) we don't need to spot timeouts by constantly re-posting the state event.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertlong pointed out earlier that this doesn't help us if you want to know who's in a call before you join it (e.g. for showing a facepile on the roomlist or whatever). One solution could be for each client interested in checking to ping the participants via to-device keepalives, but this could get very busy (e.g. if someone starts a call in a room with 20,000 users, every online user will promptly send to-device msgs to the participants in a full mesh call to check whether they're there or not. this is also a privacy problem).

An alternative could be to not support this for full mesh calls, but instead if there's a SFU keeping track of the users participating in a call, the SFU could publish this to all interested users via to-device message. This could still be very busy though (and leaks to the SFU who's online).

Another alternative might be for the SFU to publish into the room timeline updates to the m.call event as users join and part (assuming the SFU has permission to write to the room). This same trick could be then be used in full-mesh by call participants to (roughly) track who else is in the call.

Or we could have a dedicated per-room presence API to try to track who's online, and assume that if they're online and they claim to be participating in the call, then they're at least attempting to be present.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest issue I see would be spam and processing times, a lot of users joining and leaving the call at once, being propagated as dozens of state events in a room, could be a vector for abuse, by loading participant servers with unnecessary stateres and such.

I also have concerns wrt privacy, if users leave behind information after the call for when they've joined or left.

The best solution i'd see is to-device between SFUs, and to-device between SFUs and corresponding users, and have those share the load of "who's on this call right now". Have SFUs exchange information instantly, while when sharing information with users, having a small delay of aggregation (a second or such) before pushing changes in membership in the call. This may have a degraded UX experience if someone joins and starts talking in the call within that second, or within the delay of to-device, but I think that at least call membership shouldn't be stored permanently.

A SFU might even then, for example, opt to delay pushing membership information to its corresponding clients if the server is under heavy load (detected or inferred through one way or another).

All of this (using to-device) reduces the effect such an abuse vector might have (compared to state or normal events).


### Call setup

Call setup then uses the normal `m.call.*` events, except they are sent over to-device messages to the relevant devices (encrypted via Olm). This means:
robertlong marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

@bwindels bwindels Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the idea here is that clients should identify the sender (user_id and device_id) of the to_device through the sender_key of the olm-encrypted message to known which peer the message is from in a full-mesh group call? Perhaps it could be valuable to spell that out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the spec to include this now, we send the sender's device_id along in the m.call.invite event. But I think this is a better method? I haven't seen sender_key yet, could you point me to the docs on that? Also we're not using olm-encryption just yet so it might still be better to include the device_id field in the content? Not sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can read about sender_key here and here. The idea is that you query the keys for the given user (if not fetched already) and verify that the keys match the sender_key of the olm message you received. The advantage of doing it this way is that the user id and device id can almost not be spoofed, assuming you have marked the other device/user as trusted, either manually or through cross-signing.

Perhaps the current impl doesn't encrypt with olm yet, but does it make sense to spec that? Is there a good reason to offer a non-encrypted version of the signalling?


* When initiating a 1:1 call, the `m.call.invite` is sent to `*` devices of the intended target user.
* Once the user answers the call from the device, the sender should rescind the other pending to-device messages, ensuring that other devices don't get spammed about long-obsolete 1:1 calls. XXX: We will need a way to rescind pending to-device msgs.
robertlong marked this conversation as resolved.
Show resolved Hide resolved
* Subsequent candidates and other events are sent only to the device who answered.
* XXX: do we still need MSC2746's `party_id` and `m.call.select_answer`?
* We will need to include the `m.conf_id` so that peers can map the call to the right room.
ara4n marked this conversation as resolved.
Show resolved Hide resolved
* However, especially for 1:1 calls, we might want to let the to-device messages flow and cause the client to ring even before the `m.conf` event propagates, to minimise latency. Therefore we'll need to include an `m.intent` on the `m.call.invite` too.
* When initiating a group call, we need to decide which devices to actually talk to.
* If the client has no SFU configured, we try to use the `m.foci` in the `m.conf` event.
* If there are multiple `m.foci`, we select the closest one based on latency, e.g. by trying to connect to all of them simultaneously and discarding all but the first call to answer.
* If there are no `m.foci` in the `m.conf` event, then we look at which foci in `m.room.member` that are already in use by existing participants, and select the most common one. (If the foci is overloaded it can reject us and we should then try the next most populous one, etc).
* If there are no `m.foci` in the `m.room.member`, then we connect full mesh.
* If subsequently `m.foci` are introduced into the conference, then we should transfer the call to them (effectively doing a 1:1->group call upgrade).
* If the client does have an SFU configured, then we decide whether to use it.
ara4n marked this conversation as resolved.
Show resolved Hide resolved
* If other conf participants are already using it, then we use it.
* If there are other users from our homeserver in the conference, then we use it (as presumably they should be using it too)
* If there are no other `m.foci` (either in the `m.conf` or in the participant state) then we use it.
* Otherwise, we save bandwidth on our SFU by not cascading and instead behaving as if we had no SFU configured.
SimonBrandner marked this conversation as resolved.
Show resolved Hide resolved

TODO: spec how clients discover their homeserver's preferred SFU foci

Originally this proposal suggested that foci should be identified by their `(user_id, device_id)` rather than just their user_id, in order to ensure convergence on the same device. In practice, this is unnecessary complication if we make it the SFU implementor's problem to ensure that either only one device is logged in per SFU user - or instead you cluster the SFU devices together for the same user. It's important to note that when calling an SFU you should call `*` devices.

### SFU control
ara4n marked this conversation as resolved.
Show resolved Hide resolved

SFUs are Selective Forwarding Units - a server which forwarding WebRTC streams between peers (which could be clients or SFUs or both). To make use of them effectively, peers need to be able to tell the SFU which streams they want to receive, and the SFU must tell the peers which streams it wants to be sent. We also need a way of telling SFUs which other SFUs to connect ("cascade") to.

The client does this by establishing an optional datachannel connection to the SFU, in order to perform low-latency signalling to rapidly select streams.

To select a stream over this channel, the peer sends:

```jsonc
{
"op": "select",
"conf_id": "cvsiu2893",
"start": [
"zbhsbdhwe",
"zbhsbdhzs",
],
"stop": [
"zbhsbdhxz",
]
}
```

Rather than sending arrays one can send `"all"` to either `start` or `stop` to start or stop all streams.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Managing these streams via start/stop events seems a little prone to failure. Would it be easier to send the entire list of streams you wish to receive? This should be sufficiently small that the payload would never get too big and I think both the client/SFU logic would be simpler to manage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reason for not doing this is that switching streams can happen very rapidly (e.g. the client could request different streams as they receive different active speaker notifications), and the idea of sending the whole list of all streams you care about every time just feels like a big waste of bandwidth. If you're in a big cascading conference with thousands of users (which this architecture could support!) do you really want to list out all the stream IDs when you want to switch from one speaker to the next?


ara4n marked this conversation as resolved.
Show resolved Hide resolved
All streams are sent within a single media session (rather than us having multiple sessions or calls), and there is no difference between a peer sending simulcast streams from a webcam versus two SFUs trunking together.

If no DC is established, then 1:1 calls should send all streams without prompting, and SFUs should send no streams by default.

If you are using your SFU in a call, it will need to know how to connect to other SFUs present in order to participate in the fullmesh of SFU traffic (if any). One option here is for SFUs to act as an AS and sniff the `m.room.member` traffic of their associated server, and automatically call any other `m.foci` which appear. (They don't need to make outbound calls to clients, as clients always dial in). Otherwise, we could consider an `"op": "connect"` command sent by clients, but then you have the problem of deciding which client(s) are responsible for reminding the SFU to connect to other SFUs. Much better to trust the server.

Also, in order to authenticate that only legitimate users are allowed to subscribe to a given conf_id on an SFU, it would make sense for the SFU to act as an AS and sniff the `m.conf` events on their associated server, and only act on to-device `m.call.*` events which come from a user who is confirmed to be in the room for that `m.conf`. (In practice, if the conf is E2EE then it's of limited use to connect to the SFU without having the keys to decrypt the traffic, but this feature is desirable for non-E2EE confs and to stop bandwidth DoS)

Finally, the DC transport is also used to detect connectivity timeouts more rapidly than webrtc's media timeout would allow, while avoiding clogging up the homeserver with keepalive traffic. This is done by each side sending a `"op": "ping"` packet every few seconds, and timing out the call if an `"op": "pong"` doesn't arrive within 5 seconds.
robertlong marked this conversation as resolved.
Show resolved Hide resolved

XXX: define how these DC messages muxes with other traffic, and consider what message encoding to actually use.

TODO: spell out how this works with active speaker detection & associated signalling

## Encryption

We get E2EE for 1:1 and full mesh calls automatically in this model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I don't like about this proposal, is that it uses quite a few unencrypted state events. If you join a conference, you are leaking metadata about

  1. the call existing.
  2. Who (tried to) participate in the call.
  3. Maybe some info about the physical devices of the user

Normal calls are not affected by that, because they don't use state event. State events of course make it easier to track, that a room is a conference room or similar, but they currently can't be encrypted and calls are imo somewhat more sensitive metadata. Verification gets around that by using relations instead of state events.

I currently can't think of a good alternative to state events and maybe one day we will get magic encrypted state events, that none figured out so far. But maybe someone has an idea or we could at least call out this issue in the encryption, potential issues or security sections?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair point. i'm assuming we will have magic encrypted state events sooner or later, however.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. o O ( make m.call and m.call.member state events with no body, but a state_key which contains an event_id for a timeline E2EE event. clients then call GET /event on the event_id in the state_key of the state event in order to grab the encrypted contents of the event in question :P )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or actually, keep the same state_keys as before, but just have the contents be { "encrypted": "$event_id" }. (shamelessly stolen from @turt2live)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...which has now turned into #3414

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mark this resolved? Should we add MSC3414 to the spec?


However, when SFUs are on the media path, the SFU will necessarily terminate the SRTP traffic from the peer, breaking E2EE. To address this, we apply an additional end-to-end layer of encryption to the media using [WebRTC Encoded Transform](https://github.com/w3c/webrtc-encoded-transform/blob/main/explainer.md) (formerly Insertable Streams) via [SFrame](https://datatracker.ietf.org/doc/draft-omara-sframe/).

In order to provide PFS, The symmetric key used for these stream from a given participating device is a megolm key. Unlike a normal megolm key, this is shared via `m.room_key` over Olm to the devices participating in the conference including an `m.conf_id` field on the key to correlate it to the conference traffic, rather than using the `session_id` event field to correlate (given the encrypted traffic is SRTP rather than events, and we don't want to have to send fake events from all senders every time the megolm session is replaced).

The megolm key is ratcheted forward for every SFrame, and shared with new participants at the current index via `m.room_key` over Olm as per above. When participants leave, a new megolm session is created and shared with all participants over Olm. The new session is only used once all participants have received it.

## Potential issues

To-device messages are point-to-point between servers, whereas today's `m.call.*` messages can transitively traverse servers via the room DAG, thus working around federation problems. In practice if you are relying on that behaviour, you're already in a bad place.

The SFUs participating in a conference end up in a full mesh. Rather than inventing our own spanning-tree system for SFUs however, we should fix it for Matrix as a whole (as is happening in the LB work) and use a Pinecone tree or similar to decide what better-than-full-mesh topology to use. In practice, full mesh cascade between SFUs is probably not that bad (especially if SFUs only request the streams over the trunk their clients care about) - and on aggregate will be less obnoxious than all the clients hitting a single SFU.

SFrame mandates its own ratchet currently which is almost the same as megolm but not quite. Switching it out for megolm seems reasonable right now (at least until MLS comes along)

## Alternatives

There are many many different ways to do this. The main other alternative considered was not to use state events to track membership, but instead gossip it via either to-device or DC messages between participants. This fell apart however due to trust: you effectively end up reinventing large parts of Matrix layered on top of to-device or DC. So you might as well publish and distribute the participation data in the DAG rather than reinvent the wheel.

Another option is to treat 1:1 (and full mesh) entirely differently to SFU based calling rather than trying to unify them. Also, it's debatable whether supporting full mesh is useful at all. In the end, it feels like unifying 1:1 and SFU calling is for the best though, as it then gives you the ability to trivially upgrade 1:1 calls to group calls and vice versa, and avoids maintaining two separate hunks of spec. It also forces 1:1 calls to take multi-stream calls seriously, which is useful for more exotic capture devices (stereo cameras; 3D cameras; surround sound; audio fields etc).

An alternative to to-device messages is to use DMs. You still risk gappy sync problems though due to lots of traffic, as well as the hassle of creating DMs and requiring canonical DMs to set up the calls. It does make debugging easier though, rather than having to track encrypted ephemeral to-device msgs.

## Security considerations

Malicious users could try to DoS SFUs by specifying them as their foci.
SimonBrandner marked this conversation as resolved.
Show resolved Hide resolved

SFrame E2EE may go horribly wrong if we can't send the new megolm session fast enough to all the participants when a participant leave (and meanwhile if we keep using the old session, we're technically leaking call media to the parted participant until we manage to rotate).

Need to ensure there's no scope for media forwarding loops through SFUs.

Malicious users in a room could try to sabotage a conference by overwriting the `m.conf` state event.
ara4n marked this conversation as resolved.
Show resolved Hide resolved

Too many foci will chew bandwidth due to fullmesh between them. In the worst case, if every use is on their own HS and picks a different foci, it degenerates to a fullmesh call (just serverside rather than clientside). Hopefully this shouldn't happen as you will converge on using a single SFU with the most clients, but need to check how this works in practice.

## Unstable prefix

...