Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC3401: Native Group VoIP Signalling #3401

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open
Changes from 18 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
05fd5af
MSC3401: Native Group VoIP Signalling
ara4n Sep 19, 2021
7f5ee49
comments & cosmetics
ara4n Sep 20, 2021
083fd9a
grammar
ara4n Sep 20, 2021
5ee96fb
incorporate review
ara4n Sep 22, 2021
b90b85e
more feedback
ara4n Sep 23, 2021
ed37a0d
add `purpose` from #3077
ara4n Sep 23, 2021
33a64f2
Update proposals/3401-group-voip.md
ara4n Sep 23, 2021
7fd1ba6
converge better with #3077 and WebRTC norms
ara4n Sep 25, 2021
669d471
tracks have to be identified by stream + track tuple
ara4n Sep 25, 2021
48526ad
spell out that you should ignore `m.call.member` state events from pa…
ara4n Oct 12, 2021
dfd4ffe
Add basic call sequence diagram
robertlong Mar 9, 2022
3c306cc
Remove SFU datachannel ping/pong timeout section
robertlong Mar 9, 2022
4d43aae
Update m.call.member and call setup sections
robertlong Mar 10, 2022
856ddc7
spell out the unstable prefix
ara4n May 28, 2022
d109b54
add tracks back into m.call.member for SFUs to use
ara4n May 30, 2022
07f9547
add session IDs & labels
ara4n Jun 3, 2022
7a06ed7
Let call member events expire (#3831)
robintown Jun 16, 2022
32f566a
Rip out SFU bits out of MSC3401 (#3897)
SimonBrandner Oct 21, 2022
3fde32b
Move expiration timestamps to be per-device (#3941)
robintown Nov 30, 2022
05b5db2
Specify who calls who (#3942)
robintown Nov 30, 2022
43dc42f
Clarify `expires_ts`
SimonBrandner Dec 3, 2022
5635cee
Add `seq`
SimonBrandner Dec 3, 2022
b8ebe27
Use heading for Legend
SimonBrandner Dec 3, 2022
6b98d66
Fix-up some formatting
SimonBrandner Dec 3, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
263 changes: 263 additions & 0 deletions proposals/3401-group-voip.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,263 @@
# MSC3401: Native Group VoIP signalling
robertlong marked this conversation as resolved.
Show resolved Hide resolved

Note: previously this MSC included SFU signalling which has now been moved to
[MSC3898](https://github.com/matrix-org/matrix-spec-proposals/pull/3898) to
avoid making this MSC too large.

## Problem

VoIP signalling in Matrix is currently conducted via timeline events in a 1:1 room.
This has some limitations, especially if you try to broaden the approach to multiparty VoIP calls:

* VoIP signalling can generate a lot of events as candidates are incrementally discovered, and for rapid call setup these need to be relayed as rapidly as possible.
* Putting these into the room timeline means that if the client has a gappy sync, for VoIP to be reliable it will need to go back and fill in the gap before it can process any VoIP events, slowing things down badly.
* Timeline events are (currently) subject to harsh rate limiting, as they are assumed to be a spam vector.
* VoIP signalling leaks IP addresses. There is no reason to keep these around for posterity, and they should only be exposed to the devices which care about them.
* Candidates are ephemeral data, and there is no reason to keep them around for posterity - they're just clogging up the DAG.

Meanwhile we have no native signalling for group calls at all, forcing you to instead embed a separate system such as Jitsi, which has its own dependencies and doesn't directly leverage any of Matrix's encryption, decentralisation, access control or data model.

## Proposal

This proposal provides a signalling framework using to-device messages which can
be applied to native Matrix 1:1 calls, full-mesh calls and in the future SFU
calls, cascaded SFU calls MCU calls, and hybrid SFU/MCU approaches. It replaces
the early flawed sketch at
[MSC2359](https://github.com/matrix-org/matrix-doc/pull/2359).

This does not immediately replace the current 1:1 call signalling, but may in future provide a migration path to unified signalling for 1:1 and group calls.

Diagramatically, this looks like:

1:1:
```
A -------- B
```

Full mesh between clients
```
A -------- B
\ /
\ /
\ /
\ /
C
```

SFU (aka Focus):
Copy link
Contributor

@robertlong robertlong Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bikeshedding warning: I'm relatively new to the WebRTC/VoIP industry, but I have never heard the term focus used in place of SFU. Is this a commonly known term? Should we be using SFU in this spec instead? Including renaming m.foci -> m.sfus?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason i originally went with foci is because the field originally described the (mxid, deviceid) tuples where a given mxid could be contacted - which might either be a local device (for full mesh) or an SFU.

However, in the current simpler draft, the only time you include this field is if you are using a conferencing focus of some kind.

But, this proposal is not meant to just be for SFUs - the device you use to focus together your view of the conference could (in future) equally be an MCU as much as an SFU. Hence using the correct more generic term of 'focus' rather than making it specific to SFU technology. For instance, the server could advertise a stream which composites together a mosaic of different feeds for a non-E2EE call... at which point it's acting as a (hybrid) MCU.

The term 'focus' comes from SIP (e.g. https://datatracker.ietf.org/doc/html/rfc3840#section-10.18) and is the standard term there for "an endpoint you connect to which mixes together other endpoints". I'm slightly inclined to keep it, to keep thing flexible for future more sophisticated foci tech.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call it call_focus or stream_focus or something a bit more descriptive than a not-well-known dictionary word?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

focus is a pretty well-known word, and foci is its plural. i don't particularly want to call it 'focuses', given that's a different word (the 3rd person present form of 'to focus'). not sure this is a showstopper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely isn't a showstopper but I would like to come up with a better name if we can. It is also a bit of a red-flag that just about everything else in the MSC is calling it a SFU.

Copy link
Contributor

@kyrias kyrias Sep 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While focus is a well-known word, outside of Britain its plural is 'focuses', so I would expect that a lot of people are going to be similarly confused over its meaning. Even the Cambridge Dictionary lists 'focuses' as the plural, while listing 'foci' as the formal plural in the UK.

Might it be possible to at least mention in the spec that it's used in this sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

```
A __ __ B
\ /
F
|
|
C
Where F is an SFU focus
```

Cascaded decentralised SFU:
```
A1 --. .-- B1
A2 ---Fa ----- Fb--- B2
\ /
\ /
\ /
\ /
Fc
| |
C1 C2

Where Fa, Fb and Fc are SFU foci, one per homeserver, each with two clients.
```

### m.call state event

The user who wants to initiate a call sends a `m.call` state event into the room to inform the room participants that a call is happening in the room. This effectively becomes the placeholder event in the timeline which clients would use to display the call in their scrollback (including duration and termination reason using `m.terminated`). Its body has the following fields:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should glare be handled at the group call level in the case where multiple parties actually didn't meant to set up separate group calls in a room but just meant to call each other? For example, we could dictate that calls that have the same purpose and name should be able to replace each other in case of glare?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good question. Any idea @ara4n?

I think because the m.call.invite event includes the conf_id this is a non issue? But we've also only defined the m.call.invite for group calls under to-device messages. I guess for the m.ring intent you also need to be able to send the m.call.invite with a conf_id set as a regular message event?

In any case, I think glare is a non issue for the m.room or m.prompt intent types. You both created group calls and one of you needs to join the other in the UI. However, for m.ring that involves sending invite and if you both invite each other at the same time I think we should use the same glare resolution we have for regular calls in that we compare conf_id values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glare can happen with any call type though if two clients decide to set m.call at the same time, though. I suspect we should a) add an index somewhere to futureproof for more than one call per room, b) for two calls with the same index, tiebreak between them by prioritising the m.call event with the lexicographically lowest call ID.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The m.room.power_levels state event specifies that posting state events requires a power level of 50 by default. From a user experience standpoint, I would think it is reasonable for normal users in a room to be able to start calls in that room by default, but with the current power_levels policy it would need the m.call power level set lower. It may be desirable for room creation UX in clients to present the option to set this level upfront.

Perhaps there should be a way to specify a different power level requirement for different intents as well. A Discord user would expect to be able to start a room's call freely without disturbing other members of the room ala m.room intent. On the other hand, an m.ring is a much more disruptive intent that would be reserved for smaller group chats and should not normally be allowed in other kinds of rooms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imho outside of DMs (where both users have PL100 anyway usually) calls should not be allowed for normal users. It is still a vector of spam. Just imagine having calls being started in Matrix HQ. It would just cause issues imho.

Imho it is a sane default to restrict this and need active changes to allow it in a room.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just imagine having calls being started in Matrix HQ. It would just cause issues imho.

This is what I mean about the different call intents causing different levels of disruption. You're right, obviously m.ring has very different impact from m.prompt or m.room and the default should be to disallow that. But a room's administration may want users to be able to start calls with one intent and not the other.

Unless I'm misunderstanding the purpose of m.room? Is the idea for m.room intent that a room would always have a call "active", even if it has no participants, ala a "voice channel" in Discord, such that a level-0 user would typically not be able to end that call ergo not need to be able to publish state events for it other than m.call.member?


* `m.intent` to describe the intended UX for handling the call. One of:
* `m.ring` if the call is meant to cause the room participants devices to ring (e.g. 1:1 call or group call)
* `m.prompt` is the call should be presented as a conference call which users in the room are prompted to connect to
ara4n marked this conversation as resolved.
Show resolved Hide resolved
* `m.room` if the call should be presented as a voice/video channel in which the user is immediately immersed on selecting the room.
robertlong marked this conversation as resolved.
Show resolved Hide resolved
* `m.type` to say whether the initial type of call is voice only (`m.voice`) or video (`m.video`). This signals the intent of the user when placing the call to the participants (i.e. "i want to have a voice call with you" or "i want to have a video call with you") and warns the receiver whether they may be expected to view video or not, and provide suitable initial UX for displaying that type of call... even if it later gets upgraded to a video call.
robertlong marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, would there be a benefit to other call types? Or doing this more flexibly? (Allowing audioless default state)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? I think this was originally intended to be used to differentiate between the different UIs to display, but I ended up using it in matrix-js-sdk to help determine what user media constraints to use. So perhaps this needs to become an object containing the default media types to request (audio, video, datachannel)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Third Room, I think we do want to allow for joining a room with datachannel only and then upgrading the call to use audio. So either we need another type to handle this or we split it up like I commented above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like this would work?

{
    "type": "m.call",
    "state_key": "cvsiu2893",
    "content": {
        "m.intent": "m.room",
        "m.type": "m.voice",
        "m.name": "Voice room",
        "m.foci": [
            "@sfu-lon:matrix.org",
            "@sfu-nyc:matrix.org"
        ],
        "m.audio": true,
        "m.audio_muted": true,
        "m.video": true,
        "m.video_muted": false,
        "m.datachannels": [
          {
            "label": null,
            "id": null,
            "ordered": true,
            "maxPacketLifeTime": null,
            "maxRetransmits": null,
            "protocol": ""
          }
        ]
    }
}

Where m.type is used for displaying the correct room UI and m.audio, m.video, and m.datachannel are used for specifying what the client should request from the user by default.

m.audio_muted and m.video_muted specify whether your client should by default mute the microphone or video by default. Useful for large public rooms.

Here's a voice room with audio requested by default.

{
    "type": "m.call",
    "state_key": "cvsiu2893",
    "content": {
        "m.intent": "m.room",
        "m.type": "m.voice",
        "m.name": "Voice room",
        "m.audio": true
    }
}

Here's a voice room where audio isn't requested by default. Maybe you are listening to a presenter like in Twitter Spaces or Clubhouse style apps.

{
    "type": "m.call",
    "state_key": "cvsiu2893",
    "content": {
        "m.intent": "m.room",
        "m.type": "m.voice",
        "m.name": "Audio Presenter Room",
        "m.audio": false
    }
}

Or maybe you want everyone to be able to speak in the room so you request microphone permissions up front, you just want people to join muted.

{
    "type": "m.call",
    "state_key": "cvsiu2893",
    "content": {
        "m.intent": "m.room",
        "m.type": "m.voice",
        "m.name": "Audio Presenter Room",
        "m.audio": false,
        "m.audio_muted": true
    }
}

Video and Voice Room where only audio is requested by default. This is similar to Discord where you can turn on your webcam or share your screen after you've joined.

{
    "type": "m.call",
    "state_key": "cvsiu2893",
    "content": {
        "m.intent": "m.room",
        "m.type": "m.video",
        "m.name": "Audio Room",
        "m.audio": true
    }
}

Video Room where both audio and video are requested by default.

{
    "type": "m.call",
    "state_key": "cvsiu2893",
    "content": {
        "m.intent": "m.room",
        "m.type": "m.video",
        "m.name": "Video Room",
        "m.audio": true
        "m.video": true
    }
}

Third Room would use something like the following config:

{
    "type": "m.call",
    "state_key": "cvsiu2893",
    "content": {
        "m.intent": "m.room",
        "m.type": "m.audio",
        "m.name": "Third Room World",
        "m.audio": true,
        "m.audio_muted": true,
        "m.datachannels": [
          {
            "label": "m.world.reliable"
          },
          {
            "label": "m.world.unreliable",
            "ordered": false,
            "maxRetransmits": 0
          }
        ]
    }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this seems quite nice. m.audio and m.video work okay for 1:1 calls but get a bit foggier for multi-party, so I think it's fair to upgrade. I imagine we probably would want to avoid renegotiating to add video etc on mutli-party calls - renegotiating one connection is fine but we certainly want to avoid causing all parties to renegotiate at the same time.

Copy link
Member Author

@ara4n ara4n Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea that the call should have a richer way to specify its expected UX (e.g. audio, video, datachannel, whether audio or video are muted by default etc). I'm a bit worried that the API shape starts to collide nastily with how each source advertises each stream (which in turn collides with simon's work for describing changes in each stream - i.e. signalling mute state). In other words, we have three similar related things here:

  • What media streams should new call members expect to need to receive (or send)?
  • What media streams does a call member actually broadcast?
  • What are these media streams doing? (e.g. adding/removing screenshares; changing voice/video mute)

...which are all at risk of ending up with non-matching API shapes. I wonder if there is a way to unify them. For instance, this MSC currently proposes that each device in a call advertises the feeds that it's sending as:

                        "feeds": [
                            {
                                "purpose": "m.usermedia"
                                // TODO: Add tracks
                                // TODO: Available bitrates etc. should be listed here
                            },
                            {
                                "purpose": "m.screenshare"
                                // TODO: Add tracks
                                // TODO: Available bitrates etc. should be listed here
                            }

(Which is somewhat similar to how #3077 advertises them in m.call.invite). Then #3291 adds on the ability to describe how they change over time:

{
    "type": "m.call.sdp_stream_metadata_changed",
    "room_id": "!roomId",
    "content": {
        "call_id": "1414213562373095",
        "party_id": "1732050807568877",
        "sdp_stream_metadata": {
            "2311546231": {
                "purpose": "m.usermedia",
                "audio_muted:": false,
                "video_muted": true,
            }
        },
        "version": "1",
    }
}

So, i'm wondering whether a better way of describing the expected streams for participating in a call would be a list of feeds in the m.call, each with a purpose and audio_muted, rather than yet another different shape.

This would then also pave the way for the call to specify format intents (i.e. "you're expected to join this call with stereo audio and send 4K video") as opposed to ("you're expected to join this call with 8kHz mono audio and that's it").

TL;DR: we should be publishing the recommended media constraints in the m.call.

FIXME: That said, do we want to support different send & receive constraints? Currently we assume calls are symmetric. Similarly, do we want to support proposing different constraints for different types of users? (e.g. Clubhouse presenters should start off unmuted, but Clubhouse listeners should start off muted)?

* `m.terminated` if this event indicates that the call in question has finished, including the reason why. (A voice/video room will never terminate.) (do we need a duration, or can we figure that out from the previous state event?).
robertlong marked this conversation as resolved.
Show resolved Hide resolved
* `m.name` as an optional human-visible label for the call (e.g. "Conference call").
robertlong marked this conversation as resolved.
Show resolved Hide resolved
* The State key is a unique ID for that call. (We can't use the event ID, given `m.type` and `m.terminated` is mutable). If there are multiple non-terminated conf ID state events in the room, the client should display the most recently edited event.

For instance:

```jsonc
{
"type": "m.call",
"state_key": "cvsiu2893",
"content": {
"m.intent": "m.room",
"m.type": "m.voice",
"m.name": "Voice room"
}
}
```

We mandate at most one call per room at any given point to avoid UX nightmares - if you want the user to participate in multiple parallel calls, you should simply create multiple rooms, each with one call.
Copy link
Contributor

@ShadowJonathan ShadowJonathan Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is worth considering though, the UX nightmare might not be that bad (some clients might even work entirely with this possibility), and personally i think that putting the conf ID in a sub-field is just asking for problems (if the previous call information gets overridden by a person sending another state event for a "new" call while the last one is still in-progress.)

Why not move conf_id into the state_key, currently declare multiple calls UB and unsupported, while noting that speccing it and properly seating it would be a case for a future MSC?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-opening this one because we've just had a glare-like bug on Element Call where multiple people entered the call at the same time (as you do) and multiple conferences got created in the same room. In general, we're going to want some way to handle glare of several people hitting the 'start conference call' button at the same time. Allowing multiple calls in a room means we need to handle this somehow. It's not impossible (eg. we could define some common ID for 'the' call in a room allowing you to use other IDs for other calls?) but I'd just like to check that we really want to deal with this complexity.

Copy link
Contributor

@SimonBrandner SimonBrandner Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also very much in favour of having the state_key be just "" because having multiple group calls in one room often leads to more problems rather than benefits

With MSC3985 we now also have a separate method to create break-out rooms, so it feels like multiple calls in one room are no longer necessary

I also think we should be able to use the m.termintated to calculate the call length

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still an issue with relying on m.terminated to determine the call length: If a client wants to display a timeline tile with the duration at the point where the call was ended, then it works, but if clients want to display the tile at the point that the call was started (like Element Web does), and we're reusing the same state key for all calls, it's difficult to get the duration from that event. In fact, if there's a call ongoing in the room, there's no way to tell whether a given call event is part of the current call or not, short of crawling the timeline, so clients won't know whether to label it with "call ended".

With separate state keys, this is a lot easier, because it gives you a way to efficiently look up the current state of any call, current or historical.


### Call participation

Users who want to participate in the call declare this by publishing a `m.call.member` state event using their matrix ID as the state key (thus ensuring other users cannot edit it). The event contains a timestamp named `m.expires_ts` describing when this data should be considered stale, and an array `m.calls` of objects describing which calls the user is participating in within that room. This array must contain one item (for now).
SimonBrandner marked this conversation as resolved.
Show resolved Hide resolved
robintown marked this conversation as resolved.
Show resolved Hide resolved

When sending an `m.call.member` event, clients should choose a reasonable value for `m.expires_ts` in case they go offline unexpectedly. If the user stays connected for longer than this time, the client must actively update the state event with a new expiration timestamp.

`m.call.member` state events must be ignored if the `m.expires_ts` field indicates they have expired, or if their user's `m.room.member` event's membership field is not `join`.

The fields within the item in the `m.calls` contents are:

* `m.call_id` - the ID of the conference the user is claiming to participate in. If this doesn't match an unterminated `m.call` event, it should be ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably ought to be m.conf_id to differentiate it from IDs of 1:1 calls and match the conf_id field in m.call.* to-device events?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: currently the call_id and conf_id are not identical. This seems to be confusing if we're talking about the SFU calls (not sure how it's handled in a full-mesh).

When working on an SFU recently, I realized that conf_id was the ID of a conference (or a call if you will) which was quite logical and expected. However, what I did not expect is that in addition to the conf_id, each To-Device message has a call_id which does not match the conf_id and which seems to be uniquely generated by each participant.

The thing is: call_id field does not make any sense for the SFU at the moment (see the SFU MSC), since the SFU does not know what the call_id is (it looks like a randomly generated string that is different for each participant who tries to join a conference), but at the same time, the SFU is essentially obligated to store the call_id because the To-Device messages from the SFU to the participants are expected to have the call_id that matches the call_id value sent from participants to the SFU when they contact the SFU (I tried settings the call_id to match conf_id when sending a message from the SFU to the client, but the client discarded the message if the call_id did not match the call_id that the client sent to the SFU). So essentially, there is a conf_id the semantics of which is defined (it's the unique ID of a conference/call) and the call_id (which does not have any meaning for the SFU).

* `m.devices` - The list of the member's active devices in the call. A member may join from one or more devices at a time, but they may not have two active sessions from the same device. Each device contains the following properties:
* `device_id` - The device id to use for to-device messages when establishing a call
* `session_id` - A unique identifier used for resolving duplicate sessions from a given device. When the `session_id` field changes from an incoming `m.call.member` event, any existing calls from this device in this call should be terminated. `session_id` should be generated once per client session on application load.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have trouble understanding what this is about. Is it perhaps about protecting against for example a user opening their same element web session in multiple tabs? In any case, it might be good to spell out more explicitly what exactly what this is supposed to guard against.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mostly to deal with users hitting the refresh button I believe, so we can ignore anything from previous instances of the app and terminate calls with the old instance when we see a new one.

* `feeds` - Contains an array of feeds the member is sharing and the opponent member may reference when setting up their WebRTC connection.
* `purpose` - Either `m.usermedia` or `m.screenshare` otherwise the feed should be ignored.

For instance:

```jsonc
{
"type": "m.call.member",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: actually track here whether the participant is joined to the call or not(!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we still have an issue with tracking participants for a given group call for displaying in the UI. How are we going to check who is in a call and scale it?

"state_key": "@matthew:matrix.org",
"content": {
"m.calls": [
ara4n marked this conversation as resolved.
Show resolved Hide resolved
{
"m.call_id": "cvsiu2893",
"m.devices": [
{
"device_id": "ASDUHDGFYUW", // Used to target to-device messages
"session_id": "GHKJFKLJLJ", // Used to resolve duplicate calls from a device
"feeds": [
{
"purpose": "m.usermedia",
"id": "qegwy64121wqw", // WebRTC MediaStream id
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These IDs are only accurate if announcing to an SFU - in full mesh, each separate call will have its own IDs.

Therefore we should probably scope these to a given 1:1 call ID?

"tracks": [
{
"kind": "audio",
"id": "zvhjiwqsx", // WebRTC MediaStreamTrack id
"label": "Sennheiser Mic",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, perhaps this is a bit of a privacy violation - why do other people in a conference need to know what my devices are called?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, right. Should we remove it? (I also don't seem to find a case where we would want to let others know what are devices are called when publishing 🤔)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can't really see a good reason we'd send this. The purpose ought to be enough information.

"settings": { // WebRTC MediaTrackSettings object
"channelCount": 2,
"sampleRate": 48000,
"m.maxbr": 32000, // Matrix-specific extension to advertise the max bitrate of this track
}
},
{
"kind": "video",
"id": "zbhsbdhzs",
"label": "Logitech Webcam",
"settings": {
"width": 1280,
"height": 720,
"facingMode": "user",
"frameRate": 30.0,
"m.maxbr": 512000,
Comment on lines +193 to +198

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these settings required for some matrix-specific logic? (maybe bridges that may need this data about a stream or a track?)

I'm just wondering if they are useful in a general case (like why would we need the information about a facingMode of a camera for intsance?). I.e. when we're talking about the SFU, the SFU does not really need to know the facing mode of a user I guess and I'm also not sure if the other call participants would benefit from this information.

The only case where I assume the information about camera mode etc might be useful is when there is a specific app that runs over Matrix and needs to advertise the properties of the video/audio streams in order to implement a specific logic. But in this case, we're talking about application-specific data, i.e. something that must be the logic of the app rather than part of a [generic] Matrix protocol.

I think generally we only need the stream and track IDs, a purpose (for the use case of conference / using WebRTC for calls), and, perhaps basic information about certain tracks like the width and height of the video (theoretically it's not required, because we'll be able to access it when the track is received, but practically we would need it for the simulcast implementation on the SFU side, so such information would be useful for the conference use cases).

}
},
],
},
{
"purpose": "m.screenshare",
"id": "suigv372y8378",
"tracks": [
{
"kind": "video",
"id": "xbhsbdhzs",
"label": "My Screenshare",
"settings": {
"width": 3072,
"height": 1920,
"cursor": "moving",
"displaySurface": "monitor",
"frameRate": 30.0,
"m.maxbr": 768000,
}
},
]
}
]
}
]
}
],
"m.expires_ts": 1654616071686
}
}
```

This builds on [MSC3077](https://github.com/matrix-org/matrix-spec-proposals/pull/3077), which describes streams in `m.call.*` events via a `sdp_stream_metadata` field.

** TODO: Do we need all of this data? Why would we need it? **

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO we only need to submit the minimally required information about streams and tracks that would be enough for the calls to happen (with or without the SFU).

The details (label, displaySurface, facingMode, or even bitrate) are probably something that we generally speaking don't need. The apps that run on top of the Matrix could always exchange arbitrary metadata about their devices/tracks/streams if they need it: we anyway won't be able to describe all possible use cases for the matrixRTC in advance since we don't really know all of the use-cases - strictly speaking, video and audio tracks are not necessarily coming from a webcam or a microphone in a general [matrixRTC] case, they could be anything ranging from a mirrorless camera feed attached to the laptop to an audio output of digital instruments forwarded via a DAW).

** TODO: This doesn't follow the MSC3077 format very well - can we do something
about that? **
** TODO: Add tracks field **
** TODO: Add bitrate/format fields **

Clients should do their best to ensure that calls in `m.call.member` state are removed when the member leaves the call. However, there will be cases where the device loses network connectivity, power, the application is forced closed, or it crashes. If the `m.call.member` state has stale device data the call setup will fail. Clients should re-attempt invites up to 3 times before giving up on calling a member.

### Call setup

Call setup then uses the normal `m.call.*` events, except they are sent over to-device messages to the relevant devices (encrypted via Olm). This means:
robertlong marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

@bwindels bwindels Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the idea here is that clients should identify the sender (user_id and device_id) of the to_device through the sender_key of the olm-encrypted message to known which peer the message is from in a full-mesh group call? Perhaps it could be valuable to spell that out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the spec to include this now, we send the sender's device_id along in the m.call.invite event. But I think this is a better method? I haven't seen sender_key yet, could you point me to the docs on that? Also we're not using olm-encryption just yet so it might still be better to include the device_id field in the content? Not sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can read about sender_key here and here. The idea is that you query the keys for the given user (if not fetched already) and verify that the keys match the sender_key of the olm message you received. The advantage of doing it this way is that the user id and device id can almost not be spoofed, assuming you have marked the other device/user as trusted, either manually or through cross-signing.

Perhaps the current impl doesn't encrypt with olm yet, but does it make sense to spec that? Is there a good reason to offer a non-encrypted version of the signalling?


* When initiating a 1:1 call, the `m.call.invite` is sent to the devices listed in `m.call.member` event's `m.devices` array using the `device_id` field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are using m.ring as the intent we don't have the recipient's device id. So how do we ring them? Do we send the m.call.invite as a regular message event?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was assuming that m.ring will ring every device in the room

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. So send to-device messages to each of the users and use * for to target all the user's devices. I'll add a clarification for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the ringing happen by the mere fact that there is a non-terminated m.call state event with that intent? The to_device messages only start flowing once there are at least two m.call.members in the call, e.g. you've picked up already with a specific device.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A "1:1 call" does not seem to be well defined within the context of this MSC. I assume what is meant here is a m.ring call (that could have 2 or more participants). Can we please use less ambiguous terminology?

SimonBrandner marked this conversation as resolved.
Show resolved Hide resolved
* `m.call.*` events sent via to-device messages should also include the following properties in their content:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we seem to have completely missed seq, needed to make up for todevice events having no intrinsic ordering otherwise: matrix-org/matrix-js-sdk@7f21f56

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a dumb question here: does it mean that the order of the To-Device messages is not guaranteed? I'm asking since I've processed the To-Device messages on the SFU under the assumption that they come at the same order in which they were sent by the client. If the order is not guaranteed, it may cause some interesting (undesired) effects, e.g. if the "Invite -> Hangup" sequence comes as "Hangup -> Invite" on a server, the end effect will not be what the user expects. Or Invite -> SelectAnswer as SelectAnswer -> Invite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not guaranteed and we should rely on the seq field

Copy link

@daniel-abramov daniel-abramov Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not guaranteed and we should rely on the seq field

Oh, that's interesting. Should we document how to deal with it and what's the semantics? - I've noticed that a new commit has been pushed recently to add the seq to the request that says that it starts with 0 and gets incremented with each message. But where the counter is stored? I can imagine that it's a per-device counter (meaning that 2 different devices may generate the same seq)? And what happens when the overflow of the value occurs?

It also has some practical implications: how are we (as receivers) expected to handle it in a proper way? - I.e. imagine that we receive a "New ICE candidates message" on the SFU with a seq=15 while the previous message from the sender had seq=5. We probably don't want to handle the message with seq=15 right now if we have not yet received the previous 10 messages (since the current message with seq=15 may not even be useful by the moment we process the previous 10, or maybe it's related to the invite that has been sent in seq=10). This means that in order to handle a message with seq=15, we would need to buffer a couple more messages (messages that went before seq=15) before we take a decision on whether to handle it.

However, this poses certain questions, namely if we're communicating with 1000 devices (SFU use case), this means we would need to store the lastStoredSeq for 1000 users, the problem is that we don't really know when to release the counter for a particular user from memory (i.e. when to remove it from the map since we don't know in advance the pool of devices that would communicate with us and we may theoretically receive a message from any of them) which means that the memory usage would grow indefinitely and once the SFU is restarted, we'll have the counters lost.

Another issue is that the sender can attack the receiver by sending a message with seq=1 followed by a message with seq=99999999999 and then another with seq=99999999998 and another with seq=99999999997 and, knowing that the other side buffers them since it can't process them, it will send them until the other side gets killed due to the OOM.

* `conf_id` - The group call id listed in `m.call`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use a name other than conf_id? I think this is left over from the m.conf event which was renamed to m.call. So maybe group_call_id is better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, much prefer group_call_id

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if call_id wouldn't be enough since we might want to transition to this MSC for all calls and thus call_id seems sufficient

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the to_device events sent between members of the group calls also have a call_id property, inherited from pre-3401 calls. We should probably address the existence of the other property and either plan a phase out or just use something less ambiguous like group_call_id.

It would also be good to use the same property name in both the m.call.member state event (now m.call_id) and to_device events (now conf_id).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo on top of the raised mentioned improvements, this proposal and eventual resulting spec should stick to the wording "call" even in text to clarify that "conference" is not another concept at work here.

* `dest_session_id` - The recipient's session id. Incoming messages with a `dest_session_id` that doesn't match your current session id should be discarded.
* In addition to the fields above `m.call.invite` events sent via to-device messages should include the following properties :
* `device_id` - The message sender's device id. Used by the opponent member to send response to-device signalling messages even if the `m.call.member` event has not been received yet.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EC also sends device_id, sender_session_id and dest_session_id for every toDevice event: https://github.com/matrix-org/matrix-js-sdk/blob/353d6bab47ab928aab089e897f5475942fcfa0ac/src/webrtc/call.ts#L2008-L2010

* `sender_session_id` - Like the `device_id` the `sender_session_id` is used by the opponent member to filter out messages unrelated to the sender's session even if the `m.call.member` event has not been received yet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need a better way of explaining both the device_id and sender_session_id here.

* For 1:1 calls, we might want to let the to-device messages flow and cause the client to ring even before the `m.call` event propagates, to minimise latency. Therefore we'll need to include an `m.intent` on the `m.call.invite` too.

## Example Diagrams

**Legend**

| Arrow Style | Description |
|-------------|-------------|
| Solid | [State Event](https://spec.matrix.org/latest/client-server-api/#types-of-room-events) |
| Dashed | [Event (sent as to-device message)](https://spec.matrix.org/latest/client-server-api/#send-to-device-messaging) |

### Basic Call

```mermaid
sequenceDiagram
autonumber
participant Alice
participant Room
participant Bob
Alice->>Room: m.call
Alice->>Room: m.call.member
Bob->>Room: m.call.member
Alice-->>Bob: m.call.invite
Alice-->>Bob: m.call.candidates
Alice-->>Bob: m.call.candidates
Bob-->>Alice: m.call.answer
Bob-->>Alice: m.call.candidates
Alice-->>Bob: m.call.select_answer
bwindels marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

@bwindels bwindels Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the m.call.select_answer event makes sense for MSC 3401 calls.

AIUI, the purpose of m.call.select_answer is to make your other devices stop ringing once you pick up on one of your devices. To_device messages only seem to be sent to devices already in the call (e.g. are present in the m.call.member state event for a given user), so they would not be sent to devices that haven't picked up yet.

Perhaps to stop ringing, the other devices should just monitor the m.call.member event for their own user and see if it exists and a device has been added already.

```

## Potential issues

To-device messages are point-to-point between servers, whereas today's `m.call.*` messages can transitively traverse servers via the room DAG, thus working around federation problems. In practice if you are relying on that behaviour, you're already in a bad place.

## Alternatives

There are many many different ways to do this. The main other alternative considered was not to use state events to track membership, but instead gossip it via either to-device or DC messages between participants. This fell apart however due to trust: you effectively end up reinventing large parts of Matrix layered on top of to-device or DC. So you might as well publish and distribute the participation data in the DAG rather than reinvent the wheel.

An alternative to to-device messages is to use DMs. You still risk gappy sync problems though due to lots of traffic, as well as the hassle of creating DMs and requiring canonical DMs to set up the calls. It does make debugging easier though, rather than having to track encrypted ephemeral to-device msgs.

## Security considerations

State events are not encrypted currently, and so this leaks that a call is happening, and who is participating in it, and from which devices.
Copy link
Contributor

@ShadowJonathan ShadowJonathan Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plus that it happened in the past and who there-and-then participated in it, by correlating it corresponding m.call.member state events.


Malicious users in a room could try to sabotage a conference by overwriting the `m.call` state event of the current ongoing call.

## Unstable prefix

| stable event type | unstable event type |
|-------------------|---------------------|
| m.call | org.matrix.msc3401.call |
| m.call.member | org.matrix.msc3401.call.member |