Marioph • Scalable Video Conferencing With MediaSoup

Scalable Video Conferencing With MediaSoup

Nov 1, 2025 • 12 min read

Introduction to WebRTC
How WebRTC Works
Why Peer-To-Peer Isn’t Enough
A Quick Look at SFU and MCU
Introduction to MediaSoup
Building a Video Conferencing App with MediaSoup
Scaling MediaSoup for Large Conferences
Conclusion
Other Resources

In this post, we’ll explore the core concepts behind building one-on-one and multi‑party video conferencing apps — from WebRTC basics to SFU, MCU, MediaSoup, and advanced scaling strategies.

Introduction to WebRTC

WebRTC (Web Real-Time Communication) is a set of open source standards, protocols, codecs, and APIs that enable real-time communication directly between web browsers and native applications.

By using WebRTC, developers can add low latency, high-quality audio, video and data sharing capabilities to their applications, leveraging peer-to-peer connections and eliminating the need for intermediate servers for media transmission.

How WebRTC Works

1. Signalling

Before two peers can establish a WebRTC connection, they need to exchange information about each other, such as network addresses and session details.

This is done through a signalling server, which is responsible for facilitating the initial information exchange between the peers.

The signalling server can be implemented using any technology, although in most cases it is typically implemented using WebSockets. The signalling server does not handle the media transmission itself; it only helps the peers discover each other and establish a connection.

2. Peer Connection

Once the peers have exchanged the necessary information through the signalling server, they can establish a direct peer-to-peer connection that enables them to send and receive audio, video, and data streams directly, without going through an intermediary server.

3. Media Transmission

Now that the connection is established, the peers can start transmitting media streams to each other.

WebRTC provides APIs for capturing audio and video from the user’s device, encoding and decoding media streams, and sending and receiving data over the peer-to-peer connection.

4. NAT Traversal

WebRTC relies on ICE (Interactive Connectivity Establishment) to handle NAT (Network Address Translation) traversal and establish a connection between peers that may be behind firewalls or NATs.

This process might involve using STUN (Session Traversal Utilities for NAT) and TURN (Traversal Using Relays around NAT) servers to facilitate the connection.

These protocols are essential for ensuring that peers can connect reliably, regardless of their network configurations.

WebRTC Signalling

Why Peer-To-Peer Isn’t Enough

While WebRTC’s peer-to-peer mesh architecture works great for one-on-one communication, it has limitations when it comes to multi-party video conferencing.

As the number of participants grows, the bandwidth and processing load on each peer also increase significantly.

Each participant must send and receive multiple streams (O(N) streams for N participants), which can quickly lead to network congestion and degraded performance.

Once the number of participants exceeds a certain threshold (typically around 4-6 participants), the peer-to-peer model becomes highly inefficient and impractical.

Peer-to-Peer Mesh

At this point, it becomes necessary to introduce a media server to manage and distribute media streams more efficiently.

A Quick Look at SFU and MCU

When building multi-party video conferencing applications, two main architectures are used: Selective Forwarding Unit (SFU) and Multipoint Control Unit (MCU).

Selective Forwarding Unit (SFU)

An SFU is a media server that receives media streams from multiple participants and selectively forwards them to other participants, without any mixing or processing of the streams.

In an SFU architecture, each participant sends their media stream to the SFU, which then forwards the streams to other participants based on their subscriptions. This approach reduces the bandwidth requirements for each participant, as they only need to send their stream once to the SFU.

Additionally, SFUs can optimize bandwidth usage and video quality with techniques such as dominant speaker detection, simulcast, and scalable video coding (SVC), which we will cover later.

Selective Forwarding Unit (SFU)

Multipoint Control Unit (MCU)

An MCU is a more complex media server that receives media streams from multiple participants, mixes them into a single composite stream, and then sends that composite stream back to each participant.

This approach simplifies the client-side implementation, as each participant only needs to receive one stream from the MCU.

Multipoint Control Unit (MCU)

However, MCUs can be resource-intensive and may introduce additional latency due to the mixing process.

They are also much less scalable than SFUs, especially for large conferences. While they can be useful for specific use cases, they are generally less common in modern video conferencing applications compared to SFUs.

SFUs are typically preferred for their scalability and efficiency, especially in scenarios with a large number of participants. They have become the standard architecture for most modern video conferencing applications, balancing performance, scalability, and complexity.

In the next section, we will explore MediaSoup, a popular open-source SFU that provides a robust foundation for building scalable video conferencing applications.

Introduction to MediaSoup

MediaSoup is not a standalone server, but rather just a Node.js module (or Rust crate) that you can integrate into your application.

It acts as an SFU, receiving media streams from participants and relaying them to others.

Given that MediaSoup does not transcode or mix media, it is highly scalable and requires much fewer resources compared to an MCU.

Each participant can select which streams they want to receive, allowing for efficient bandwidth usage. And since participants get the streams separately, they can have a personalized layout, choosing which streams to display and how.

MediaSoup is signalling agnostic: it does not mandate any specific signalling protocol, so you can implement your signalling server using any technology of your choice.

It includes a client library (mediasoup-client), which simplifies the process of connecting to a MediaSoup server and handling media streams.

Building a Video Conferencing App with MediaSoup

MediaSoup Architecture

1. Setting up workers and routers

The server application creates one or more Workers, each running in its own CPU core. Each worker can host multiple Routers, which are responsible for managing media streams.

Note: In an actual production application, you would typically create multiple workers (one per CPU core) to fully utilize the server’s resources. Additionally, you would create multiple routers to separate different conferences or rooms. For simplicity, this example uses one single worker and router.

server.js

import { createWorker } from "mediasoup";
 
// Create Worker
const worker = await createWorker(config);
 
// Create Router
const router = await worker.createRouter({
  mediaCodecs: [
    {
      kind: "audio",
      mimeType: "audio/opus",
      clockRate: 48000,
      channels: 2,
    },
    {
      kind: "video",
      mimeType: "video/H264",
      clockRate: 90000,
      parameters: {},
    },
  ],
});

2. Device loading

The client application loads its Device by providing it with the RTP capabilities of the server side Router.

client.js

import { Device } from "mediasoup-client";
 
// Create Device
const device = new Device();
 
// Ask server for RTP capabilities¹
const routerRtpCapabilities = await socket.emitWithAck("rtp-capabilities");
 
// Load Device with server RTP capabilities
await device.load({ routerRtpCapabilities });

server.js

// Send router RTP capabilities to client¹
socket.on("rtp-capabilities", (ack) => {
  ack(router.rtpCapabilities);
});

3. Creating transports

A WebRTC Transport must be first created in the Router, and then replicated in the client application.

client.js

// Create producer Transport¹
const producerTransportParams = await socket.emitWithAck(
  "create-producer-transport"
);
const producerTransport = device.createSendTransport(producerTransportParams);
 
// Create consumer Transport²
const consumerTransportParams = await socket.emitWithAck(
  "create-consumer-transport"
);
const consumerTransport = device.createRecvTransport(consumerTransportParams);

server.js

socket.on("create-producer-transport", async (ack) => {
  // Create server side producer Transport
  producerTransport = await router.createWebRtcTransport(config);
 
  // Send producer Transport parameters to client¹
  const producerTransportParams = {
    id: producerTransport.id,
    iceParameters: producerTransport.iceParameters,
    iceCandidates: producerTransport.iceCandidates,
    dtlsParameters: producerTransport.dtlsParameters,
  };
 
  ack(producerTransportParams);
});
 
socket.on("create-consumer-transport", async (ack) => {
  // Create server side consumer Transport
  consumerTransport = await router.createWebRtcTransport(config);
 
  // Send consumer Transport parameters to client²
  const consumerTransportParams = {
    id: consumerTransport.id,
    iceParameters: consumerTransport.iceParameters,
    iceCandidates: consumerTransport.iceCandidates,
    dtlsParameters: consumerTransport.dtlsParameters,
  };
 
  ack(consumerTransportParams);
});

4. Producing media

Once the send Transport is created, the client application can produce multiple audio and video tracks on it.

The client application obtains a track (e.g., getUserMedia()), and calls produce() on the Transport instance.

The transport will emit “connect” if this is the first call to produce(), then the client application signals the DTLS parameters to the server, which will connect the server side Transport.
The transport will emit “produce”, then the client application signals the event parameters to the server, which will create the server side Producer.
Finally, produce() will resolve with a client side Producer instance.

client.js

// Get user media
const stream = await navigator.mediaDevices.getUserMedia({
  audio: true,
  video: true,
});
const track = stream.getVideoTracks()[0];
 
// Create Producer
const producer = await producerTransport.produce({ track });
 
// Render local video
localVideo.srcObject = new MediaStream([track]);
 
// ...
 
producerTransport.on("connect", async (dtlsParameters, cb) => {
  // Signal DTLS parameters to server¹
  await socket.emit("connect-transport", dtlsParameters);
});
 
producerTransport.on("produce", async (parameters, cb) => {
  // Signal Producer parameters to server²
  const id = await socket.emitWithAck("produce", parameters);
 
  // Return server side Producer id to client
  cb({ id });
});

server.js

socket.on("connect-transport", async (dtlsParamaters) => {
  // Connect server side Transport¹
  await producerTransport.connect(dtlsParamaters);
});
 
socket.on("produce", async (parameters, ack) => {
  // Create server side Producer²
  const producer = await producerTransport.produce(parameters);
 
  // Send Producer id to client
  ack(producer.id);
});

5. Consuming media

Once the receive Transport is created, the client application can consume multiple audio and video tracks on it.

However the order is the opposite (here the Consumer must be created in the server first).

The client application signals its RTP capabilities to the server, which checks whether the Device can consume the Producer.

Then the server application creates a server side Consumer and transmits the event parameters to the client, which will also create a client side Consumer instance.

Note: When creating a server side Consumer it is recommended to set paused to true, and once the client has created its client side Consumer, unpause the server side Consumer. This optimization avoids unnecessary media transmission while the client is setting up its Consumer.

client.js

// Signal RTP capabilities to server¹
const consumerParams = await socket.emitWithAck("consume", {
  rtpCapabilities: device.rtpCapabilities,
});
 
// Create Consumer
const consumer = await consumerTransport.consume(consumerParams);
 
// Render remote video
remoteVideo.srcObject = new MediaStream([consumer.track]);
 
// Resume the server side Consumer²
await socket.emit("resume", { consumerId: consumer.id });

server.js

socket.on("consume", async (data, ack) => {
  // Create Consumer
  consumer = await consumerTransport.consume({
    producerId: producer.id,
    rtpCapabilities: data.rtpCapabilities,
    paused: true, // Start paused (as explained above)
  });
 
  // Send Consumer parameters to client¹
  const consumerParams = {
    id: consumer.id,
    producerId: producer.id,
    kind: consumer.kind,
    rtpParameters: consumer.rtpParameters,
  };
 
  ack(consumerParams);
});
 
socket.on("resume", async () => {
  // Resume server side Consumer²
  await consumer.resume();
});

Scaling MediaSoup for Large Conferences

1. Distribute Routers, Workers and Hosts

Depending on the host CPU capabilities, a MediaSoup C++ subprocess (a Worker) can typically handle around 500 consumers in total.

The server side application using MediaSoup should launch as many Workers as required (no more than the number of CPU cores on the host), and distribute Routers across them.
If higher capability is required, the application can be horizontally scaled by deploying it across multiple hosts, distributing Routers across them.
For very large conferences, the number of streams per core may become a limitation. In such cases, the pipeToRouter feature can be used to interconnect Routers running in separate Workers (on different CPU cores), even across different hosts.

There is no universal right way to scale. Since MediaSoup is very low-level, it does not prevent you from implementing your own scaling strategies based on your specific use case.

2. Dominant Speaker Detection

Displaying all video streams in large conferences is both impractical and inefficient. The user interface quickly becomes cluttered, making it difficult for participants to focus on the most relevant content.

In practice, even in conferences with many participants, only a few are actively speaking at any given time.

To address this, implementing dominant speaker detection is a common and effective strategy. This approach involves identifying the most prominent speaker in the conference, and prioritizing their video stream for display. It optimizes performance by reducing the number of video streams that need to be consumed and rendered by each participant.

For example, a video conferencing application might highlight the video stream of the current dominant speaker, along with a few recent speakers, rather than displaying all participants’ streams simultaneously.

Mediasoup provides a built-in ActiveSpeakerObserver that monitors the speech activity of the selected audio Producers.

server.js

// Create ActiveSpeakerObserver
const activeSpeakerObserver = router.createActiveSpeakerObserver({
  interval: 300
});
 
activeSpeakerObserver.on("dominantspeaker", ({ producerId }) => {
  // Notify clients about the dominant speaker
  socket.emit("dominant-speaker", { producerId });
});
 
// ...
 
// Add audio Producers to ActiveSpeakerObserver
activeSpeakerObserver.addProducer(audioProducer);

3. Adaptive Streaming

Simulcast

Simulcast consists of sending multiple versions of the same video stream at different qualities.

Each version is sent as a separate RTP stream (each with its own SSRC or RID), allowing the SFU to forward the most appropriate version to each participant based on their network conditions and device capabilities.

If the Producer uses simulcast with 3 streams, MediaSoup will forward a single and continuous stream to the Consumer.

server.js

// Create Producer with simulcast
const producer = await producerTransport.produce({
  track,
  encodings: [
    { ssrc: 111110, active: true, maxBitrate: 100000 },
    { ssrc: 111111, active: true, maxBitrate: 300000 },
    { ssrc: 111112, active: true, maxBitrate: 900000 },
  ],
});

Scalable Video Coding (SVC)

Scalable Video Coding (SVC) consists of encoding a single video stream into multiple layers of quality that build upon each other.

Instead of sending multiple separate RTP streams like in simulcast, SVC sends a single RTP stream that contains all the layers.

Then the SFU can then selectively forward the appropriate layers to each participant based on their network conditions and device capabilities.

server.js

// Create Producer with SVC
const producer = await producerTransport.produce({
  track,
  encodings: [
    { scalabilityMode: "L1T2", maxBitrate: 900000 }
  ],
});

Conclusion

In this post, we learned about the fundamentals of WebRTC, the limitations of peer-to-peer, media server architectures, MediaSoup basics, and scaling strategies.

If you want to dive deeper into building video conferencing applications with MediaSoup, consider exploring the official MediaSoup documentation, which provides comprehensive guides, API references, and examples to help you get started.

Scalable Video Conferencing With MediaSoup

Introduction to WebRTC

How WebRTC Works

Why Peer-To-Peer Isn’t Enough

A Quick Look at SFU and MCU

Introduction to MediaSoup

Building a Video Conferencing App with MediaSoup

Scaling MediaSoup for Large Conferences

Conclusion

Other Resources