Building Presence at Scale: Who Is Online and How to Track It

Presence is one of those features that looks trivial until you actually build it. Showing a green dot next to a username seems simple. But the moment you have thousands of users spread across hundreds of rooms, connecting and disconnecting on flaky mobile networks, switching tabs, backgrounding the app, and coming back after hours offline, it stops being simple very quickly.

This post covers what presence actually means, why the naive approaches fall apart, and how to build something that holds up at scale.

What Presence Actually Means

Presence is the real-time state of a user or device as observed by others. The most basic form is binary: online or offline. Most production systems need more than that.

A richer presence model typically includes:

Status. Online, offline, away, busy, do not disturb.
Last seen. A timestamp of the most recent activity or heartbeat.
Custom metadata. The current room a user is in, what they are viewing, their display name, avatar URL, or anything your application needs to show other users.
Device context. Whether the user is on mobile or desktop, which can affect what status transitions mean.

Custom metadata is where presence gets genuinely useful. A collaborative document editor wants to show which paragraph each user is editing. A support tool wants to show which ticket an agent has open. A multiplayer game wants to show the player's current lobby and score. All of that is presence data, not just a green dot.

The Naive Approach: Polling

The most common first attempt at presence is a REST endpoint that clients call periodically to report that they are still alive, and a second endpoint that other clients poll to fetch the current status of users they care about.

// Client heartbeat poll every 30 seconds
setInterval(async () => {
  await fetch('/api/presence/heartbeat', { method: 'POST' })
}, 30_000)

// Other clients poll for status
setInterval(async () => {
  const res = await fetch('/api/presence?userIds=u1,u2,u3')
  updatePresenceUI(await res.json())
}, 5_000)

This works for a demo. It falls apart in production for several reasons.

First, the polling interval is a compromise you can never win. Poll every 5 seconds and you hammer your API with requests. Poll every 30 seconds and the UI feels stale. A user can log out and their presence stays green for half a minute.

Second, the load scales badly. With 1,000 users each polling every 5 seconds, you have 200 requests per second just to maintain a feature that should be passive. Add more users and you add proportionally more load, with zero actual information being delivered most of the time, since status has not changed.

Third, ungraceful disconnects are invisible. If a user closes their laptop, the heartbeat stops. Your server only knows they are gone after the heartbeat timeout expires, which you set at 60 seconds to avoid false negatives. So users stay "online" for a minute after they have left.

The Event-Driven Approach

A better model flips the direction. Instead of clients asking "who is online?", the server pushes presence changes to subscribers as they happen. This is the pub/sub model applied to presence.

When a user connects, the server emits a join event to the relevant channel. When they disconnect, it emits a leave event. Subscribers receive these events in real time and update their local state accordingly.

// Server emits on connection
presence.emit('join', {
  userId: 'u_abc123',
  status: 'online',
  metadata: { displayName: 'Alice', avatar: 'https://...' },
  roomId: 'room_xyz',
  timestamp: Date.now(),
})

// Server emits on disconnect
presence.emit('leave', {
  userId: 'u_abc123',
  roomId: 'room_xyz',
  timestamp: Date.now(),
})

Clients subscribe to the presence channel for the rooms they care about. No polling, no wasted requests, no artificial delay. Updates arrive within milliseconds of the actual state change.

Per-Room vs Global Presence

There are two presence scopes you will need to think about separately.

Per-room presence tracks who is in a specific room right now. This is the right model for chat rooms, collaborative documents, video calls, or any context where membership is bounded. A user joining room A does not need to know about users in rooms B through Z.

Global presence (sometimes called a lobby) tracks who is online across the entire application, regardless of which room they are in. This is useful for friend lists, direct message availability indicators, or dashboards where you want to see all connected users.

Most applications need both. A chat product might show global presence on a contacts sidebar while showing per-room presence inside a conversation. The architecture handles them differently. Per-room presence is scoped to the room's pub/sub channel. Global presence runs on a separate shared channel that all authenticated connections subscribe to on connect and unsubscribe from on disconnect.

Edge Cases That Will Bite You

Tab Switching

When a user switches to another browser tab, the Page Visibility API fires a visibilitychange event. Most presence systems interpret this as "away" after a short delay. The trick is debouncing: a user switching tabs to check Slack and switching back within 10 seconds should not flip to away and back. Set a threshold, typically 30 to 60 seconds of hidden state, before emitting an away transition.

Mobile Backgrounding

On mobile, backgrounding an app is more aggressive than tab switching. The OS can suspend the JavaScript thread entirely, which means the WebSocket connection stops sending keepalives. The server will see the connection as dead after the heartbeat timeout. When the user foregrounds the app, you need to detect that the socket was dropped and reconnect, then re-emit a join event with updated metadata.

Ungraceful Disconnects

A graceful disconnect happens when the client calls socket.close() and the server receives the close frame. An ungraceful disconnect happens when the network drops, the process is killed, or the device loses power. In the ungraceful case, the server will not know the client is gone until the heartbeat timeout fires.

The standard solution is a server-side heartbeat: the server expects a ping from each client on a regular interval, typically every 20 to 30 seconds. If no ping arrives within the window, the server marks the client as disconnected and emits a leave event. Clients that are truly connected will send their ping and stay online. Clients that have silently dropped will be cleaned up within one heartbeat window.

Multiple Connections

A single user can be connected from multiple tabs or devices simultaneously. A naive implementation will emit a leave event when one tab closes, even though the user is still connected from another tab. The fix is reference counting at the user level. The server tracks how many active connections exist per user ID. It only emits a leave event when the count reaches zero.

How NoLag Handles Presence

NoLag builds presence into the room and lobby primitives so you do not have to wire it yourself.

When a client joins a room, NoLag automatically tracks membership and emits join/leave events to other room members. Heartbeat timeouts, graceful disconnects, and ungraceful drops are all handled at the infrastructure level. You subscribe to presence events on a room and receive a consistent stream of join, leave, and metadata update events without writing any server-side heartbeat logic.

For global presence, the lobby channel serves as the application-wide presence tracker. Every connected client is visible in the lobby, and you can query the current snapshot on join to bootstrap your contacts list with the current online state before live updates start flowing.

Custom metadata is first-class. When a client connects, it can include a metadata payload that gets broadcast to other presence subscribers. Update the metadata at any time, for example when the user navigates to a different page, and the update propagates immediately. No custom presence server, no polling, no heartbeat management code in your application.