WebSocket, as a low-latency, bidirectional communications protocol, has become a mainstay of the modern realtime landscape. Developers turn to WebSocket to power chat, live experiences, fan engagement, and countless other realtime use cases at scale. But how reliable is WebSocket in supporting those experiences?

In this article, we explore what we mean when we refer to reliability within a realtime WebSocket infrastructure, how to ensure WebSocket reliability at scale, and what you need to build yourself (vs. what you get out-of-the-box with a realtime platform provider).

With all of this in mind, we’ll help you determine the best WebSocket implementation for your use case.

What do we mean by reliability?

When we talk about reliability in the context of WebSockets, we're referring to a system's ability to deliver data consistently over time, as expected, and without interruptions. This should be true even if an individual component in the infrastructure fails.

Availability - uptime - is naturally tied to reliability, but reliability also requires additional mechanisms. More specifically, in the context of WebSocket-powered realtime experiences, there are a few distinct concepts around reliability to unpack:

Failure recovery: A realtime system needs to be able to continue operating should a single component fail. In practice, strategies like message acknowledgments, retries, and message persistence can overcome such failures. For example, a chat user should be able to receive messages even if the server in their local region unexpectedly goes offline.
Redundancy and global distribution: Redundancy refers to exceeding the required capacity to continue service. Multiple databases that retain the same data set ensure data redundancy. A globally-distributed network is often an important element of this, since if a local database region goes down, we want to be able to keep the whole system running regardless.
Data integrity: Integrity refers to the actual accuracy of the data delivered. Guaranteed message delivery (and to the right recipients), correct message ordering, and delivery semantics (for example, exactly-once or at-least-once delivery) are aspects of data integrity. In fact, it’s a benchmark of a dependable realtime service unto itself, but is deeply connected to reliability.

When a realtime system can remain available and reliable even when multiple components fail, it is considered to be fault-tolerant. For a deeper dive into fault-tolerant architecture in distributed systems, we recommend this article from our CTO, Paddy Byers.

What reliability do WebSockets provide on their own?

While WebSockets provide a good foundation for bidirectional communication, they don’t inherently come with any functionality to ensure reliable service:

No delivery guarantees or message ordering

WebSockets themselves don’t guarantee message delivery, or that a message will be delivered exactly once. Messages can get lost if the connection drops unexpectedly, or if your application doesn’t have built-in mechanisms to handle failed deliveries. Without these guarantees, You may experience incomplete or misordered message delivery. To avoid message loss, you will have to build retries, acknowledgments, and message persistence yourself, or use a library of platform-as-a-service (PaaS) that adds these guarantees.

No automatic reconnections

If a WebSocket connection drops, it doesn’t automatically reconnect. To keep your connection alive, you need to build custom reconnection logic to detect dropped connections and re-establish them. Again, this requires custom handling, or a PaaS.

Complex to scale

WebSocket is stateful, so horizontally scaling WebSocket to multiple servers means adding to the complexity of your infrastructure. This can compromise global distribution and redundancy if built out incorrectly.

Building a reliable WebSocket infrastructure yourself

Given that reliability involves failure recovery, data integrity and global redundancy, building a reliable WebSocket infrastructure essentially means supplying the reliability functionality that WebSockets can’t provide on their own. We’ll now cover the details of what building these components entails, and the time and cost implications of undertaking this yourself.

Making the infrastructure reliable

In the last section, we mentioned delivery guarantees, message ordering, automatic connections, and flexible horizontal scaling. These are all things you need to build out yourself if you’re creating infrastructure from scratch. Here are more specifics on what this involves:

Message delivery guarantees: You need to build custom mechanisms for message acknowledgments, retries, and persistence. This prevents message loss if a connection drops or an unexpected failure occurs.
Reconnection logic: Reliable infrastructure requires reconnection logic that can detect and manage dropped connections, so that users can resume sessions without interruption. You’ll need to build out state management to keep track of connection status and sync lost messages once reconnected.
Horizontal scaling with sticky sessions: To scale WebSocket connections horizontally, you’ll need to configure load balancers and enable sticky sessions so that each client is always routed to the same server. It’s important to properly manage this, since failures could lead to increased latencies and poor user experiences.
Data replication and redundancy: Reliable WebSocket infrastructure requires data replication across multiple servers and regions. This goes hand in hand with scaling - achieving this requires a globally-distributed network with redundant systems that can automatically route traffic to an available server if one fails.

In addition to all of this, you’ll need to consider building out auxiliary systems to support these major reliability functions, like monitoring and alerting in case something goes wrong.

The cost of a self build

With all that in mind, the financial and time cost of actually building and maintaining reliable WebSocket infrastructure at scale is significant, to the point where we’ve written a whole research report on this. Engineers planning a self build often underestimate the resources required to deliver WebSocket infrastructure, leading to project delays and cost overruns:

Time and resources: Over 70% of companies reported that building WebSocket infrastructure required more than 3 months of engineering time.
High infrastructure costs: The operating cost of building globally distributed infrastructure can be unpredictable and substantial. Those surveyed in our report quoted between $100k-200k in annual costs.
Global scaling issues: As mentioned, WebSockets (and their libraries) are hard to scale. Socket.IO, for instance, is limited to a single datacenter or region. Without a fallback mechanism, this means that if the region goes offline, the entire messaging system goes down. High latencies for users located further away from the datacenter can also degrade the user experience.

Even after considerable time and resources have been invested, self-built infrastructure often fails to meet evolving business needs or user expectations over time. A custom-built solution is typically designed for a specific use case, making it difficult to adapt or innovate as requirements shift. In contrast, managed platforms offer composable realtime, allowing you to easily scale and extend your use case without being constrained by your self build’s parameters.

An alternative to building it yourself: WebSocket services

If you are looking to build out realtime infrastructure, reliability and integrity are non-negotiable to the quality of your service. However, as we’ve covered, building this level of dependability in-house is resource-intensive and risky.

Reliability at scale is especially fraught with challenges, since building out global replication increases infrastructural complexity exponentially, and has high cost implications.

By using a realtime PaaS, you can avoid the challenges of managing WebSocket infrastructure yourself, and allow your team to focus on what matters most - a great user experience, and product innovation.

These managed platforms eliminate the need to invest heavily in building and maintaining WebSocket infrastructure in-house. There are plenty available, but naturally, we are most familiar with Ably. Our Four Pillars of Dependability (performance, integrity, reliability and availability) guide everything we do, and have enabled us to provide a WebSocket platform with:

Predictable performance: A low-latency and high-throughout global edge network, with median latencies of <50ms.
Guaranteed ordering & delivery: Messages are delivered in order and exactly once, even after disconnections.
Fault-tolerant infrastructure: Redundancy at regional and global levels with 99.999% uptime SLAs. 99.999999% (8x9s) message availability and survivability, even with datacenter failures.
High scalability & availability: Built and battle-tested to handle millions of concurrent connections at scale.
Optimized build times and costs: Deployments typically see a 21x lower cost and upwards of $1M saved in the first year.

WebSocket reliability in realtime infrastructure