Introduction

In this article, we are going to discuss a strongly consistent, distributed key-value pair datastore used for shared configuration, service discovery, and scheduler coordination in Kubernetes, this database is called etcd (pronounced et-see-dee). This article is part of a series that will focus on understanding, mastering, and designing efficient etcd clusters. In this article, we will discuss the justification behind using etcd, the leader election process, and finally, the consensus algorithm used in etcd, in the following parts, we will follow up with a technical implementation of a highly available etcd cluster. and also backing up an etcd database to prevent failures. This article requires a basic understanding of Kubernetes, algorithms, and system design.

etcd

etcd is an open-source leader-based distributed key-value datastore designed by a vibrant team of engineers at CoreOS in 2013 and donated to Cloud Native Computing Foundation (CNCF) in 2018. Since then, etcd has grown to be adopted as a datastore in major projects like Kubernetes, CoreDNS, OpenStack, and other relevant tools. etcd is built to be simple, secure, reliable, and fast (benchmarked 10,000 writes/sec), it is written in Go and uses the Raft consensus algorithm to manage a highly-available replicated log. etcd is strongly consistent because it has strict serializability, which means a consistent global ordering of events, to be practical, no client subscribed to an etcd database will ever see a stale database (this isn't the case for NoSQl databases the eventual consistency of NoSQL databases ). Also unlike traditional SQL databases, etcd is distributed in nature, allowing high availability without sacrificing consistency.

etcd is that guy.

Why etcd?

Why is etcd used in Kubernetes as the key-value store? Why not some SQL database or a NoSQL database? The key to answering this question is understanding the core storage requirements of the Kubernetes API-server.

The datastore attached to Kubernetes must have the following requirements,

Change notification: Kubernetes is designed to watch over the state of a cluster and adapt the cluster to the desired state, hence the API server acts as a central coordinator between the different clients, streaming changes to the different parts of the controlplane. The most optimal datastore for kubernetes would allow the API-server to conveniently subscribe to a key or some set of keys and would promptly update the API-server on any key change, while simultaneously performing the updates efficiently at scale.
Consistency: Kubernetes is a fast-moving orchestrator, it would be a disaster if the datastore powering the kube api-server has eventual consistency. Imagine your kubernetes cluster creating two deployments from a deployment spec because the datastore did not broadcast the update that a deployment already exists. etcd is strongly consistent, meaning that data across each node is the same, infact etcd does not complete a write till the update has been written to every member of the etcd cluster.
Availability: the Kube-Episerver requires its datastore to be highly available, this is because if the datastore is unavailable the Kube-apiserver goes down immediately and Kubernetes is unavailable. etcd solves this problem by being a distributed cluster with many nodes, this means that if a leader node is unhealthy another leader is elected and if a follower node is unhealthy requests are no longer sent to it.

The aforementioned characteristics highlight why etcd is the best choice as a datastore for Kubernetes. Other datastores like Hashicorp Consul, Zookeeper and CockroachDB can replace etcd in the way it is used in Kubernetes but etcd is by far the most popular just due to its share ability to be highly performant.

etcd internals

etcd in itself is a distributed consensus-based system, this means that for the cluster to function there must be a leader and the rest of the nodes would be followers. The distributed consensus system, offers one of computer science's biggest problems, "How do multiple independent processes decide on a single value for something?". etcd solves this problem by using the Raft Algorithm. the Raft algorithm is etcd's secret tool for maintaining a balance between string consistency and high availability.

The Raft algorithm works by electing a leader among nodes in the cluster and then ensuring all write requests go through that leader, any changes made by the leader are broadcast to the other nodes, a write is not also complete until all other nodes receive the write, hence the larger the etcd cluster the longer it takes for a write to occur. Since all the nodes are maintained in the same state, a read request can be sent to any node.

How does the Raft elect the leader in a group of nodes in a cluster? At the beginning of every etcd cluster, every node is a follower generally, nodes can exist in three states, follower, candidate, and leader. If followers at any point in time do not hear the heartbeat of a leader, they can then become candidates and request votes from other nodes to become the leader, nodes then respond with their vote, and the candidate with the highest number of votes becomes the leader.

Maintaining Quorum

in distributed consensus-based systems that use the Raft algorithm for leader election and voting, decisions are made using a majority vote, which means that if a majority cannot be reached the cluster becomes unavailable. in a cluster of 3 nodes, the majority is 2, if the leader goes offline and an election is held there would be no majority, and the cluster would remain unavailable, hence the quorum of this cluster is 2.

Theoretically, Quorum refers to the minimum number of members or nodes in a distributed system that must agree (reach consensus) for the cluster to make a decision or perform an operation. The quorum is typically defined as (N/2) + 1, where N is the total number of nodes in the cluster rounded down. For example, in a 5-node cluster, the quorum would be 3 nodes (5/2 rounded down + 1 = 3). This means at least 3 nodes must agree to proceed with a given operation.

Fault tolerance is the number of nodes that can be allowed to fail for a cluster to still maintain Quorum, For example, in a 5-node cluster, losing 2 nodes still allows the system to maintain a quorum (3 nodes), but losing 3 nodes would cause the cluster to lose its quorum, hence the fault tolerance of a 5-node cluster is 2.

Fault Tolerance = N - Q

If you look very closely at the image above, you will realize that the difference in fault tolerance between odd-numbered clusters and consecutive even-numbered clusters are the same (For example, 3 and 4 have the same cluster, 5 and 6 also have the same fault tolerance). There is genuinely no competitive advantage of having an even-numbered cluster, hence it is a rule of thumb for consensus-based clusters to have an odd number of nodes.

Conclusion

In this article, we discussed one of the most fundamental parts of Kubernetes, the Kubernetes datastore "etcd". We discussed the requirements that Kubernetes API-server has for its datastore, we also went in-depth on why etcd is the most popular datastore and what competitive advantages it has over SQL and NoSQL databases. Finally, we discussed concepts fundamental to the design of distributed clusters and how to ensure and guarantee availability for distributed clusters and not just etcd. In the next article, we will build an etcd cluster from the ground up and put the practice the things discussed in this article. If you enjoyed reading this article, feel free to share and also subscribe to my page.

Designing a fault-tolerant etcd cluster