INTRODUCTION
Designing a highly available and fault tolerant database can be one of the most challenging tasks for any service. There is never a “one size fits all” approach to achieving these reliably and often times, based on organizational and business needs, one of these is prioritized over the other. But keeping a fine balance between the two of them can prevent disastrous fail overs and complete loss of data. This article focuses on clearly defining the two paradigms and understanding the basics of various techniques & principles that are helpful in achieving them.
HIGH AVAILABILITY
It is the ability of the system/service to continue providing services and minimizes any down time. For designing highly available database service, some of the following key principles are kept in mind:
Single Point of Failures: This can be achieved by adding redundancy and prevent failure of the entire service if one of the parts of the system fails. Creating a failover service or a standby is super helpful to avoid single point of failures. In case of a failure, the standy service can start taking the traffic. It can be achieved by:
Hot failovers: In this case, all the servers (primary and backup) are running simultaneously but the traffic is routed to only one server at a time. In case of failure, the traffic get directed the backup server.
Cold failovers: In this case the backup server starts after the primary server is completely shut down.
Clustering: For a highly available database service, clustering helps in calling resources from other services within a cluster in case of a failure. A database cluster will include several nodes which communicate with each other and during a failure in one of the nodes, rest of the cluster can operate normally. The cluster can continue to operate while the fault node is getting recovered.
Load Balancing: This is an important principle when you want to achieve a highly available database service because during a failure, a load balancer is the one that detects the failed server and redirects the traffic to the healthy ones. Apart from high availability, a load balancer also provides incremental stability to the entire system.
Redundancy: Geographic redundancy is very important for a database service and helps prevent outages & loss of data due to natural disasters.
FAULT TOLERANCE
It is the ability of any service/system to continue operating inspite of failures due to one of it’s own components. For designing fault tolerant database system, a couple of techniques should be applied in different categories such as replication, failure detection, throttling, etc.
Data Replication: In order to maintain high durability of data, storing multiple copies of data is preferred. Some of the popular ways to replicate the data are as follows (let’s assume there are N replicas for the database):
Synchronous Replication: When a client sends a write request to the database, it starts writing synchronously to all the N replicas one by one even before acknowledging the client request. The leader gets all the requests and starts writing them in order and then replicating the data on the followers.
Paxos based replication: It is very similar to the synchronous replication but this kind of replication requires communication with the majority nodes only.
Leader Follower Replication: It is the widely popular replication methodology popularly used my MySQL. The client writes the data to the leader and then asynchronously, leader writes the data to all of its followers.
Disaster Recovery: It is the ability to recover from a large scale failures with minimum discontinuation to the services as possible. The key objectives for a good disaster recovery plan circles around Recover Time Objective (RTO) and Recovery Point Objective (RPO). In order to achieve these, there are two major elements of it:
Backup: We should always keep copies of important data in order to recover them at the time of disaster recovery. One of the major concepts built on top of this is Point in Time Recovery. It helps in recovering a database from a previously known good point.
Tolerance: We should also deploy two or more database services that are kept far away from each other and continuously monitor the health status of each other. During a failure, the traffic can be switched over to the other healthy service without interruption.
Conclusion
Despite some of the best intentions, failure scenarios are inevitable and hence, preparing for them and deploying processes in place helps mitigate disastrous scenarios.
If you like the post, share and subscribe to the newsletter to stay up to date with tech/product musings.
(The contents of this blog are of my personal opinion and/or self-reading a bunch of articles and in no way influenced by my employer.)
Great article! Completely agree, there isn't one size fits all approach. Also, with major cloud providers today, most of these techniques are available out of the box (magic of serverless computing :D). This certainly helps in focusing on application logic, rather than worrying about infrastructure level stuff. But, it's important to keep these fundamental techniques in mind while designing a highly available and fault tolerant system.