High Availability vs. Fault Tolerance: 3 Key Differences

essidsolutions
  • High availability is defined as the ability of a system to operate continuously with minimal risk of failure.
  • Fault tolerance is defined as the ability of a system to continue operating without interruption, even if several components fail.
  • This article covers the key differences between high availability and fault tolerance.

High Availability and Fault Tolerance: An Overview

High availability is the ability of a system to operate continuously with minimal risk of failure. On the other hand, fault tolerance is the ability of a system to continue operating without interruption, even if several components fail.

Architectural Overview: Fault Tolerance vs. High Availability

Sources: Avi NetworksOpens a new window and ImpervaOpens a new window

Before diving into the key differences between high availability and fault tolerance, let’s learn more about these concepts.

What Is High Availability?

In any IT ecosystem, “availability” is the ability of a system to respond to a request. High availability, as the name suggests, refers to a system capable of responding to excessive requests with minimal downtime.

High-availability systems are not completely immune to failure and downtime. Rather, a high-availability system is designed to be responsive as regularly as possible. High availability also doesn’t reflect the speed or quality of a system’s output. It only refers to the ability of the IT system to respond to requests.

High availability as a feature is commonly seen in the case of cloud service providers, who typically assign a service level agreement (SLA) score to the availability of their cloud systems. For instance, blob storage systems such as Azure, Google Cloud, and AWS S3 all feature an availability SLA of 99.99%.

How exactly does this 99.99% availability translate to the real world? The percentage is calculated in terms of annual availability, which means that over any 365-day period, the system is guaranteed to be online 99.99% of the time.

A quick calculation would reveal that 99.99% uptime equals 0.01% downtime. This means that out of the 525,600 minutes in any 365-day period, approximately 53 minutes of downtime can be expected. That’s not even an hour in a full year!

Now comes an interesting calculation: what if a system offers 99.9% uptime instead of 99.99% uptime? Not that big a difference — or is it?

A system with 99.9% availability would feature a downtime of 0.1%, which, if we follow the same calculation as above, translates to about 8.8 hours per year. That’s a big jump from 53 minutes!

So, is 99.9% availability high or low? It depends on the application. For instance, for a super-critical system like air traffic control or payments processing, 8 hours of downtime per year is simply unacceptable. On the other hand, some applications (such as blob storage) would face limited flexibility in achieving a very high availability SLA (say 99.999%).

Simply put, the higher the aim for a system’s availability, the more complex and expensive it becomes. Let’s understand why.

Say, there exists an infrastructure as a service (IaaS) provider that requires its IT systems to be available 24/7/365. Such an enterprise would not be able to achieve its intended 100% availability SLA if it has only one server, right? This is because one server can only handle so much traffic. Plus, it is likely to experience downtime due to hardware failures or maintenance requirements.

To increase system availability, the IaaS vendor could commission more servers to handle more traffic through workload distribution. The greater the number of servers commissioned, the closer the vendor reaches their intended 100% availability mark.

However, no matter how many servers are commissioned to handle active traffic, availability can never be truly 100% because servers are bound to experience downtime at some point. The solution here would be to commission standby servers that don’t actively provide IaaS services but simply await an opportunity to take over if one of the primary servers goes down. While this increases system availability even further, it increases costs because standby servers also need to be paid for. They can’t, however, be given data to process at the same throughput as primary servers.

The higher the number of backup servers added, the closer any IT system would reach 100% availability (at an ever-increasing cost). However, as mentioned earlier, achieving true 100% availability is virtually impossible.

See More: What Is Private Cloud Storage? Definition, Types, Examples, and Best Practices

What Is Fault Tolerance?

As we’ve established already, failure occurs in all IT systems at some point. Once this happens, availability is maintained if the entire system goes down and another system steps in to take its place. However, what if the system within which the failure occurred continues operating without downtime? Such a system would be known as fault tolerant.

This is the main difference between high availability and fault tolerance. In a highly available system, failures can lead to downtime and the denial of requests. However, this would happen rarely, perhaps only a few minutes or hours per year.

On the other hand, fault-tolerant systems can recover from failure and will be able to continue responding to requests without a similar backup system having to take over.

Imagine the IaaS service provider example from above again. This provider has stocked up their data center with several servers and achieved high availability. But suddenly, the data center experiences a power outage. No number of primary servers or backup servers can respond to user requests since they all need a power supply to operate.

But now, our IaaS vendor has installed a well-connected backup power generator that instantly fulfills the data center’s power supply needs in case of power loss. This makes the IaaS IT systems both highly available and fault tolerant.

See More: What Is ETL (Extract, Transform, Load)? Meaning, Process, and Tools

High Availability vs. Fault Tolerance: Top 3 Differences

Both high availability and fault tolerance aim to achieve business continuity and system reliability. However, they differ in terms of their design and approach. Let’s look at the top differences between high availability and fault tolerance.

1. Operations

High Availability Fault Tolerance
High availability systems are designed to minimize downtime and avoid loss of service.

High availability as a metric is expressed as a percentage of collective operation time in terms of system uptime.

The gold standard for high availability is 99.999% (five nines) uptime.

Both high availability and fault tolerance are relevant to total system uptime and long-term operations. However, they are not mutually exclusive, and both strategies are generally combined in organizational deployments.

For instance, some operations would require an entirely mirrored system, which is both highly available and fault-tolerant. In such a setup, the failure of one mirror would lead to the other kicking in and downtime being averted. However, such a setup would be costly and cumbersome to manage.

Conversely, a highly available system that is not necessarily fault-tolerant would leverage a load balancer. Such a setup would allow for minimal downtime in service but lack total redundancy in case of failure.

High-availability operations in organizational settings have several distinct advantages (and some limitations) compared to fault-tolerant systems.

Regarding cost, fault tolerance can be more expensive to implement than high availability. This is because fault-tolerant systems demand the continuous operation and maintenance of redundant system components. On the other hand, high availability can be achieved by a subset of a vast IT system, such as by installing a load-balancing solution.

Another key operational difference lies in downtime, with high availability still allowing minimal service interruption levels. Even gold standard “five nines” systems are allowed to experience around 5 minutes of annual downtime. Conversely, fault-tolerant systems are required to continue working without downtime, even in the case of component failure.

High-availability systems are also lighter to set up as they are built to share resources with the aim of downtime minimization and failure co-management. In contrast, fault tolerance requires more hardware and software investment to detect failures and immediately switch to redundant components and power backups.

Finally, not all systems are required to be fault-tolerant by design. In many applications, high availability is sufficient.

Fault-tolerant systems are designed to go beyond high availability and ensure robust business continuity and, in some applications, even assist in disaster recovery. Fault tolerance solutions specifically target disruptions caused by a single point of failure.

Fault-tolerant systems operate by ensuring seamless automated service switchovers to backup components in case of primary component failure.

A common example of hardware-based fault tolerance operations is seen in server setups. Here, a critical server would have an identical fault-tolerant server that mirrors all its operations running in parallel. Such a system eliminates single points of failure and enables hardware fault tolerance through redundancy. This makes components and systems more reliable.

Software systems, too, can be created to be fault tolerant. Think of an application backed up by another instance of the same application, like a payments database that is replicated continuously. In such a setup, primary database operations would be automatically redirected to the backup database in case of application failure.

Fault tolerance can even exist at the motive power level, with redundant power sources and internet connections helping avoid system faults by automatically switching over in case of failure.

Fault tolerance is generally focused on mission-critical systems and applications and can cover several levels:

  • The most basic level of fault tolerance is generally the ability of a system to respond to challenges such as power failure and internet outages.
  • The next level is usually the ability of a system to switch over to a backup setup instantaneously in case of failure.
  • Fault tolerance can also be for partial components. For instance, in case of disk failure, a fault-tolerant system is expected to switch over to a mirrored disk instantly. Such a setup would stay functional even during partial system failure without having to switch over to a mirror completely.
  • High-level fault tolerant systems also exist — these leverage multiple processors to collaboratively scan data to spot and instantly correct any arising errors.
  • Finally, fault tolerance can be built directly into the operating system, allowing programmers to monitor critical data at specific system points.

 

2. Techniques

High Availability Fault Tolerance
High availability uses techniques including load balancing, clustering, and redundancy to achieve high uptime.

Load balancing

High availability load balancing (HALB) systems automatically distribute workloads among data centers. This is achieved by leveraging primary and secondary load balancers to ensure near-continuous application delivery.

Load balancers assign workloads to different servers, track server health, and reroute workloads from faulty to healthy ones as required. High availability is achieved when HALB systems use redundancy in load balancers and servers, including backup systems, to replace components facing downtime.

Load balancers rely on session persistence for optimized performance and prevention of application failure. Workload distribution algorithms can include the least response time, least connections, hash, IP hash, round-robin, and random.

HALB also ensures high availability by protecting organizational systems from distributed denial of service (DDoS) attacks, conducting scheduled health checks to ensure servers are up and handling requests, and speeding up responses.

Clustering

When a group of hosts combine bandwidth to act as a single system and ensure continuous uptime, it is known as high availability clustering.

High availability clusters are capable of load balancing, failover, and backup. All hosts in the cluster share storage access, allowing virtual machines (VMs) and other applications on any host to fail over to another host without downtime.

Redundancy

Common between high availability and fault tolerance, redundancy is an important strategy for maintaining both.

In case of primary component failure, redundancy ensures that a secondary copy of the same component is ready to take over automatically and instantly. A component that is not redundant is likely to be a single point of failure and can be detrimental to high availability efforts.

Redundancy in high availability systems can be achieved in several ways. These include backup power provisions, uninterrupted power supply, load balancing or bonding of network cards, multiple network fibers linking components, CPU clusters, and numerous hard drives set up in a redundant array.

To ensure continuous operations even during partial failure, fault tolerance uses techniques such as replication, redundancy, and failover.

Replication

In replication-based fault tolerance, data is copied into multiple systems. Incoming requests can be transmitted to any replica, thus ensuring system continuity even during a node failure.

Replication protocol phases include server coordination, client contact, agreement, execution, and client response. Criteria to ensure consistency among replicated components include sequential consistency, causal consistency, and linearizability.

The degree of replication defines the number of replicas created. High fault tolerance is achieved through the creation of a large number of replicas. A low degree of replication can impact scalability, fault tolerance, and system performance.

Redundancy

Process-level redundancy is a software-based fault tolerance technique for handling transient failures, such as system malfunction and environmental interference.

System data is retained in stable storage to allow for easy rollback in case of node failure. Rollback and checkpointing are generally used to store the current system state, including process state, register values, and environment.

Failover

In fault tolerance, a failover system is designed to automatically activate a secondary platform to keep a system or application running in case of primary platform failure. During this time, IT personnel are usually expected to fix the primary platform and bring it back online on high priority.

 

3. System design

High Availability Fault Tolerance
Adding too many components to any system will make it complex and difficult to manage. A complicated system design increases the potential for failure and makes it hard to ensure high availability.

Therefore, it is critical to analyze the redundancy setup required to achieve streamlined high availability.

In the zero downtime system design, modeling and simulation are used to plan maintenance and upgrades before failure can occur. Redundancy is ensured at every level, failed components are hot-swapped automatically, and the system undergoes thorough testing before being deployed.

Fault instrumentation is also a useful system for high availability, especially in setups with limited redundancy. Fault indicators are used to direct maintenance efforts to faulty components as soon as they go down to ensure they can be brought back up swiftly and effectively.

In a passive redundancy system design, sufficient excess capacity is arranged to accommodate performance issues. However, such a system is kept inactive until needed, which can lead to reduced costs but a higher potential for downtime.

Conversely, active redundancy refers to a system design wherein multiple identical components operate in parallel so that if one fails, the other continues to operate without downtime.

Redundancy simulation is also a useful methodology in high availability systems. It involves stress testing system capacity by intentionally shutting down components.

Finally, it is critical to remember that human error is a prevalent cause of system outages, making automation essential for effective high availability.

The basic system design for fault tolerance requires the following features.

Fault detection 

The system must be able to detect the faulty component in case of failure. To achieve this, dedicated failure detection mechanisms must be added to the system. Faults can be categorized based on cause, effect, locality, and duration.

Single point of failure removal

Any single point of failure within the system must be made redundant so that the overall system can continue operating without interruption, even during failure.

Reversion mode availability

The availability of reversion mode allows the system to switch back to its original configuration after the repair of the failure. For instance, once a system switches to backup mode during a failure and the failure is repaired, reversion mode ensures that it can revert to the primary mode without system downtime or loss of information.

Fault containment

Finally, some faults can drive a system to failure by propagating the cause of failure throughout the system. To ensure the failure doesn’t spread in such cases, a firewall or other mechanism must be implemented to contain the fault and protect the system.

See More: What Is a Security Vulnerability? Definition, Types, and Best Practices for Prevention

Takeaway

High availability is typically expressed as a percentage — a higher percentage means less downtime and a more highly available system. Except for pacemakers, very few systems strive to achieve true 100% availability. This is because the availability of 99.999% (five nines uptime) translates to about five minutes of downtime annually, which is sufficient for most real-world systems.

On the other hand, fault tolerance is not easily measurable. Generally, a system is either classified as fault-tolerant or not. If a system is fault-tolerant, it means it is designed to keep responding even if some of its components fail.

While both these concepts are similar, key differences exist in terms of operations, techniques, and system design. However, the ultimate goal of both solutions is to ensure a high response rate and low downtime. This often means both are deployed in enterprise systems.

Did this article help you understand the difference between high availability and fault tolerance? Share this article on FacebookOpens a new window , XOpens a new window , or LinkedInOpens a new window !

Image Source: Shutterstock

MORE ON DISASTER RECOVERY