When it comes to the reliability of your organizational systems, what can you do?
A fault-tolerant organization is an approach to managing risk and uncertainty by designing and building resilient systems capable of continuing to operate despite certain events with no loss of service. The critical elements of this approach include:
• Designing for resilience from the beginning of the project;
• Building with redundancy (dual-redundancy) wherever possible;
• Using multiple layers of protection against single points of failure;
• Testing and verifying the protections;
• Monitoring and maintaining the system's health; and
• Taking action when things go wrong.
The goal of a fault-tolerant organization is to reduce the probability of failure and minimize the impact of such failures on the business. This means that organizations must be able to detect problems early and take corrective action before they escalate into significant disruptions.
Let's talk about each of these in some more detail.
1. Designing for resilience from the beginning of the project
It is essential to start any new project with an understanding of how the system will behave if something goes wrong. In other words, you need to know how the system will fail so that you can build it accordingly.
For example, suppose you're planning to build a new factory. You could make sure that everything is built to withstand earthquakes, tornadoes, floods, fires, explosions, etc., but there's no point in doing that unless you have a good idea of where those risks are likely to occur. Similarly, if you're planning to build an airplane, you'd better understand the potential hazards associated with flying, including turbulence, lightning strikes, engine failure, etc.
One essential tool that can be used for this is the Failure Mode and Effects Analysis or FMEA.
2. Building with redundancy
Redundancy is simply the presence of two or more copies of a system, either physically or logically. For example, if you want to ensure that your data center remains online, you might put several servers in different rooms. Use two pumps instead of one for a critical supply line in the refinery.
Redundancy can be achieved through physical duplication, i.e., having multiple identical copies of the same equipment, or logical duplication, i.e. using multiple computers connected to the same network. Redundancy can be applied at various levels of complexity, ranging from simple to complex.
In general, the higher the level of redundancy, the greater the cost. However, the benefits of redundancy often outweigh the costs.
3. Using multiple layers of protection
There are many ways to protect a system from failure. Each has advantages and disadvantages. Some examples include:
• Physical barriers include walls, doors, fences, locks, guards, etc. They prevent unauthorized access and protect from natural disasters like fire, flood, earthquake, etc.
• Security devices include alarms, cameras, sensors, intrusion detection devices, etc. They help identify threats and allow you to respond quickly.
• Backup power supplies – These provide continuous power to the system in case of primary power loss.
• Backups – These are copies of the original data stored elsewhere. The backups may be local (on the same computer) or remote (in another location).
4. Testing and verifying the protections
Testing and verification are essential parts of building resilient systems. To do this well, you need to test both the software and hardware to see what happens when things go wrong. This testing should cover the regular operation of the system and extreme conditions such as large numbers of simultaneous requests, unexpected input, etc.
A fire drill is conducted by organizations to test their building evacuation process. During the exercise, each team member would leave the building and return to report on their experience.
5. Monitoring and maintaining the system's health
Health monitoring is vital for safety-critical applications. Health monitoring helps keep the system running smoothly and provides early warning of problems. It includes checking the system's status, identifying any issues, and taking corrective action before they become serious.
Common monitoring methods include:
- Logging. System logs contain information about how the system operates. These logs can be examined to find out why the system failed.
- Alerting. Systems can send alerts to people whenever something goes wrong. These alerts can tell people to contact the support staff or to restart the system.
- Error detection. Systems can check their own status periodically to see whether any errors have occurred. If an error does occur, the system can report it to the administrator.
6. Taking action when things go wrong
When something goes wrong with the system, it is called an incident. An incident management plan defines how the organization will handle incidents. There are three main types of actions taken during an incident:
• Recovery - restoring service to customers as soon as possible.
• Investigation - finding out why the incident happened and preventing similar incidents in the future.
• Prevention - avoiding the occurrence of future incidents.
Considerations in Building a Fault Resistant System
In designing a resilient system, you need to consider the following factors:
•Availability: What percentage of time should your system be up and running?
•Resiliency: How much downtime should your application experience before it breaks down completely?
•Scalability: Can your system handle increasing loads without breaking down?
•Performance: How fast should your system run?
•Cost: Is it possible to implement a robust solution with minimal cost?
•Reliability: How likely is it that your system will break down?
•Maintainability: How easy is it to make changes to your system?
The primary goal of a fault-tolerant organization is to protect its business from loss due to system failure. A reliable system ensures the availability of critical services while keeping costs low. To achieve these goals, you must design and build fault-resistant systems.