Table of Contents
Reliability Pillar - This article is part of a series.
If you work with AWS you are likely familiar with the AWS Well-Architected Framework and understand how crucial it is to periodically evaluate your system using this framework to ensure its stability and efficiency.
For those who may not be familiar, the AWS Well-Architected Framework provides a set of design principles, best practices and questions that help you to evaluate architectures.
The AWS Well-Architected Framework #
The set of design principles, best practices and questions is organized in pillars. The AWS Well-Architected Framework is based on six pillars as defined in the documentation:
operational excellence: The ability to support development and run workloads effectively, gain insight into their operations, and to continuously improve supporting processes and procedures to deliver business value
security: the ability to protect data, systems, and assets to take advantage of cloud technologies to improve your security.
reliability: the ability of a workload to perform its intended function correctly and consistently when it’s expected to. This includes the ability to operate and test the workload through its total lifecycle.
performance efficiency: The ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve.
cost optimization: The ability to run systems to deliver business value at the lowest price point.
sustainability: The ability to continually improve sustainability impacts by reducing energy consumption and increasing efficiency across all components of a workload by maximizing the benefits from the provisioned resources and minimizing the total resources required.
I mentioned earlier that the framework helps you to evaluate architectures. But how is the evaluation of the architecture done in practice?
The evaluation essentially involves a conversation, lasting a couple of hours at mosts, among the team members who designed and built a particular architecture. This discussion is preferably held in front of a whiteboard, with architectural diagrams on hand. To ensure a positive experience and extract the most value from the conversation, it’s important to avoid any blame-oriented attitudes during the session.
For enterprise projects, it’s common that only a subset of the team is involved in designing and building specific services or components of the system, rather than the entire team.
AWS uses the term Workload to indicate a collection of resources and code that delivers business value, such as a customer-facing application or a backend process. A workload might consist of a subset of resources in a single AWS account or be a collection of multiple resources spanning multiple AWS accounts. When reviewing a workload, it’s important to involve only those who designed and built that particular workload.
The evaluation isn’t a one-time process carried out after the architecture is built; rather, it’s an ongoing part of the project’s lifecycle. It should begin early in the design phase and occur prior to going live.
The purpose of the review is to identify critical issues and areas for improvement, organizing them into a list of action points that can be prioritized based on the project’s needs.
Reliability pillar #
Let’s revisit the definition of reliability: The ability of a workload to perform its intended function correctly and consistently when it’s expected to. This includes the ability to operate and test the workload through its total lifecycle.
But what does that really mean?
To keep it simple, a reliable system is a system that can be trusted. A system can be trusted when failures are not common, when the impact of each failure is minimized and the system continues to run despite them. In today’s world, reliability is a keystone for the success of any system.
In this series I would like to dig deeper into the reliability pillar. We will examine the questions presented by the framework and try to address some of them within a serverless context.
The questions are 13 and are grouped by 4 areas:
- Workload Architecture
- Change Management
- Failure Management
How do you manage Service Quotas and constraints?
How do you plan your network topology?
Workload Architecture #
How do you design your workload service architecture?
How do you design interactions in a distributed system to prevent failures?
How do you design interactions in a distributed system to mitigate or withstand failures?
Change Management #
How do you monitor workload resources?
How do you design your workload to adapt to changes in demand?
How do you implement change?
Failure Management #
How do you back up data?
How do you use fault isolation to protect your workload?
How do you design your workload to withstand component failures?
How do you test reliability?
How do you plan for disaster recovery (DR)?
Looking at the number of questions, more than a series, this is going to be a Saga.
Stay tuned! 📻 📡