Reliability pillar - The structure

Table of Contents

Reliability Pillar - This article is part of a series.

Part 0: Series - The Well-Architected Framework: Reliability pillar

Part 1: This Article

Part 2: Reliability pillar: How do you manage Service Quotas and constraints?

In the introduction article of this series, I have described the AWS Well-architected Framework structure and how to use it to evaluate your architectures.

I have also briefly introduced the Reliability Pillar and listed the 13 questions we will examine in this series.

Before we get started with the questions, let’s delve into the Reliability Pillar structure.

The Reliability pillar structure
#

The Reliability pillar is the set of design principles, best practices and questions related to Reliability.

Design Principles
#

Five are the design principles for reliability in the cloud:

Automatically recover from failure
Test recovery procedures
Scale horizontally to increase aggregate workload availability
Stop guessing capacity
Manage change in automation

Automatically recover from failure
#

The only way you can automatically recover from failure, is by monitoring your system. You have to identify business value metrics and monitor them: in this way, you can act as soon as a threshold is breached. The action has to be automatic. The more advanced automations, the easier it is to be proactive and prevent failures from happening.

Test recovery procedures
#

One should regularly tests the recovery procedures and validate them. By using automation, you can simulate different failure scenarios: completely new ones or scenarios that led to failures before. The advantage of testing recovery procedures is that you reduce risk by potentially fixing failures before a real failure scenario occurs.

Scale horizontally to increase aggregate workload availability
#

You should reduce the impact of a single point of failure by using multiple small resources instead of a large one.

Stop guessing capacity
#

You should monitor resources utilization and automate their addition or removal to maintain an optimal level of provisioning.

Manage change in automation
#

Any changes to your infrastructure should be executed via automation. This encompasses changes to the automation process itself, which can then be tracked and subjected to review.

Best Practices and Questions
#

Reliability best practices and questions are tightly coupled and grouped by the following areas:

Foundations
Workload architecture
Change management
Failure management

Each question comes with a list of related Best Practices. So, examining and answering a certain question means going through the best practices and understanding if they are established in your workload or not.

Each best practice has a well-defined structure with the following components:

Desired outcome: a description of how the workload should behave if the best practice has been established.
Common anti-patterns: a list of poor decisions that should be avoided.
Benefits of establishing this best practice: a list of reasons why the best practice should be implemented.
Level of risk exposed if this best practice is not established: a label that can be low, medium or high.
Implementation guidance: a section that describes how to establish the best practice.

Conclusions
#

In this article we delve into the Reliability Pillar structure, describing:

the reliability design principles
the structure of the best practices we are going to see.

This knowledge is fundamental to have before we get started with the questions.

In the next article, we are finally going to examine the first question of the Reliability pillar:

How do you manage Service Quotas and constraints?

See you soon! 👋

Resources
#

Reliability Pillar Design principles

Reliability Pillar - This article is part of a series.

Part 0: Series - The Well-Architected Framework: Reliability pillar

Part 1: This Article

Part 2: Reliability pillar: How do you manage Service Quotas and constraints?

The Reliability pillar structure #

Design Principles #

Automatically recover from failure #

Test recovery procedures #

Scale horizontally to increase aggregate workload availability #

Stop guessing capacity #

Manage change in automation #

Best Practices and Questions #

Conclusions #

Resources #