Skip to main content
  1. Posts/

Reliability pillar - The structure

·537 words·3 mins· loading · loading ·
The indie coder
The indie coder
Software Engineer
Reliability Pillar - This article is part of a series.
Part 1: This Article

In the introduction article of this series, I have described the AWS Well-architected Framework structure and how to use it to evaluate your architectures.

I have also briefly introduced the Reliability Pillar and listed the 13 questions we will examine in this series.

Before we get started with the questions, let’s delve into the Reliability Pillar structure.

The Reliability pillar structure

The Reliability pillar is the set of design principles, best practices and questions related to Reliability.

Design Principles

Five are the design principles for reliability in the cloud:

  1. Automatically recover from failure

  2. Test recovery procedures

  3. Scale horizontally to increase aggregate workload availability

  4. Stop guessing capacity

  5. Manage change in automation

Automatically recover from failure

The only way you can automatically recover from failure, is by monitoring your system. You have to identify business value metrics and monitor them: in this way, you can act as soon as a threshold is breached. The action has to be automatic. The more advanced automations, the easier it is to be proactive and prevent failures from happening.

Test recovery procedures

One should regularly tests the recovery procedures and validate them. By using automation, you can simulate different failure scenarios: completely new ones or scenarios that led to failures before. The advantage of testing recovery procedures is that you reduce risk by potentially fixing failures before a real failure scenario occurs.

Scale horizontally to increase aggregate workload availability

You should reduce the impact of a single point of failure by using multiple small resources instead of a large one.

Stop guessing capacity

You should monitor resources utilization and automate their addition or removal to maintain an optimal level of provisioning.

Manage change in automation

Any changes to your infrastructure should be executed via automation. This encompasses changes to the automation process itself, which can then be tracked and subjected to review.

Best Practices and Questions

Reliability best practices and questions are tightly coupled and grouped by the following areas:

  1. Foundations

  2. Workload architecture

  3. Change management

  4. Failure management

Each question comes with a list of related Best Practices. So, examining and answering a certain question means going through the best practices and understanding if they are established in your workload or not.

Each best practice has a well-defined structure with the following components:

  • Desired outcome: a description of how the workload should behave if the best practice has been established.
  • Common anti-patterns: a list of poor decisions that should be avoided.
  • Benefits of establishing this best practice: a list of reasons why the best practice should be implemented.
  • Level of risk exposed if this best practice is not established: a label that can be low, medium or high.
  • Implementation guidance: a section that describes how to establish the best practice.


In this article we delve into the Reliability Pillar structure, describing:

  • the reliability design principles
  • the structure of the best practices we are going to see.

This knowledge is fundamental to have before we get started with the questions.

In the next article, we are finally going to examine the first question of the Reliability pillar:

How do you manage Service Quotas and constraints?

See you soon! 👋


Reliability Pillar Design principles

Reliability Pillar - This article is part of a series.
Part 1: This Article