Failure as part of cloud software architecture

The Cloud journey requires organizations to re-evaluate “constants” which were taboo in legacy environments. One of the points on which organizations need to put an emphasis on is: Failure must be taken into account when planning and deciding on an architecture for a cloud environment. The reasons and logic for such an approach is what this article is about.

Vantage point

Understanding the “whys” and “wheres” is mainly dependent on the observer, and it can be very difficult to align all the relevant teams (DevOps, Dev, Ops etc.) on the fact that an application can be destroyed at any given point in time – more often than not – without warning, aggressively and abruptly.


It is important to emphasize here that when I say application, I do not mean end user service. That should be kept as close to a 100% up-time as possible but we’ll touch on that in the next sections.


The logic – All applications fail eventually

Cloud environments and Kubernetes specifically are designed with a basic fact in mind:

“All applications fail eventually”.  

There are a myriad of reasons for failures ranging from bugs, load, platform maintenance, server maintenance and even cloud provider outages – so there is no doubt there – “All applications fail eventually”, the question is – what happens when they do?


On  Kubernetes environments, the basic approach is: if an application fails – let’s restart it, Kubernetes has a couple of components (such as the controller-manager, scheduler and the node kubelet) which monitor and send telemetry about each Pods state constantly and when the state changes, there are actions that will be taken automatically with the assumption that it might recover and keep SLO high – meaning end users will not be impacted or be less impacted as the service will be renewed.


The above fact leads us to two very important guidelines when designing cloud applications:

  1. Design for failure
  2. Fail fast


Moreover, Kubernetes provides tooling for testing when an application should be restarted even if it did not crash in the form of readiness, liveness and startup probes which we will discuss in the next sections.


Design for failure

I’ve often been asked what does that mean? And how can an organization design an application that will fail? (isn’t the point of good code not to fail?)


Design for failure does not mean that the application should be designed to fail but rather – what happens when the application will eventually fail?


This means taking into account some things that have never been an issue in legacy applications and some that have been but were easy to overcome on static server locations. 


Some of the main architecture guidelines to take into account when designing a cloud application are:

“In flight” Data persistency

If a Pod (or a process inside a container inside a Pod) which is consuming data from a queue,database or any other source for processing, dies, the data needs to be available and processed by another Pod in order to avoid losing the data that was being processed in the failing Pod (in-flight data).

Graceful termination

In Kubernetes there is a grace period before a Pod is aggressively terminated.

If your Pod has other replicas dependent on it’s operation, make sure you transfer responsibilities to other replicas as the Pod terminating (the process inside gets the “kill” signal) 

Aggressive termination

In some cases such as node power-down (due to cluster scaling for example), node failure etc. Pods may become immediately unavailable and with no grace period or warning.

A good cloud native application should take such a scenario into account and make sure there are other replicas ready to keep the service available to end users and possibly apply data redundancy or replication. 

Persistent storage

Storage In Kubernetes environments can be attached by adding a PersistentVolume (PV) using a persistentVolumeClaim (PVC) onto a Pod. This allows the application running inside the Pod to have a “disk” on which the data will survive even if the Pod is restarted.

It is, however, important to understand that there are limitations and vast differences between different storage providers  and how they are configured to behave and accessed.

Unlike legacy systems, data on a PV is not necessarily available after a Pod crashes, gets evicted or replaced with a new instance for any reason, and might not be accessed from multiple locations.


In order to help facilitate the Design for failure (and fail fast which will be discussed further in this article) concept, Kubernetes provides users with the ability to decide when to restart a Pod and when to allow it to receive data.


The idea is to assist existing applications or applications which do not have such mechanisms built in, to notify Kubernetes whether it should restart the application or send information towards it.


The probes currently provided with Kubernetes and their functionality logic are planned for the following scenarios:


Liveness Probe

If an application is in a state of deadlock or any state in which it stops performing its intended tasks, the logic is to restart it and make it available despite possible bugs existing in the application in an effort to keep SLO high.


Readiness Probe

The main use of readiness probes is to control whether an application should get traffic sent to it from services. When the readiness probe fails, traffic will not be sent to the Pod from services configured to send traffic to that Pod (other Pods in ready state will continue to receive traffic if they are in Ready state).


Startup Porbe

This probe can help migrating applications survive the move to Kubernetes and delay liveness Probe functionality until this probe succeeds. For example, imagine an application which has a warm-up time of 2 minutes (meaning it takes 2 minutes until the application is functioning), configuring a liveness probe will fail the Pod constantly as it will never reach a running state. Configuring a startup Probe properly can take into account the warm-up / data population/provisioning stage of the application and delay the liveness probe until when the application is started , liveness probe and readiness probe can be checked more often to establish application availability.

Fail Fast

Once we established that every application is destined to fail at some point, it is also important to make sure that if a failure occurs, the application will not always try to recover from it and allow the failure to occur as fast as possible in order for the environment to detect it and spin another instance of it as fast as possible to keep SLO high.


This is application specific, but a good rule of thumb is to have deadman switches which will stop the application on key situations where processing is not occurring properly or in a timly manner, for example:

If an application should be writing data to a file in a PV and write fails, restarting the application might overcome a problematic mount.

Chaotic behavior

One simple way to prepare for the behaviors and situations depicted above is to run your applications in chaotic environments and measure the SLO of the application in such environments.


You can do this manually as part of your testing process, but it is more efficient to perform it on a regular random basis and use some tooling for that as well, one option is to use chaoskube (see operator) which kills random Pods at configured terms. 


Having your application survive a chaotic test environment increases the resulting SLO and makes the application more robust testing both your Design For Failure level and the Fail Fast mechanisms in your application.