Crafting Resilient Systems on the Cloud by Parna Mehta
Parna Mehta
Cloud Support EngineerReviews
Building Resilient Cloud Systems: A Guide to Ensuring Uninterrupted Business Operations
In today’s fast-paced digital environment, the concept of cloud resilience has become more crucial than ever. As businesses grapple with the challenges of maintaining continuous operations, understanding how to create resilient systems in the cloud is essential to safeguard your brand and customer relationships. Let’s dive into the key components of crafting resilient cloud systems.
The Importance of Cloud Resilience
Cloud resilience refers to a system's ability to withstand and recover from disruptions, ensuring that applications run 24/7 without interruption. As a cloud support engineer, I've witnessed first-hand the repercussions of inadequate resilience:
- Financial Losses: Every second of downtime can cost businesses millions.
- Brand Damage: Outages can tarnish your reputation.
- Customer Churn: Trust is hard to regain once lost.
In high-stakes domains like healthcare, finance, and aviation, even minor disruptions can have catastrophic consequences. To avoid these pitfalls, proactive planning for resilience is fundamental.
Common Pitfalls of Poor Planning
Let’s look at some real-life scenarios illustrating the consequences of failing to plan for disruptions:
- A power outage in Spain and Portugal disrupted multiple critical services, but businesses with multi-region deployments managed to maintain access.
- A winter storm in December 2022 canceled over 17,000 flights due to a failure in the crew scheduling system.
- In India, a major healthcare institute experienced a 12-hour downtime due to scheduled maintenance, severely affecting patient care.
Strategies for Building Resilience
Creating resilient cloud systems involves a multi-faceted approach:
1. **Multi-Region Deployments**
Distributing resources across multiple geographical regions ensures availability, even during regional outages.
2. **Automated Failover Systems**
Implement redundant systems that can automatically take over when failures occur, minimizing downtime.
3. **Data Replication**
Keep your data synchronized across different geographic locations to prevent data loss. Choose between:
- Synchronous replication for immediate data accuracy.
- Asynchronous replication for more flexible timings.
4. **Regular Testing and Drills**
Conduct ongoing disaster recovery drills to ensure your team is prepared for actual events. Identify weaknesses before they become critical failures.
5. **Continuous Monitoring**
Utilize monitoring tools to track system performance metrics, collect logs, and set up alerts to detect issues before they escalate.
Implementing Best Practices for Resilient Systems
Developing resilient systems doesn't happen overnight. It is a continuous cycle that involves:
- Business Impact Analysis: Identify mission-critical systems and assess potential risks.
- Design and Architecture: Embrace fault isolation and loose coupling, particularly through microservices to enhance flexibility.
- Incident Response Planning: Create detailed playbooks and conduct regular practice drills.
- Securing Your Infrastructure: Implement strong security measures, including encryption and multi-factor authentication (MFA).
Measuring Resilience: RPO and RTO
Two major metrics critical to measuring your system’s resilience are:
- Recovery Point Objective (RPO): Refers to the maximum period in which data might be lost due to a disruption.
- Recovery Time Objective (RTO): Refers to the acceptable amount of time to restore services after a disruption.
Designing for High Availability vs. Disaster Recovery
Focusing on high availability addresses frequent, low-impact disruptions, while disaster recovery strategies cover rare, large-scale failures. Here are some strategies for each:
High Availability Strategies
- Redundancy: Ensure you have multiple components so if one fails, another can step in.
- Load Balancing: Distribute workloads across multiple servers to prevent any single point of failure.
Disaster
Video Transcription
So a very warm welcome to all of you. Welcome to my session on crafting resilient systems on the cloud.And if you are here and you've kept your applications enabled hands of your team, I'm guessing they are thriving and they're surviving. Who am I? I'm your friendly cloud support engineer that you reach out to when your production systems are down or suffering. Imagine a world where your business never sleeps, where your data is accessible 24 by seven uninterrupted come rain or shine. Well, that's no more a utopian dream, is it? It's the promise of cloud resilience. And in today's digital world, every second of downtime counts. It can cost millions of dollars. It can cost you your brand image and reputation and, of course, a customer churn. Failure is after all given, and everything will fail over time.
So what can we do? We can ensure that those small little falls do not escalate into failures. As in mission critical domains like healthcare, aviation, or say even finance, even a few seconds of downtime cannot be tolerated. If any of you have ever worked in a production environment would identify with this graph. No. This is not a scientific plot, but it does show the emotion, the stress that one goes through when you're handling, when you're managing a production environment. Right from the point that there is a disruption. You can see that there is a slow increase in your stress levels. About your disruptions from social media, do you? So obviously, I'm sure you must be having some automated systems which will trigger, which will let you know that something is wrong.
And over a period of time, your stress does not reduce, does it? Ultimately, you see towards the end that your stress level even increases more after your systems are back up. That's because you want to ensure that it's maintaining the same level of operation and functionality as it was doing before the disruption. So it increases. So this is just a crowdsource data just to give you an idea what happens when an outage, takes place in your environment. Take for instance a month ago, some of you must have already experienced this. There was a major power outage in Spain and Portugal. So imagine what was impacted. There was transit systems impacted, banks, cellular networks, everything was impacted.
However, industries and systems which had planned in advance with multi region deployments, they did not suffer as much because their global users still could access their systems even though the local grids were down. And there were banks which had backup systems due to which they could operate. However, you must have experienced if you're based out of Portugal or Spain that traffic had plummeted. Now what's scary about this situation is not that there was a power outage and it did come back ultimately. What is scary is that if you don't know what caused it, how will you plan for it? How will you make sure it doesn't happen again? Take for instance, in the December 2022, there was a major winter snowstorm, and holiday goers were impacted majorly.
There were over 17,000 flights which were impacted and 2,000,000 passengers were stranded and impacted. There were cancellations everywhere and all because of the crew scheduling system of a famous airline which was overwhelmed and it crashed. They learned that the hard way, but this is what happens when you don't plan for disaster recovery. Imagine in healthcare, there was a famous medical institute in India, which was also impacted by a downtime due to schedule maintenance. It lasted over twelve hours and the emergency section, even the patient out section unit were impacted. Medics could not access the medical reports and the patient records. Imagine you have to undergo a critical surgery. Would you go under the knife of that medic? I wouldn't. Definitely not.
They were also impacted earlier that year by some cybersecurity attacks that wiped off some data, crucial data like billing and transaction billing records of the patients. Not accept. Is it? Again, earlier last year, there were some major banks in UK which experienced outages on their digital as well as mobile banking systems. People were not able to access their payments, access to their funds on a crucial day like their payday. How frustrating that can be. You can imagine. Put yourself in the person's position. You have bills to pay. You have mortgages, everything. And then, of course, I'm sure a lot of you would have seen this blue screen. I surely did. And all because of what? Because of a faulty software update and release that went out. There were a whole lot of industries and companies which were impacted globally.
Whoever thought that you would actually see drawing boards like this used in the airports. You would see handwritten tickets all because of which because of a software update which brought down systems. Again, companies which had planned in advanced, which had planned for automated failovers or multi region deployments or multi cloud environments, they survived. Now tell me who pays the price. Of course, the end users do pay a price. They see unexpected behavior. They expect, latency. They see, some inconsistencies in the responses from the application, but majorly, the businesses get impacted. There is a cost to all the downtime. Their brand image, customers churn, they also lose trust in those companies, and there is a cost even to the data loss. There's a cost to the repair as well.
Now what can you do about it? Well, you can build resilience. Now resilient system is one which can withstand any disruptions to infrastructure as well as service disruptions. Now resilience is not only when the services go down. It can also be the scalability of an application. So suppose your users are in hundreds, but you released a very fancy new feature or there's a promotion or a sale, and suddenly, there's a huge rush, and the number of users grow from hundreds to thousands to millions. Is your application able to sustain and withstand that? That is also resilience. Over a period of time, organizations have developed their own mechanisms and strategies to withstand all such uncertainties. Few of them you are seeing on your screen. There's multi region deployments where you have identical setups in a primary as well as a recovery region.
You can plan for automated failovers by having identical or redundant, resources or a application stack by itself. So they can fail over or get reply placed by them. Then you have data replication. So you need to constantly either synchronously replicate your data so it's in sync or asynchronously, whatever is your mechanism. You could do testing over time. So a lot of organizations do plan for disaster recovery. It's not that they do it at the end moment. But if you've not tested it frequently enough, you might have plan a, plan b, plan c. But if you've not done more drills, then your incident response team may not know how to react and remediate and detect faster. Once you test your systems, you will be able to identify any sort of weaknesses and vulnerabilities and fix them before the D Day.
And of course, all this can happen only if you're constantly monitoring your systems. When I say monitoring, you have metrics and dashboards in place. You are collecting logs and traces and, of course, the all important alarms and notifications to trigger and wake you up in the middle of the night if something goes wrong. These strategies have been used by organizations in real world situations. You've seen it as well. So sometime ago, I spoke about the power outage which happened in the Iberian Peninsula, which impacted Spain and, Portugal. There are also organizations which had planned for multi region deployment or data replication across regions. They had near zero data loss and their global users could still access their systems. Similarly, in case of automated failover, a classic example is that of Townsend routers.
So they moved their disaster recovery to AWS. And by doing so, their recovery time objective, that is RTO, reduced from days from forty eight hours to a couple of minutes. Chaos engineering, as I said, you keep on testing and running more drills. It is a mechanism where you slowly and in a managed way introduce some disruptions in your system just to test their reliability and resilience and check whether the auto remediation is working or the failover is working or no. And Adobe did that too. They made use of AWS's fault injection simulator, and they tested their multi availability zone failures, and that helped to maintain their uptime to up to 99.95% of the time.
Similarly, HSBC also, they followed some good security best practices like encryption, MFA, conducting regular security audits so that considerably reduce the security breaches. And if security breach breaches are reduced, of course, your systems are going to thrive. The other things that you can do is constant monitoring like I spoke to you about. Slack for instance, they have systems in place whereby they can detect any increase in API latency, and they'll be able to detect it much earlier than their end users will experience it. So that's why they can proactively go and fix it. Similarly, organizations like Capital One. So they had, you know, downtime in, 2023 where the DNS outage, happened. And what they learned from that is that if they adopt their multi cloud strategy, they can reduce that outage by 90% almost. So what are we doing?
We are adapting continuously learning and improving. So that's also part of your resilience cycle. Now this resilience exercise of building the resilience doesn't happen overnight. It is a complete life cycle, and there is a resilience life cycle framework that you can, use in your organizations to follow to build such reliable systems. Now how do you start it? The first step that you take is to do a business impact analysis. You identify all your resources and your systems and your processes, which are mission critical, which are business critical and tier them. Next, you are going to identify the risk and, what is the potential threats that could bring down the systems. You identify or measure or what are the RTO and the RTO goals that you need to achieve or your availability goals.
Once you have that in place, next thing you'll do is now start designing and implementing. And this is where your, architects come into the picture, your cloud architects or your solution architects. There are n number of design and solution strategies you can use. Some of the very common ones I've highlighted over here that is fault isolation. Fault isolation or reducing the blast radius kind of point to the same thing. You want your defect or your fault to be isolated from other parts of the system so that it doesn't spill over or impact other parts as well or other customers as well. You don't want that to happen. Auto scale. So if you don't have enough resources, if you are facing a load, you're facing an increase in number of requests, then your application should have that capability to automatically scale out whether in terms of compute, memory, storage, whatever it might be.
Loose coupling. Again, you can use, architectures like microservices where they are loosely coupled, and you've broken down that silo monolithic application into multiple services. So even if one service is impacted, it doesn't pull down the other services. So that's loose coupling. And, of course, another famous strategy is to fail over to healthy resources or get replaced by healthy resources. There will be some health checks which will automatically do it. So that should be your target. After designing, you can't stop right there. Test test and more testing. You can do mechanisms such as chaos engineering, which I spoke to you earlier about. A lot of organization can't conduct game days or more drills. They refine that playbooks and runbooks based on, what are the defects and gaps they find and what are the process they should follow.
If you do it frequently enough, then your teams will develop what we call as the muscle memory. So they immediately know how to react if there is a disruption. Of course, none of this will work if you don't monitor and observe your systems and how can you use that. There are n number of third party, monitoring tools that you can use. You can create your customized dashboards and have your important metrics in place, which are evaluating. You can mark some thresholds and put some alarms and notifications in place and constantly ensure that you're collecting enough logs and traces. So even if there is a problem that occurs, you will be able to detect, mitigate faster if you have those in place. As this is not a onetime exercise, you constantly have to refine it, respond, learn, and add back those learnings and refine your processes.
And there are some pillars which have been identified by the resilience life cycle where you need to build that resilience. So you have to build that resilience at your compute layer, meaning you can have multiple availability zone deployments. You can, plan for auto scaling, or you can have redundant servers and resources. In terms of storage, ensure that you're replicating your data across regions or taking regular automated backup. So automated backup is important so that you don't rely on manual backups, and you might just forget about it one day. Network resilience. So network resilience, if you have a load balance in place, it will be able to distribute the load or the traffic amongst the healthy resources. That is one way.
Security resilience. Again, you can maintain encryption, encrypt your data at rest. You can have least privileged access controls. So whoever is not allowed to access your systems or your data should not be allowed to do so. They could have malicious intents and have your compliances in place as well. And then of course, I keep reiterating operational readiness through monitoring. Now, how will you measure your resilience? So there are two very important measurements that you use that is RPO and RTU. Okay? So some of you must have already heard about it. RPO is recovery point objective. So it is the amount of data that you can afford to lose that you can resume off your operations from.
So, for example, every 5AM and 5PM, you're taking data backups. If your systems go down at 10AM, so you've lost about five hours worth of data, but you're okay to restart from that point. So the last backup, you would have taken at 5AM. So based on the amount of data you need or how much you can afford to lose to resume normal operations, different organizations will have different RPOs. Similarly, RTO, recovery time objective, how much downtime can that organization or a business critical application endure? So this will also depend from application to application, how business critical or mission critical they are. And both of these are measured in terms of time.
Then we have disaster recovery strategies also in place where RPO and RTO are considered. Now if, say, we are talking about the intranet or internal application within our organization, which is non customer facing and, it is just for administrative purposes, it's not so critical. So you can afford a few hours for its restoration. So you can restore your data from the backup, provision your resources after the, the disruption has happened. So that's okay for you. However, if they are business critical application, so one of them is pilot light. So if they're non customer facing, you can have a few resources ready. Your data has to be live. The resources are ready. They just have to be turned on. So there'll be some cold start time for it to come up. That's called as pilot light. This is okay for non customer facing business application.
If it's customer facing business application, then you go for warm standby. By the word warm, I mean that there are some resources which are up and running, and they are serving some requests, but a small number maybe in the recovery site. And as soon as there's a disaster and it, all your, requests fail over to your recovery site, suddenly all those resources will scale out. And then finally, a active active strategy is where you have a primary region as well as a recovery region. Both are active, and both are actively serving the request. So this is like a mere real time failover, and there's practically no data loss. However, this again depends upon your organization. Maintaining two active recovery sites also takes a lot of money, so it depends upon, your organization's part to invest in that as well. How do you measure availability?
So availability is measured by the number of nines. So I'm sure, a lot of you must have heard, two nines of availability, five nines of availability. So that is what we are talking about. So you can think of five nines of availability is equivalent to about five point two five minutes of downtime per year that that application can withstand. There are few other KPIs also which are used to measure resilient systems. Some of the popular ones are failure rate. Now what is failure rate? So if you're, deploying your application, say, frequently, so out of your 10 deployments, if, say, three of them fails, so 70% is your success rate, but 30% is your failure rate. So for certain organization, that may not be acceptable as well. So that's your failure rate.
Scalability, again, as I said, it is not only about disruption, how well or how capable your system is to scale out when there is a barrage of request, there's a increase in load. Right? That's the scalability measure. Fault tolerance is to do with no single point of failure. So even if there is one component or one part of it which fails, it should not bring down the entire system. So that's the measure of fault tolerance. There's the three m's, which, I'll show you in terms of the diagram, which might be easier to understand. The first stem is mean time to detect. You obviously want it to be really quick for your organization to detect what caused the problem and detect it in the sense even alarms going off or notifications or SMS is going out. So meantime to detect is that. So it should be as short as possible.
You don't want to know, say, from your users or you don't want to know from social media. Right? Mean time to recovery or meantime to repair. Now that includes the detection time plus the time, that your, team takes to bring back the systems, the repair time. So that's accumulative of both. Again, you want it as short as possible. The third m is mean time between failures. Say your team is really good at fixing, but if they do quick fixes, it's possible that it might be a short term solution, and it might fail again. Right? So the time between two failures should be as long as possible. Problems can happen if you don't do regression test. So you might go to fix one thing and you broke something else. Right? So the meantime between failures should be as large as possible.
Now how do you design such resilient systems? So there are majorly two ways that you can do it. Either you can plan for high availability. These are for common faults, or you can plan for disaster recovery. These are for very rare uncommon false, large scale, disruptions. So common false mean it could be something like component failures or your rack going down and things like that. Some some kind of power, disruption in the data center. These are transient or network, latency. Whereas large scale one, as I said, Internet going down or there's a faulty deployment or a operator failure happening. These are large scale, even natural disasters. So the frequency is a decision point over here. Plus, as I told you, the scope, these are transient small failures which can be easily handled in the primary site itself.
However, in case of disaster recovery, you need a separate site called as a recovery site where you can failover to. The measurement availability is in the number of nines, whereas disaster recovery you usually measure with RPO and RTOs. And strategies that you use in high availability is you replace by healthy, resources or, you can either, replace them or you can add more, healthy resources or you can add more capacity. Whereas in case of disaster recovery, it will usually be failing over to the recovery site itself, right, or probably failing over and, promoting your, secondary databases also. That could also be another strategy. Now there are some five characteristics of a highly available distributed system. So any highly available distributed system should have redundancy. It should have sufficient capacity. It should have timely output, correct output, and fault isolation. Now what does that mean? That will translate to five categories of errors.
So let me explain to you. So you will have single points of failures if you do not have redundancy in place. Now what does that mean? You don't have sufficient components. May it may be hardware components or it could be a complete replication of your application stack. So you don't have that in place, then that could lead to single points of failure. Excessive load. This kind of errors happen when you have not planned for sufficient compute capacity, storage, or memory. So your, your servers are overloaded. This could lead to throttling. This could lead to you reaching your quotas and limits. Then the third one is excessive latency. Now excessive latency, meaning there is a reasonable amount of time that your users are expecting a response.
If you're not able to meet that, you are not able to meet your SLAs and SLOs, then that is excessive latency. There could be some kind of incorrect output which is coming out that could be because there are bugs and defects in your system. These are all controllable and common ones. And then finally, shared faith. What does that mean? Meaning, if there's a disruption in one place, it spills over, crosses the boundary, and impacts other parts of the system or could even impact other users of that system as well. K. So these are the common categories. And if you don't have these characters, this is the kind of error that you will get. Of course, we do not have enough time to go through all the techniques and the best practices and strategies.
We'll go through a couple of them over here just, graphically just to show you. So the first one I spoke about is redundancy. So redundancy can be as simple as, we have multiple availability zones, in AWS. Right? So these availability zones are clusters of data centers, and they are hundreds of kilometers apart but close enough for low latency. So if there is a natural disaster which strikes in one of them, still the others are available. So there is redundancy. So that's an example. Then failover. We usually see this in databases. You would have seen that there's a primary database, and there is a secondary database which is kept in sync. In case the primary goes down, you promote the secondary which has been kept in sync. Another strategy which is called as bulk head pattern.
Now, to give you an example in the ship design, this is also used. So what happens over here, it is compartmentalized. So even if water seeps into one of the compartment, it will not sink the ship. I'm sure this was not done for Titanic. However, that is what a bulkhead pattern is. You contain the problem. You don't want it to spill over. For example, if you have a single service which is responding to all your client requests, If that service is heavily loaded and it goes down, none of your clients will be able to reach it. However, if you have broken it down into multiple service instances, which we see in microservice architecture. Right? So even if one service instance is impacted, your other clients will not have a bad experience because they can still reach out.
Another common strategy is the retry with back off. Commonly, we see the 500 series errors. We send out a request. The server might be busy or down or there could be some network issues, so the request is not able to reach and complete it. There'll be a 500 response. We will wait. The client will wait for some time and try again. Waited for one second. Again, a 500 response, probably the server has not recovered. We'll wait a little longer. So this is exponential waiting. So one second, we'll wait a little longer. We'll allow it more time, and this time, probably, we'll get a 200 okay response. So these are couple of common strategies I will say. Then in terms of, infrastructure, resilience planning, how can we evolve the design? We start off with a single server. Say we have a single server. Everything is fine.
All the things are hunky gory until the server goes down. That's not a good design. Why not distribute the load across multiple servers so you can have a load balancer in between? Alright. Everything is good. The load is being distributed. What if one of them goes down? Still okay. Your customers can reach out because another server is there. However, it's not operating at the same capacity. We can make use of auto scaling maybe. And based on that, once the health is detected of one server going down, we can replace it with a healthy instance. That's been taken care of. What if all the servers in the same availability zone or data centers are impacted? Now what? Instead, why not spread the resources across multiple availability zones so the same load balancer can distribute amongst both of them? Everything is well. Everything is working good.
One availability zone goes down. Still, the request can be served from the other one. So you from this, demonstration, you saw that common kind of problems, latency related component failures, rack failures, data center, intermittent power outage. These can be handled with high availability. But it when it comes to, mitigation of bigger problems, natural disasters, Internet going down, or probably a very bad deployment or some kind of operator issues which has caused, the entire data to be corrupted or wiped out. In this case, you need a multi region deployment. Right? So your multiple regions, you probably, will have one primary region and then you'll have a secondary region which will be inactive. So it's just like your plan b, and you're going to route all your requests to one region.
Problem, you will immediately start routing all your requests to the other region. And slowly as the number of requests increase, if you've planned for auto scaling, even the resources will increase. Of course, your data has to be in sync, so you will do the live, replication of data, syncing of data beforehand. You can't do it later. Alright. So I'm guessing you would have got some good tips and strategies. Apart from that, some of these are also some well known strategies adopting microservice architecture, containerizing your load, and using Kubernetes, for orchestration and scaling. You can use durable storage. Durability is the measure that your data won't be lost.
So if you use durable storage, for example, s three, it has it has nine elevens of durability because data is stored across multiple availability zones. So even if your data is lost in one, it is still recoverable from the others. Infrastructure as code, if you've heard of Terraform or cloud formation templates, so you can structure or declare your, application stack in the form of code. And that is, very simplistic code. This is not not like your programming language. It's called declarative code. And you can spin up identical environments in a couple of minutes. Right? So you can, spin up identical environments even during outages or when you're testing. Say you want to spin up a development as well as a staging environment. So the same piece of infrastructure code template can be used. You can make use of content delivery networks.
So, for example, if you have data and content, which is not changing over time, relatively static and not changing, then you can cache them at your edge location. And so even if your back end servers are down because your data is cached at those edge locations, your content delivery networks will keep delivering it to your end users, and they will not face any issues. Similarly, caching can be used even for databases. So frequently queried information. You don't want to bombard your databases with the same queries over and over again, especially if the data is not changing. So, again, you can use caching for that. And then, of course, security best practices as I spoke to you about having controlled access, principle of least privilege, having security at every tier. Awesome. So I guess I would have armored you with all the knowledge and all the know how's to build a resilient application, and these are our final takeaways.
Design always for failure. Things will fail over time. All we can do is plan and design to make sure that things are automated. If we can reduce manual intervention, that's the best. We humans are notorious to do things, so it's better to automate things. Keep testing regularly so we know our strategies are working, and we identify the weaknesses early. Secure and maintain. Maintenance is also important of our servers. Ensure you're patching them regularly. So if there are any security vulnerabilities, they can be caught early or they can be mitigated. And, of course, monitor, adapt, and improve. K? This is a continuous cycle. It doesn't stop. You need to refine your architecture, refine your strategies, and see what best works for you. Thank you so much, everyone.
No comments so far – be the first to share your thoughts!