IT Resilience, as Explained by an Expert

Data security is an ever-present concern in the Software as a Service industry.

Being able to protect both proprietary information and consumer privacy is generally understood to be the only way for companies to ensure their survival.

But, how can businesses strengthen their security measures and increase their resilience in the increasingly vulnerable IT sector?

To get the answer to that question, we spoke with CAKE.com’s Director of Infrastructure Engineering, Slobodan Gacesa.

Backed by years of experience in site reliability engineering (SRE) and DevOps, Gacesa ensures that CAKE.com software products — Clockify, Pumble, and Plaky — as well as the company itself are resilient to infrastructure failure and security threats of all kinds.

Having said that, let’s see what he has to say on the subject of IT resilience.

What is IT resilience and why does it matter?

IT resilience refers to a company’s ability to resist and recover from all manner of digital security threats, whether they’re the result of targeted phishing campaigns, infrastructure failure, or poor software architecture.

As Slobodan Gacesa told us, even minor mishaps have the potential to ruin a business:

“Small holes can sink ships. We’ve seen huge companies fall because of stupid mistakes, whether they’re technical, infrastructural, because they went offline for a week, or because they had a security breach. It’s all equally dangerous.”

Making sure your company’s security measures are impenetrable is certainly a valid reason to create a strong resilience plan.

In fact, in Slobodan’s opinion, it’s the only way to stay afloat in this industry:

“In the IT sector, having a resilience plan is the only way to keep your company afloat.

There are a lot of things you can never plan for, things you’ve never lived through or prepared for before. But, having a plan that accounts for the majority of known risk factors and outlines the steps you need to take to reduce them makes it easier to survive those situations if/when they come to pass.

Forming a resilience plan requires you to get in the mindset of knowing that things will inevitably go wrong. But, it’s ultimately about being able to let that happen without letting it affect you or your business.”

So, longevity is certainly the primary purpose of having a robust resilience plan.

However, that’s not the only thing companies stand to gain from planning ahead.

A hidden purpose of resilience in the SaaS industry: Building trust

During our conversation, Gacesa mentioned the fickle nature of the average customer in the SaaS industry, saying:

“As a SaaS company, we have people around the world using our products.

If your service is online and you’re gone for half an hour, people won’t forgive you. If you’re gone for a day, you’re done. By tomorrow, they’ll have already found an alternative and moved on.”

By and large, software products are supposed to provide a solution to a problem its users are having.

The moment they stop doing that, the company behind them should expect to see a portion of its customers migrate to a product that will satisfy their needs.

Thinking that your company will never go through those moments of crisis is simply unrealistic, as Gacesa explained:

“Any business that claims to have never gone through infrastructure failures or security threats is going to look pretty suspicious. Once that kind of doubt sets in — you’re done.”

With that in mind, the secondary purpose of having a strong resilience plan is to demonstrate to your customers that, when something inevitably goes wrong, you’ll be able to handle it:

“So, having a good resilience plan is ultimately about survival. And it’s not even about complete prevention of failures — though that is always the ultimate goal.

Instead, having a solid plan is about showing your customers/clients how you behave under pressure.

When something happens, people need to know about it. They need to know that you survived — that’s how you gain trust.”

So, even if all your resilience efforts fail and a breach happens, being transparent about it and explaining the situation to your customers as soon as it happens can help you build a more loyal user base.

What’s the best way for an IT company to ensure its resilience?

According to Slobodan Gacesa, a good resilience plan should make your service more reliable and secure:

”There are many moving parts you need to consider when creating a resilience plan, but most of them have something to do with one of two things — reliability and security.

Working on the reliability of your service means ensuring constant presence or availability for your clients. In other words, you need to make sure you have the infrastructure necessary to support your service or products.

The other half of your plan will focus on preventing data breaches and other security risks.”

So, how can a company be sure that its resilience plan covers both of these points?

In Slobodan’s opinion, the best way to keep your infrastructure secure is to attack it yourself.

Break it before it breaks: The Chaos Engineering philosophy

Attack your own infrastructure?

Wouldn’t that just expose its flaws?

According to Gacesa, that’s precisely the point of this approach:

“Ultimately, the best way to ensure resilience as an IT company is to constantly test yourself. In other words, you have to keep pulling the rug out from under yourself to train your system to regain its footing quickly when disaster strikes.

That’s the Chaos Engineering philosophy. You have software that attacks parts of your infrastructure and replicates various scenarios that can happen to Internet-based businesses.”

Even though Chaos Monkey is the most famous example of Chaos Engineering software, Gacesa revealed that his team uses many different programs to run these tests:

“We use a variety of tools and scripts — many of which we make ourselves — to attack parts of our service, infrastructure, applications, network, whatever, and reveal potential vulnerabilities we need to work on.”

So, the goal of this approach is to jostle the system in a relatively controlled environment to find its weaknesses.

That allows the SRE team to fortify those vulnerabilities before looking for new ones.

That cycle of testing the system and fixing its flaws should be continuous.

Of course, you can never plan for every contingency — which is something Slobodan is all too aware of:

“Of course, these scenarios are so diverse, that you can never really plan for everything. After all, the people who write this software can only create situations they have experienced or heard of before. So, there are a lot of situations that aren’t covered, that may not have happened to anyone yet — but that doesn’t mean they won’t happen in the future.”

With that in mind, the only thing that can increase your chances of covering every eventuality is experience.

The secret sauce: Experience

In addition to the tried-and-true method of Chaos Engineering, Gacesa also relies on his experience to create a resilience plan:

“The only way to be reasonably sure that you’re not leaving anything out of your resilience plan is to have a lot of experience. Someone who’s seen a lot of things go wrong will have a better idea of everything that can happen to a product or a company. From there, you just have to fortify all those weaknesses one by one until you have a relatively secure product and company.”

During our conversation, Gacesa recalled an anecdote from earlier in his career when a team member who was on call at night had to wake him up to deal with an urgent issue.

That rude awakening was enough to impress upon him the importance of resilience more than any theoretical knowledge could have.

Having to resolve these kinds of incidents when you least expect it will allow you to understand the diverse nature of the problems a good resilience strategy can defend against.

Of course, even that wouldn’t cover absolutely everything that could go wrong.

After all, new threats are being devised even as your security measures are being implemented.

Supporting components of a resilience strategy in the SaaS industry

Having discovered the best way to ensure resilience in the IT industry, we wanted to ask Gacesa about the different elements that constitute a comprehensive resilience strategy.

In addition to the Chaos Engineering software, Gacesa highlighted 3 other components that should provide the necessary support to your resilience plan:

Solid products achieved through careful software architecture planning,
Infrastructure that’s capable of self-healing when parts of it go offline, and
Tools for measuring the effectiveness of your strategy.

Let’s see what he had to say about these individual elements of a resilience plan.

Component #1: Software architecture planning

According to Slobodan, having a solid foundation is a good way to set your resilience plan up for success:

“The way I see it, a good resilience plan starts with a good product — so, you have to start with software architecture planning.

Coming from the perspective of a SaaS company, it’s important for our products to be well-constructed so they don’t have any obvious flaws that are built into them. That makes my job much easier.”

As we have previously mentioned, Gacesa is in charge of making sure all 3 CAKE.com products, as well as the company itself, are protected against cyber attacks and other security threats.

Therefore, building products that don’t have any obvious vulnerabilities is a great way to ensure resilience.

As Slobodan told us, even a great resilience plan wouldn’t be able to save a poorly constructed product:

“I simply can’t produce an infrastructure that’s sturdy enough to support a poor base product. So, all those details are important when the application or service is created.”

Having said that, it’s important to acknowledge that having a solid product wouldn’t be enough to save a company if its infrastructure was lacking.

Component #2: Infrastructure

At its most basic level, resilience in the SaaS industry is about making sure that your product can function even when failures do occur.

A big part of that is making sure that the product — and your service, in general — are supported by a robust infrastructure.

According to Slobodan, any good resilience plan requires infrastructure that’s capable of self-healing:

“Of course, infrastructure is another important component of any resilience plan. You have to have infrastructure that’s capable of self-healing. If something breaks, you should be able to cut it off and regrow it. While that’s happening, you should have infrastructure on other sides of the planet that can pick up the slack.”

The concept of self-healing infrastructure in the IT industry refers to systems that use machine learning algorithms to predict, detect, and resolve issues with minimal direction from human operators.

Therefore, having a self-healing infrastructure should allow you to prioritize issues that really need your attention.

Most importantly, since some part of your infrastructure will always be online, this should allow you to resolve those problems without inconveniencing customers.

Component #3: Measuring tools

The third component Slobodan pointed out has to do with being able to precisely describe and quantify the weaknesses your tests are unearthing.

As Slobodan put it, you need to be able to say when and why certain issues are occurring before you can figure out how to resolve them once and for all:

“The third component you need to make a good resilience plan consists of tools for measuring the efficacy of your plan. For one, you have to have some kind of idea of how frequently certain problems are occurring.

Saying, ‘Hey, this component breaks sometimes’ doesn’t mean much. But, saying, ‘This breaks every Thursday at 6 p.m. and this is why’ — that’s another story altogether. That helps the developers find and solve the problem, ensuring that it never happens again.

After all, finding a problem in your system is fine — but, having the same issue come up again and again is not. To make sure that doesn’t happen, you need to be able to track and document these issues.”

Crucially, tracking and documenting the issues you find in your products and infrastructure shouldn’t fall on the SRE team.

So, let’s talk about the other teams that should participate in the realization of your resilience plan.

Who should be involved in the formation and execution of a resilience strategy?

Even though site reliability engineers like Slobodan are a crucial piece of the resilience puzzle, they can’t be solely responsible for maintaining the resilience of a SaaS company.

Instead, the plan has to include developers, QA testers, and even management, according to Gacesa:

“In addition to site reliability engineers, a good resilience plan should also include DevOps (or app architecture engineers), the quality assurance team, and even management.

If you want your business to be truly resilient, everyone has to work together. Division between teams can lead to a lack of accountability, which can negatively impact a company’s resilience.”

Collaboration is crucial when it comes to protecting both your data and your clients’ privacy.

So, let’s see which roles each of these teams can play in the process of executing a resilience plan.

#1: Developers

The first group of people Slobodan singled out when asked to talk about the different teams that contribute to resilience plans were developers:

“Developers are just as essential to a good resilience plan as site reliability engineers because, if they write something bad, nothing will be able to force it to become functional.”

This goes back to software architecture planning — the first component Slobodan highlighted earlier:

“It’s definitely handy if the system you’re trying to improve was built to survive some of those disaster scenarios in the first place. Building on top of a good foundation is easier than trying to fix something that’s poorly made at its core. You have to have a good foundation that allows you to fix individual components when they inevitably break.”

So, developers are a necessary part of the equation because they create the base product a resilience plan is built to support.

If that base product is solid, that automatically minimizes the amount of work the SRE team will have to do to support it.

#2: QA testers

Software quality assurance specialists are the people who catch mistakes that could end up in a SaaS company’s product before the faulty version of the software is released to the public.

That puts QA testers in the position of being able to see some of those potential vulnerabilities that might crop up and immediately notify developers to make sure those flaws don’t end up in the final version of the software.

According to Slobodan, that helps the SRE team prioritize more unavoidable issues:

“QA also has an important role to play. Without their tests, a lot of junk would make it into the final product — especially when the product in question is big. Preventing that lets the SRE team prioritize other unavoidable issues.”

#3: Management

Last but not least, Slobodan highlighted the importance of management for the execution of a resilience plan.

Namely, if management doesn’t understand the necessity of a resilience plan, it could effectively stomp out all resilience efforts and compromise the whole organization.

Conversely, good management could give the SRE team the resources and space it needs to operate, according to Slobodan:

“Management is also necessary for the implementation of your resilience plan. After all, managers are the ones who will approve the resources that need to go into ensuring resilience.

Keep in mind, some of the problems we’re trying to locate and resolve may be invisible to them. When everything seems to be perfectly functional, management might ask why we’re purposely breaking and fixing parts of our infrastructure.

But, when SRE has enough freedom, we can take control of the resources we need to conduct these tests and solve these problems. That’s why managers need to be understanding and permissive for this whole process to take place.”

In other words, management can:

Approve resources for the creation and implementation of resilience plans,
Coordinate communication between teams to ensure collaboration, and
Make sure the SRE team has enough freedom to work.

Further reading

Ensuring resilience in the IT sector starts with getting all the right people on your team. Take your tech company’s talent strategy to the next level by following the tips we shared in this blog post:

Why you need to rethink your tech company’s talent strategy: 3 experts share their tips

Final thoughts: Everyone in the SaaS industry must have a resilience strategy

Most professionals working to ensure the resilience of their organizations are well aware that they can never plan for every eventuality, which is something Slobodan emphasized throughout our conversation:

“Obviously, you can never be completely bulletproof. Well, you can, but it takes a lot of effort and money — and that kind of investment is something you’d have to justify.

Still, running tests and then improving the segments that faltered during those tests is the best way to protect your business. That’s what makes implementing Chaos Engineering software into your resilience plan so effective. It reveals vulnerabilities and allows you to patch them up before they become a real issue.”

But, at the end of the day, Slobodan is also a realist, noting that not every company will have the resources to invest in building its resilience right off the bat:

“Eventually, you’ll have to justify your pursuit of resilience. There’s no point in making your product and your company impenetrable if you only have 3 users. But, when a company is up and running like ours is, it makes sense to do whatever it takes to ensure survival.”

Still, some aspects of a resilience plan are well within the scope of smaller SaaS companies.

Namely, even if you can’t invest in Chaos Engineering software, you can still give your developers the resources they need to make sure your product has as few vulnerabilities as possible built into it.

That alone will go a long way toward ensuring resilience until you get enough users to justify the introduction of other resilience components we have discussed.

Olga Milicevic

Communication researcher and author at Pumble

Olga Milicevic is a communication researcher and author dedicated to making your professional life a bit easier. She believes that everyone should have the tools necessary to respond to their coworkers’ requests and communicate their own professional needs clearly and kindly.

IT Resilience, as Explained by CAKE.com’s Director of Infrastructure Engineering