How to improve infrastructure resilience


To keep your IT infrastructure running smoothly, you not only need to design and configure it properly. It is also important to anticipate and mitigate the risks beforehand, to achieve greater system resilience. Below we explain some of the ways to do this.

Who is responsible for infrastructure resilience

Infrastructure resilience is usually the task of the company's IT department. If part or even the entire infrastructure is hosted in the cloud, however, the provider handles many of the issues. For example, the provider is in charge of channel reservations, UPS devices, server room racks, and so on.

Fault tolerance can also be embedded in out-of-the-box services, where the user runs the application and the technical department of the cloud provider handles the underlying technical aspects.

Of course, customers too can undertake resilience work on their computer systems. And it is important that this is done expertly, not just pro forma, for the sake of appearances. For example, we had a case of a company that hosted all critical services on one server in a high-quality data center, but one of the disks failed. No one is immune to such risks, which are not related to the provider or the data center.

Yet you can avoid or at least minimize the consequences. For example, if a company requests the provider to distribute services to several locations. This will increase fault tolerance because even if one server goes down, there will be a backup.

Such opportunities are only available from large providers who are able to host customer data in different data centers. Cloud4Y recently opened access to a Turkish data center, and there are several data centers in Russia. This allows you to increase the resilience of your infrastructure at the provider's expense.

Fault-tolerance at different infrastructure levels

Increasing resilience involves creating three basic levels of resilience.

infrastructure resilence.png

Region level. Put simply, it is the location of the data center. This region should not be affected by any potential disruptions of another region. For example, if an emergency happens in our Moscow data center, the data center in St. Petersburg will not be affected. Geographical remoteness does not mean that delays will occur. High-performance communication channels help the distributed infrastructure to work with an acceptable latency level. Regional separation ensures high system resilience and enables the Disaster Recovery service.

Availability zone level. Multiple data centers can work in one region. For example, in Moscow on Korovinskoe shosse and on 8th Marta Street. These availability zones are all interconnected by means of optical fibre.

Pool level. This level refers to the filling of a separate data center, i.e. a set of servers that have technical or logical links between them. These may include servers of the same series of the same vendor, located on different floors of the data center. Thus, each server room is an independent mini-DC with separate communications. Keeping critical services in different pools increases their fault tolerance. For example, it protects against network equipment failure or local power failures.

The physical layer of providing fault-tolerant infrastructure

When it comes to specific solutions for building resilient systems, the following are available:

  • Fault-tolerant power supply. Two independent lines power each cloud element. An ATS (Automatic Transfer Switch) system eliminates sudden spikes in power supply by switching the load between the main and reserve channel.

  • Resource redundancy. Resources are backed up on another host, which can be located anywhere. If the master server fails, the VMs are started from the backup server and the cloud keeps running without downtime.

  • Migration during maintenance. If the need for hardware maintenance occurs, there is a seamless automatic migration to another hosting service. This does not have any impact on the company infrastructure and does not affect the operation of the cloud servers.

There are two other interesting methods of achieving fault tolerance. The first one implies the normal operation of the systems in the event of any failure. Failure does not affect response times or bandwidth – the performance is not affected. The second method assumes a smooth performance degradation. The principle is simple - the impact of a failure on the infrastructure is proportional to its significance. Light problems will have almost no impact on performance, and certainly will not lead to system failure.

Improving Fault Tolerance with WAFs

There is a stereotype that firewalls act brutally blocking ports, addresses, and protocols in an attempt to stop malicious traffic. As a result, important 'normal' services become vulnerable.

WAF (Web Application Firewall) is an advanced solution that prevents intruders from finding and exploiting vulnerabilities in services. The false positive rate of the WAF is less than 0.01%, and it can find vulnerabilities in the code and suggest ways to eliminate them.

The firewall provides better protection for web applications. This means that the level of fault-tolerance of the infrastructure becomes higher, as well.

If you want to find out more about the provider's fault-tolerant infrastructure, send us a message via live chat or
call Cloud4Y's managers. We are ready to answer your questions.

Is useful article?
0
0
author: John
published: 06/16/2023
Last articles
Scroll up!