Multi-Region AWS – Can Your Business Afford an Outage?

There are many strategies for fault tolerance within a single AWS region that may be sufficient for most businesses. Region level service failures are rare but some organizations cannot tolerate any outages and require a secondary region for failover. Running dual region applications can get expensive quickly, so organizations need to take the time to ensure they understand the risks and cost of a potential outage to their business within a region if that occurs. There are ways to mitigate the costs of dual region but in addition to the AWS fees there is also the overhead of development and testing in building out dual region architectures.

Recently, Amazon released a Multi Region Application Architecture to demonstrate dual region fault tolerant applications.

This works fine if this is what your architecture is or will be.
However, what are your dual region options for existing applications using other services?

Before we discuss dual region failover strategies for other services, it’s important to prioritize how to choose services with scale and redundancy in mind, particularly if dual region might be in your future.

Cloud Native Services/Think “Serverless”
- These are services designed to run solely on AWS technology utilizing the infinite scale and fault tolerances within a region (and in some cases, across regions or edge locations). Some examples are Lambda, S3, DynamoDB, API Gateway, SNS, Kinesis, SQS, Aurora, CloudFront, etc.
Fully Managed Services
- These are applications that were not architected natively for the cloud but are fully managed by AWS. Some examples of this are ElasticSearch, ElasticCache, Managed Kafka, RDS, etc.
Anything else that doesn’t fit the above
- These are applications that need to run on EC2 instances or in containers.

For some of the Cloud Native Services, there are dual region features built in to them. DynamoDB had Global table spaces. S3 has cross region replication (be aware of latency here if this is part of a failover strategy). Aurora has the Global database. For many of the other services, custom strategies have to be employed to support replication for dual region failover.

For one of our customers we built a highly available SaaS application layer on an existing data analytics platform utilizing a strategy of mixing cloud native services with managed services for dual region failover. In order to ensure a dual region architecture could fit into the budget, we had to make some architectural changes to some existing parts of the application. Here are some key points of the architecture:

Deployment: CI/CD
- To continue to support Continuous Integration and Continuous Deployment, Jenkins was used in combination with Cloud Formations to deploy code changes into multiple regions.
Lambda
- All applications that were running on EC2 Instances were refactored over to Lambdas wherever possible. For applications that could not run in a Lambda, they were ported to run on Docker from Fargate. All new code was implemented utilizing a combination of Fargate and Lambda.
API Gateway
- All SaaS endpoints were implemented/exposed using API Gateway.

ElasticSearch and Managed Kafka – Replication Strategy
- There are several Kafka topics running in AWS MSK that are used to stream documents into ElasticSearch for indexing by a container running in ECS Fargate.
  
  To guarantee the data is replicated to a second region; a Kafka replication consumer process was implemented which runs in ECS Fargate. This process consumes messages from various Kafka topics and writes them to the secondary AWS MSK cluster running in the secondary region.
  
  To mitigate any potential data loss from either region, a Kafka persistence consumer was implemented. The persistence consumer runs in both regions, consuming from it’s respective MSK cluster and persisting messages into S3. In the rare chance that we lose an MSK cluster, we can then rehydrate the entire cluster using a custom S3 to MSK producer.
Aurora MySql – Replication Strategy
- The built in Aurora Global Database was used here. During a failover, remove the secondary cluster from the Aurora global database and promote it to allow full read and write operations.
SQL Server – Replication Strategy
- For SQL Server cross region replication we opted for Basic Availability Groups for Sql Server Standard on Linux. There is a secondary server running in the backup region. Running on Linux and utilizing the Standard version provides significant cost savings over running on Windows server and using the Enterprise Edition of SQL Server. During a failover, we promote the backup to be the master.
Testing and Automation
- Failover must take place as quickly as possible so all failover procedures were automated and tested thoroughly to ensure dual region failover within minutes.

Implementing dual region architectures on existing applications and staying within budget can seem elusive at first. Working with the right team of AWS Solutions Architects can get you there.

Get your free consultation today.