The recent high-profile AWS outage has highlighted, in excruciating detail, what can happen when you haven’t designed your solutions for the inevitability of your cloud provider going offline.  No cloud provider promises 100% up-time because they can’t.  It’s a hard problem to design for regardless of whether you solve it at the application, infrastructure or network-level.  But design for it you must if your deployed solution is truly mission-critical and when outages are correlated to large profit loss.  
 
You have two choices:
  1. Design your solution to run in different data centres, and manage the synchronisation of application state as best you can based on your requirements.  Mexia is an Australian cloud consultancy focused on Microsoft technologies, so we solve this problem by designing solutions that span between the Azure data centers in Australia East and Australia Southeast and manage state at the application layer; or
  2. Design your solution to take advantage of a mythical platform that spans multi-cloud providers without requiring code re-writes, auto-heals itself when one of those clouds goes offline, manages and auto-replicates all application state, and supports modern DevOps practices.  Sound too good to be true?  Read on.
Mexia has recently demonstrated to one of our clients how they could deploy a highly-available, geo-redundant and multi-cloud microservices platform using Azure Service Fabric deployed over an interconnected network of 3 x Windows Server virtual machines each in Microsoft Azure, Amazon AWS and Google Cloud, creating a 9-server cluster with 3 independent fault domains.
NB: We could have easily configured this solution to span multiple Microsoft Azure data centres around the world, but we wanted to demonstrate a multi-provider approach.
Before we talk more about this particular solution, let’s learn a little more about Service Fabric.

Azure Service Fabric

Service Fabric is Microsoft’s next-generation application service platform with mission critical features built in to the platform itself, to overcome a lot of the hurdles usually faced when implementing resiliency with traditional infrastructure solutions. With Service Fabric, virtual machines are provisioned onpremise or in the cloud, and the Service Fabric facade laid over the top, forming a cluster of nodes capable of delivering microservices. 
The Service Fabric cluster operates a rich array of features including load balancing, health monitoring, replication & failover and self-healing for the deployed services. Individual nodes within the cluster may be patched and upgraded without compromising availability, by implementing fault domains across your data centres and individual servers, which Service Fabric manages by dynamically redistributing workloads on the fly.
Azure Service Fabric was born from years of experience at Microsoft delivering mission-critical cloud services and is production-proven since 2010. It is the foundation technology on which Microsoft run the core Azure infrastructure, powering services including Skype for Business, Intune, Azure Event Hubs, Azure Data Factory, Azure DocumentDB, Azure SQL Database and Cortana.
Microsoft have now made the platform components available to the world, and Mexia are leveraging its power and flexibility on a number of enterprise-scale digital transformation projects on-premises and in Microsoft Azure for our customers.

Imagine the Impossible

We said above that a microservices hosting platform with the following attributes is mythical:
  • spans multi-cloud providers without requiring code re-writes
  • auto-heals itself when one of those clouds goes offline
  • manages and auto-replicates all application state
  • supports modern DevOps practices
To prove that Service Fabric can provide all this to our client, Mexia built a multi-cloud, geographically dispersed infrastructure platform leveraging 9 x Windows Server IaaS machines across Microsoft Azure, Amazon AWS and Google Cloud data centers.  We then interconnected those machines using Site-to-Site IPSec VPN connectivity utilised to provide secure, private network between our cloud providers.
We then deployed the Service Fabric components to each of those nodes (3 in each, so 9 nodes in total) and configured them into self-healing, self-load balancing cluster that is resilient to single server, single data center, or entire cloud provider outages.
Next we built a demonstration solution that followed the principles of microservices, including standards-based communication protocols (i.e. HTTP) and internalising the management of application state rather than pushing to external queues and databases.  We built the services using stateful Reliable Services and C#.NET, and implemented business logic in a loop whist incrementing a persisted sequential counter. In this architecture only one instance of the service is considered active’ whilst replica services are deployed in other cluster nodes and ready to become active when needed. The state of the active service is persisted into the Service Fabric persistence layer, which is then shared automatically across the other instances of the same service running on other nodes.
The final step was to configure Azure’s Chaos service to start randomly restarting nodes, restart deployed services, remove replica services and generally trying to compress years of infrastructure errors into a few minutes and hours. If our active service got taken out, one of the replicas retrieved the last known state and assumes the role of the active service. It does not matter that the newly active service could of been running in a different data center, let alone a different cloud provider, than the previous active service.
The resulting solution was a thing of beauty to watch.  The chaos monkey was trying it’s hardest to put a dent in the up-time and availability of the deployed service, but to no avail.  The highly available, geo-redundant & multi-cloud Service Fabric environment ensured the .NET service just kept on working, oblivious to the carnage going on in the underlying infrastructure management layer.

What Does This Mean for Your Cloud Strategy?

We’ve already spoken about the need for organisations who require missioncritical applications to design highly resilient architecture that is not limited to a single cloud provider’s data centre.  You can do what we do in Option 1 above and design for a dual data center approach.  This is hard and requires application-level code to be aware of it’s multi-data centre state persistence environment.
Alternatively you can choose Option 2 now, with the possibility to  architect a resilient microservices-based application architecture and deploy into a platform that is scaled beyond your cloud providers’ data center, or across the global data centres from your preferred cloud provider.  Naturally, we recommend Microsoft’s data centers because they’re the only global vendor with the mature on-premises server products AND modern PaaS capability, all with a common DevOps/ALM story and a common enterprise procurement relationship. 
Whether you run a platform like this purely in the cloud, on-premises or a hybrid of both, the imperative when designing missioncritical systems is to not place all your eggs in one basket when you absolutely cannot be offline.  Microsoft’s Service Fabric is a production-ready microservices framework that can make that possible for you today.
If you’d like some help, drop Mexia a note at enquiries@mexia.com.au.  We can design it for you, build it for you, and manage it for you.  We’d love to help.