You are viewing version 2.22 of the documentation, which is no longer maintained. For up-to-date documentation, see the latest version.

Configuring Armory on AWS for Disaster Recovery

Overview

The following guide describes how to configure your Armory on AWS deployment to be more resilient and perform Disaster Recovery (DR). Armory does not function in multi-master mode, which means that active-active is not supported at this time. Instead, this guide describes how to achieve an active-passive Armory setup. This results in two instances of Armory deployed into two regions that can fail independently.

Diagram of Armory deployment on AWS with disaster recovery

Requirements

  • The passive Armory will have the same permissions as the active Armory
  • The active Armory is configured to use AWS Aurora and S3 for persistent storage
  • Your Secret engine/store has been configured for Disaster Recovery (DR)
  • All other services integrated with Armory, such as your Continuous Integration (CI) system, is configured for DR

What is a passive Armory

A passive Armory means that the deployment:

  • Is not reachable by its known endpoints while passive (external and internal)
  • Does not schedule pipelines
  • Cannot have pipelines triggered by CI jobs

Storage considerations

Armory recommends using a relational database for Orca and Clouddriver. For Orca, a relational database helps maintain integrity. For Clouddriver, it reduces the time to recovery. Even though any MySQL version 5.7+ database can be used, Armory recommends using AWS Aurora MySQL for the following reasons:

  • More performant than RDS MySQL
  • Better high availability than RDS MySQL
  • Less downtime for patching and maintenance
  • Support for cross-region replication

Note the following guidelines about Armory storage and caching:

  • S3 buckets should be set up with cross-region replication turned on. See Replication in the AWS documentation.
  • Consider the following if you plan to use Aurora MySQL:
  • Redis - Each service should be configured to use its own Redis. With Armory services configured to use a relational database or S3 as a permanent backing store Redis is now used for caching. For disaster recovery purposes it is no longer required that Redis is recoverable. A couple things to note are:
    • Gate - Users will need to login again
    • Fiat - Will need to sync user permissions and warmup
    • Orca - Will lose pending executions
    • Rosco - Will lose bake logs
    • Igor - Will lose last executed Jenkins job cursor

Kubernetes guidelines

Keep the following guidelines in mind when configuring Kubernetes.

Control plane

  • The Kubernetes control plane should be configured to use multiple availability zones in order to handle availability zone failure. For EKS clusters they are available across availability zones by default.

Workers

The following guidelines are meant for EKS workers:

  • The Kubernetes cluster should be able to support the Armory load. Use the same instance type and configure the same number of worker nodes as the primary.
  • There needs to be at least 1 node in each availability zone the cluster is using.
  • The autoscaling group has to have a proper termination policy. Use one or all of the following policies: OldestLaunchConfiguration, OldestLaunchTemplate, OldestInstance. This allows the underlying worker AMIs to be rotated more easily.
  • Ideally, Armory pods for each service that do not have a replica of 1 should be spread out among the various workers. This means that pod affinity/anti-affinity should be configured. With this configuration Armory will be able to handle availability zone failures better.

DNS considerations

A good way to handle failover is to set up DNS entries as a CNAME for each Armory installation.

For example:

  • Active Armory accessible through us-west.spinnaker.acme.com and api.us-west-spinnaker.acme.com load balancers.
  • Passive Armory accessible through us-east.spinnaker.acme.com and api.us-east-spinnaker.acme.com load balancers.
  • Add DNS entries spinnaker.acme.com with a CNAME pointing to us-west-spinnaker.acme.com (same for api subdomain) and a small TTL (1 minute to 5 minute).

In this setup, point your CNAME to us-east when a disaster event happens.

Setting up a Passive Armory

To make a passive version of Armory, use the same configuration files as the current active installation for your starting point. Then, modify it to deactivate certain services before deployment.

To keep the configurations in sync, set up automation to create a passive Armory configuration every time a configuration is changed for the active Armory. An easy way to do this is to use Kustomize Overlays.

Configuration modifications

Make sure you set replicas for all Armory services to 0. Example in SpinnakerService manifest for service gate:

apiVersion: spinnaker.armory.io/v1alpha2
kind: SpinnakerService
metadata:
  name: spinnaker
spec:
  spinnakerConfig:
    config:
      deploymentEnvironment:
        customSizing: # Configure, validate, and view the component sizings for the Armory services.
          gate:
            replicas: 0

Once you’re done configuring for the passive Armory, run kubectl -n <spinnaker namespace> apply -f <SpinnakerService manifest> if using Operator, or hal deploy apply if using Halyard to deploy.

Performing disaster recovery

If the active Armory is failing, the following actions need to be taken:

Activating the passive Armory

Perform the following tasks when you make the passive Armory into the active Armory:

  • Use the same version of Operator or Halyard to deploy the passive Armory installation that was used to deploy the active Armory.
  • AWS Aurora
    • Promote another cluster in the global database to have read/write capability.
    • Update SpinnakerService manifest if using Operator, or Halyard configuration if using Halyard to point to the promoted database if the database endpoint and/or the database credentials have changed.
  • Create the Redis clusters.
  • Activate the passive instance.
    • Set the replicas to more than 0. Ideally, this should be set to the same number of replicas that the active Armory used.
  • Change the DNS CNAME if it is not already pointing to the passive Armory installation.
  • If the Armory that is not working is accessible, it should be deactivated

Recovery Time Objective (RTO)

Restoration time is dependent on the time it takes to restore the database, the Armory services, and the time it takes to update DNS. Most Armory services that fails should recover within a 10 minute timeframe. Clouddriver may take longer especially when at scale because it needs to reconnect to all configured cloud accounts. Note that services are limited to local resources, which are configured to be redundant (databases, nodes, etc.) or highly available. In addition to Clouddriver, the following services may also take additional time to restore since Redis needs time to warm up the cache:

  • Orca
  • Igor
  • Echo
  • Fiat

Recovery Point Objective (RPO)

This is the state to which Armory will recover the affected systems in case of a failure, such as database corruption. The current Armory RPO target is 24 hours maximum, tied to the last snapshot of the database.

Other resources