Data and disaster recovery

Disaster recovery is a strange topic in the world of security and compliance. There is a lot of conversation about it, but often little gets done in practical terms. This may be because, for most organisations, it’s perceived as too expensive or too remote. In normal circumstances, a true disaster is a High Impact — Low Frequency (HILF) event and something in the human psyche prefers to focus on Low Impact — High Frequency (LIHF) events. This is why you tend to worry more about burning dinner than getting hit by a meteorite.

When you design your data storage and manipulation systems, the most common route to take is to consider redundancy: RAID storage, clustered/load-balanced servers and so on. What receives much less consideration is resiliency, which is relevant to HILF events: fault recovery and persistent service dependability.

Choose wisely

The underlying technology that creates the Aiven platform is (perhaps unsurprisingly) built in part on Aiven products. We utilize all of the advanced redundancy and resiliency features to ensure the durability and persistence of Aiven services. Replication of data and services to multiple continents across a large number of systems and disks with regularly tested backup and restore procedures are table stakes for us. And remember, these are all features we also offer to you, our customers. We keep your infra and your data as safe as your own data architecture lets us.

So far so theoretical. By now you’re probably asking “what’s that got to do with me?” and “what should I be doing?” And maybe even “what do you mean my own data architecture?”

Making a data recovery plan

The key is to design your data architecture in such a way that it is safe from both LIHF and HILF events. Remember, Aiven can keep your infra and data safe, but even though we can “see” that you have two Postgres services running in different availability zones, we can’t know what you keep in them — whether they’re separate databases or whether they’re the same database replicated safely across AZs.

Step 1: Ensure system-level redundancy

RAID is a real-life array but, instead of containing numbers or strings, each disk is a hard drive containing your critical system data. As storage has become cheaper, using local NVMe drives with your compute instances is a very real option. Depending on your cloud, you can use “ephemeral” storage and create a RAID setup or you can leave all of that to be managed by the cloud provider; i.e. using EBS with AWS.

You can also provide your own keys to encrypt the disks of your bare metal instance, with most cloud providers supporting this through their Command Line Tools or API. Often, they will integrate with the authentication services such as AWS KMS or GCP Cloud KMS.

Of course, this all depends on the operating system you run on your instances and your needs. For some, it might be enough to use Networked Storage and backup snapshots to Object Storage regularly.

Step 2: Distribute HA plans securely

We support single node instances because we know that HA is less important for development environments. For production environments, our HA plans deploy instances in the region you choose, ensuring that they are spread across Availability Zones (AZ). If you launch Postgres in an HA setup, you will have your primary in one AZ and your standby in another. If you launch a 6 node Kafka cluster then we launch each instance into a different AZ until there are none left for that region, then we will deploy multiple instances to the same AZ. You don’t need to worry about this, however, as Kafka is aware of these locations and will assign partitions so that data is spread across AZs as much as possible.

Backups for all of our database services are taken at different intervals and kept at different times, depending on the plan you have selected. Your backups are stored in Object Storage in the same region and can be used to restore your service to a particular point in time (PITR) or to create a fork of your service that is an exact clone of your current service but you can select the region and plan. This is particularly useful when you want to make a fork of your production database for your engineering teams.

Step 3: Set up read replicas and mirrored copies

In Postgres or MySQL, add a read replica to your service and deploy that into the cloud (and region) of your choice.

For Kafka, use the magic of MirrorMaker 2 (and 1), a service that is bundled with Open Source Kafka and allows you to replicate data from one cluster to another. Whether you want to run a one-off migration or if you have some geographically distributed clusters that you need to be in constant sync (either with either or with a large, centralised cluster), MirrorMaker is the optimal solution.

And for services that do not have such simple options, Kafka can come to the rescue! Through Kafka Connect, Aiven offers a wealth of open source connectors that can read from and/or push to many sources. This can be useful if you would like to move parts of your database into Elasticsearch for analytics or to move data between InfluxDB using the HTTP API.

Further reading

Wrapping up

In the meantime, make sure you follow our changelog and blog RSS feeds or our LinkedIn and Twitter accounts to stay up-to-date with product and feature-related news.

Your database in the cloud, www.aiven.io

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store