Testing GITLAB disaster recovery: what we learned

As we witnessed gitlab.com having a major database incident last February 1st, 2017 I decided to put our backup and recovery processes to the test. This blog describes what we learned.

To be clear: our gitlab instance was not affected by this outage. We chose to self-host gitlab 3 years ago – migrating from a SVN SAAS solution – and have been a very happy camper ever since without any major (unplanned) outages.

TL;DR

It worked but we can improve. Read the section “Lessons learned” at the end of this blog.

Setting the scene

We actually migrated this instance a couple of times: moving to another server and cloud provider, migrating from MySQL to PostgreSQL, moving from a source based installation to the Omnibus package, upgrading the base operating system and so on. To that effect we have had some exposure to recovery, restore and reconfiguration exercises.

First I will give an overview of what we have right now and next how we performed the disaster recovery.

Hosting is done on a single public cloud virtual server – at Transip. An Ubuntu 14.04 box with 2 cores, 4G of RAM and 150 GB of SSD storage (single disk). Our instance hosts 425 projects in 60 groups for 94 users.

Our procedures

What backup and/or disaster recovery processes do we have in place already?

  • Regular VM snapshots with a 4 hour interval. Retention is 48 hours. Fastest way to recover, if the cloud provider is not out of business 😉
  • A manual snapshot which we use for major maintenance cycles. Not up to date and could be 3 months old. Still could be used to recover to a well known state.
  • Daily dumps of the gitlab database and repositories – using the regular gitlab backup tasks. Stored on the server for 4 days. Enables quick restores without accessing off site storage.
  • Weekly dump of that same data plus artefacts, builds and registries.
  • All exports are uploaded to an AWS S3 bucket. Retention is not set at the moment. We could go back years.
  • Both the dumps and important file system mount points are pulled with rsnaphot to another (cloud) server every 4 hours. This data is persisted every day, week and month for about half a year (default rsnapshot configuration). This server also holds other backup data from all of our servers.
  • The rsnapshot data is replicated every 4 hours to redundant storage in a different data center for up to 48 hours. This is just for our piece of mind.

As you can see we don’t go easy on redundancy. Adding up all options we store our data at least 5 times in different locations. Should be enough we figure.

Going in disaster recovery mode

For this situation we will assume we lost our virtual server and the provider is unable to recover for some reason. A recovery from scratch using off site sources is the best way to test for the worst possible scenario. The primary goal is to get gitlab up and running with access to our data. Configuration details can be sorted out later.

I created a new virtual server – using another public cloud provider (Digital Ocean) – with nearly identical specifications. First issue was disk space: the standard $40/month model only has a 60 GB disk. Adding a additional block storage of 150 GB (extra $15/month) fixed that. That took about 5 minutes in total to get running.

A quick apt-get update/upgrade brings the system up with the recent patches. I also set the time zone to Europe/Amsterdam. Another 5 minutes in total including some time to make sure we box works as expected.

Then install gitlab using the instructions at https://about.gitlab.com/downloads/#ubuntu1404

Make sure you are installing the exact same version of gitlab! In our case: sudo apt-get install gitlab-ce=8.15.5-ce.0

Configure gitlab as instructed: sudo gitlab-ctl reconfigure. Now you will have a running gitlab instance using all the defaults. That took 15 minutes. Half an hour in total to get a ‘vanilla’ installation from scratch. Not bad.

Now to restore our data

Like I mentioned our backups are uploaded to a S3 bucket. How do I get that data on the server?
Using google I found the official AWS CLI tools. Had to install python-pip first and then pip install awscli.

Next, setup my credentials (key and secret) with aws configure. Now I was able to access our bucket with the most recent backup. This step took me about 25 minutes (including copy time for 8GB of data).

Time to lookup how to restore data. Follow the restore (omnibus) instructions on this page https://gitlab.com/gitlab-org/gitlab-ce/blob/master/doc/raketasks/backup_restore.md.

Note the restore unpacks the backup first, then moves all repositories in position. You need some extra disk space! Roughly twice it needs in production. This part can take longer as expected, in our case about 30 minutes. This depends on the volume of the data and the IO speed of your server.

Also copy the files from your original /etc/gitlab over to the new server. These files contain the essential configuration for gitlab. If you don’t accounts with 2FA won’t work! Make sure to run gitlab-ctl reconfigure after this.

Last step was another 10 minutes. But now we have a visual! Every thing was back where we left it and cloning, pushing, commenting all works (on a different URL!) Total time around 90 minutes.

Still missing a couple of things

  • SSL configuration (we use LetsEncrypt)
  • DNS reconfiguration (using the original URL)
  • GITLAB configuration (from the original /etc/gitlab/gitlab.rb)
  • E-mail setup (setup your MTA)
  • Security hardening (firewall, fail2ban, etc.)
  • Test if our CI runners still work
  • Making sure backups run regularly and stuff like that

I estimate this will take another hour of tweaking and tuning but most of these tasks can be performed without downtime for the gitlab service.

Lessons learned

  • Have a copy of the /etc/ directory along side with your gitlab backup. We do have this in our regular backup but not in S3.
  • Take into account the transfer times retrieving the backups and restoring it. This is quite a big factor in the total return-to-operation time.
  • Automate-all-the-things: we probably want to write an Ansible script to setup a new server. OR create a virtual server template to be deployed quickly. This will probably cut the total restore time in half. And as we noticed there a number of manual steps involved which are error-prone.
  • Investigate if we can increase the frequency of the database and file system snapshots. Currently we could lose 24 hours in the database and 4 hours for the file system.
  • Perform this recovery every couple of months! Practice makes perfect.

Hope this helps anybody. Feel free to ask questions or provide feedback.