Backup Strategy (Under the hood)

This article is part of our "Tiger Bridge - Under the hood" series, where we uncover technical details about our product, dive deep into the solutions it offers, and explain which decisions we had to make while designing it.

Definition/Background

The main purpose of any backup solution is to keep data secure. You should never lose anything without the option to recover it, regardless of what happens – a disaster, virus attack, or accidental user action.

A good backup strategy is derived from the following qualities:

  1. Granularity – how precisely you can recover, how close to the desired point in time you can get
  2. Preparation cost – how much time and money it takes for a backup to get created
  3. Restoration cost – how much time and money it takes for a backup to get restored
  4. Upkeep cost / depth – the price of keeping your backups; the larger the timeframe for restoration you want to support (couple of days, weeks, months, or years), the higher the upkeep cost will be

Problem

Our clients are looking for the best solution, measured with the above quantities. They also expect to invest the lowest amount of money (which is determined by the complexity of the solution, meaning how hard it is to create it).

When using the cloud, it is also important to assess the cost required for maximum upkeep of the backed-up data.

Existing Solutions

In terms of existing solutions, we have identified the following options:

  1. Full backup – the traditional way of keeping data secure; it creates a mirror of all of your data at present, and you can store it on a secure location (in the cloud or elsewhere); it can be performed every night; during restoration, everything will be restored
  2. Incremental backup – it backs up the delta, or data which has changed in one way or another since the previous (of any type) backup; it was introduced as an effort to increase the speed of the backup creation procedure and decrease the space backups consume; it can be performed every day of the week but only once full backup is created;
  3. Differential backup – it backs up the delta, or data that has changed since the previous full and specific backup; it was introduced as an effort to decrease the restoration speed
  4. CDP (Continuous Data Protection) – stores changes to your data as they happen; it captures creations, edits, and deletions of files and folders nearly in real-time

We put together a graphic to illustrate the different options. We are illustrating the lifecycle of couple of files with their normal operations – create/rename/change/delete. Underneath the timeline you can see what gets captured at which point with the different backup solutions.

With a full backup approach, every night (or whatever the schedule is) the whole dataset gets collected and saved for backup purposes.

With incremental or differential backups (irrelevant for our discussion, although there is a significant difference between the two), only a subset of the data (changes since last backup) gets collected every time a new backup gets created.

With CDP, every change gets captured as soon as it happens and, as visible in the graphic, we can potentially collect more states of a file it gets changed more frequently than once in a day.

We summarized our assessment of the existing solutions’ main qualities in the following table, together with their complexity cost (the cost to implement such a solution):

Quality\Type

Full

Incremental

Differential

CDP

Preparation cost

High

Low

High

Low

Granularity

High

Medium/Low

Medium/High

Low

Upkeep cost

High

Low

Low

Low

Restoration cost

High/Low

High

High/Low

Low/Lower

Complexity cost

Low

Medium

Medium

High

Looking at this table, we cannot help but wonder why people perform any backups other than the CDP. The latter consumes lower space and operates in a much faster way. The answer lies in the last row. It is up to the individual to decide how much they are ready to sacrifice for a lower cost.

Cons

The full backup solution comes with a very high prep cost.

If data in your business does not grow rapidly, this solution might be a good fit. However, rapid data growth makes the traditional solution increasingly impractical. Furthermore, higher demands might get introduced within the organization, like keeping backups for longer periods of time.

The IT industry has been working on reducing the “High” values in our table. Prices are definitely getting lower but data is only growing larger. Although it is now cheaper to save the same volume of information, that volume increases, so the price stays relatively the same.

Typical backup solutions (full, incremental, and differential) wait for their scheduled hour. They are latent and do not do anything the rest of the time. While the backups are being created, they could consume a lot of resources, especially when working with larger datasets. This may also result in normal applications being interrupted.

Integrity of the data is another traditional backup solution problem. Since backup creation is a lengthy process, especially if it is a full backup of a large dataset, the longer it takes, the more likely it is for the integrity to suffer. We would not have the state of the whole dataset at a certain time but in a couple of minutes or hours (the duration of the backup creation). Some files might change during this time interval, and the integrity of the whole dataset gets compromised. There are ways to achieve integrity even with traditional backup solutions, but they will add complexity, cost, and time.

CDP solutions never scan the disc and work live whenever data changes. They are active at all times and require some kind of an agent which stays online and deals with the data. This is expensive and complex to build and maintain. It is similar to antivirus solutions and inherits their limitations, like the fact that the agent might stop working, it consumes resources constantly, and others.

Making a good working sparing CDP solution is complicated and this is why businesses avoid it until they reach a point where traditional solutions simply cease to work. Because of their complexity, CDPs are usually high-end and tailored to large enterprises.

Our Solution

Most traditional solutions working with snapshots of the dataset have just moved their storage to the cloud, but have not really changed their operations. The same solutions which worked with physical tapes before are now working with cloud buckets – that is the only adoption of the cloud they are making. It does introduce much better redundancy and high availability of the “tapes”, but it is still not good enough for us.

In the tapes’ world, CDP was simply impractical. You couldn’t create and store a new tape for each file or folder change as there would be simply too many for you to to keep track of. In addition, restoration would be almost impossible. With the help of the cloud, however, CDP became usable.

According to our table, CDP is clearly the more flexible and powerful backup strategy. The prep cost is minimal because we track all changes. It is paid in multiple small pieces on each change. The granularity can be significantly lower than any traditional option – almost instant even, as it only has some practical limits. While we can reduce it to one second, an hour is good enough for any use case. You are also able to create policies and replicate changes in some folders more frequently than in others. The upkeep cost is very low, and since we replicate individual files, we can lay the foundation for complex versioning strategies. Restoration cost is also minimal and could be further reduced thanks to our partial restoration feature.

We’ve put a lot of effort into making a usable CDP solution and a much better backup option. This was possible because we took advantage of all the new features supported by the cloud providers, and we benefited from the knowledge and growth of OS providers (Microsoft) which allowed them to enhance their antivirus solutions. Therefore, our solution turned out cheaper and lighter. We lowered the “High” complexity from the last row of the table and made it into a “Medium”.

What we achieved with Tiger Bridge was the ability to restore certain files whenever needed, and to the particular time of interest. With traditional backup solutions, it is really hard to recover just a few files or folders – it usually even requires a systems administrator’s help.

In general, our CDP implementation allows you to do things that are simply impossible with traditional backup solutions, like restoring a file to a certain time point, or taking a version of one file from one time and a version of another file from a different time.