Database Usage (Under the hood)

This article is part of our "Tiger Bridge - Under the hood" series, where we uncover technical details about our product, dive deep into the solutions it offers, and explain which decisions we had to make while designing it.

Dilemma

There were two options to choose from in deciding where to store our data. Option 1 was to use a separate database instance. Option 2 involved using the native file system.

Background

In the early days of the computer industry, Bill Gates said the following about the IBM PC:

“640K ought to be enough for anyone.”

This statement sounds odd from today’s perspective, but it goes to show how fast the industry has evolved. More and more complex applications were built, demanding the manufacture of increasingly sophisticated hardware. As a result, demands for operating system storage have grown rapidly.

Before, engineers optimized every line of code they wrote so it could consume the least amount of PC resources. Today, we see the exact opposite – new software solutions sacrifice productivity without end user knowledge, as they are run on extremely powerful machines. Even though a system may not be optimized, it would run fast enough on the newest hardware.

The history of file systems is similar. In the beginning, there were no big demands. Applications were simple and small and greed was not common among end users. File systems started as simple lists, then evolved to use binary trees. As time went by, we started seeing various forms of optimizations because applications and end users needed more.

File systems have a hierarchical structure. In Windows, hierarchy exists on a directory level. You start on the root level and make your way through different folders and subfolders until you reach the exact file that you need. This is what we call browsability.

File systems do not have access to all of your data at once. If you wanted to find all the files starting with ‘a’, for example, your system will have to go through every file and check if its name begins with ‘a’.

 The idea is to show how you can start from one thing, like your drive letter, which is common for all files inside of it, and then browse from one directory to its sub-directories until you reach the desired file.

Databases work in reverse – they have a flat structure with indexing capabilities, on top of which you can build the hierarchy. You do not get the same browsability, but this is compensated for by great searching capabilities. The Google engine is a good example. You type in something and results come back in seconds. The larger your data sets, the more tempting it will feel to use a database.

 To show a table/database which has information about all the files and allows you to sort by a field, for example, but there is no hierarchy.

However, while Google has massive servers that back every search, on-premises infrastructure usually doesn’t. Databases are designed to work with huge sets of data, but they leave you without any hierarchy or browsability. To build them, you inevitably sacrifice performance. Plus, they cannot work without access to the entire data set.

A few years ago, Microsoft started working on WinFS. The purpose of this project was to use a database as the foundation of its file system. After spending millions of dollars and resources, it was clear that WinFS could never reach the same level of performance that native file systems provided. Although the benefits were really attractive, the lack of reasonable performance was a show stopper. Microsoft eventually cancelled the project.

Pros/Cons Analysis

We had to analyze the pros and cons of using a database in our solution. It is summarized below.

Pros

Cons

The database is file system agnostic – you can build whatever structure you need with the fields that are required for you

Point of failure – the database becomes critical because the file system cannot function without it; since databases can potentially become corrupted, this adds a point of failure

It is policy and operations friendly – you can build indexes that change dynamically

Performance – the database still uses the file system underneath while adding latency and consuming RAM and disk resources, this way significantly damaging the system performance (in magnitude of times)

Easier to implement

Difficult local operations – it can only function with the full data set. If you need to recover from a disaster, there might be billions of files that need to be recovered and while that is happening, the system will not be functional

 

No full functionality – file systems have optimizations in place (like oplock or ACLs) that we would miss if using a database

 

Separate data/metadata

 

Sync issues – it is possible to get yourself in a situation where your database is not in sync with the file system

 

You cannot do dynamic population since all the data in the data set is required

Decision

We want to help customers such as mission-critical systems which are not ready to sacrifice even a little bit of performance. They can simply not be served by any other solutions on the market. For this reason, we decided to go with option number 2 – using the native file system with no separate database.

Arguments

It may seem that the pros are more important than the cons in our table.

However, we are not ready to add another point of failure to our solution as that would be problematic for our bigger clients. As long as customers protect their file systems, they will not lose important data or functionality.

We do not intend to sacrifice any performance because we work with businesses which cannot afford to lose even 5% of it.

We are not file system agnostic, so we need to build a separate solution for each OS. However, our expertise in all OSs makes this limitation easy to overcome. We decided to start with Windows for a number of reasons which we will explain later.

The full file system functionality is important to us. Features like opportunistic control or access control can help us meet the needs of our clients. They are not supported by databases.

It is also integral for us to work as an operating system – with local operations.

When we create a policy, we need to take all the data and sort it by some criterium. For example, while doing tiering, you may want to take the oldest files and move them to the cloud. This is not achieved with the file system directly – you need a database. To overcome that, we build temporary databases whenever needed to accomplish the task. It is more difficult initially but it pays off when ready.

The no-database solution we’ve decided to go for is technically harder to implement, but we are always open for challenges. Our engineers are ready to work on all the difficulties and question marks. While dealing with the file system, we had to find where to put important data for each file system’s object.

In databases, metadata is separate from the data – it resides in the database while the actual data is still in the file system. With our solution, we keep them together.

The sync issues may become a big problem if the software did not work for a while to keep things tidy.

Conclusion

We feel comfortable helping the few companies which cannot work with any other solution on the market. Take a look at Milestone, for example. Our clients cannot use solutions which Google or SharePoint provide. However, we are always available to assist them.