Metadata Handling (Under the hood)

This article is part of our "Tiger Bridge - Under the hood" series, where we uncover technical details about our product, dive deep into the solutions it offers, and explain which decisions we had to make while designing it.

Dilemma

We had to make a decision whether to keep data and metadata together or split them to allow better flexibility.

Background

When synchronizing data between two locations, a key aspect of the solution is to keep syncing changes which happen in one of the locations through the cloud to the other and vice versa. This is fully supported by Tiger Bridge and it is not the dilemma we were facing. It is actually called full sync.

Sometimes a specific type of collaboration is needed where the origin or generator of data is in the first location, and the consumer – in another one. However, the consumer can also be a generator of additional information for the data (metadata) which is then consumed in the first location. This typically happens with AI algorithms. The data is generated first, sent somewhere for the creation of additional info, and eventually received back at the original generating location for further consumption.

If files are relatively small in size, it is easy to move them between the locations and the cloud as needed, and synchronization will be implemented easily, just by sending the files back and forth. This can be a typical workflow where a local file server has a ton of small Word and Excel files. In such environments, the full sync works well and is simple to implement.

When collaboration is needed for larger files, separate sync may become a requirement because moving large files back and forth is inefficient. Industries we work with, such as healthcare, media and entertainment, construction, and oil and gas, generate enormous files, sometimes hundreds of gigabytes in size. Moving them more than the bare minimum is slow and expensive.

There are three possible methods for handling the metadata.

    1.   Embedded
    2.   Attached
    3.   Disconnected

Embedded means that metadata becomes part of the data, like adding an author to a Word file. If the metadata is embedded into the data, they will always travel together and you cannot separate them, hence your only option is full sync. We have implemented some features like partial update and partial restoration to help in this scenario, and we will discuss them separately.

If metadata is attached to the data, it can be generated and sent separately with instructions about the data it will be attached to. We have implemented a lot of modifications to better support this scenario.

Disconnected metadata is represented as a separate object. It can be generated and sent separately but the application must handle the data to make the association or connection with the metadata. Since this is a responsibility of the application, there was not much we could do in this scenario.

The dilemma was to decide which of the three methods to implement in our solution.

Pros/Cons Analysis

Embedded data is difficult to move, and this is especially true for larger files. Any time you make a change, the whole file needs to be transferred again. However, this approach is good for keeping all of the data together.

Attached data might be challenging to implement, especially on some operating systems. You still have the data together with the metadata, but they can travel separately.

The main problem with disconnected metadata and data is that the consuming application or user needs to know how to bring them together. This method is easiest to implement, but it does not give you the flexibility of the other solutions.

Decision

After careful consideration, we believe there is no right or wrong choice here. Applications exist for every possible approach and workflows which require them. Knowing our clients and their needs, we decided to develop a solution that supports all available approaches. However, it is really best suited for the attached scenario, as we have put a lot of effort into making optimizations there. For that purpose, we also take advantage of the partial update / partial restore functionalities.

Arguments

Embedded metadata works well with smaller files. The resources required to manage attached or disconnected metadata are more than the time and money needed to move the files around freely. The larger the data, the more impractical the embedded approach becomes. For really big files, we had to decide between having the metadata attached or disconnected.

With regards to disconnected metadata, imagine you and your teammates need to collaborate over a Word file. You can put it in Google Drive or a similar solution, open it from a browser in one location, and make some edits; then your colleague can access it from a browser in another location, make more edits, and so on.

We have a real-life use-case for this: a group of medical centers using one pathologist for expert advice. All medical centers had the needed local equipment to perform medical examinations. For more complex cases, they sent a patient’s data to the pathologist who looked at it and commented based on the results. The data was eventually returned to the medical center with its attached metadata for further testing or treatment based on the doctor’s opinion. If the returning data (pathologist’s comments) was embedded, the whole image would have to travel back, which is not optimal. In a detached scenario, comments would travel separately – via email, for example, but then someone would have to link them to the respective examination. In the attached scenario we offered, only the delta (pathologist’s comments) had to be returned, and was made part of the original examination. This made it the best solution for the given use-case – it eliminated the embedded approach problems (big file transfers) without introducing detached approach problems (making a connection between the examination and the comments after receiving them).

Now, imagine the same story, but for a specific on-premises application, like the pathologist’s software. Unlike the ones used in the medical industry, it cannot be opened directly in the cloud from a browser. If the file system is not able to natively see the file together with its entire metadata, then we cannot solve these companies’ collaboration needs.

Having metadata attached to the data allows the on-premises application to see and work with every bit of important information.

The main difference in the usefulness of the different solutions is the consumption location. If data is consumed where it is generated, it makes sense to go with embedded metadata and have all of the information together. If data is consumed on another location, then you can opt for the attached or the disconnected approach. Keep in mind that disconnected has specific requirements for the application, so not all applications will be built for it.

Depending on the OS, it could be possible (or not) to add metadata to the data (like extended attributes). This is why it was important for us to support all existing approaches. Windows supports working with metadata, while Linux and Mac do not. The cloud has a similar but very limited functionality, allowing you to add only a small amount of metadata. Windows also uses the so-called streams – separate files which act as subfiles and can be as large as needed. Since we have built a Windows solution, we were able to take advantage of all of the functionalities it supports.

Conclusion

We have considered the separate data and metadata movement workflows when developing our application, as we know our clients and acknowledge their needs. Most other solutions do not facilitate these workflows or are not optimized for them.