technical August 20, 2015

What is a Data Fingerprint?

Matchlight was born out of a simple conversation. A Chief Information Security Officer at a bank asked us to tell him if one of his client lists was ever leaked to the internet. The catch? He couldn’t actually ever provide us with the list.

We built Matchlight to do just that — to search for information on behalf of our clients that is so sensitive, they wouldn’t even trust us with it. But being blind has another advantage. Our founding thesis at Terbium is that no organization is 100% safe, and that all sensitive data is at risk of breach by sufficiently motivated actors. So by avoiding the need to store our clients’ sensitive data, we avoid increasing their attack surface, even if they do trust us. In the unlikely event that our data store was compromised, there would be nothing in there worth stealing.

So how do we do it? How does Matchlight search for data without needing to know what the actual data is? We accomplish that through a technique we developed called Data Fingerprinting. All data is transformed into what we call a fingerprint, and all calculations and correlations are done on only the fingerprints, allowing us to measure similarity among sets of data without needing access to the data itself. Our clients create these fingerprints — which are very simple to compute — on their own networks, meaning their sensitive data never needs to leave their exclusive control.

The fingerprinting protocol is based on the idea of fuzzy hashing, and it is an example of a kind of one-way Private Set Intersection protocol, a cryptographic technique that allows two parties to measure the intersections of sets without requiring that they reveal the contents of their own sets to the other party. In our case, we want to give a client the ability to query an index with one primary requirement, namely that we must be blind to the content of the client’s query. We care much less about the security of the index, since, in our case, it is derived from data available on the publicly-accessible internet.

While there are a number of ways people have devised to operate on private or encrypted data (the ultimate, of course, being fully homomorphic encryption), data fingerprinting is a simple, highly scalable, and effective way to do fuzzy matching of private queries against an index.