Enabling textual search of encrypted personally identifiable information (PII)

In recent years, there have been many news reports of security breaches involving personally identifiable information (PII). In this article, we describe our solution to managing encrypted data records whilst enabling searches based on customer PII.

Introduction

Security breaches involving personally identifiable information (PII) have contributed to the loss of millions of records over the past few years across the world. Breaches involving PII are hazardous to both individuals and organizations when such sensitive data is leaked.

Individual impact: Identity theft, embarrassment or blackmail. Identity theft also increases the user’s vulnerability to cross-site compromise of their access identities
Organizational impact: Loss of public trust, legal liability, reduced enterprise value etc.

As an operator of platforms across multiple countries, Ascenda has to observe the common jurisdictional requirements for managing sensitive customer data. We therefore have to take steps to protect our customer information as best we can.

PII in general can be defined as including data like:

Customer first & last name
Personal identification numbers like national IDs or member information
Address info like addresses, email or phone numbers

As a loyalty platform first and foremost, we don’t have to deal with much critical information like customer payment info but we still work with a good amount of PII like names & addresses (even though we attempt to reduce the amount of PII collected based on least privilege as much as possible).

In this article, we wish to focus specifically on the tension between having PII-encrypted data records and enabling searches of customer PII practically (against said raw encrypted data).

Brief - PII Safeguards, What & Why PII encryption here?

Ascenda refers to a few different guides from various countries, like NIST in the US, the Privacy Act 1988 for Australia, PDPA in Singapore and so on.

Common PII safeguard expectations include:

Data masking: Data is stored and transmitted with only details required for the transaction and nothing more.
Access control and monitoring: Monitor privilege changes and excessive, inappropriate or unused privileges.
User log tracking: Ensure audit trails are archived securely to ensure data integrity is not compromised.
Ethical walls: Access policies to limit access to PII that is not relevant to an individual's work.

A common standard expected across guidelines is that PII data should be encrypted at all times (at-rest, in-transit). Encrypting your data is an industry expectation as one of the key safeguards against data theft. One of the common breach points of data exfiltration is the breakage or misconfiguration (unwittingly or otherwise) of access controls. This is also why you see plenty of guidance in HIPAA, ISO27001, SOC2 standards focusing on the governance of access.

Here, we treat PII encryption as a safeguard in depth, which becomes crucially important if an attacker has already breached the system and has the ability to exfiltrate data.

Encryption in practice

Before we begin, an important question to ask is: Where can data be?

Given the general structure of web applications, the data can be in the:

Database server - When the data is at rest
Application server - When the data is in use
Client side - Client specific data in use upon transit

Focus: Application-level encryption of the data

For data stored in our systems, we have already set up disk-level and database-level encryption. This is industry practice and it is usually provided out-of-the-box by most cloud services. Here, we focus on an additional layer involving Application-level encryption; This is where the data in the database records themselves are encrypted by the application.

With Application-level encryption, we are targeting protection against the scenario where the bad actor has already (somehow) achieved access to our raw data (bypassing above said checks - rendering safeguards like disk-level encryption irrelevant). Here’s how it works:

the encryption key is only accessible by the application (and not the database server)
the data is encrypted by the application before it is stored in the database
the database server never sees the unencrypted data

This protective measure introduces a pretty big tradeoff complexity-wise for fulfilling typical data search and retrieval actions in applications. Typical RDBMS queries like text comparisons are rendered irrelevant in this case since the database no longer “sees” plaintext PII data, adversely impacting the work needed to enable simple actions like searching for customers using said PII datapoints.

We will next look at techniques for performing encrypted data search (bearing in mind not to compromise the safety of our raw PII datapoints).

Encrypted data retrieval

The main problem raised by encrypting our data is that web applications tend to rely on being able to access the data in its’ raw form.

Previously, something like this would have worked:

records where email = 'thisshouldbeprivate@email.com'

With encryption introduced, our values in the database now look something like this:

   email
1 | QEVuQwFAEAAMXyBHBSaN7xRuDZnsyccS50hii9BC8hqJiwM8SHX7bJ7LHHTieaY7sMKEDdhFTBk=
2 | QEVuQwFAEAC/DEZ/FKvOexYzgTaWe8azbkaMRkslU+JKZTRhSKn52JUubuKz6Om6/WerPlfLKXc=

We have a problem.

Encrypt the value before searching

The most immediate solution to this problem is to encrypt the value before using it to search:

encrypted_email = encrypt('thisshouldbeprivate@email.com') # QEVuQwFAEAC/DEZ/FKvOexYzgTaWe8azbkaMRkslU+JKZTRhSKn52JUubuKz6Om6/WerPlfLKXc=
records where email = encrypted_email

This introduces two search limitations:

It cannot support partial text searches since we can only do a meaningful comparison of the full text values
The search term is case-sensitive

Even if you are able to accept the search limitations, a good security practice is to include a randomised initialisation vector (IV) for each value to be encrypted.

This makes the above solution unworkable.

Encrypted data retrieval with randomised IV

Once you introduce a randomised IV, the encrypted value is different every time you encrypt it.

encrypt('thisshouldbeprivate@email.com') # QEVuQwFAEABIuil3e4aBmKz/oRMtGG+wD/jTJs6xFZi1gRHlQKKpyP3hzlDVzLATZoSCaOUYQHA=
encrypt('thisshouldbeprivate@email.com') # QEVuQwFAEACCeAs0dmQ86ZCAb0MvZ/+cq+lEe7OleIjJEfAFAZJCrnVrpFcrJEhve4JjPFqZCiw=

This means that you can no longer rely on this value to be consistent for data retrieval.

Decrypt everything and filter

The naive strategy is (we don’t expect most engineers to arrive at this unless you’re being cynical):

retrieve all data,
decrypt , and
filter

However, this has an obvious performance tradeoff when done within the application in-memory; We retrieve, say, 100K records and decrypt them in memory before performing each search. We end up easily “DDoS-ing” ourselves with excessive computation whenever the traffic for customer searches increases 😅

Blind indexing: Storing hashed value for matching

Here’s the challenge: we need some way to narrow down the initial retrieval of data without access to the unencrypted data. That means we have to get creative. We can’t use the unencrypted data directly, but we can derive some sort of proxy that we can consistently reproduce when we need to retrieve the data.

To solve this problem, we use a Bloom filter as the first step. This gives us a much reduced number of records to decrypt and work with, which we can use the naive strategy to process.

💡 Bloom filters
These are probabilistic data structures that work by storing hashes of the values that you provide.
As an example, when constructing the bloom filter (storing hashes): ”Something” maps to “abcd” ”Something else” maps to “efgh” ”This also” maps to “abcd” ”Some text” maps to “bcde” ”Yet another thing” maps to “abcd”
The above example illustrates a key property of Bloom filters: multiple values can map to the same hashed value (known as *hash collisions*). This gives rise to false positives but no false negatives, that is:

if the filter does not return an item, we can be sure that the item is not in the data.
if the filter does return an item, it might belong in the data

To use this property to help us, in step one of the search, we run our search term “Something” through the Bloom filter, giving us [“Something“, “This also”, “Yet another thing”], since they all map to “abcd”.
This means we have now narrowed down our search scope (from 5 records to 3 in this case, but in practice, it can mean the difference between 1 million records and 50). We can now focus the actual search within the items in (2).

Practically in this strategy, we:

Compute hashes based on the values
Store the hashes in the database
At search time, we hash search values and compare them to the pre-calculated hashes (Bloom filter)
Decrypt the records that are returned (there shouldn’t be that many after the Bloom filter)
Filter the plaintext

We mitigate this by never hashing the full value. Instead, we chunk the string and use that as the basis for comparison. The main purpose of this is to increase the number of potential hash collisions (see above on Bloom filter) and therefore restrict the amount of information that an attacker can gain by observing search results. To understand why this is important, imagine the opposite case where an attacker is able to observe a whole set of queries which each return only 1 result. In effect, they are building their hash table with a 1-to-1 mapping between the plaintext query and the hashed response.

💡 Blind indexing example
When preparing the index
We start with a plaintext string “Something”.
This original string can be split into substrings like “Som”, “Ome”, “Met”, and so on.
Each of these substrings is hashed using a hash function and stored in the database, resulting in something like “fc3ee4996f800bd9fd3609d002ebac05”, “38138f2a843046be40d952bbcca2a13b”, “af125e92cfefb13abd826a06734cdb36”, …

At search time,
We might have a search term “Some”.
We will process the search string using the same strategy as above, giving “Som” and “Ome”.
Using the same hash function, we generate the hashes “fc3ee4996f800bd9fd3609d002ebac05” and “38138f2a843046be40d952bbcca2a13b”.
We can now run a standard database query to find any rows that have these hashes in the blind index.

Bonus: A side benefit is that hashing substrings allows us to perform partial searches on our encrypted data (somewhat similar to our usual SQL LIKE queries). We will discuss this further in a separate post!

Learnings: Enabling searches (largely) whilst keeping our PII secure

With the above model, we enable our engineers to still write deliberately constructed queries against customer PII, whilst protecting the raw customer PII.

If any of the blind indices involved in the query match, the record is potentially a match.
- Our app decrypts each candidate row and then applies its own text comparison before serving up the best matches
- There’s a slight performance hit but we still save a lot of memory tradeoffs (whilst maintaining security)
If the blind indices in the query do not match, then the data is definitely not a match.

To keep the search up to date, we also ensure we regenerate the corresponding blind indices every time PII data is added or updated.

I hope this was a useful demonstration of a practical technique to search data against encrypted datasets - one that is still relatively fast whilst reducing informational leakage when faced with highly privileged bad actors. Stay tuned for the next post on how this enables partial match searches.

References

Guide to Protecting the Confidentiality of Personally Identifiable Information (PII) from NIST

Application-level Encryption

The Three States of Data Guide - Description and How to Secure them

The need for a randomised initialisation vector (IV)

The need for a randomised salt

Bloom Filter | Brilliant Math & Science Wiki

Blind indexing

Building Searchable Encrypted Databases with PHP and SQL - Paragon Initiative Enterprises Blog