Enabling textual search of encrypted personally identifiable information (PII)
In recent years, there have been many news reports of security breaches involving personally identifiable information (PII). In this article, we describe our solution to managing encrypted data records whilst enabling searches based on customer PII.
Security breaches involving personally identifiable information (PII) have contributed to the loss of millions of records over the past few years across the world. Breaches involving PII are hazardous to both individuals and organizations when such sensitive data is leaked.
- Individual impact: Identity theft, embarrassment or blackmail. Identity theft also increases the user’s vulnerability to cross-site compromise of their access identities
- Organizational impact: Loss of public trust, legal liability, reduced enterprise value etc.
As an operator of platforms across multiple countries, Ascenda has to observe the common jurisdictional requirements for managing sensitive customer data. We therefore have to take steps to protect our customer information as best we can.
PII in general can be defined as including data like:
- Customer first & last name
- Personal identification numbers like national IDs or member information
- Address info like addresses, email or phone numbers
As a loyalty platform first and foremost, we don’t have to deal with much critical information like customer payment info but we still work with a good amount of PII like names & addresses (even though we attempt to reduce the amount of PII collected based on least privilege as much as possible).
In this article, we wish to focus specifically on the tension between having PII-encrypted data records and enabling searches of customer PII practically (against said raw encrypted data).
Brief - PII Safeguards, What & Why PII encryption here?
Common PII safeguard expectations include:
- Data masking: Data is stored and transmitted with only details required for the transaction and nothing more.
- Access control and monitoring: Monitor privilege changes and excessive, inappropriate or unused privileges.
- User log tracking: Ensure audit trails are archived securely to ensure data integrity is not compromised.
- Ethical walls: Access policies to limit access to PII that is not relevant to an individual's work.
A common standard expected across guidelines is that PII data should be encrypted at all times (at-rest, in-transit). Encrypting your data is an industry expectation as one of the key safeguards against data theft. One of the common breach points of data exfiltration is the breakage or misconfiguration (unwittingly or otherwise) of access controls. This is also why you see plenty of guidance in HIPAA, ISO27001, SOC2 standards focusing on the governance of access.
Here, we treat PII encryption as a safeguard in depth, which becomes crucially important if an attacker has already breached the system and has the ability to exfiltrate data.
Encryption in practice
Before we begin, an important question to ask is: Where can data be?
Given the general structure of web applications, the data can be in the:
- Database server - When the data is at rest
- Application server - When the data is in use
- Client side - Client specific data in use upon transit
Focus: Application-level encryption of the data
For data stored in our systems, we have already set up disk-level and database-level encryption. This is industry practice and it is usually provided out-of-the-box by most cloud services. Here, we focus on an additional layer involving Application-level encryption; This is where the data in the database records themselves are encrypted by the application.
With Application-level encryption, we are targeting protection against the scenario where the bad actor has already (somehow) achieved access to our raw data (bypassing above said checks - rendering safeguards like disk-level encryption irrelevant). Here’s how it works:
- the encryption key is only accessible by the application (and not the database server)
- the data is encrypted by the application before it is stored in the database
- the database server never sees the unencrypted data
This protective measure introduces a pretty big tradeoff complexity-wise for fulfilling typical data search and retrieval actions in applications. Typical RDBMS queries like text comparisons are rendered irrelevant in this case since the database no longer “sees” plaintext PII data, adversely impacting the work needed to enable simple actions like searching for customers using said PII datapoints.
We will next look at techniques for performing encrypted data search (bearing in mind not to compromise the safety of our raw PII datapoints).
Encrypted data retrieval
The main problem raised by encrypting our data is that web applications tend to rely on being able to access the data in its’ raw form.
Previously, something like this would have worked:
records where email = 'firstname.lastname@example.org'
With encryption introduced, our values in the database now look something like this:
email 1 | QEVuQwFAEAAMXyBHBSaN7xRuDZnsyccS50hii9BC8hqJiwM8SHX7bJ7LHHTieaY7sMKEDdhFTBk= 2 | QEVuQwFAEAC/DEZ/FKvOexYzgTaWe8azbkaMRkslU+JKZTRhSKn52JUubuKz6Om6/WerPlfLKXc=
We have a problem.
Encrypt the value before searching
The most immediate solution to this problem is to encrypt the value before using it to search:
encrypted_email = encrypt('email@example.com') # QEVuQwFAEAC/DEZ/FKvOexYzgTaWe8azbkaMRkslU+JKZTRhSKn52JUubuKz6Om6/WerPlfLKXc= records where email = encrypted_email
This introduces two search limitations:
- It cannot support partial text searches since we can only do a meaningful comparison of the full text values
- The search term is case-sensitive
Even if you are able to accept the search limitations, a good security practice is to include a randomised initialisation vector (IV) for each value to be encrypted.
This makes the above solution unworkable.
Encrypted data retrieval with randomised IV
Once you introduce a randomised IV, the encrypted value is different every time you encrypt it.
encrypt('firstname.lastname@example.org') # QEVuQwFAEABIuil3e4aBmKz/oRMtGG+wD/jTJs6xFZi1gRHlQKKpyP3hzlDVzLATZoSCaOUYQHA= encrypt('email@example.com') # QEVuQwFAEACCeAs0dmQ86ZCAb0MvZ/+cq+lEe7OleIjJEfAFAZJCrnVrpFcrJEhve4JjPFqZCiw=
This means that you can no longer rely on this value to be consistent for data retrieval.
Decrypt everything and filter
The naive strategy is (we don’t expect most engineers to arrive at this unless you’re being cynical):
- retrieve all data,
- decrypt , and
However, this has an obvious performance tradeoff when done within the application in-memory; We retrieve, say, 100K records and decrypt them in memory before performing each search. We end up easily “DDoS-ing” ourselves with excessive computation whenever the traffic for customer searches increases 😅
Blind indexing: Storing hashed value for matching
Here’s the challenge: we need some way to narrow down the initial retrieval of data without access to the unencrypted data. That means we have to get creative. We can’t use the unencrypted data directly, but we can derive some sort of proxy that we can consistently reproduce when we need to retrieve the data.
To solve this problem, we use a Bloom filter as the first step. This gives us a much reduced number of records to decrypt and work with, which we can use the naive strategy to process.
Practically in this strategy, we:
- Compute hashes based on the values
- Store the hashes in the database
- At search time, we hash search values and compare them to the pre-calculated hashes (Bloom filter)
- Decrypt the records that are returned (there shouldn’t be that many after the Bloom filter)
- Filter the plaintext
We mitigate this by never hashing the full value. Instead, we chunk the string and use that as the basis for comparison. The main purpose of this is to increase the number of potential hash collisions (see above on Bloom filter) and therefore restrict the amount of information that an attacker can gain by observing search results. To understand why this is important, imagine the opposite case where an attacker is able to observe a whole set of queries which each return only 1 result. In effect, they are building their hash table with a 1-to-1 mapping between the plaintext query and the hashed response.
Bonus: A side benefit is that hashing substrings allows us to perform partial searches on our encrypted data (somewhat similar to our usual SQL
LIKE queries). We will discuss this further in a separate post!
Learnings: Enabling searches (largely) whilst keeping our PII secure
With the above model, we enable our engineers to still write deliberately constructed queries against customer PII, whilst protecting the raw customer PII.
- If any of the blind indices involved in the query match, the record is potentially a match.
- Our app decrypts each candidate row and then applies its own text comparison before serving up the best matches
- There’s a slight performance hit but we still save a lot of memory tradeoffs (whilst maintaining security)
- If the blind indices in the query do not match, then the data is definitely not a match.
To keep the search up to date, we also ensure we regenerate the corresponding blind indices every time PII data is added or updated.
I hope this was a useful demonstration of a practical technique to search data against encrypted datasets - one that is still relatively fast whilst reducing informational leakage when faced with highly privileged bad actors. Stay tuned for the next post on how this enables partial match searches.