How to Search on Securely Encrypted Database Fields

Key Takeaways

Utilize non-deterministic encryption to ensure each encrypted message is unique, enhancing security by making each ciphertext indistinguishable from random noise and safeguarding against chosen-ciphertext attacks.
Implement blind indexing for encrypted data to enable efficient and secure search functionality; store a keyed hash of the plaintext in a separate database column to facilitate queries without decrypting the entire database.
Avoid traditional insecure or experimental encryption methods, such as ECB mode or homomorphic encryption, which can expose sensitive data or are not yet secure enough for production environments.
Consider the use of authenticated encryption schemes like XSalsa20-Poly1305 or XChacha20-Poly1305, which provide both confidentiality and integrity, ensuring that encrypted data remains secure and unaltered.
Address the limitations of encrypted search by creating multiple blind indexes for different types of queries, and ensure comprehensive security measures are in place to protect against potential attacks and unauthorized access.

This post was originally published on the ParagonIE blog and reposted here with their permission.

We [ParagonIE] get asked the same question a lot (or some remix of it).

This question shows up from time to time in open source encryption libraries’ bug trackers. This was one of the “weird problems” covered in my talk at B-Sides Orlando (titled Building Defensible Solutions to Weird Problems), and we’ve previously dedicated a small section to it in one of our white papers.

You know how to search database fields, but the question is, How do we securely encrypt database fields but still use these fields in search queries?

Our secure solution is rather straightforward, but the path between most teams asking that question and discovering our straightforward solution is fraught with peril: bad designs, academic research projects, misleading marketing, and poor threat modeling.

If you’re in a hurry, feel free to skip ahead to the solution.

Towards Searchable Encryption

Let’s start with a simple scenario (which might be particularly relevant for a lot of local government or health care applications):

You are building a new system that needs to collect social security numbers (SSNs) from its users.
Regulations and common sense both dictate that users’ SSNs should be encrypted at rest.
Staff members will need to be able to look up users’ accounts, given their SSN.

Let’s first explore the flaws with the obvious answers to this problem.

Insecure (or otherwise ill-advised) Answers

Non-randomized Encryption

The most obvious answer to most teams (particularly teams that don’t have security or cryptography experts) would be to do something like this:

<?php
class InsecureExampleOne
{
    protected $db;
    protected $key;

    public function __construct(\PDO $db, string $key = '')
    {
        $this->db = $db;
        $this->key = $key;
    }

    public function searchByValue(string $query): array
    {
        $stmt = $this->db->prepare('SELECT * FROM table WHERE column = ?');
        $stmt->execute([
            $this->insecureEncryptDoNotUse($query)
        ]);
        return $stmt->fetchAll(\PDO::FETCH_ASSOC);
    }

    protected function insecureEncryptDoNotUse(string $plaintext): string
    {
        return \bin2hex(
            \openssl_encrypt(
                $plaintext,
                'aes-128-ecb',
                $this->key,
                OPENSSL_RAW_DATA | OPENSSL_ZERO_PADDING
            )
        );
    }
}

In the above snippet, the same plaintext always produces the same ciphertext when encrypted with the same key. But more concerning with ECB mode is that every 16-byte chunk is encrypted separately, which can have some extremely unfortunate consequences.

Formally, these constructions are not semantically secure: If you encrypt a large message, you will see blocks repeat in the ciphertext.

In order to be secure, encryption must be indistinguishable from random noise to anyone that does not hold the decryption key. Insecure modes include ECB mode and CBC mode with a static (or empty) IV.

You want non-deterministic encryption, which means each message uses a unique nonce or initialization vector that never repeats for a given key.

Experimental Academic Designs

There is a lot of academic research going into such topics as homomorphic, order-revealing, and order-preserving encryption techniques.

As interesting as this work is, the current designs are nowhere near secure enough to use in a production environment.

For example, order-revealing encryption leaks enough data to infer the plaintext.

Homomorphic encryption schemes are often repackaging vulnerabilities (practical chosen-ciphertext attacks) as features.

Unpadded RSA is homomorphic with respect to multiplication.
- If you multiply a ciphertext by an integer, the plaintext you get will be equal to the original message multiplied by the same integer. There are several possible attacks against unpadded RSA, which is why RSA in the real world uses padding (although often an insecure padding mode).
AES in Counter Mode is homomorphic with respect to XOR.
- This is why nonce-reuse defeats the confidentiality of your message in CTR mode (and non-NMR stream ciphers in general).

As we’ve covered in a previous blog post, when it comes to real-world cryptography, confidentiality without integrity is the same as no confidentiality. What happens if an attacker gains access to the database, alters ciphertexts, and studies the behavior of the application upon decryption?

There’s potential for ongoing cryptography research to one day produce an innovative encryption design that doesn’t undo decades of research into safe cryptography primitives and cryptographic protocol designs. However, we’re not there yet, and you don’t need to invest into a needlessly complicated research prototype to solve the problem.

Dishonorable Mention: Decrypt Every Row

I don’t expect most engineers to arrive at this solution without a trace of sarcasm. The bad idea here is, because you need secure encryption (see below), your only recourse is to query every ciphertext in the database and then iterate through them, decrypting them one-by-one and performing your search operation in the application code.

If you go down this route, you will open your application to denial of service attacks. It will be slow for your legitimate users. This is a cynic’s answer, and you can do much better than that, as we’ll demonstrate below.

Secure Searchable Encryption Made Easy

Let’s start by avoiding all the problems outlined in the insecure/ill-advised section in one fell swoop: All ciphertexts will be the result of an authenticated encryption scheme, preferably with large nonces (generated from a secure random number generator).

With an authenticated encryption scheme, ciphertexts are non-deterministic (same message and key, but different nonce, yields a different ciphertext) and protected by an authentication tag. Some suitable options include: XSalsa20-Poly1305, XChacha20-Poly1305, and (assuming it’s not broken before CAESAR concludes) NORX64-4-1. If you’re using NaCl or libsodium, you can just use crypto_secretbox here.

Consequently, our ciphertexts are indistinguishable from random noise, and protected against chosen-ciphertext attacks. That’s how secure, boring encryption ought to be.

However, this presents an immediate challenge: We can’t just encrypt arbitrary messages and query the database for matching ciphertexts. Fortunately, there is a clever workaround.

Important: Threat Model Your Usage of Encryption

Before you begin, make sure that encryption is actually making your data safer. It is important to emphasize that “encrypted storage” isn’t the solution to securing a CRUD app that’s vulnerable to SQL injection. Solving the actual problem (i.e. preventing the SQL injection) is the only way to go.

If encryption is a suitable security control to implement, this implies that the cryptographic keys used to encrypt/decrypt data are not accessible to the database software. In most cases, it makes sense to keep the application server and database server on separate hardware.

Implementing Literal Search of Encrypted Data

Possible use-case: Storing social security numbers, but still being able to query them.

In order to store encrypted information and still use the plaintext in SELECT queries, we’re going to follow a strategy we call blind indexing. The general idea is to store a keyed hash (e.g. HMAC) of the plaintext in a separate column. It is important that the blind index key be distinct from the encryption key and unknown to the database server.

For very sensitive information, instead of a simple HMAC, you will want to use a key-stretching algorithm (PBKDF2-SHA256, scrypt, Argon2) with the key acting as a static salt, to slow down attempts at enumeration. We aren’t worried about offline brute-force attacks in either case, unless an attacker can obtain the key (which must not stored in the database).

So if your table schema looks like this (in PostgreSQL flavor):

CREATE TABLE humans (

    humanid BIGSERIAL PRIMARY KEY,
    first_name TEXT,
    last_name TEXT,
    ssn TEXT, /* encrypted */
    ssn_bidx TEXT /* blind index */
);
CREATE INDEX ON humans (ssn_bidx);

You would store the encrypted value in humans.ssn. A blind index of the plaintext SSN would go into humans.ssn_bidx. A naive implementation might look like this:

<?php
/* This is not production-quality code.
 * It's optimized for readability and understanding, not security.
 */

function encryptSSN(string $ssn, string $key): string
{
    $nonce = random_bytes(24);
    $ciphertext = sodium_crypto_secretbox($ssn, $nonce, $key);
    return bin2hex($nonce . $ciphertext);
}

function decryptSSN(string $ciphertext, string $key): string
{
    $decoded = hex2bin($ciphertext);
    $nonce = mb_substr($decoded, 0, 24, '8bit');
    $cipher = mb_substr($decoded, 24, null, '8bit');
    return sodium_crypto_secretbox_open($cipher, $nonce, $key);
}

function getSSNBlindIndex(string $ssn, string $indexKey): string
{
    return bin2hex(
        sodium_crypto_pwhash(
            32,
            $ssn,
            $indexKey,
            SODIUM_CRYPTO_PWHASH_OPSLIMIT_MODERATE,
            SODIUM_CRYPTO_PWHASH_MEMLIMIT_MODERATE
        )
    );
}

function findHumanBySSN(PDO $db, string $ssn, string $indexKey): array
{
    $index = getSSNBlindIndex($ssn, $indexKey);
    $stmt = $db->prepare('SELECT * FROM humans WHERE ssn_bidx = ?');
    $stmt->execute([$index]);
    return $stmt->fetchAll(PDO::FETCH_ASSOC);
}

A more comprehensive proof-of-concept is included in the supplemental material for my B-Sides Orlando 2017 talk. It’s released under the Creative Commons CC0 license, which for most people means the same thing as “public domain”.

Security Analysis and Limitations

Depending on your exact threat model, this solution leaves two questions that must be answered before it can be adopted:

Is it safe to use, or does it leak data like a sieve?
What are the limitations on its usefulness? (This one is sort of answered already.)

Given our example above, assuming your encryption key and your blind index key are separate, both keys are stored in the webserver, and the database server doesn’t have any way of obtaining these keys, then any attacker that only compromises the database server (but not the web server) will only be able to learn if several rows share a social security number, but not what the shared SSN is. This duplicate entry leak is necessary in order for indexing to be possible, which in turn allows fast SELECT queries from a user-provided value.

Furthermore, if an attacker is capable of both observing/changing plaintexts as a normal user of the application while observing the blind indices stored in the database, they can leverage this into a chosen-plaintext attack, where they iterate every possible value as a user and then correlate with the resultant blind index value. This is more practical in the HMAC scenario than in the e.g. Argon2 scenario. For high-entropy or low-sensitivity values (not SSNs), the physics of brute force can be on our side.

A much more practical attack for such a criminal would be to substitute values from one row to another then access the application normally, which will reveal the plaintext unless a distinct per-row key was employed (e.g. hash_hmac('sha256', $rowID, $masterKey, true) could even be an effective mitigation here, although others would be preferable). The best defense here is to use an AEAD mode (passing the primary key as additional associated data) so that the ciphertexts are tied to a particular database row. (This will not prevent attackers from deleting data, which is a much bigger challenge.)

Compared to the amount of information leaked by other solutions, most applications’ threat models should find this to be an acceptable trade-off. As long as you’re using authenticated encryption for encryption, and either HMAC (for blind indexing non-sensitive data) or a password hashing algorithm (for blind indexing sensitive data), it’s easy to reason about the security of your application.

However, it does have one very serious limitation: It only works for exact matches. If two strings differ in a meaningless way but will always produce a different cryptographic hash, then searching for one will never yield the other. If you need to do more advanced queries, but still want to keep your decryption keys and plaintext values out of the hands of the database server, we’re going to have to get creative.

It is also worth noting that, while HMAC/Argon2 can prevent attackers that do not possess the key from learning the plaintext values of what is stored in the database, it might reveal metadata (e.g. two seemingly-unrelated people share a street address) about the real world.

Implementing Fuzzier Searching for Encrypted Data

Possible use-case: Encrypting peoples’ legal names, and being able to search with only partial matches.

Let’s build on the previous section, where we built a blind index that allows you to query the database for exact matches.

This time, instead of adding columns to the existing table, we’re going to store extra index values into a join table.

CREATE TABLE humans (
    humanid BIGSERIAL PRIMARY KEY,
    first_name TEXT, /* encrypted */
    last_name TEXT, /* encrypted */
    ssn TEXT, /* encrypted */
);
CREATE TABLE humans_filters (
    filterid BIGSERIAL PRIMARY KEY,
    humanid BIGINT REFERENCES humans (humanid),
    filter_label TEXT,
    filter_value TEXT
);
/* Creates an index on the pair. If your SQL expert overrules this, feel free to omit it. */
CREATE INDEX ON humans_filters (filter_label, filter_value);

The reason for this change is to normalize our data structures. You can get by with just adding columns to the existing table, but it’s likely to get messy.

The next change is that we’re going to store a separate, distinct blind index per column for every different kind of query we need (each with its own key). For example:

Need a case-insensitive lookup that ignores whitespace?
- Store a blind index of preg_replace('/[^a-z]/', '', strtolower($value)).
Need to query the first letter of their last name?
- Store a blind index of strtolower(mb_substr($lastName, 0, 1, $locale)).
Need to match “beings with this letter, ends with that letter”?
- Store a blind index of strtolower($string[0] . $string[-1]).
Need to query the first three letters of their last name and the first letter of their first name?
- You guessed it! Build another index based on partial data.

Every index needs to have a distinct key, and great pains should be taken to prevent blind indices of subsets of the plaintext from leaking real plaintext values to a criminal with a knack for crossword puzzles. Only create indexes for serious business needs, and log access to these parts of your application aggressively.

Trading Memory for Time

Thus far, all of the design propositions have been in favor of allowing developers to write carefully considered SELECT queries, while minimizing the number of times the decryption subroutine is invoked. Generally, that is where the train stops and most peoples’ goals have been met.

However, there are situations where a mild performance hit in search queries is acceptable if it means saving a lot of disk space.

The trick here is simple: Truncate your blind indexes to e.g. 16, 32, or 64 bits, and treat them as a Bloom filter:

If the blind indices involved in the query match a given row, the data is probably a match.
- Your application code will need to perform the decryption for each candidate row and then only serve the actual matches.
If the blind indices involved in the query do not match a given row, then the data is definitely not a match.

It may also be worth converting these values from a string to an integer, if your database server will end up storing it more efficiently.

Conclusion

I hope I’ve adequately demonstrated that it is not only possible to build a system that uses secure encryption while allowing fast queries (with minimal information leakage against very privileged attackers), but that it’s possible to build such a system simply, out of the components provided by modern cryptography libraries with very little glue.

If you’re interested in implementing encrypted database storage into your software, we’d love to provide you and your company with our consulting services. Contact ParagonIE if you’re interested.

Frequently Asked Questions (FAQs) on Securely Searching Encrypted Database Fields

What is the importance of searching encrypted database fields securely?

Searching encrypted database fields securely is crucial to protect sensitive data from unauthorized access. Encryption converts data into a code to prevent unauthorized access. However, if the search process isn’t secure, it can expose the data to potential threats. A secure search process ensures that even if someone intercepts the data during the search, they cannot decipher it without the correct decryption key.

How does searchable encryption work?

Searchable encryption is a cryptographic system that allows searches to be carried out on encrypted data. It works by creating an encrypted index for each word in the database. When a search query is made, it is also encrypted and matched with the encrypted index. This way, the data remains encrypted throughout the process, ensuring its security.

What are the challenges of searching encrypted data?

Searching encrypted data presents several challenges. First, encryption makes data unreadable, making it difficult to perform operations like searching. Second, traditional encryption methods require decryption before searching, which can expose sensitive data. Lastly, ensuring the search process is secure without compromising the speed and efficiency of database operations can be challenging.

What is the role of a secure index in encrypted database search?

A secure index plays a crucial role in encrypted database search. It allows for efficient searching without needing to decrypt the data. The index is created by associating each word in the database with an encrypted identifier. When a search query is made, the system checks the secure index instead of the actual data, ensuring the data remains encrypted and secure.

How can I ensure the security of my encrypted database search?

Ensuring the security of an encrypted database search involves several steps. First, use a strong encryption algorithm to encrypt your data. Second, create a secure index for efficient searching. Third, ensure the search process is also encrypted to prevent exposure during the search. Lastly, regularly update your security measures to protect against new threats.

What is the difference between deterministic and probabilistic encryption in database search?

Deterministic encryption always produces the same encrypted output for a given input, making it efficient for searching as identical terms will have identical encrypted values. However, it is less secure as it can be vulnerable to frequency analysis attacks. On the other hand, probabilistic encryption produces different outputs for the same input, making it more secure but less efficient for searching.

Can I use SQL queries on encrypted database fields?

Yes, you can use SQL queries on encrypted database fields, but it requires additional steps. You need to use a searchable encryption scheme that allows SQL queries to be performed on the encrypted data. This involves creating a secure index and encrypting the SQL queries to match the encrypted data.

How does homomorphic encryption help in searching encrypted database fields?

Homomorphic encryption is a form of encryption that allows computations to be performed on encrypted data without decrypting it. This means you can search encrypted database fields without exposing the data. The results of the search are also encrypted and can only be viewed with the correct decryption key.

What is the impact of encrypted database search on performance?

Encrypted database search can impact performance as it requires additional computational resources. Creating a secure index, encrypting and decrypting data, and performing computations on encrypted data can slow down database operations. However, with efficient encryption schemes and hardware, the impact can be minimized.

How can I balance security and performance in encrypted database search?

Balancing security and performance in encrypted database search involves choosing the right encryption scheme. Deterministic encryption is faster but less secure, while probabilistic and homomorphic encryption are more secure but slower. You need to choose based on your specific needs. Additionally, optimizing your database structure and using efficient hardware can also help balance security and performance.