How to Create a Unique 64bit Integer from String

PHP provides the popular md5() hash function out of the box, which returns 32 a hex character string. It’s a great way to generate a fingerprint for any arbitrary length string. But what if you need to generate an integer fingerprint out of a URL?

Key Takeaways

URL Canonization is a necessary step when generating a unique Int64 ID from a string, as different URLs may load the same page. The canonization function gets rid of the protocol and ignores everything after the hashmark fragment, ensuring that identical pages have identical Int64 IDs.
PHP’s GMP library is a useful tool for generating a 64-bit integer out of an arbitrary length string. The function get64BitHash() is a one-line hash function that generates a 64-bit integer using this library, providing a solution to PHP’s 32-bit limitations.
The combination of URL canonization and string to Int64 mapping provides a reliable solution for creating unique Int64 IDs from URLs. This technique is efficient and versatile, with potential applications in cryptography and dynamic widget development.

Challenge

We faced that challenge in RatingWidget when we had to bind our rating widgets to a unique Int64 IDs based on the website’s page it’s being loaded from. Theoretically we could just store the URLs and query the URL column, but URLs can be very long and creating an index for text column with unknown length is very inefficient.

So if you are working on any kind of dynamic widget development that should load different data based on the URL it’s loaded from, this post will save you tonnes of time.

To simplify the problem, let’s divide it into two sub-challenges:

URL Canonization
String to unique Int64 conversion

URL Canonization

In our case, we wanted to assign a unique Int64 for a page, not for a URL. For instance, http://domain.com?x=1&y=2 and http://domain.com?y=2&x=1 are different URLs but in fact both of them will load the exact same page. Therefore, we wanted to assign them an identical Int64 ID. Thus, by canonizing the URLs before mapping them to Int64, we can convert the URLs to uniform representation.



    function canonizeUrl($url)

    {

        $url = parse_url(strtolower($url));
        $canonic = $url['host'] . $url['path'];
        if (isset($url['query']))

        {

            parse_str($url['query'], $queryString);

            $canonic .= '?' . canonizeQueryString($queryString);

        }
        return $canonic;

    }
    function canonizeQueryString(array $params)

    {

        if (!is_array($params) || 0 === count($params))

            return '';
        // Urlencode both keys and values.

        $keys = urlencode_rfc3986(array_keys($params));

        $values = urlencode_rfc3986(array_values($params));

        $params = array_combine($keys, $values);
        // Parameters are sorted by name, using lexicographical byte value ordering.

        // Ref: Spec: 9.1.1 (1)

        uksort($params, 'strcmp');
        $pairs = array();

        foreach ($params as $parameter => $value)

        {

            $lower_param = strtolower($parameter);
            if (is_array($value))

            {

                // If two or more parameters share the same name, they are sorted by their value

                // Ref: Spec: 9.1.1 (1)

                natsort($value);

                foreach ($value as $duplicate_value)

                    $pairs[] = $lower_param . '=' . $duplicate_value;

            }

            else

            {

                $pairs[] = $lower_param . '=' . $value;

            }

        }            
        if (0 === count($pairs))

            return '';
        return implode('&', $pairs);

    }
    function urlencode_rfc3986($input)

    {

        if (is_array($input))

            return array_map(array(&$this, 'urlencode_rfc3986'), $input);

        else if (is_scalar($input))

            return str_replace('+', ' ', str_replace('%7E', '~', rawurlencode($input)));
        return '';

    }

Basically what this code does is reorder the query string parameters by lexicographical order, and slightly tweak the URL encoding based on RFC 3986 URI syntax standard, to compensate for the different browsers + server URL encoding inconsistency.

Notes:

In our case canonizeUrl, the canonization function, gets rid of the protocol. So https://domain.com and http://domain.com are both canonized to domain.com because we wanted to show the same rating widget on HTTP and HTTPS equivalent pages.
As you can notice, we also ignore everything the after hashmark fragment. Therefore, if you would like to generate unique IDs for SPA (Single Page Application) different states like http://my-spa.com/#state1 and http://my-spa.com/#state2, the URL canonization function has to be modified to support that.

Converting String to unique Int64 ID for MySql BIGINT Indexed Column

After fooling around with various bit conversion functions like bindec(), decbin(), base_convert(). We have found out that 64 bit integers and PHP are not playing well. None of the mentioned functions consistently supports 64 bit. After digging around on Google, we were lead to a post about 32 bit limitations in PHP which included the suggestion to use GMP, a really cool library for multiple precision integers. Using this library, we managed to create this one line hash function that generates a 64 bit integer out of arbitrary length string.



    function get64BitHash($str)

    {

        return gmp_strval(gmp_init(substr(md5($str), 0, 16), 16), 10);

    }

Post factum, we could have implemented the CRC64 algorithm which generates a string checksum and should perform faster than MD5. But the advantage of the technique we’ve used over CRC is that we’ve created a one-way-hash function, so we can reuse it for various cryptography purposes in the code.

To find out more about GMP, see here.

Grand Finale

Combining the URL canonization with the String to Int64 mapping, the final solution looks like this:



    function urlTo64BitHash($url)

    {

        return get64BitHash(canonizeUrl($url));

    }

Collision and Performance Test of get64BitHash

Platform: Intel i3, Windows 7 64 bit, PHP 5.3
Iterations: 10,000,000 Times generated get64BitHash
Elapsed Time: 460 millisecond for every 100,000 generations
Collision: Not found

Summary

I hope this straightforward solution will save you time on your next project. If you have comments or any additional use-cases where this technique can be applied, please feel free to comment below.

Frequently Asked Questions (FAQs) about Creating Unique 64-bit Integer Strings

What is a 64-bit integer string and why is it important?

A 64-bit integer string is a data type that can hold a whole number ranging from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. It’s important because it allows for a vast range of unique values, which is crucial in many computing scenarios such as database indexing, hashing algorithms, and unique identifier generation. The ability to create unique 64-bit integer strings can significantly enhance the efficiency and security of your software applications.

How does a hashing function work in creating a unique 64-bit integer string?

A hashing function takes an input (or ‘message’) and returns a fixed-size string of bytes, typically in the form of a 64-bit integer string. The output is unique to each unique input – a small change in the input will produce such a drastic change in output that the new hash value appears uncorrelated with the old hash value. This property is critical in various computer science applications, including data retrieval and encryption.

Can I convert a string to a unique integer in Ruby?

Yes, you can convert a string to a unique integer in Ruby. Ruby provides several built-in methods to convert strings to integers, such as the to_i method. However, to ensure the integer is unique for each unique string, you may need to implement a custom hashing function. This function could combine the ASCII values of the characters in the string in a specific way to generate a unique integer.

What are the potential issues with string hashing?

While string hashing is a powerful tool, it’s not without its potential issues. One of the main concerns is the possibility of collisions, where two different strings produce the same hash value. While the chances are low with a good hashing function and a large enough hash size (like 64-bit), it’s still a possibility. Another issue is the irreversibility of hashes – once you’ve hashed a string, you can’t get the original string back from the hash value.

How can I ensure the uniqueness of my 64-bit integer string?

Ensuring the uniqueness of your 64-bit integer string largely depends on the quality of your hashing function. A good hashing function will produce a unique hash value for each unique input string. You can also add a ‘salt’ – a random data that is used as an additional input to the hashing function – to further ensure uniqueness. However, remember that while you can minimize the risk of collisions, it’s impossible to completely eliminate it.

What is the role of a 64-bit integer string in cryptography?

In cryptography, a 64-bit integer string can be used as a hash value in a hashing algorithm. Hashing is a fundamental part of many cryptographic protocols. It’s used to ensure data integrity, create digital signatures, and generate unique keys for encryption and decryption. The 64-bit size offers a good balance between computational efficiency and security, as it’s large enough to make brute-force attacks impractical.

Can I use a 64-bit integer string as a unique identifier?

Yes, a 64-bit integer string can be used as a unique identifier, especially in scenarios where a large number of unique IDs are required. Because of the vast range of values a 64-bit integer can hold, it allows for a huge number of unique IDs. However, remember that the uniqueness is dependent on the quality of your hashing function.

How does a 64-bit integer string compare to other data types in terms of storage efficiency?

A 64-bit integer string is more storage-efficient than many other data types. For example, it can hold a much larger range of values than a 32-bit integer while only taking up twice as much storage space. This makes it a good choice for scenarios where storage efficiency is a concern, such as in large databases or high-performance computing applications.

How can I convert a 64-bit integer string back to the original string?

In general, you can’t convert a 64-bit integer string back to the original string. This is because hashing is a one-way function – it’s designed to be easy to compute in one direction (from input to hash value), but practically impossible to compute in the reverse direction. This property is what makes hashing useful in many applications, including data integrity checks and password storage.

What are some practical applications of 64-bit integer strings?

64-bit integer strings have a wide range of practical applications. They’re used in database indexing to quickly locate data, in hashing algorithms to ensure data integrity and create unique identifiers, and in cryptography for secure data transmission. They’re also used in high-performance computing applications where a large range of unique values is required.