How to Create a Unique 64bit Integer from String

PHP provides the popular md5() hash function out of the box, which returns 32 a hex character string. It’s a great way to generate a fingerprint for any arbitrary length string. But what if you need to generate an integer fingerprint out of a URL?

Challenge

We faced that challenge in RatingWidget when we had to bind our rating widgets to a unique Int64 IDs based on the website’s page it’s being loaded from. Theoretically we could just store the URLs and query the URL column, but URLs can be very long and creating an index for text column with unknown length is very inefficient.

So if you are working on any kind of dynamic widget development that should load different data based on the URL it’s loaded from, this post will save you tonnes of time.

To simplify the problem, let’s divide it into two sub-challenges:

  1. URL Canonization
  2. String to unique Int64 conversion

URL Canonization

In our case, we wanted to assign a unique Int64 for a page, not for a URL. For instance, http://domain.com?x=1&y=2 and http://domain.com?y=2&x=1 are different URLs but in fact both of them will load the exact same page. Therefore, we wanted to assign them an identical Int64 ID. Thus, by canonizing the URLs before mapping them to Int64, we can convert the URLs to uniform representation.

    function canonizeUrl($url)
    {
        $url = parse_url(strtolower($url));
        
        $canonic = $url['host'] . $url['path'];
        
        if (isset($url['query']))
        {
            parse_str($url['query'], $queryString);
            $canonic .= '?' . canonizeQueryString($queryString);
        }
        
        return $canonic;
    }
    
    function canonizeQueryString(array $params)
    {
        if (!is_array($params) || 0 === count($params))
            return '';

        // Urlencode both keys and values.
        $keys = urlencode_rfc3986(array_keys($params));
        $values = urlencode_rfc3986(array_values($params));
        $params = array_combine($keys, $values);
    
        // Parameters are sorted by name, using lexicographical byte value ordering.
        // Ref: Spec: 9.1.1 (1)
        uksort($params, 'strcmp');
    
        $pairs = array();
        foreach ($params as $parameter => $value) 
        {
            $lower_param = strtolower($parameter);
            
            if (is_array($value)) 
            {
                // If two or more parameters share the same name, they are sorted by their value
                // Ref: Spec: 9.1.1 (1)
                natsort($value);
                foreach ($value as $duplicate_value)
                    $pairs[] = $lower_param . '=' . $duplicate_value;
            } 
            else 
            {
                $pairs[] = $lower_param . '=' . $value;
            }
        }            
        
        if (0 === count($pairs))
            return '';
    
        return implode('&', $pairs);
    }

    function urlencode_rfc3986($input) 
    {
        if (is_array($input))
            return array_map(array(&$this, 'urlencode_rfc3986'), $input);
        else if (is_scalar($input)) 
            return str_replace('+', ' ', str_replace('%7E', '~', rawurlencode($input)));
            
        return '';
    }

Basically what this code does is reorder the query string parameters by lexicographical order, and slightly tweak the URL encoding based on RFC 3986 URI syntax standard, to compensate for the different browsers + server URL encoding inconsistency.

Notes:

  1. In our case canonizeUrl, the canonization function, gets rid of the protocol. So https://domain.com and http://domain.com are both canonized to domain.com because we wanted to show the same rating widget on HTTP and HTTPS equivalent pages.
  2. As you can notice, we also ignore everything the after hashmark fragment. Therefore, if you would like to generate unique IDs for SPA (Single Page Application) different states like http://my-spa.com/#state1 and http://my-spa.com/#state2, the URL canonization function has to be modified to support that.

Converting String to unique Int64 ID for MySql BIGINT Indexed Column

After fooling around with various bit conversion functions like bindec(), decbin(), base_convert(). We have found out that 64 bit integers and PHP are not playing well. None of the mentioned functions consistently supports 64 bit. After digging around on Google, we were lead to a post about 32 bit limitations in PHP which included the suggestion to use GMP, a really cool library for multiple precision integers. Using this library, we managed to create this one line hash function that generates a 64 bit integer out of arbitrary length string.

    function get64BitHash($str)
    {
        return gmp_strval(gmp_init(substr(md5($str), 0, 16), 16), 10);
    }

Post factum, we could have implemented the CRC64 algorithm which generates a string checksum and should perform faster than MD5. But the advantage of the technique we’ve used over CRC is that we’ve created a one-way-hash function, so we can reuse it for various cryptography purposes in the code.

To find out more about GMP, see here.

Grand Finale

Combining the URL canonization with the String to Int64 mapping, the final solution looks like this:

    function urlTo64BitHash($url)
    {
        return get64BitHash(canonizeUrl($url));
    }

Collision and Performance Test of get64BitHash

Platform: Intel i3, Windows 7 64 bit, PHP 5.3
Iterations: 10,000,000 Times generated get64BitHash
Elapsed Time: 460 millisecond for every 100,000 generations
Collision: Not found

Summary

I hope this straightforward solution will save you time on your next project. If you have comments or any additional use-cases where this technique can be applied, please feel free to comment below.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • http://mrxxiv.com/ Terrence Campbell

    This is highly useful for any service I attempt to create in the future. I do wonder, how does this compare to bcrypt? I’ve always wanted to learn more about PHP’s hashing capabilities.

    By the way, out of topic, PHP 5.3 support expires tomorrow.

    • svovaf

      Usually encryption takes more time than hashing. I’m not familiar with bcrypt, but I would carefully guess that the hashing technique should be more efficient. Obviously it requires performance testing.

    • http://evanbyrne.com Evan Byrne

      Bcrypt is the standard for password hashing because it is slow. However, if all you are doing is something like in the article (i.e., for data that is not sensitive), then something in the SHA family would probably suffice.

  • griesx

    why are you using substr(md5($string), 16); ? this will create possible collisions ?!

    • svovaf

      Using gmp_init with HEX numbers which are longer than 16 characters may result numbers out of the scale of 64 Bit.

      • griesx

        yeah but that is a problem as a md5 sum that gets split in half is not unique anymore so your proposed solution has a major flaw or am I getting something wrong here?

        • svovaf

          MD5 has collisions anyways. It’s impossible to map endless number of string variations to 32 characters string. But you are right, in theory it increase the probability for collision. Practically, it depends on the problem you are trying to solve. If your set of input strings is very large, you’ll have to deal with collisions.

          • http://yapf.blogspot.nl/ vinny42

            “MD5 has collisions anyways.”

            The first collision found was on binary data of >2kByte.

    • http://www.coolfactor.ca/ Ted Wood

      I agree with this assumption, and my modification was to take the first 8 characters of the MD5 string and join that with the last 8 characters, instead of always taking the first 16 characters. That _should_ reduce the chances of collisions, eh?

  • svovaf

    Thanks for the comment @disqus_ZtzWSTh07Z:disqus – sounds like a good idea. The reason we used Int64 is because of legacy reasons – all our ratings’ IDs has to be numeric.

    • https://www.raptor-editor.com/ Petah

      Whats wrong with crc64?

      • svovaf

        Petah, you are right! Thanks for the heads up. MD5 is not very much secure like encryption algorithms (AES, 3DES, etc). Instead of “various cryptography purposes” it would be more accurate to write “some cryptography purposes”. It really depends on the use case.

  • Gabriel

    Hi, I have just finished an unique ID generator (numeric, incremental and resilient) written in php (and willing to port it in C). Can you check http://www.webhead.ro/ and tell me your opinion?
    Thank you.