PHP - - By Vova Feldman

How to Create a Unique 64bit Integer from String

PHP provides the popular md5() hash function out of the box, which returns 32 a hex character string. It’s a great way to generate a fingerprint for any arbitrary length string. But what if you need to generate an integer fingerprint out of a URL?

Challenge

We faced that challenge in RatingWidget when we had to bind our rating widgets to a unique Int64 IDs based on the website’s page it’s being loaded from. Theoretically we could just store the URLs and query the URL column, but URLs can be very long and creating an index for text column with unknown length is very inefficient.

So if you are working on any kind of dynamic widget development that should load different data based on the URL it’s loaded from, this post will save you tonnes of time.

To simplify the problem, let’s divide it into two sub-challenges:

  1. URL Canonization
  2. String to unique Int64 conversion

URL Canonization

In our case, we wanted to assign a unique Int64 for a page, not for a URL. For instance, http://domain.com?x=1&y=2 and http://domain.com?y=2&x=1 are different URLs but in fact both of them will load the exact same page. Therefore, we wanted to assign them an identical Int64 ID. Thus, by canonizing the URLs before mapping them to Int64, we can convert the URLs to uniform representation.

function canonizeUrl($url)
    {
        $url = parse_url(strtolower($url));
        
        $canonic = $url['host'] . $url['path'];
        
        if (isset($url['query']))
        {
            parse_str($url['query'], $queryString);
            $canonic .= '?' . canonizeQueryString($queryString);
        }
        
        return $canonic;
    }
    
    function canonizeQueryString(array $params)
    {
        if (!is_array($params) || 0 === count($params))
            return '';

        // Urlencode both keys and values.
        $keys = urlencode_rfc3986(array_keys($params));
        $values = urlencode_rfc3986(array_values($params));
        $params = array_combine($keys, $values);
    
        // Parameters are sorted by name, using lexicographical byte value ordering.
        // Ref: Spec: 9.1.1 (1)
        uksort($params, 'strcmp');
    
        $pairs = array();
        foreach ($params as $parameter => $value) 
        {
            $lower_param = strtolower($parameter);
            
            if (is_array($value)) 
            {
                // If two or more parameters share the same name, they are sorted by their value
                // Ref: Spec: 9.1.1 (1)
                natsort($value);
                foreach ($value as $duplicate_value)
                    $pairs[] = $lower_param . '=' . $duplicate_value;
            } 
            else 
            {
                $pairs[] = $lower_param . '=' . $value;
            }
        }            
        
        if (0 === count($pairs))
            return '';
    
        return implode('&', $pairs);
    }

    function urlencode_rfc3986($input) 
    {
        if (is_array($input))
            return array_map(array(&$this, 'urlencode_rfc3986'), $input);
        else if (is_scalar($input)) 
            return str_replace('+', ' ', str_replace('%7E', '~', rawurlencode($input)));
            
        return '';
    }

Basically what this code does is reorder the query string parameters by lexicographical order, and slightly tweak the URL encoding based on RFC 3986 URI syntax standard, to compensate for the different browsers + server URL encoding inconsistency.

Notes:

  1. In our case canonizeUrl, the canonization function, gets rid of the protocol. So https://domain.com and http://domain.com are both canonized to domain.com because we wanted to show the same rating widget on HTTP and HTTPS equivalent pages.
  2. As you can notice, we also ignore everything the after hashmark fragment. Therefore, if you would like to generate unique IDs for SPA (Single Page Application) different states like http://my-spa.com/#state1 and http://my-spa.com/#state2, the URL canonization function has to be modified to support that.

Converting String to unique Int64 ID for MySql BIGINT Indexed Column

After fooling around with various bit conversion functions like bindec(), decbin(), base_convert(). We have found out that 64 bit integers and PHP are not playing well. None of the mentioned functions consistently supports 64 bit. After digging around on Google, we were lead to a post about 32 bit limitations in PHP which included the suggestion to use GMP, a really cool library for multiple precision integers. Using this library, we managed to create this one line hash function that generates a 64 bit integer out of arbitrary length string.

function get64BitHash($str)
    {
        return gmp_strval(gmp_init(substr(md5($str), 0, 16), 16), 10);
    }

Post factum, we could have implemented the CRC64 algorithm which generates a string checksum and should perform faster than MD5. But the advantage of the technique we’ve used over CRC is that we’ve created a one-way-hash function, so we can reuse it for various cryptography purposes in the code.

To find out more about GMP, see here.

Grand Finale

Combining the URL canonization with the String to Int64 mapping, the final solution looks like this:

function urlTo64BitHash($url)
    {
        return get64BitHash(canonizeUrl($url));
    }

Collision and Performance Test of get64BitHash

Platform: Intel i3, Windows 7 64 bit, PHP 5.3
Iterations: 10,000,000 Times generated get64BitHash
Elapsed Time: 460 millisecond for every 100,000 generations
Collision: Not found

Summary

I hope this straightforward solution will save you time on your next project. If you have comments or any additional use-cases where this technique can be applied, please feel free to comment below.

Sponsors