PHP provides the popular md5() hash function out of the box, which returns 32 a hex character string. It’s a great way to generate a fingerprint for any arbitrary length string. But what if you need to generate an integer fingerprint out of a URL?

Challenge

We faced that challenge in RatingWidget when we had to bind our rating widgets to a unique Int64 IDs based on the website’s page it’s being loaded from. Theoretically we could just store the URLs and query the URL column, but URLs can be very long and creating an index for text column with unknown length is very inefficient.

So if you are working on any kind of dynamic widget development that should load different data based on the URL it’s loaded from, this post will save you tonnes of time.

To simplify the problem, let’s divide it into two sub-challenges:

  1. URL Canonization
  2. String to unique Int64 conversion

URL Canonization

In our case, we wanted to assign a unique Int64 for a page, not for a URL. For instance, http://domain.com?x=1&y=2 and http://domain.com?y=2&x=1 are different URLs but in fact both of them will load the exact same page. Therefore, we wanted to assign them an identical Int64 ID. Thus, by canonizing the URLs before mapping them to Int64, we can convert the URLs to uniform representation.

function canonizeUrl($url)
    {
        $url = parse_url(strtolower($url));
        
        $canonic = $url['host'] . $url['path'];
        
        if (isset($url['query']))
        {
            parse_str($url['query'], $queryString);
            $canonic .= '?' . canonizeQueryString($queryString);
        }
        
        return $canonic;
    }
    
    function canonizeQueryString(array $params)
    {
        if (!is_array($params) || 0 === count($params))
            return '';

        // Urlencode both keys and values.
        $keys = urlencode_rfc3986(array_keys($params));
        $values = urlencode_rfc3986(array_values($params));
        $params = array_combine($keys, $values);
    
        // Parameters are sorted by name, using lexicographical byte value ordering.
        // Ref: Spec: 9.1.1 (1)
        uksort($params, 'strcmp');
    
        $pairs = array();
        foreach ($params as $parameter => $value) 
        {
            $lower_param = strtolower($parameter);
            
            if (is_array($value)) 
            {
                // If two or more parameters share the same name, they are sorted by their value
                // Ref: Spec: 9.1.1 (1)
                natsort($value);
                foreach ($value as $duplicate_value)
                    $pairs[] = $lower_param . '=' . $duplicate_value;
            } 
            else 
            {
                $pairs[] = $lower_param . '=' . $value;
            }
        }            
        
        if (0 === count($pairs))
            return '';
    
        return implode('&', $pairs);
    }

    function urlencode_rfc3986($input) 
    {
        if (is_array($input))
            return array_map(array(&$this, 'urlencode_rfc3986'), $input);
        else if (is_scalar($input)) 
            return str_replace('+', ' ', str_replace('%7E', '~', rawurlencode($input)));
            
        return '';
    }

Basically what this code does is reorder the query string parameters by lexicographical order, and slightly tweak the URL encoding based on RFC 3986 URI syntax standard, to compensate for the different browsers + server URL encoding inconsistency.

Notes:

  1. In our case canonizeUrl, the canonization function, gets rid of the protocol. So https://domain.com and http://domain.com are both canonized to domain.com because we wanted to show the same rating widget on HTTP and HTTPS equivalent pages.
  2. As you can notice, we also ignore everything the after hashmark fragment. Therefore, if you would like to generate unique IDs for SPA (Single Page Application) different states like http://my-spa.com/#state1 and http://my-spa.com/#state2, the URL canonization function has to be modified to support that.

Converting String to unique Int64 ID for MySql BIGINT Indexed Column

After fooling around with various bit conversion functions like bindec(), decbin(), base_convert(). We have found out that 64 bit integers and PHP are not playing well. None of the mentioned functions consistently supports 64 bit. After digging around on Google, we were lead to a post about 32 bit limitations in PHP which included the suggestion to use GMP, a really cool library for multiple precision integers. Using this library, we managed to create this one line hash function that generates a 64 bit integer out of arbitrary length string.

function get64BitHash($str)
    {
        return gmp_strval(gmp_init(substr(md5($str), 0, 16), 16), 10);
    }

Post factum, we could have implemented the CRC64 algorithm which generates a string checksum and should perform faster than MD5. But the advantage of the technique we’ve used over CRC is that we’ve created a one-way-hash function, so we can reuse it for various cryptography purposes in the code.

To find out more about GMP, see here.

Grand Finale

Combining the URL canonization with the String to Int64 mapping, the final solution looks like this:

function urlTo64BitHash($url)
    {
        return get64BitHash(canonizeUrl($url));
    }

Collision and Performance Test of get64BitHash

Platform: Intel i3, Windows 7 64 bit, PHP 5.3
Iterations: 10,000,000 Times generated get64BitHash
Elapsed Time: 460 millisecond for every 100,000 generations
Collision: Not found

Summary

I hope this straightforward solution will save you time on your next project. If you have comments or any additional use-cases where this technique can be applied, please feel free to comment below.

Vova is a passionate serial entrepreneur and has been a professional full-stack developer since the age of 12. After his previous company Senexx was acquired by Gartner Inc., Vova left his position as CTO and founded RaitingWidget.com. His newest company is disrupting the internet ratings and reviews industry for publishers and eCommerce by providing a ridiculously simple, yet highly user friendly star ratings solution for web developers.

Get your free chapter of Level Up Your Web Apps with Go

Get a free chapter of Level Up Your Web Apps with Go, plus updates and exclusive offers from SitePoint.


  • http://mrxxiv.com/ Terrence Campbell

    This is highly useful for any service I attempt to create in the future. I do wonder, how does this compare to bcrypt? I’ve always wanted to learn more about PHP’s hashing capabilities.

    By the way, out of topic, PHP 5.3 support expires tomorrow.

    • svovaf

      Usually encryption takes more time than hashing. I’m not familiar with bcrypt, but I would carefully guess that the hashing technique should be more efficient. Obviously it requires performance testing.

      • Daniel

        of course it’s more efficient. the whole point of using blowfish hashing is to be slower to make bruteforce attacks take forever

    • http://evanbyrne.com Evan Byrne

      Bcrypt is the standard for password hashing because it is slow. However, if all you are doing is something like in the article (i.e., for data that is not sensitive), then something in the SHA family would probably suffice.

  • griesx

    why are you using substr(md5($string), 16); ? this will create possible collisions ?!

    • svovaf

      Using gmp_init with HEX numbers which are longer than 16 characters may result numbers out of the scale of 64 Bit.

      • griesx

        yeah but that is a problem as a md5 sum that gets split in half is not unique anymore so your proposed solution has a major flaw or am I getting something wrong here?

        • svovaf

          MD5 has collisions anyways. It’s impossible to map endless number of string variations to 32 characters string. But you are right, in theory it increase the probability for collision. Practically, it depends on the problem you are trying to solve. If your set of input strings is very large, you’ll have to deal with collisions.

          • http://yapf.blogspot.nl/ vinny42

            “MD5 has collisions anyways.”

            The first collision found was on binary data of >2kByte.

    • http://www.coolfactor.ca/ Ted Wood

      I agree with this assumption, and my modification was to take the first 8 characters of the MD5 string and join that with the last 8 characters, instead of always taking the first 16 characters. That _should_ reduce the chances of collisions, eh?

  • Michael Gelfand

    Good article! The only question I have, is why you decided to use Int64 and not md5 with binary representation (I mean store it not as string(32), but use binary(16) representation (128 bit), which should be almost same performance as int64).

  • Marc

    In my opinion this code has a big flaw, which is that URL paths, parameters and their values are case sensitive, and you are ‘strlowering’ them (in the case of parameter names not once, but twice, the second being unnecessary).

    This will lead the system to think that possibly two different pages are the same. For example, http://www.example.com/news/first-ufo-contact could be a different news than http://www.example.com/news/first-UFO-contact or http://www.example.com/?section=news could be a different section than http://www.example.com/?section=News. The second one is not likely, but it’s possible; the first one is a bit more likely in a site with a lot of news or other content (I’ve myself faced similar cases).

    Another big flaw is that you are not “canonizing” the host. This is a bit more complicated issue with no easy solution, but with your code, http://www.example.com/this/is/a/page would get a different ID than example.com/this/is/a/page, which in 99,99% of cases would be the same page. The “easy” way would be to add “www.” to any host missing a “subdomain”, but in some cases that wouldn’t be the best fit (it shouldn’t, but some people thinks it’s a good idea to make http://www.example.com a different page than example.com, which I’ve sadly seen too).

    On a side note, I’ve been checking your project’s documentation, and I have a feature suggestion for you (sorry if it’s already there and I missed it). In almost all multilingual projects, you will have different URLs for the “same” page, but in different languages. There’s mostly 3 use-cases: a different domain per language (example-en.com and example-es.com), a different subdomain per language (en.example.com and es.example.com) or a different url base path (example.com/en and example.com/es). In most cases, the URL path will also be different, especially when using clean URLs, which is almost all sites made nowadays.

    The suggestion is that there should be some way to tell the system that different pages are in reality the same page in different languages, and I would want all the ratings on them to be ratings to the same “page”. Even if they are in different languages, they are still the same page, with the same content, and I want all the ratings taken into account to the same thing. This would be specially important in multilingual eCommerce sites, where the same product is being rated but your code will think they are different products. This could be easily done checking for the ” head tag and/or the ‘Link: ; rel=”alternate”; hreflang=”es”‘ http header. It could also be done providing some API functionality.

    As a last comment, and as a Drupal developer myself, I can’t miss saying that I feel bad not seeing a Drupal module for the project! There are some good rating modules for it, but none “cloud” based, so they all add load and space consumed to your server and so a Drupal module would surely drive more customers to you, specially on big sites, which are Drupal’s specialty ;) I am maintainer of some modules so if you are interested in some advice or help, I’ll be glad to lend you a hand.

    And just to finish this wall of text (sorry! I can’t help it), I’d like to congratulate you for the project. Seems like a great idea that worked wonders, 180,000 sites using something isn’t easily achieved. Grats!

    • https://freemius.com Vova Feldman

      Hey Marc, thanks for the informative comment and for the warm feedback about RatingWidget.com. URI canonization is a real challenge and a topic by itself. As you’ve mentioned, each site has it’s own “server hacks”. To highlight the challenge, Google’s Blogger/Blogspot is even using totally different domains according to your Geo Location. So for example opening myblog.com from Israel will auto redirect to myblog.co.il (which will show the exact same content). Therefore, Blogger is using the canonical URL tag to specify the “right” URL of every page (read more: http://moz.com/blog/canonical-url-tag-the-most-important-advancement-in-seo-practices-since-sitemaps).

    • https://freemius.com Vova Feldman

      Hey Marc, thanks for the informative comment and for the warm feedback about RatingWidget.com. URI canonization is a real challenge and a topic by itself. As you’ve mentioned, each site has it’s own “server hacks”. To highlight the challenge, Google’s Blogger/Blogspot is even using totally different domains according to your Geo Location. So for example opening myblog.com from Israel will auto redirect to myblog.co.il (which will show the exact same content). Therefore, Blogger is using the canonical URL tag to specify the “right” URL of every page (read more: http://moz.com/blog/canonical-url-tag-the-most-important-advancement-in-seo-practices-since-sitemaps).

  • svovaf

    Thanks for the comment @disqus_ZtzWSTh07Z:disqus – sounds like a good idea. The reason we used Int64 is because of legacy reasons – all our ratings’ IDs has to be numeric.

    • https://www.raptor-editor.com/ Petah

      Whats wrong with crc64?

      • svovaf

        Petah, you are right! Thanks for the heads up. MD5 is not very much secure like encryption algorithms (AES, 3DES, etc). Instead of “various cryptography purposes” it would be more accurate to write “some cryptography purposes”. It really depends on the use case.

  • Gabriel

    Hi, I have just finished an unique ID generator (numeric, incremental and resilient) written in php (and willing to port it in C). Can you check http://www.webhead.ro/ and tell me your opinion?
    Thank you.

Related books & courses
Jump Start MySQL

Available on SitePoint Premium. Check it out now!