Ruby
Article

Diving into How Hashes Work in Ruby

By David Bush

Ruby’s Hash is arguably the most useful data type in the language. Utilizing a structure that lends itself well to modelling real-world problems, combined with speedy lookups, it is frequently the backbone of scripts, web pages, and complex business logic. It was even once used internally by Ruby to store all your functions, variables, and constants (although modern Ruby approaches this differently by using branch prediction of hot paths and call-site method caching; that’s another post!) We often take it for granted that it “just works”. Today we’re going to delve into the internals of Hash, seeing exactly why it works, how it is so efficient, and how it is implemented. Onward!

Implementation

big_four_stock_prices = {
    :google => 801.23,
    :apple => 115.57,
    :facebook => 128.35,
    :microsoft => 57.19
}

What’s going on here? Let’s walk through it.

First, the runtime allocates a contiguous block of memory for the data structure. Next it hashes the first key:

:google.hash
=> 3592

The hash function is declared in Ruby’s kernel class, so it’s available to any class with that in its inheritance chain. The actual number will vary depending on a number of factors, but what is important is that it remains consistent across a session.

Next, we have to store the value. The value is placed in what is called a ‘bucket’ (or a ‘bin’ if you’re coming from other languages). A Ruby hash will allocate 11 buckets initially, so we need to decide in which bucket to store our hash value. For that, the runtime simply performs the following calculation:

 :google.hash % 11
 => 5

This means that the value for :google, which is 801.23, is placed in the 5th bucket of the hash table, with an identifier for its key that is something like:

[:google, 801.23]

It repeats this process for the rest of the key/value pairs. To pull it back out, the interpreter simply repeats the calculation and pulls out the value from the bucket. Simple right? Let’s get into the edge cases.

Collisions

The more astute readers among you may have caught on to the fact that if there are only 11 possible buckets, sooner or later the calculation is going to place two values in the same bucket. What happens then? Well, Ruby’s hash implementation will store the items in the same bin in the form of a linked list. If the identifier (the key) matches the one stored (it checks this with equal?), it returns the value, and if it doesn’t, it simply continues traversing the list.

Of course, this process could be time consuming – what if an item shared a bucket with 1000 others? It could theoretically take a large amount of time to return the desired value, and the whole point of using a hash table is that it is fast. Ruby, of course, has you covered. It does this by keeping a running total of the average number of items in each bin across the whole table. Once this value exceeds 5, Ruby discards the array of 11 bins, creates a new one and it is at this point where Ruby will rehash the values. By doing this, the linked lists in bins are kept short, which decreases overall lookup time.

Lookup time is the key selling point for a hash. For the Big O Notation-inclined amongst you, a hash table will look up values in constant time; that is that as the number of values in the table increases, the lookup time remains the same. In practice, it’s closer to O (log n), but let’s not get too pedantic!

The Importance of a Good Hashing Algorithm

To cement our understanding of the hash implementation, let’s try a brief experiment: What would happen if a custom hashing function returned the same integer each time? Our mental model of the hash suggests that this will append each item to a linked list in the same bin, and the time to look up an item towards the end of that list will be much higher. Let’s try with some code:

require 'benchmark'

class GoodHash
end

class BadHash
  def hash
    7
  end
end

hashing_algorithm == BadHash  # switch to GoodHash for inverse

17.times do |ex|
  lookup_key = nil
  size = 2**ex
  test_hash = {}
  (1..size).each do |n|
    index = hashing_algorithm.new
    lookup_key = index if n > size/2 and lookup_key.nil?
    test_hash[index] = rand
  end

  Benchmark.bm do |bench|
    bench.report("Hash size: #{size} elements 10000 times\n") do
      10000.times do
        test_hash[lookup_key]
      end
    end
  end
end

This will generate a hash (of varying sizes) similar to this:

test_hash = {
    #<BadHash:0x007f92941eaae0> => 0.029655456018705673,
    #<BadHash:0x007f92941e3c68> => 0.973576967117432,
    #<BadHash:0x007f92941e1508> => 0.2443654110192186,
    #<BadHash:0x007f92941d1ab8> => 0.7536013342129247
}

Let’s walk through the code:

First we create a class, GoodHash, which uses Ruby’s native hashing algorithm. Then we loop 17 times (65536 items seems a good place to stop!), exponentiating a base of 2 to the power (ex). We then insert a random value into test_hash using an instance of GoodHash as the key. We look up on that value 1000 times and benchmark the results, before repeating the process with BadHash

Results (readings taken from ‘real’ column):

Hash size GoodHash BadHash
1 0.002808 0.003077
2 0.002841 0.003663
4 0.003018 0.004167
8 0.005757 0.004750
16 0.003502 0.006898
32 0.003174 0.011138
64 0.002843 0.019534
128 0.003080 0.041806
256 0.003159 0.086643
512 0.003137 0.147724
1024 0.003179 0.287432
2048 0.002843 0.554049
4096 0.002828 1.101022
8192 0.005380 2.192948
16384 0.003037 4.385111
32768 0.002830 8.739381
65536 0.002830 17.314654

This proves our hypothesis fairly conclusively. As the number of items in the hash grows, the lookup time with the BadHash class grows exponentially, while the GoodHash implementation remains firmly grouped around the same sort of time. The outliers in the GoodHash column also illustrate Ruby rehashing once the buckets get too full.

Interestingly, Ruby 2.0 onwards doesn’t work like this for smaller hashes. If your hash is 6 items or fewer, it is stored as an array list and traversed until the key matches the lookup. While hash tables are quick, by far the biggest cost is calculating the hash value; a process neatly sidestepped with this approach. In the event that the hash grows to 7 items, the array is discarded and the traditional hash is used. This is illustrated in the results table as with the entry at a hash size of 8.

How Does Ruby Allocate Bins?

Since the number of bins don’t increase linearly with the number of items in them, how does Ruby decide how many bins a hash should have? Well, Ruby calculates a list of prime numbers close to powers of 2. The benefit of this is that the number when combined with the hash value is much less likely to share a common factor. If this wasn’t true, then the modulus calculation we covered earlier could yield the same number each time, which would put everything into the same bin, not unlike our experiment prior.

Conclusion

For a dive into the internals of Ruby, this was relatively straightforward. While we’ve covered hashes pretty extensively, it is important to know when and where to use them. By far the largest costs in hashing are the rehashing, hash calculation, and bucketing, so be mindful of these when performance is a factor (the JRuby implementation avoids relying on hashes internally for this exact reason). Generally however, a hash is an excellent tool to reach for, given its intelligent design and careful optimizations. We now know exactly how a hash works, why it is so performant, and never to override the hash function for an object!

More:
  • Joe

    Great post!

    Just a quick edit for your test code:

    `hashing_algorithm == BadHash # switch to GoodHash for inverse` has an extra `=` so it doesn’t assign the variable.

    • dbush

      Good catch Joe! That’s what I get for editing it after I’ve run it!

  • Fred@Bootstrap

    Great insight into the Hash Internals! Performance isn’t something you always think of when using hashes but this article demonstrates why we should!

    • dbush

      Absolutely! I think knowing your data structures well is a powerful tool in and of itself.

  • Great article, David! When you override the `hash` method in an Object you will definitely notice performance issues and most likely unexpected behavior in your application.

    I’ve learned this the hard way (by accidentally doing it in the past). Hopefully this article will help other developers not to make that mistake. 👍

    • dbush

      Thanks for the kind words Ernesto. That’s the beauty of Ruby I guess – with great power, comes great responsibility!

Recommended
Sponsors
Get the latest in Ruby, once a week, for free.