Data Structures, Big O and You
Data is at the heart of any nontrivial application – and while Ruby provides excellent Array and Hash classes with some seriously feel-good syntax support, anything else is rather sparse in the standard library.
When you’re picking a data structure for your project beyond the standard Hash and Array the main consideration, and indeed the main motivation for switching, is often the efficiency of the data structure. In this article I’ll be taking a high level look at how we can go about comparing data structures in terms of efficiency.
We’ll use link lists as an example.
A Quick Aside on Array
For the purposes of this article, when I talk about an Array I mean Ruby’s implementation of Array, which is a fair bit different to that of many lower level programming languages. For one, it has no restrictions on the type nor size of the data that can be inserted. Secondarily, it’s dynamically re-sizable, meaning operations which would increase size no longer (or rarely) require the expensive step of copying the array into a bigger brother.
All examples and benchmarks are given using ruby 1.9.3, with it’s naively implemented
Array class – meaning instances where two data structures have the same complexity class for a given operation but one is in pure ruby and one is in C the naively implemented structure will be quicker.
In order to understand why you’d choose one data structure over another, it’s useful to be able to measure and categorize how they perform as their input grows. We place algorithms with a similar cost into complexity classes in order to compare them. Some algorithms take the same amount of time to run no matter the input, some grow linearly with their input and some grow exponentially, for example.
When we talk about ‘time’ to run, it’s rare that two algorithms would be compared using benchmarks – which are susceptible to differences in things like the programming language, machine speed and garbage collector activity. Instead, computer scientists will look at the amount of meaningful actions an algorithm has to perform in order to complete.
A meaningful action is rather domain dependent. When talking about sorting algorithms, for instance, a meaningful action might be a comparison of one value with another.
Algorithms also tend to have a constant cost. That is, the part of the total cost which is not dependent on the size of the input. When you’re thinking about large data sets you can usually disregard this cost. For any non-trivial application with a large data set the growth rate will quickly dwarf even the largest constant cost as the input grows.
For example, if an algorithm has a constant cost of ’10’ (let’s say seconds) but takes input^2 time to run then for an input of 2 the constant might be significant chunk of the 14 seconds it would take to run. But what if your input was, say, 1 billion? Well, you would have needed to start the process some time before the big bang. In that context, the constant pales into insignificance.
There’s some very handy notation we can use when we’re comparing algorithms. Big O is used to express the worst case scenario for a given algorithm. So, we could say that algorithm x is O(n) meaning that, in its worst case algorithm x will have a runtime of 15 meaningful operations for a input of 15. If x had a runtime of O(n*15), however, we’d be looking at 225 meaningful operations.
Linked lists are one of the simplest data structures you can get, along with traditional arrays. Whereas an array is a set of allocated memory segments with stuff inside them, a linked list is a set of nodes where each node has a pointer to the next node.
Think of a chest of 10 drawers. With an Array you have each of these drawers numbered 0 to 9 with some value in each of the draws. To find a element in drawer 2 you simply open that drawer and you’ve got your item.
In a linked list, however, the numbers of the draws have no relevance. They could be random numbers, colours or pictures of cats. All you know from the outset is which drawer is the first in the list. To find the second item you must open the first drawer and read an instruction that points you to the location of the next drawer in the sequence.
If we’re iterating though both an Array and a Linked list from start to finish, then it’ll take O(n) to go though the whole list. For each element in the list we must open exactly one drawer.
The difference comes, however, when we try and find an element at i in the list. In the case of the Array we’re able to just open the drawer and be done with it – giving random access a worst case scenario of O(1). In the linked list, however, we’ve got to run through all the drawers by stepping though the pointers and find our element. If the element we’re after is at the end of the list then we’ll have to do n drawer openings giving a complexity of O(n).
You might ask why we’d ever use a linked list then? Well, in cases where we need to insert a element somewhere in the list (and assuming we already know which drawer the element we want to insert after is in) we can simply replace the pointer to the next element in the previous elements drawer to our new element in our new drawer and then have our new drawer point to the element which was previously in that position. That means that insertion will take no more that one drawer open so we have a complexity of O(1).
With arrays, however, if we want to put a new element in the top drawer then we have have to move every element in every drawer below that down one. That means we’ll have open up to n drawers giving a complexity of O(n).
Let’s look at some sample code, with a simple linked list implementation, and some benchmarks to see how this plays out.
In the first example, we’re building up our list. Both these operations have a complexity of O(n). You’ll notice that the Array version comfortably beats the linked list. That’s because, as I mentioned earlier, we’re comparing a native 1.9.3 Array class with a pure Ruby implementation which has some built in performance costs to each operation.
In the second benchmark we’re choosing 10 random elements in the lists and adding 10000 items after that point to each collection. This example really shows the strength of linked lists and how the difference in their complexity classes for that operation translates to actual benchmark differences.
Hopefully, this article has given you a reasonable, if brief, introduction to the world of algorithm and data structure complexity, as well as a sense of how a complexity class impacts real world performance. If you’re looking to write applications which deal with large amounts of data in a workable manner, it’s vital you work out which operations on that data you’ll end up doing most often and choose a data structure that does that operation well. Our linked list class, for example, would be a great candidate if you’re planning on doing lots of inserting into the middle of lists and not much random access on those lists.
If you’ve got any questions or if I’ve been wrong about anything I’ll be hanging around the comment section.