Danny Sullivan on Google Print

Dan Thies

Danny Sullivan has posted an excellent analysis of the technical issues involved with Google Print, in Indexing Versus Caching & How Google Print Doesn’t Reprint.

The thrust of Danny’s argument, and I agree 100%, is that indexing the content of a book so that it becomes searchable is not the same thing as creating or publishing a copy of the book. He is correct about that, but his post perpetuates a misunderstanding about how search engines work. This misunderstanding is part of the reason why publishers think Google is “stealing” their intellectual property.

Danny describes the search engine index as resembling a big spreadsheet (emphasis added):

I’ve described the index… to being like a “big book of the web.” But it’s not, really. It’s more like a giant spreadsheet, where all the words of a page are in one row of the spreadsheet, each word to a different column, then the next page in the row below that, and so on.

Actually, the index is far less readable than a spreadsheet, because search engines are storing word occurences, not documents, when they create their index. It’s not a row for every document, it’s a table of occurences for every word.

If the word “defenestration” appears on a web page, search engines like Google will store a Document ID (referencing the URL), the location within the page (the 342nd word), and some other stuff like whether it was in italics or whatever. This occurence will be stored in the database with all the other occurences of “defenestration,” not in a separate record for that document.

Indexing by words comes in very handy, because people search with words. If I search for a “defenestration instruction manual,” the search engine can quickly find all of the documents listed in the index for all 3 of those words. Searching through 3 word indexes is a lot faster than searching through 8 billion documents.

There is no separate “index” of the document itself. The word occurences are stored, the document is disposable. To reconstruct a document, you’d have to look into every word index, find all the word occurences that matched the document ID, and put them all back together.

In the case of web pages, a cached copy is usually kept, but Google Print isn’t offering that. If you search for some words with Google Print, they’ll tell you which books the words occured in, and give you a very small snippet of context, which is about as close a fit to “fair use” as you’ll find… I just quoted a far more substantive piece of Danny’s intellectual property than Google Print ever would, and I’m well within the bounds of reasonable fair use.

I know that Danny understands all this stuff perfectly well, and he’s just trying to make a point. So let me hammer on that point, because I agree with him.

I used to work at FedEx Kinko’s, a world leader in document management solutions. I know how much businesses are willing to pay to get their legacy documents into a searchable electronic format. What I can’t understand is why publishers aren’t doing cartwheels when they see Google doing the job for them, for free.