Danny Sullivan on Google Print

Tweet

Danny Sullivan has posted an excellent analysis of the technical issues involved with Google Print, in Indexing Versus Caching & How Google Print Doesn’t Reprint.

The thrust of Danny’s argument, and I agree 100%, is that indexing the content of a book so that it becomes searchable is not the same thing as creating or publishing a copy of the book. He is correct about that, but his post perpetuates a misunderstanding about how search engines work. This misunderstanding is part of the reason why publishers think Google is “stealing” their intellectual property.

Danny describes the search engine index as resembling a big spreadsheet (emphasis added):

I’ve described the index… to being like a “big book of the web.” But it’s not, really. It’s more like a giant spreadsheet, where all the words of a page are in one row of the spreadsheet, each word to a different column, then the next page in the row below that, and so on.

Actually, the index is far less readable than a spreadsheet, because search engines are storing word occurences, not documents, when they create their index. It’s not a row for every document, it’s a table of occurences for every word.

If the word “defenestration” appears on a web page, search engines like Google will store a Document ID (referencing the URL), the location within the page (the 342nd word), and some other stuff like whether it was in italics or whatever. This occurence will be stored in the database with all the other occurences of “defenestration,” not in a separate record for that document.

Indexing by words comes in very handy, because people search with words. If I search for a “defenestration instruction manual,” the search engine can quickly find all of the documents listed in the index for all 3 of those words. Searching through 3 word indexes is a lot faster than searching through 8 billion documents.

There is no separate “index” of the document itself. The word occurences are stored, the document is disposable. To reconstruct a document, you’d have to look into every word index, find all the word occurences that matched the document ID, and put them all back together.

In the case of web pages, a cached copy is usually kept, but Google Print isn’t offering that. If you search for some words with Google Print, they’ll tell you which books the words occured in, and give you a very small snippet of context, which is about as close a fit to “fair use” as you’ll find… I just quoted a far more substantive piece of Danny’s intellectual property than Google Print ever would, and I’m well within the bounds of reasonable fair use.

I know that Danny understands all this stuff perfectly well, and he’s just trying to make a point. So let me hammer on that point, because I agree with him.

I used to work at FedEx Kinko’s, a world leader in document management solutions. I know how much businesses are willing to pay to get their legacy documents into a searchable electronic format. What I can’t understand is why publishers aren’t doing cartwheels when they see Google doing the job for them, for free.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • http://www.lowter.com charmedlover

    They probably like it, but they know they can get some money being a pain about it. That’s how it works, sadly.

  • http://boyohazard.net Octal

    As you say; it’s the misunderstanding of the system. I also think the fact that it’s a free service, as compared to the paid for service (eg from FedEx Kinko), is something they cannot readily accept. A lot of people are pessimistic of the word “free”.

  • imran

    Hi

    I did like your blog postings and comments. I have one question.

    Can any 1 harm our rankings in google or any search engine ?

    Regards

    Imran Hashmi

  • chukshen

    There are definite barriers to Google winning this war of print. It would mark a need for a change in legislation on privacy, copyright and a host of other laws, similar to the music and copyright laws that alerted attention when MP3 downloading became big business.

    Chukshen, DMAFB
    http://www.domeafavorbuddy.com

  • http://www.seoresearchlabs.com DanThies

    Just to clarify, as it’s come up a couple times by email, the big issue is with Google’s library initiative, not the Google Print product.