Danny Sullivan on Google Print

Danny Sullivan has posted an excellent analysis of the technical issues involved with Google Print, in Indexing Versus Caching & How Google Print Doesn’t Reprint.

The thrust of Danny’s argument, and I agree 100%, is that indexing the content of a book so that it becomes searchable is not the same thing as creating or publishing a copy of the book. He is correct about that, but his post perpetuates a misunderstanding about how search engines work. This misunderstanding is part of the reason why publishers think Google is “stealing” their intellectual property.

Danny describes the search engine index as resembling a big spreadsheet (emphasis added):

I’ve described the index… to being like a “big book of the web.” But it’s not, really. It’s more like a giant spreadsheet, where all the words of a page are in one row of the spreadsheet, each word to a different column, then the next page in the row below that, and so on.

Actually, the index is far less readable than a spreadsheet, because search engines are storing word occurences, not documents, when they create their index. It’s not a row for every document, it’s a table of occurences for every word.

If the word “defenestration” appears on a web page, search engines like Google will store a Document ID (referencing the URL), the location within the page (the 342nd word), and some other stuff like whether it was in italics or whatever. This occurence will be stored in the database with all the other occurences of “defenestration,” not in a separate record for that document.

Indexing by words comes in very handy, because people search with words. If I search for a “defenestration instruction manual,” the search engine can quickly find all of the documents listed in the index for all 3 of those words. Searching through 3 word indexes is a lot faster than searching through 8 billion documents.

There is no separate “index” of the document itself. The word occurences are stored, the document is disposable. To reconstruct a document, you’d have to look into every word index, find all the word occurences that matched the document ID, and put them all back together.

In the case of web pages, a cached copy is usually kept, but Google Print isn’t offering that. If you search for some words with Google Print, they’ll tell you which books the words occured in, and give you a very small snippet of context, which is about as close a fit to “fair use” as you’ll find… I just quoted a far more substantive piece of Danny’s intellectual property than Google Print ever would, and I’m well within the bounds of reasonable fair use.

I know that Danny understands all this stuff perfectly well, and he’s just trying to make a point. So let me hammer on that point, because I agree with him.

I used to work at FedEx Kinko’s, a world leader in document management solutions. I know how much businesses are willing to pay to get their legacy documents into a searchable electronic format. What I can’t understand is why publishers aren’t doing cartwheels when they see Google doing the job for them, for free.

Frequently Asked Questions about Danny Sullivan and Google Print

Who is Danny Sullivan?

Danny Sullivan is a renowned technologist and journalist who has been a significant figure in the search engine industry for over two decades. He is the co-founder of Search Engine Land, an industry publication that provides news and insights about SEO and SEM. Currently, he is the Public Liaison for Search at Google, where he helps the public understand Google’s search products and policies.

What is Google Print?

Google Print, now known as Google Books, is a service from Google that searches the full text of books and magazines that Google has scanned, converted to text using optical character recognition, and stored in its digital database. It’s a valuable resource for researchers, students, and anyone interested in exploring the world’s knowledge.

What is Danny Sullivan’s role at Google?

As the Public Liaison for Search at Google, Danny Sullivan’s role is to help the public better understand Google’s search products and the policies surrounding them. He acts as a bridge between Google and the public, answering questions, addressing concerns, and providing insights into how Google Search works.

How has Danny Sullivan contributed to the search engine industry?

Danny Sullivan has made significant contributions to the search engine industry. He co-founded Search Engine Land, a leading industry publication that provides news and insights about SEO and SEM. His expertise and insights have helped shape the industry’s understanding of search engines and their impact on digital marketing.

What is the significance of Google Books in the digital world?

Google Books, formerly known as Google Print, has revolutionized the way we access information. By digitizing books and making them searchable, Google has made a vast amount of knowledge accessible to anyone with an internet connection. It’s a valuable resource for researchers, students, and anyone interested in exploring the world’s knowledge.

How can I follow Danny Sullivan’s work?

You can follow Danny Sullivan’s work through various platforms. He often shares insights and updates on his Twitter account. You can also read his articles on Search Engine Land and his personal website. Additionally, you can follow his work at Google through the Google SearchLiaison Twitter account.

What is Danny Sullivan’s background in technology?

Danny Sullivan has a rich background in technology, particularly in the field of search engines. He has been a leading voice in the search engine industry for over two decades, co-founding Search Engine Land and serving as an advisor for Third Door Media. He is currently the Public Liaison for Search at Google.

How does Google Books work?

Google Books works by scanning books and magazines and converting them into text using optical character recognition. These texts are then stored in Google’s digital database and made searchable. Users can search for specific phrases or keywords and see snippets of text from the books where those phrases appear.

What is Danny Sullivan’s impact on Google?

As the Public Liaison for Search, Danny Sullivan has a significant impact on Google. He helps the public understand Google’s search products and policies, acting as a bridge between Google and its users. His expertise and insights have helped shape Google’s approach to search and its communication with the public.

Where can I find more information about Danny Sullivan?

You can find more information about Danny Sullivan on his personal website, his LinkedIn profile, and his Twitter account. You can also read his articles on Search Engine Land and follow his work at Google through the Google SearchLiaison Twitter account.