Is it ok to crawl Google?

Hi, I am looking into ideas for my dissertation for my final year at univeristy and one idea involves creating a web crawler that takes samples from Google search results in order to analyse the HTML in the pages retrieved to gain an insight into google page ranking.

I have been told that Google doesn’t like people crawling it (sort of ironic :-p) but you are allowed to do it as long as it is under a specific limit per day. The problem is, I can’t find anything about it from Google it’s self.

Does anyone have any experience in this? or know a usful link about it?

Thanks, ro0bear :slight_smile:

During my final year at university I built a search engine, although I created the entire thing (Crawler, Indexer, etc). I did, however, build a comparison tool to use with Google searches.

From what I vaguely remember, Google has a limit on the number of searches one can perform with its API (JSON if I recall), and the searches themselves are limited. However, even this is quite large and for something like a dissertation that likely won’t be stress tested through thousands of queries I think you’ll be fine.

One tip I’ll give you, if you’re using .NET give HtmlAgilityPack a try. If you’re not and you want to analyse HTML you’ll likely need a HTML parser. Check the kind of language HTML is and why Regex isn’t adequate and I’m sure you’ll get another mark or two. It certainly worked for me.

Thanks ULTiMATE :slight_smile: