I have a client who has handed me quite the doozy of a task and I am not altogether sure as to what is the best way to help them with what they need so I thought I would ask here for any suggestions anyone might care to give me.
The task involves scraping or otherwise getting a list of all pages indexed by Google for a site and comparing that list of URL’s to actual page URL’s at the site such that any discrepancies between the two can be discerned and corrected with 301’s or whatever.
As to the first part…here is what seems best.
- Create a PHP script to scrape the URL’s from one SERP results page.
- go to Google and using the site: operator manually get Google to return all pages from a site that it has in it’s index (100 results at a time).
- copy each such SERP page of 100 results one at a time to a file in a directory until I have gone through all Google results for a given site.
- run a script on all files in that directory to scrape each page copied and to spit out a CSV file containing all indexed pages.
For the second part…it would seem that I need to run a web crawler on the site to get a list of it’s present pages.
Once I get the two list of URL’s I can then use Meld under Linux to compare the two lists to point me to where there are discrepencies and then visit each such URL if I need to to determine what to do with it.
Anybody know of a better way to do this?
Instead of scraping I’d use their API at http://code.google.com/apis/customsearch/v1/using_rest.html
JSON is a lot easier to parse than HTML and you can be fairly certain the JSON will not change, at least not in a big way, whereas with the HTML this is not guaranteed at all. Besides, I’m pretty sure scraping Google is against their TOS.
Also, be wary that the site: operator does not necessarily return all results google has indexed from your domain. See the discussion at http://www.webmasterworld.com/google/3587770.htm
Other than that, your method of comparing lists seem correct
Thanks for the input Scallio (not sure what to call you) (yes…this is me carlos12345…I got confused and did not realize that I had a username from way, way back and posted under that username),
Hmm…very interesting indeed. Do you (or anyone) happen to know if one can get a list of all URL’s indexed for a given site through their API?
Is it difficult to get an API key from Google?
Besides, I’m pretty sure scraping Google is against their TOS.
It is. However saving SERP result pages to your hard drive is not. If I chose to scrape one such page at a time in the directory where I save them…I do not believe such is a violation of their TOS as written. They don’t want automated scraping directly at Google that will take up bandwidth or otherwise interfere with their service. Saving Google SERP’s manually and working with those saved pages does neither.
Any further input would be appreciated.
I don’t know about an API key for this search API, but the API keys I did get from them (for Google Maps) was easy as pie. Just enter a few details and you instantly get it (this was for the now-deprecated v2 of the Google Maps API).
I’ve not yet worked with their API, so I can’t answer your first question. Hopefully someone else can