I have a client who has handed me quite the doozy of a task and I am not altogether sure as to what is the best way to help them with what they need so I thought I would ask here for any suggestions anyone might care to give me.
The task involves scraping or otherwise getting a list of all pages indexed by Google for a site and comparing that list of URL's to actual page URL's at the site such that any discrepancies between the two can be discerned and corrected with 301's or whatever.
As to the first part...here is what seems best.
- Create a PHP script to scrape the URL's from one SERP results page.
- go to Google and using the site: operator manually get Google to return all pages from a site that it has in it's index (100 results at a time).
- copy each such SERP page of 100 results one at a time to a file in a directory until I have gone through all Google results for a given site.
- run a script on all files in that directory to scrape each page copied and to spit out a CSV file containing all indexed pages.
For the second part...it would seem that I need to run a web crawler on the site to get a list of it's present pages.
Once I get the two list of URL's I can then use Meld under Linux to compare the two lists to point me to where there are discrepencies and then visit each such URL if I need to to determine what to do with it.
Anybody know of a better way to do this?