Run Your Own Spider
I came across Carlos Perez’s blog, manageability.org, while Googling for some research today. Carlos had a great list of open source web crawlers that included JSpider, a tool I have used for error checking on web sites.
JSpider is written entirely in Java and can be configured extensively for spidering, error checking and downloading. It of course obeys robots.txt files (http://www.robotstxt.org/wc/norobots-rfc.txt) and additional options included in configuration.
I thought the added downloading option was nice as I had been using a separate application to pull down entire web sites for offline use. Now this can be accomplished with the JSpider engine.
The tool has a plug-in architecture that opens the door for custom development from users to extend JSpider to meet their needs (and perhaps contribute to the project). JSpider is released under the LGPL license.
JSpider does require J2SE 1.3+ Runtime and an XMLParser (Xerces, …) installed (comes with JDK1.4). The app will run on any system supporting Java and these requirements.
There is even a simple sample site JSpider has created for testing purposes once you get up and running. Additionally, a fairly comprehensive 120 page user manual is available in PDF format.