Web
Article

Run Your Own Spider

By Blane Warrene

I came across Carlos Perez’s blog, manageability.org, while Googling for some research today. Carlos had a great list of open source web crawlers that included JSpider, a tool I have used for error checking on web sites.

JSpider is written entirely in Java and can be configured extensively for spidering, error checking and downloading. It of course obeys robots.txt files (http://www.robotstxt.org/wc/norobots-rfc.txt) and additional options included in configuration.

I thought the added downloading option was nice as I had been using a separate application to pull down entire web sites for offline use. Now this can be accomplished with the JSpider engine.

The tool has a plug-in architecture that opens the door for custom development from users to extend JSpider to meet their needs (and perhaps contribute to the project). JSpider is released under the LGPL license.

JSpider does require J2SE 1.3+ Runtime and an XMLParser (Xerces, …) installed (comes with JDK1.4). The app will run on any system supporting Java and these requirements.

There is even a simple sample site JSpider has created for testing purposes once you get up and running. Additionally, a fairly comprehensive 120 page user manual is available in PDF format.

Free Guide:

7 Habits of Successful CTOs

"What makes a great CTO?" Engineering skills? Business savvy? An innate tendency to channel a mythical creature (ahem, unicorn)? All of the above? Discover the top traits of the most successful CTOs in this free guide.

  • http://www.Bigmoolah.com yolah

    i also have my own spider on my site that searches other sites its a webcrawler at yolah.com it pretty cool its like google

Recommended
Sponsors
Because We Like You
Free Ebooks!

Grab SitePoint's top 10 web dev and design ebooks, completely free!

Get the latest in Front-end, once a week, for free.