SitePoint Sponsor

User Tag List

Results 1 to 3 of 3
  1. #1
    SitePoint Addict
    Join Date
    Aug 2001
    Location
    Los Angeles, CA
    Posts
    346
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Any definite, official definition of a search engine spider?

    I'm pretty positive I know what a search engine spider is (a module that goes to web pages and indexes them to put them in the search engines database) but I wanted to know if there is any definite, official definition anywhere on the web for what a spider (as in search engine spider) is.

    Thanks,

    GregC

  2. #2
    SitePoint Columnist Skunk's Avatar
    Join Date
    Jan 2001
    Location
    Lawrence, Kansas
    Posts
    2,066
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I doubt there is an official definition anywhere as it's not something that is governed by standards of any sort - a spider is just a program that "crawls" the web following links and indexing pages. There are plenty of ways of implementing such a thing and plenty of reasons for doing so, ranging from the beneficial (search engine spiders) to the unpleasant (ripping email addresses from web pages to add to email spam mailing list). Anyone can write one, in fact it is pretty trivial to put one together in Perl, Python or even PHP.

  3. #3
    SitePoint Wizard Ian Glass's Avatar
    Join Date
    Oct 2001
    Location
    Beyond yonder
    Posts
    2,384
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Writing a paper? ;-)

    Anyway, the definition straight from dictionary.com:
    spider

    <World-Wide Web> (Or "robot", "crawler") A program that automatically explores the World-Wide Web by retrieving a document and recursively retrieving some or all the documents that are referenced in it. This is in contrast with a normal web browser operated by a human that doesn't automatically follow links other than inline images and URL redirection.

    The algorithm used to pick which references to follow strongly depends on the program's purpose. Index-building spiders usually retrieve a significant proportion of the references. The other extreme is spiders that try to validate the references in a set of documents; these usually do not retrieve any of the links apart from redirections.

    The standard for robot exclusion is designed to avoid some problems with spiders.

    Early examples were Lycos and WebCrawler.

    Home (http://info.webcrawler.com/mak/proje...ts/robots.html).

    (2001-04-30)
    ~~Ian


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •