SitePoint Sponsor

User Tag List

Results 1 to 5 of 5
  1. #1
    SitePoint Enthusiast
    Join Date
    Mar 2005
    Posts
    65
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    creating a temporary file?

    I am writing a program that reads through a website grabbing the content of each page and then following links and grabbing the contents of those pages and putting them all in to one linear file.

    any ideas on how best to store the contents of the content grabber so it can move on to the next page?

    I was thinking that I could do it in a temp file but I am not really sure how I would go about implementing it.

    Any ideas?
    Last edited by strugglingon; Mar 22, 2005 at 16:30.

  2. #2
    Compulsive Clubber icky_bu's Avatar
    Join Date
    Aug 2003
    Location
    Portugal
    Posts
    351
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Uhm, you would certainly have to work in some regular expressions I'd think that all that content would add up like mad.

    Btw, what would you be using the content for?
    Please be carefull with copyright issues.

  3. #3
    SitePoint Enthusiast Grubilo's Avatar
    Join Date
    Oct 2002
    Posts
    41
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    strugglingon, the best way to store this content will depend on the things you want to do with this content

    But... for sure I can tell you that storing all data in one file is not effecient, better use separate files for each node/domain you are processing. This can make a benefit in time when you will be reading content locally.

    For more advanced scheme I would recommend to use a database fro indexing urls and mapping files to each database record.
    Good luck
    Web tutorials
    http://webclass.ru

  4. #4
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by strugglingon
    I am writing a program that reads through a website grabbing the content of each page and then following links and grabbing the contents of those pages and putting them all in to one linear file.
    You should be able to use wget for this

  5. #5
    SitePoint Enthusiast
    Join Date
    Mar 2005
    Posts
    65
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by icky_bu
    Uhm, you would certainly have to work in some regular expressions I'd think that all that content would add up like mad.

    Btw, what would you be using the content for?
    Please be carefull with copyright issues.
    The idea is for it to be used to searialise the pages of online help manuals so that the user can then have the aibility to print the whole thing as a book rather than going through each page and printing individually.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •