SitePoint Sponsor

User Tag List

Results 1 to 8 of 8
  1. #1
    SitePoint Enthusiast
    Join Date
    Dec 2002
    Posts
    34
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    grab the content from a link

    for example,

    I need get all contents (source code) from yahoo.com and about.com and google.com, and more and more.
    of their index page.

    I use fsockopen and fopen, but its too slow when I am getting over 100 sites.

    Can I use other ways to grab the content over 100 sites at the same time?

    any other good ways for it?

    I am considing using socket, but does it work for that?


    Thank You

  2. #2
    ********* wombat firepages's Avatar
    Join Date
    Jul 2000
    Location
    Perth Australia
    Posts
    1,717
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    you can use non blocking sockets to speed things up a bit BUT you are then using an awful lot of bandwidth for something you probably should not be doing in hte first place anyway , either your provider or one of the sites you are mining will get narked with you sooner rather than later.

  3. #3
    SitePoint Enthusiast
    Join Date
    Dec 2002
    Posts
    34
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    how does non blocking socket ?

    can you give me an example for it? or url ?

    THanks for your help

  4. #4
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi...

    Quote Originally Posted by kaikai
    Can I use other ways to grab the content over 100 sites at the same time?
    PHP is completely inappropriate for this. You need a tool that works outside of Apache's memory management and is multi-threaded/multi-process. Your only hope is to shell out to a command line tool and pick up the results later, or to write a PHP extension.

    yours, Marcus
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  5. #5
    SitePoint Enthusiast
    Join Date
    Dec 2002
    Posts
    34
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi

    Can you give me an example or a site which show it?

    Thanks

    Quote Originally Posted by lastcraft
    Hi...


    PHP is completely inappropriate for this. You need a tool that works outside of Apache's memory management and is multi-threaded/multi-process. Your only hope is to shell out to a command line tool and pick up the results later, or to write a PHP extension.

    yours, Marcus

  6. #6
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi...

    Quote Originally Posted by kaikai
    Can you give me an example or a site which show it?
    My work involves meta crawlers and spiders, so I cannot say more without giving away information which is commercially sensitive to my clients. Sorry and all that . All I can say is have a look around at other languages and tools and see what you can work with.

    yours, Marcus
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  7. #7
    SitePoint Guru dagfinn's Avatar
    Join Date
    Jan 2004
    Location
    Oslo, Norway
    Posts
    894
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by lastcraft
    All I can say is have a look around at other languages and tools and see what you can work with.
    There are a number of Perl modules you could check out, including
    this one. Also there's a book called something like Web Content Mining in Java, but I've only glanced at it.
    Dagfinn Reiersøl
    PHP in Action / Blog / Twitter
    "Making the impossible possible, the possible easy,
    and the easy elegant"
    -- Moshe Feldenkrais

  8. #8
    Team SitePoint Lucas Chan's Avatar
    Join Date
    Sep 2002
    Location
    Melbourne
    Posts
    59
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You may have some luck with wget or curl.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •