SitePoint Sponsor

User Tag List

Results 1 to 16 of 16

Thread: Scraping Script

  1. #1
    SitePoint Member
    Join Date
    Feb 2004
    Location
    www.bulldog.name
    Posts
    0
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Lightbulb Scraping Script

    Is their any examples of scraping script on the net.

  2. #2
    SitePoint Wizard HarryR's Avatar
    Join Date
    Dec 2004
    Location
    London, UK
    Posts
    1,376
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    A scraping script is just something which connects to a webserver, pretends to perform an action or just gets a page, and interprets the resulting HTML.

    There's no generic "scraping script", as everything is different.

  3. #3
    dooby dooby doo silver trophybronze trophy
    spikeZ's Avatar
    Join Date
    Aug 2004
    Location
    Manchester UK
    Posts
    13,807
    Mentioned
    158 Post(s)
    Tagged
    3 Thread(s)
    lol, though I might see you in here Bulldog!
    Here are a few links for you to browse.....

    http://www.daniweb.com/code/snippet293.html
    http://www.devnewz.com/devnewz-3-200...eInternet.html
    http://codingforums.com/archive/index.php?t-36563.html

    Should set you on the right path
    Mike Swiffin - Community Team Advisor
    Only a woman can read between the lines of a one word answer.....

  4. #4
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,323
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    i would hardly call any scraping script "the right path"
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  5. #5
    dooby dooby doo silver trophybronze trophy
    spikeZ's Avatar
    Join Date
    Aug 2004
    Location
    Manchester UK
    Posts
    13,807
    Mentioned
    158 Post(s)
    Tagged
    3 Thread(s)
    Quote Originally Posted by r937 View Post
    i would hardly call any scraping script "the right path"
    true....
    Mike Swiffin - Community Team Advisor
    Only a woman can read between the lines of a one word answer.....

  6. #6
    SitePoint Wizard HarryR's Avatar
    Join Date
    Dec 2004
    Location
    London, UK
    Posts
    1,376
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by r937 View Post
    i would hardly call any scraping script "the right path"
    REST FTW

    But for those situations where you cant get it from anything other than scraping?

  7. #7
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,323
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    Quote Originally Posted by HarryR View Post
    But for those situations where you cant get it from anything other than scraping?
    for those situations, resist the urge to steal

    because that's what screen scraping usually is -- theft of copyright material
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  8. #8
    Worship the Krome kromey's Avatar
    Join Date
    Sep 2006
    Location
    Fairbanks, AK
    Posts
    1,621
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by r937 View Post
    for those situations, resist the urge to steal

    because that's what screen scraping usually is -- theft of copyright material
    There's countless legitimate uses of scrapers - I've written more than a dozen here at work in the last year, none of which are stealing copyrighted material and all of which were the only option available for the purposes we needed. Most of them are scraping content from our own servers (e.g. pulling in dashboard data from a myriad different monitoring utilities that don't provide alternative access means such as SOAP or RSS); the ones that reach out across the internet I took great pains to make as "friendly" as possible - they connect directly to what they need and nothing more, and all of them implement local caching so that at most I'm only sucking down the remote page once per hour (most are cached for a full 24 hours).
    PHP questions? RTFM
    MySQL questions? RTFM

  9. #9
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,323
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    i was very careful to insert the word "usually" in my statement

    bulldog's previous thread was about large databases for sale on the web, for example lyric databases over 500,000+ and recipe database and so on, and how do people get the data for those databases... and i answered "they scrape them from other sites" ... and then he started this new thread
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  10. #10
    SitePoint Wizard TheRedDevil's Avatar
    Join Date
    Sep 2004
    Location
    Norway
    Posts
    1,198
    Mentioned
    4 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by kromey View Post
    There's countless legitimate uses of scrapers - I've written more than a dozen here at work in the last year, none of which are stealing copyrighted material and all of which were the only option available for the purposes we needed. Most of them are scraping content from our own servers (e.g. pulling in dashboard data from a myriad different monitoring utilities that don't provide alternative access means such as SOAP or RSS); the ones that reach out across the internet I took great pains to make as "friendly" as possible - they connect directly to what they need and nothing more, and all of them implement local caching so that at most I'm only sucking down the remote page once per hour (most are cached for a full 24 hours).
    If you are "scraping" your own sites, then you would be better off writing a simple api instead. You would in most cases finish the code faster, and it would be a lot more effective.

  11. #11
    Worship the Krome kromey's Avatar
    Join Date
    Sep 2006
    Location
    Fairbanks, AK
    Posts
    1,621
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by r937 View Post
    i was very careful to insert the word "usually" in my statement

    bulldog's previous thread was about large databases for sale on the web, for example lyric databases over 500,000+ and recipe database and so on, and how do people get the data for those databases... and i answered "they scrape them from other sites" ... and then he started this new thread
    Well, now that we've got this context around this thread, it is sounding pretty shady.
    PHP questions? RTFM
    MySQL questions? RTFM

  12. #12
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,323
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    Off Topic:

    Quote Originally Posted by HarryR View Post
    REST FTW
    Restricted Environmental Stimulation Technique in Fort Worth?
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  13. #13
    SitePoint Wizard HarryR's Avatar
    Join Date
    Dec 2004
    Location
    London, UK
    Posts
    1,376
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by r937 View Post
    Off Topic:


    Restricted Environmental Stimulation Technique in Fort Worth?
    For The Win.

    Quote Originally Posted by r937 View Post
    i was very careful to insert the word "usually" in my statement

    bulldog's previous thread was about large databases for sale on the web, for example lyric databases over 500,000+ and recipe database and so on, and how do people get the data for those databases... and i answered "they scrape them from other sites" ... and then he started this new thread
    So in this specific instance it's copyright plagurism.

  14. #14
    dooby dooby doo silver trophybronze trophy
    spikeZ's Avatar
    Join Date
    Aug 2004
    Location
    Manchester UK
    Posts
    13,807
    Mentioned
    158 Post(s)
    Tagged
    3 Thread(s)
    Quote Originally Posted by HarryR View Post
    For The Win.
    I thought you mean "For The Wicked" as in No Rest For The Wicked.....

    Quote Originally Posted by HarryR View Post
    So in this specific instance it's copyright plagurism.
    "Possibly copyright plagiarism......"
    Last edited by spikeZ; Jun 19, 2007 at 03:25. Reason: curse you Rudy ;)
    Mike Swiffin - Community Team Advisor
    Only a woman can read between the lines of a one word answer.....

  15. #15
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,323
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    REST?

    plagiarism
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  16. #16
    SitePoint Wizard HarryR's Avatar
    Join Date
    Dec 2004
    Location
    London, UK
    Posts
    1,376
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by r937 View Post
    Yeah, or any other type of structured data access method (soap/xml-rpc/rss) that the site explicitly provides (e.g. they want you to use or syndicate their content/site).


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •