SitePoint Sponsor

User Tag List

Results 1 to 23 of 23

Thread: Site Rippers

  1. #1
    Serial Publisher silver trophy aspen's Avatar
    Join Date
    Aug 1999
    Location
    East Lansing, MI USA
    Posts
    12,937
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Site Rippers

    Has anyone had a problem with Site Rippers? Aka Offline-Browsers.

    I have about 10 ips banned from my site for doing it, I have to think there is a better way though.

    Is there a way I can ban all offline browsing via the user agent instead of the ip?
    Chris Beasley - I publish content and ecommerce sites.
    Featured Article: Free Comprehensive SEO Guide
    My Guide to Building a Successful Website
    My Blog|My Webmaster Forums

  2. #2
    SitePoint Guru
    Join Date
    Sep 1999
    Location
    Singapore
    Posts
    854
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    No because most Site Rippers can spoof the User Agent string.

  3. #3
    SitePoint Member
    Join Date
    Jun 2001
    Location
    Indianapolis, IN
    Posts
    12
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I myself always wondered about this.

    I think you can add some kind of javascript into the page.. so that all it will look for the host, if it can't find it, or is not directly connected to it. the page fails or keeps refreshing... I have seen something that freezes apps like Frontpage, but I am not really for sure what it was..
    BradC

  4. #4
    Serial Publisher silver trophy aspen's Avatar
    Join Date
    Aug 1999
    Location
    East Lansing, MI USA
    Posts
    12,937
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    most robots don't have javascript turned on, I'd need a server side solution using htaccess etc
    Chris Beasley - I publish content and ecommerce sites.
    Featured Article: Free Comprehensive SEO Guide
    My Guide to Building a Successful Website
    My Blog|My Webmaster Forums

  5. #5
    SitePoint Evangelist thewitt's Avatar
    Join Date
    Apr 2001
    Posts
    468
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Other than look at how fast the ripper goes from one URL on your site to another - and then not mistaking it for a search engine - you can't tell.

    I have written a couple of these in the past and had to masquerade as a web browser to not be treated like a web crawler by an interactive site. It's not hard to do.

    -t

  6. #6
    code addict Abstraction's Avatar
    Join Date
    Apr 2001
    Location
    Des Moines, IA
    Posts
    346
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Just curious. Why don't you want people to rip your site? Possibly they just want to do some offline browsing or possibly not.

  7. #7
    SitePoint Enthusiast
    Join Date
    Jun 2001
    Location
    New Zealand
    Posts
    74
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Where I live - which is New Zealand - schools often download sites for offline browsing. Firstly the staff check to make sure it is "kid safe" and secondly it cuts down the cost of being online.
    <help>StIcKs</help>

  8. #8
    SitePoint Zealot Aonghus's Avatar
    Join Date
    Feb 2001
    Location
    Ireland
    Posts
    116
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Here's the most simple solution: Make a zip file containing all the files off your site (you could make a Perl script to update the file as and when the site gets updated), then stick it on a free host's server and let ppl download it there . This is the perfect solution if it's bandwidth you're worried about... if it's content being stolen, you could make a cgi script that only lets you download a certain number of pages a minute and a certain number of pages a day... That wouldn't be too hard.

  9. #9
    Serial Publisher silver trophy aspen's Avatar
    Join Date
    Aug 1999
    Location
    East Lansing, MI USA
    Posts
    12,937
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    My site is driven off a 150mb database. There are over 8000 Unique pages. People trying to request all of those pages so fast make too much of a server load and slow the site down for everyone else.

    Additionally I don't want people ripping my content. Its my content, if they want to read it they can do so online.

    I'll just continue banning IP addresses
    Chris Beasley - I publish content and ecommerce sites.
    Featured Article: Free Comprehensive SEO Guide
    My Guide to Building a Successful Website
    My Blog|My Webmaster Forums

  10. #10
    SitePoint Zealot Aonghus's Avatar
    Join Date
    Feb 2001
    Location
    Ireland
    Posts
    116
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well, go with solution two then . I'm sure it would be quite easy to set up a tracking script that only lets you download a reasonable amount of data...

  11. #11
    code addict Abstraction's Avatar
    Join Date
    Apr 2001
    Location
    Des Moines, IA
    Posts
    346
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Banning an IP after the fact is kind of pointless.

  12. #12
    SitePoint Addict -TheDarkEye-'s Avatar
    Join Date
    Feb 2001
    Location
    canada
    Posts
    286
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Originally posted by Abstraction
    Banning an IP after the fact is kind of pointless.
    lol, no kidding. and that realy doesnt solve the problem.

    i believe most those "site ripper" programs can be set to only update or download a page if it is new or changed. so it's not as likely that these people will rip your entire site again.

    i think there is a better solution anyways... something like what Aonghus was suggesting. it would be realy easy to impliment an IP/time based page download limit. you could easily make this a feature that no regular user would even notice but also discourage any site ripping just because it would take so damn long (ie: 8000 pages. limit, 15 to 30 pages every 15 minutes. time to download would be 3 to 6 days.).
    Last edited by -TheDarkEye-; Jul 27, 2001 at 11:34.

  13. #13
    SitePoint Addict -TheDarkEye-'s Avatar
    Join Date
    Feb 2001
    Location
    canada
    Posts
    286
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    ACK! sorry i was trying to edit a typo but i hit the "quote" button instead of the "edit" button.
    Last edited by -TheDarkEye-; Jul 27, 2001 at 11:36.

  14. #14
    SitePoint Zealot Aonghus's Avatar
    Join Date
    Feb 2001
    Location
    Ireland
    Posts
    116
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    aspen: how much of your site is based around server-side scripting?... obviously for talking to the database, there's a lot, but is a script run every time any page is loaded? If so, it would be quite easy to add a sub routine that's run before the rest of the sript, one that checks a log to see how many pages an ip address has downloaded in the past 10 minutes or whatever, then print an error message and exit the script.

    -Aonghus

  15. #15
    Serial Publisher silver trophy aspen's Avatar
    Join Date
    Aug 1999
    Location
    East Lansing, MI USA
    Posts
    12,937
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Every page is dynamic.. 99% of them are run off the same script too.

    Last night someone ripped my entire site again, 8000+ hits.

    Anyways I usually catch them in the middle of it since it is a rather long process.

    But what I've done is made a check for the HTTP user agent. Yes it can be faked. But chances are the first time around they wont fake it and when they try I'll auto log and ban their IP address.

    If that doesn't work I'll try a time based limit.
    Chris Beasley - I publish content and ecommerce sites.
    Featured Article: Free Comprehensive SEO Guide
    My Guide to Building a Successful Website
    My Blog|My Webmaster Forums

  16. #16
    SitePoint Zealot Aonghus's Avatar
    Join Date
    Feb 2001
    Location
    Ireland
    Posts
    116
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Maybe I'll write up a script to take care of it... How much would you be willing to pay, aspen? heh, just kidding... Open Source and all that...

    Joking aside, it would be extremely easy to add a sub routine to ever script that checks a log and exits the script if they've met their 'quota'

    -Aonghus

  17. #17
    Serial Publisher silver trophy aspen's Avatar
    Join Date
    Aug 1999
    Location
    East Lansing, MI USA
    Posts
    12,937
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Considering the kind of traffic I get I don't know if I'd like the overhead of doing that.
    Chris Beasley - I publish content and ecommerce sites.
    Featured Article: Free Comprehensive SEO Guide
    My Guide to Building a Successful Website
    My Blog|My Webmaster Forums

  18. #18
    SitePoint Zealot Aonghus's Avatar
    Join Date
    Feb 2001
    Location
    Ireland
    Posts
    116
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well maybe you should weigh out the options; let them rip your site, keep chasing them manually and banning their IPs (not always a good idea, what about the next person using that isp with that IP address who visits your site?), and pay the bandwidth-cost of that, or set up a system of limiting the number of requests to a 'realistic' amount, and paying the cost of that.

    If you think the second option will use up a lot of resources, then fine, but if you think the site-rippers would use up more resources, then set it up... totally your decision, but they're more or less the only options you have.

    -Aonghus

  19. #19
    Serial Publisher silver trophy aspen's Avatar
    Join Date
    Aug 1999
    Location
    East Lansing, MI USA
    Posts
    12,937
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I don't chase them manually, I have a script that auto bans them now. Scanning an access log for every hit isn't feasible at all.
    Chris Beasley - I publish content and ecommerce sites.
    Featured Article: Free Comprehensive SEO Guide
    My Guide to Building a Successful Website
    My Blog|My Webmaster Forums

  20. #20
    Serial Publisher silver trophy aspen's Avatar
    Join Date
    Aug 1999
    Location
    East Lansing, MI USA
    Posts
    12,937
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    this is basically what I'm doing:

    http://66.33.83.213/forums/showthrea...threadid=11744
    Chris Beasley - I publish content and ecommerce sites.
    Featured Article: Free Comprehensive SEO Guide
    My Guide to Building a Successful Website
    My Blog|My Webmaster Forums

  21. #21
    Say WHA?! goober's Avatar
    Join Date
    Sep 2000
    Location
    United States
    Posts
    1,921
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Chris, can't you do something with global.asa that would check to see how fast they're loading successive pages? I mean, there's gotta be some hidden function like onPageLoad() or something. Should I look around for things like this, or do you want to still do it the .htaccess way?
    Sean Killeen [LinkedIn] [Twitter] [Web]

    Warning: Reality.sys corrupted. Universe halted. Reboot? (Y/N)

  22. #22
    Serial Publisher silver trophy aspen's Avatar
    Join Date
    Aug 1999
    Location
    East Lansing, MI USA
    Posts
    12,937
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Wrong platform. .htaccess is how you control website access and server settings in apache, there is no better way to do it.
    Chris Beasley - I publish content and ecommerce sites.
    Featured Article: Free Comprehensive SEO Guide
    My Guide to Building a Successful Website
    My Blog|My Webmaster Forums

  23. #23
    Say WHA?! goober's Avatar
    Join Date
    Sep 2000
    Location
    United States
    Posts
    1,921
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ah, yes, thanks for the reminder.
    Sean Killeen [LinkedIn] [Twitter] [Web]

    Warning: Reality.sys corrupted. Universe halted. Reboot? (Y/N)


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •