Oops, There Goes Another Web Site: The Web Is Disappearing

By Josh Catone
We teamed up with SiteGround
To bring you the latest from the web and tried-and-true hosting, recommended for designers and developers. SitePoint Readers Get Up To 65% OFF Now

If there’s one thing we know for sure about the Internet, it is that by its very nature it is a transient medium. What was 10 years ago, is generally nothing like what is today, which will be nothing like what will be 10 years from now. Blogger Robert Scoble posted today that the recently released Search 2001 archived search engine from Google (our coverage) highlights well the web’s transient nature. Many of the sites that existed in the search results on Google in 2001, not only don’t appear in the results for those searches today, but don’t exist on the web at all.

Unlike printed information, which requires a physical act to destroy, when you change something on a web page, or stop paying your hosting bill, that information is lost to the ether rather passively, and immediately. In April, the Library of Congress in the US completed a project that restored the original 6,000 books from Thomas Jefferson’s personal library, that made up the LoC’s first collection. Most of the books were originally lost in a fire about 150 years ago.

It’s somewhat alarming that printed books from a century and a half ago have been able to be archived until today, but much of the information created on the web in the past decade is already gone forever. The biggest reason for that is probably the sheer amount of information that we’re creating.

Last year, Google’s cache contained 100 exabytes of data — or almost three quarters of a million times the size of the information contained in the Library of Congress, one of the world’s largest libraries. That’s far more data than existed to be archived 150 years ago, and the Internet, where everyone is a publisher, is causing us to create data perhaps faster than we can hold on to it.

We’re living in an age where data is being created at an overwhelming rate (to the point where many of us are feeling overloaded). Billions of gigabytes of data are pushed out over the web every year — and with the growing popularity of microblogging and the realization of the ubiquitous Internet, the rate of information creation is only going to grow.

Dave Morin from Facebook and Nova Spivack from Radar Networks said last week that the “ephemerality of the web” was a huge problem that we have to figure out how to address. Before we can begin to figure out how to filter and make use of all the information we’re creating on the web, we’re going to figure out how to keep it from disappearing.

150 years from now, will this blog post still exist? Robert Scoble doesn’t think most of what we’re writing today will survive the century. How about you? And the corollary question to all this is, should we even bother archiving most of the stuff on the web? Who gets to decide what should be saved and how? I’d be interested to hear your thoughts in the comments.

We teamed up with SiteGround
To bring you the latest from the web and tried-and-true hosting, recommended for designers and developers. SitePoint Readers Get Up To 65% OFF Now
  • jpchasepoint

    I don’t think most of what is being written today will make it the 10 years let alone the century!

  • Do we need to keep everything? Surely only the best, plus a sample of the rest, of the content online really need be stored indefinitely. I know this comment of mine doesn’t need to be around in a decade.

  • nedlud

    Let’s face it, there’s so much rubbish on the net. Why should we want to keep it all anyway?

    Many sites content have “graduated” from the web into print. Not that one medium should necessarily be put above the other, but print has the advantage of have a perceived permanency (over the webs “ephemerality”), and books *are* archived by libraries etc.

    Also, many people recycle content on the web, keeping it alive. Many fan sites exists which re-publish content and BBS discussions from the ancient days of the web (10 years ago). Aren’t these sites forming a kind of ad-hoc archiving process for content that would otherwise have gone the way of the dodo by now?

    I tend to think that 90% of what’s out there simply isn’t worth saving, and for the remaining 10%, it will attract enough of a following that it will either make it into print or somebody will care enough to maker a copy.

  • Danny Cooper

    I think every human being should create there own archive, whether it be on the web, stored on the computer, or on paper.

  • Gabe

    Why would we need to keep everything? There were thousands of years of information passed down from one generation to the next BEFORE written words, the internet is similar to story telling in that way. Everyone takes what they read, see, or learn from the net, and as they redistribute that information, they put their own spin on it. Then some one else picks it up and carries it along in the same fashion. Add in the fact that such a high percentage of the information becomes irrelevant so quickly, and there really is no need to store any of it – the important stuff will get passed on in new messages and formats.

  • Anonymous

    50 years ago, people wrote to each other on paper rather than discussing items online, and paper newspapers were the way to spread news. Community news was no doubt written on paper and posted in the local shop.

    Very little of this information will have lasted the 50 years gap, primarily because it’s not that important.

    The same goes for the web — most of it is simply not worth saving, transient as it is by nature. That which is good enough you would hope has enough backing and community around it that it would be maintained, even when one person decides not to procede with it. It’s an interesting question, though…

  • Chris Pratt

    This process of losing the web happens quicker than many of us think. I’ve seen blog posts created within the last year chalk up 404’s when I finally stumble upon them in Google search results. In those cases, Google keeping a cache has been invaluable, as I can still get the information I was looking for.

    However, it really doesn’t make sense to keep a cache of everything ever created for all time. There’s a lot of garbage out there. There is, though, a mechanism for creative caching already in place. Just use bookmarking and other site recommendation services already out there to sift the wheat versus the chaff. This isn’t a 100% fool-proof, but for the most part, it would ensure that the most important information does it saved.

  • I often wonder how I will keep track of sites/pages that I link to on my site.

    It looks bad when a link is broken – even if it is in an article I posted a couple of years earlier.

    Maybe I should make a page where I can track and monitor all the external links I use…

    Does anyone else think about this for their sites?

  • Sean Tierney

    I’d agree with the sentiments expressed that we don’t have to preserve _everything_ created. It’s almost like hoarding possessions in your house- every one you keep dilutes the value of your existing collection because it makes the truly interesting ones harder to find.

    So if the problem can be broken into two components: #1 filtering out the crap and #2 preserving the data created, I’d say things like akismet, the pagerank algorithm, relying on a trusted authorities to distill info on a certain space and the discovery aspects of social bookmarking sites are helping us deal with #1 adequately. #2 has an interesting “thermostat” effect working in its favor in that people are more likely to protect valuable info so for instance, I’ve put a lot of time into writing what I believe are useful posts on my blog over the past 3 years. You can believe that I’ve taken precautions to backup the writings that I’ve labored hard to produce.

    So #2 should actually help with #1 in that the less valuable data will die off as people fail to protect it (i’ve let sites go that sucked and I’ve revived sites that went down because they’re worthy of preserving). What survives should be the cream of the crop mixed in with the inevitable spam, but hopefully our spam-fighting technology will stay one step ahead in the arms race.


  • yes I’m with recycle all unwanted articles & data but for sure archiving them is the main idea of making the web history. I believe that one day students in college or people out there will hit the search engines to search about the web in our time and how it used to be now, what we were thinking of, what we used & how we communicate together, which services came first & what’s after..many many things that should be archived for the history of the web :)

    PS. not talking about personal blogs, hi/bye tweets etc of course am talking about the useful information over there in the cloud

  • I don’t know if the size of the Google cache really means anything. Google might have just come up with better ways of filtering (and not caching) a lot of junk/spam sites.

  • nevillef

    Hi Josh,
    A lot of information on the Web has a limited useful lifetime, for example camera of mobile phone reviews. The shelf life of these products is so short that it is unlikely anyone would be interested in them a few years down the track.

    But there is a lot of content which is far more useful, and which in the ideal world would stay accessible for a long time.

    But sadly this is only too often not the case. And even with the likes of Google it can be difficult to re-find content you know you’ve found in the past, assuming it still exists.

    A solution is to take control of content that is important to you by saving it to your PC. There are a several programs that do this such as the Scrapbook Firefox Extension and our product Surfulater.


  • Abhay Bakshi

    Josh Catone asks: “150 years from now, will this blog post still exist?”. My response is: who is worried over it? who cares if it doesn’t exist? Worry is something that is not going to take us anywhere. Worry doesn’t last for long, that’s good news.

    My response to the *whole* blog post is: we cannot forget – “energy can neither be created nor be destroyed, it just takes one form from another”. If some good content is being created and posted on the web, it *will* be picked up by those (brains) to whom it is most applicable AND *will also* be retained by them in one form or another!!

    The other law that we should not forget about is: “the law of sowing and reaping”. Whatever you sow, you reap it, now or later. Whatever goes around comes around.

    Lastly, the best news is – the human has made progress in the whole process, and so has the Universe. Who is worried? Can we focus on our task at hand for today instead? “The Web Is Disappearing” – I am confident it is not. I am sorry. (Nothing personal against Josh Catone)


  • Let’s face it, there’s so much rubbish on the net. Why should we want to keep it all anyway?
    Many sites content have “graduated” from the web into print.

    Well said nedlud.

  • Dorsey

    While a lot, if not most, of what’s on the net is crap-ola (vanity blogs, porn, outdated catalogs, et al), what remains is likely worth holding on to. Back in the paper-only era, studies showed that at most 3% of what was being saved was ever needed again. The problem was that you never knew in advance what that 3% would be. I’m sure that all of you who are old enough to have maintained a paper filing system are nodding your heads about now.

  • skoolsonline

    i skoolsonline we are basically develop website and also we are using
    JOOMLA ,PHP, and also handling you can also extract lost of thing
    so on behafe of skoolsonline i am inviting you watch my site for
    innovative work