Caching a page

I would like to capture an external web page and cache it.

For example, I would like to be able to cache a snapshot and display it on mysite/bbc

Capturing the HTML seems simple enough, a curl function should do it…

The problem I am going to have is all the links, images and scripts will be broken…

An image in the form <img src=“/images/pic.png”> will be broken because it will no longer point to but rather mysite/images/pic.png

A solution to this would be to run a regex and replace ="/ with =" but this is a bit cumbersome and unreliable…

so… is there another way of doing it?

I was thinking that there might be a way to set the headers to tell the browser to use rather than

(Note:… I know there are copyright issues etc with this… I am just displaying a snapshot of a page, not copying entire pages from a site… and all there references are in place)


How do you plan to show this “snapshot”?
As an image or via something like an iframe?

Most likely via an iframe


Found it

<base href="…

You might want to look at the myriad of options that wget throws up for you, call it through cron.