I would like to capture an external web page and cache it.
For example, I would like to be able to cache a snapshot www.bbc.co.uk and display it on mysite/bbc
Capturing the HTML seems simple enough, a curl function should do it…
The problem I am going to have is all the links, images and scripts will be broken…
An image in the form <img src=“/images/pic.png”> will be broken because it will no longer point to http://www.bbc.co.uk/images/pic.png but rather mysite/images/pic.png
A solution to this would be to run a regex and replace ="/ with ="www.bbc.co.uk/ but this is a bit cumbersome and unreliable…
so… is there another way of doing it?
I was thinking that there might be a way to set the headers to tell the browser to use http://www.bbc.co.uk rather than mysite.co.uk…
(Note:… I know there are copyright issues etc with this… I am just displaying a snapshot of a page, not copying entire pages from a site… and all there references are in place)
Thanks