Caching a page

mrwooster · May 10, 2010, 3:44pm

I would like to capture an external web page and cache it.

For example, I would like to be able to cache a snapshot www.bbc.co.uk and display it on mysite/bbc

Capturing the HTML seems simple enough, a curl function should do it…

The problem I am going to have is all the links, images and scripts will be broken…

An image in the form <img src=“/images/pic.png”> will be broken because it will no longer point to http://www.bbc.co.uk/images/pic.png but rather mysite/images/pic.png

A solution to this would be to run a regex and replace ="/ with ="www.bbc.co.uk/ but this is a bit cumbersome and unreliable…

so… is there another way of doing it?

I was thinking that there might be a way to set the headers to tell the browser to use http://www.bbc.co.uk rather than mysite.co.uk…

(Note:… I know there are copyright issues etc with this… I am just displaying a snapshot of a page, not copying entire pages from a site… and all there references are in place)

Thanks

smftre · May 10, 2010, 4:22pm

How do you plan to show this “snapshot”?
As an image or via something like an iframe?

mrwooster · May 10, 2010, 4:33pm

Most likely via an iframe

mrwooster · May 10, 2010, 4:40pm

Ah…

Found it

<base href="…

Cups · May 10, 2010, 5:46pm

You might want to look at the myriad of options that wget throws up for you, call it through cron.

Topic		Replies	Views
Need to cache external html and save the images locally PHP	1	958	October 8, 2014
Help with Cache / Archive PHP	2	428	October 8, 2014
PHP equivalent to wget? PHP	4	5908	October 8, 2014
Is this best handled by PHP or Apache? PHP	2	156	March 19, 2010
Fetching images and text on an external page PHP	2	488	October 24, 2010

Caching a page

Related topics