I’ve been trying to figure out how I can get the target URL of a link using PHP. I figured out that this is normally done using CURL, which would have worked fine, if the links I am testing redirected correctly, however they are using JavaScript to do the redirecting.
Does anyone know of a different PHP technique I can use to figure out the final URL of a link, no matter what kind of redirect is in place? I feel like it may be impossible, but I wanted to see what others thought…
Essentially what I have is a script that generates a sitemap for my site from a database that is created from a website spidering program. Within this database are a bunch of URLs that have JavaScript in them redirecting to the actual content, say a PDF file:
The problem I am facing is because the pages are redirecting using JavaScript I can’t scrape the header for the target URL (the second URL above). All I get is the original URL before the page redirects. I’ve been trying to do this using CURL:
/*
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the HTTP server response header fields and content.
*/
function get_web_page( $url )
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
curl_exec($ch);
return curl_getinfo($ch,CURLINFO_EFFECTIVE_URL);
}
Unfortunately, I don’t have access to modify the redirect strategy of these pages. I’m trying to find a solution that would work with any redirect.
You can’t really do that, without emulating Javascript, which in turn would mean emulating an entire browser. Firefox can be remotely scripted, so it is possible, but it would be a lot of work. You’re probably better off writing a per-site specific regular-expression to parse out the URL from the Javascript code. Alternatively, you could simply treat the page as one big text-file, and parse out everything, that looks like a URL and then try to grab that.