Get final URL of a Javascript Redirect using PHP

Hi,

I’ve been trying to figure out how I can get the target URL of a link using PHP. I figured out that this is normally done using CURL, which would have worked fine, if the links I am testing redirected correctly, however they are using JavaScript to do the redirecting.

Does anyone know of a different PHP technique I can use to figure out the final URL of a link, no matter what kind of redirect is in place? I feel like it may be impossible, but I wanted to see what others thought…

Thanks for the help!

Give an example of the code/flow of the redirect.

Essentially what I have is a script that generates a sitemap for my site from a database that is created from a website spidering program. Within this database are a bunch of URLs that have JavaScript in them redirecting to the actual content, say a PDF file:

something.com/linkothumbredirect.jsp?im_dbkey=123456

redirects via Javascript to:

something.com/content/somepdf.pdf

The problem I am facing is because the pages are redirecting using JavaScript I can’t scrape the header for the target URL (the second URL above). All I get is the original URL before the page redirects. I’ve been trying to do this using CURL:

/*
 * Get a web file (HTML, XHTML, XML, image, etc.) from a URL.  Return an
 * array containing the HTTP server response header fields and content.
 */
function get_web_page( $url )
{

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
curl_exec($ch);
return curl_getinfo($ch,CURLINFO_EFFECTIVE_URL);	
	
}

Unfortunately, I don’t have access to modify the redirect strategy of these pages. I’m trying to find a solution that would work with any redirect.

You can’t really do that, without emulating Javascript, which in turn would mean emulating an entire browser. Firefox can be remotely scripted, so it is possible, but it would be a lot of work. You’re probably better off writing a per-site specific regular-expression to parse out the URL from the Javascript code. Alternatively, you could simply treat the page as one big text-file, and parse out everything, that looks like a URL and then try to grab that.

So, then how does javascript know how to translate the url1 into url2?

Or are you saying that the server does the redirect via an http header, and you just don’t yet know how to get that header?

The JSP dynamically populates the values of the redirect, referencing a database.

You’re not providing enough information.

The JSP page looks like this:

<html>
<head>
<title><%= (ci.getAltTitle() == null || ci.getAltTitle().equals("")) ?
ci.getTitle() + " - PTC.com" :
ci.getAltTitle()%></title>
<script type="text/javascript"><!--//--><![CDATA[//><!--
var isNew = "1";
var isBack;

function cleanRedirect() {
	isBack = (isNew != document.backCheck.a1.value);
	document.backCheck.a1.value=2;
	document.backCheck.a1.defaultValue=2;
	
	var url = '<%=file.getPath()%>'	
	if (isBackButtonUsed()) {
		url = document.referrer;
	}
	location.replace(url);

}

function isBackButtonUsed () {
	return isBack;
}

//--><!]]></script>

<meta http-equiv="Refresh" content="1;url=<%=file.getPath()%>">

</head>
<body onload="cleanRedirect();">
<form name="backCheck" id="backCheck">
<input type="hidden" name="a1" value="1" style="visibility:hidden">
</form>
</body>
</html>

It is getting the path of the file and plugging it into the JavaScript function which will redirect to the URL of the actual content.

Okay.

So using


$resp = file_get_contents('http://something.com/linkothumbredirect.jsp?im_dbkey=123456');

returns the html/javascript you just posted.

Use a regex to parse the url out.


$resp = file_get_contents('http://something.com/linkothumbredirect.jsp?im_dbkey=123456');
if (preg_match("#var url = '([^']+)'#", $resp, $matches)) {
    echo $matches[1];
} else {
    // failed
}