cURL and relative links base href problem

Hey guys,

I am using cURL to get a few web pages. To solve the problem of relative paths, I set the base tag href attribute to be the page cURL downloads.

But I want all the clickable links on the page returned by cURL to have this “http://example.com?url=” before the actual link. So I want all links to be proceeded by the above url.

So on the page, <a href=“/about.php”> would become <a href=“http://google.com/about.php”> because of the base tag, but then I want it finally to be <a href=“http://example.com?url=http://google.com/about.php”>

How can I go about this?

Well actually base would not change the actual href attribute, but would rather change the link upon clicking. But how can I make it so base will change to what I want upon clicking (proceeded by “http://example.com?url=”) , in addition to the absolute mapping.

eek. does anybody at least understand what I’m asking and trying to do?

why dont you use preg_replace?

Can you elaborate on how I would do this? Let’s say I wanted to change every href attribute of every <a> tag on the page to have all the links be proceeded by a string of my choice (e.g. turn all href=“index.php” to href=“mystring?index.php”). What would the preg_replace look like?

in its simplest form :-


<?php 
// set up our test string
$string = '
<html>
<body>
Some html here
<a href="file1.php">link</a> 
some more html here and then 
<a HREF="file2.php">second link</a>
</body>
</html>.';
 
// specify our find and replace strings /i is to make find case insensitive
$find = '/href="/i';
$replace = 'href="http://mystring?';
 
// parse the string
$string = preg_replace($find, $replace, $string);
 
// echo results to browser
echo htmlspecialchars($string); 
?> 

Yes but see I’m trying to ONLY do it for <a> tags. This would change all tags with a set href attribute.

And you can’t say $find = '<a href=" ’ because not all sites have href as the first attribute in all their links. So it wouldn’t work for <a title=“blah” href=“blah”>blah</a>

Also, how would you distinguish between href=‘blah.html’ and href=‘http://externalsite.com/blah.html’ ??

Because I want to append the website’s root domain to the relative links, but not to the absolute links.

And remember, no base tag.

something like this should get you close:

$fetchUrl = 'http://www.google.com/';
$rewriteBaseUrl = 'http://example.com?url=';

$doc = new DOMDocument(); 
$doc->loadHTMLFile($fetchUrl);

foreach($doc->getElementsByTagName('a') as $link) {
    $href = preg_replace('#^/#', '', $link->getAttribute('href'));
    $href = preg_match('#^http#i', $href) ? $rewriteBaseUrl.$href : $rewriteBaseUrl.$fetchUrl.$href;
    $link->setAttribute('href', $href);
} 


$html = $doc->saveHTML();

Thanks for your help Monkey. Would you mind explaining what exactly the replace functions look for and replace / what they do? They are my weakness as you can see. Thank you!!

$href = preg_replace('#^/#', '', $link->getAttribute('href'));

this replaces removes leading '/'s from links

$href = preg_match('#^http#i', $href) ? $rewriteBaseUrl.$href : $rewriteBaseUrl.$fetchUrl.$href;

if the link starts with http, then we assume this is an absolute link so just use the full path. If it does not, we assume it’s a relative path and add the base url of the site to the begining of the link, transforming it into an absolute path

Very very neat Monkey. I really appreciate this. I’m gonna give it a shot later tonight. :slight_smile:
Thank you!