Dealing with unqualified HREF values

When I was building my extension for finding unused CSS rules, I needed a way of qualifying any href value into a complete URI. I needed this because I wanted it to support stylesheets inside IE conditional comments, but of course to Firefox these are just comments — I had to parse each comment node with a regular expression to extract what’s inside it, and therefore, the href value I got back was always just a string, not a property or a qualified path.

And it’s not the first time I’ve needed this ability, but in the past it’s been with predictable circumstances where I already know the domain name and path. But here those circumstances were not predictable — I needed a solution that would work for any domain name, any path, and any kind of href format (remembering that an href value could be any one of several formats):

relative: "test.css"
relative with directories: "foo/test.css"
relative from here: "./test.css"
relative from higher up the directory structure: "../../foo/test.css"
relative to the http root: "/test.css"
absolute: "https://www.sitepoint.com/test.css"
absolute with port: "http://www.sitepoint.com:80/test.css"
absolute with different protocol: "https://www.sitepoint.com/test.css"

When are HREFs qualified?

When we retrieve an href with JavaScript, the value that comes back has some cross-browser quirks. What mostly happens is that a value retrieved with the shorthand .href property will come back as a qualified URI, whereas a value retrieved with getAttribute('href') will (and should, according to specification) come back as the literal attribute value. So with this link:

<a id="testlink" href="/test.html">test page</a>

We should get these values:

document.getElementById('testlink').href == 'https://www.sitepoint.com/test.html';
document.getElementById('testlink').getAttribute('href') == '/test.html';

And in Opera, Firefox and Safari that is indeed what we get. However in Internet Explorer (all versions, up to and including IE7) that isn’t what happens — for both examples we get back a fully-qualified URI, not a raw attribute value:

document.getElementById('testlink').href == 'https://www.sitepoint.com/test.html';
document.getElementById('testlink').getAttribute('href') == 'https://www.sitepoint.com/test.html';

This behavioral quirk is documented in Kevin Yank and Cameron Adams’ recent book, Simply JavaScript; but it gets quirkier still. Although this behavior applies with the href of a regular link (an <a> element), if we do the same thing for a <link> stylesheet, we get exactly the opposite behavior in IE. This HTML:

<link rel="stylesheet" type="text/css" href="/test.css" />

Produces this result:

document.getElementById('teststylesheet').href == '/test.css';
document.getElementById('teststylesheet').getAttribute('href') == '/test.css';

In both cases we get the raw attribute value (whereas in other browsers we get the same results as for an anchor — .href is fully qualified while getAttribute produces a literal value).

Anyway…

Behavioral quirks aside, I have to say that IE‘s behavior with links is almost always what I want. Deriving a path or file name from a URI is fairly simple, but doing the opposite is rather more complex.

So I wrote a helper function to do it. It accepts an href in any format and returns a qualified URI based on the current document location (or if the value is already qualified, it’s returned unchanged):

//qualify an HREF to form a complete URI
function qualifyHREF(href)
{
	//get the current document location object
	var loc = document.location;

	//build a base URI from the protocol plus host (which includes port if applicable)
	var uri = loc.protocol + '//' + loc.host;

	//if the input path is relative-from-here
	//just delete the ./ token to make it relative
	if(/^(./)([^/]?)/.test(href))
	{
		href = href.replace(/^(./)([^/]?)/, '$2');
	}

	//if the input href is already qualified, copy it unchanged
	if(/^([a-z]+):///.test(href))
	{
		uri = href;
	}

	//or if the input href begins with a leading slash, then it's base relative
	//so just add the input href to the base URI
	else if(href.substr(0, 1) == '/')
	{
		uri += href;
	}

	//or if it's an up-reference we need to compute the path
	else if(/^((../)+)([^/].*$)/.test(href))
	{
		//get the last part of the path, minus up-references
		var lastpath = href.match(/^((../)+)([^/].*$)/);
		lastpath = lastpath[lastpath.length - 1];

		//count the number of up-references
		var references = href.split('../').length - 1;

		//get the path parts and delete the last one (this page or directory)
		var parts = loc.pathname.split('/');
		parts = parts.splice(0, parts.length - 1);

		//for each of the up-references, delete the last part of the path
		for(var i=0; i<references; i++)
		{
			parts = parts.splice(0, parts.length - 1);
		}

		//now rebuild the path
		var path = '';
		for(i=0; i<parts.length; i++)
		{
			if(parts[i] != '')
			{
				path += '/' + parts[i];
			}
		}
		path += '/';

		//and add the last part of the path
		path += lastpath;

		//then add the path and input href to the base URI
		uri += path;
	}

	//otherwise it's a relative path,
	else
	{
		//calculate the path to this directory
		path = '';
		parts = loc.pathname.split('/');
		parts = parts.splice(0, parts.length - 1);
		for(var i=0; i<parts.length; i++)
		{
			if(parts[i] != '')
			{
				path += '/' + parts[i];
			}
		}
		path += '/';

		//then add the path and input href to the base URI
		uri += path + href;
	}

	//return the final uri
	return uri;
}

One more for the toolkit!