Dealing with unqualified HREF values

When I was building my extension for finding unused CSS rules, I needed a way of qualifying any href value into a complete URI. I needed this because I wanted it to support stylesheets inside IE conditional comments, but of course to Firefox these are just comments — I had to parse each comment node with a regular expression to extract what’s inside it, and therefore, the href value I got back was always just a string, not a property or a qualified path.

And it’s not the first time I’ve needed this ability, but in the past it’s been with predictable circumstances where I already know the domain name and path. But here those circumstances were not predictable — I needed a solution that would work for any domain name, any path, and any kind of href format (remembering that an href value could be any one of several formats):

  • relative: "test.css"
  • relative with directories: "foo/test.css"
  • relative from here: "./test.css"
  • relative from higher up the directory structure: "../../foo/test.css"
  • relative to the http root: "/test.css"
  • absolute: "http://www.sitepoint.com/test.css"
  • absolute with port: "http://www.sitepoint.com:80/test.css"
  • absolute with different protocol: "https://www.sitepoint.com/test.css"

When are HREFs qualified?

When we retrieve an href with JavaScript, the value that comes back has some cross-browser quirks. What mostly happens is that a value retrieved with the shorthand .href property will come back as a qualified URI, whereas a value retrieved with getAttribute('href') will (and should, according to specification) come back as the literal attribute value. So with this link:

<a id="testlink" href="/test.html">test page</a>

We should get these values:

document.getElementById('testlink').href == 'http://www.sitepoint.com/test.html';
document.getElementById('testlink').getAttribute('href') == '/test.html';

And in Opera, Firefox and Safari that is indeed what we get. However in Internet Explorer (all versions, up to and including IE7) that isn’t what happens — for both examples we get back a fully-qualified URI, not a raw attribute value:

document.getElementById('testlink').href == 'http://www.sitepoint.com/test.html';
document.getElementById('testlink').getAttribute('href') == 'http://www.sitepoint.com/test.html';

This behavioral quirk is documented in Kevin Yank and Cameron Adams’ recent book, Simply JavaScript; but it gets quirkier still. Although this behavior applies with the href of a regular link (an <a> element), if we do the same thing for a <link> stylesheet, we get exactly the opposite behavior in IE. This HTML:

<link rel="stylesheet" type="text/css" href="/test.css" />

Produces this result:

document.getElementById('teststylesheet').href == '/test.css';
document.getElementById('teststylesheet').getAttribute('href') == '/test.css';

In both cases we get the raw attribute value (whereas in other browsers we get the same results as for an anchor — .href is fully qualified while getAttribute produces a literal value).

Anyway…

Behavioral quirks aside, I have to say that IE‘s behavior with links is almost always what I want. Deriving a path or file name from a URI is fairly simple, but doing the opposite is rather more complex.

So I wrote a helper function to do it. It accepts an href in any format and returns a qualified URI based on the current document location (or if the value is already qualified, it’s returned unchanged):

//qualify an HREF to form a complete URI
function qualifyHREF(href)
{
	//get the current document location object
	var loc = document.location;

	//build a base URI from the protocol plus host (which includes port if applicable)
	var uri = loc.protocol + '//' + loc.host;

	//if the input path is relative-from-here
	//just delete the ./ token to make it relative
	if(/^(./)([^/]?)/.test(href))
	{
		href = href.replace(/^(./)([^/]?)/, '$2');
	}

	//if the input href is already qualified, copy it unchanged
	if(/^([a-z]+):///.test(href))
	{
		uri = href;
	}

	//or if the input href begins with a leading slash, then it's base relative
	//so just add the input href to the base URI
	else if(href.substr(0, 1) == '/')
	{
		uri += href;
	}

	//or if it's an up-reference we need to compute the path
	else if(/^((../)+)([^/].*$)/.test(href))
	{
		//get the last part of the path, minus up-references
		var lastpath = href.match(/^((../)+)([^/].*$)/);
		lastpath = lastpath[lastpath.length - 1];

		//count the number of up-references
		var references = href.split('../').length - 1;

		//get the path parts and delete the last one (this page or directory)
		var parts = loc.pathname.split('/');
		parts = parts.splice(0, parts.length - 1);

		//for each of the up-references, delete the last part of the path
		for(var i=0; i<references; i++)
		{
			parts = parts.splice(0, parts.length - 1);
		}

		//now rebuild the path
		var path = '';
		for(i=0; i<parts.length; i++)
		{
			if(parts[i] != '')
			{
				path += '/' + parts[i];
			}
		}
		path += '/';

		//and add the last part of the path
		path += lastpath;

		//then add the path and input href to the base URI
		uri += path;
	}

	//otherwise it's a relative path,
	else
	{
		//calculate the path to this directory
		path = '';
		parts = loc.pathname.split('/');
		parts = parts.splice(0, parts.length - 1);
		for(var i=0; i<parts.length; i++)
		{
			if(parts[i] != '')
			{
				path += '/' + parts[i];
			}
		}
		path += '/';

		//then add the path and input href to the base URI
		uri += path + href;
	}

	//return the final uri
	return uri;
}

One more for the toolkit!

Win an Annual Membership to Learnable,

SitePoint's Learning Platform

  • http://www.sitepoint.com/ Kevin Yank

    Also mentioned in Simply JavaScript, here’s some further reading on the mess of retrieving attributes in IE: Attribute Nightmare In IE.

  • http://www.tyssendesign.com.au Tyssen

    Sitepoint needs to update your article bio info, well the first part at least. ;)

  • Steve Clay

    Considering the first half of the article, I thought you were gearing up to do this:

    function qualifyHREF(href) {
    var div = document.body.appendChild(document.createElement('div'));
    div.innerHTML = '<a href="' + href + '"></a>';
    var ret = div.getElementsByTagName('a')[0].href;
    document.body.removeChild(div);
    return ret;
    }

    :) Kind of a joke, but…

  • http://www.brothercake.com/ brothercake

    LOL, nice hack – never even occured to me :)

  • nick

    How do browsers (and this script) deal with a tag? The full path may not necessarily use document.location, but some other location specified by the author.

  • http://www.brothercake.com/ brothercake

    (Your tag was edited out – remember that HTML needs to be entities to show up as code.)

    Anyway, assuming you meant an <iframe> – that’s a very good question. I did originally have a context parameter that could accept an input document location object rather than using the main one:

    function qualifyHREF(href, context)
    {
    if(typeof context == 'undefined')
    {
    var loc = document.location;
    }
    else
    {
    loc = context;
    }

    And that could then be called like this:

    qualifyHREF('whatever.html', document.getElementById('myiframe').location);

    But that failed in Internet Explorer. I’m not sure why (I never got as far as investigating) but I think it might be that IE doesn’t expose the location object of an iframe page in that straightforward way. It might be necessary to get the iframe reference through the document.frames collection, or perhaps you need to drill down into the iframe object to get its contentDocument, and then that will have a location object.

  • nic

    Actually I meant a <base> tag!

  • nic

    sorry, submitted too soon.

    With a <base> tag, the document.location may refer to one URL, but the base tag would override this — for example a page on http://www.domain-one.com might have a base tag with an href of http://www.domain-two.com — any relative links which appeared on the page, a browser would send to domain-two, but because you’re looking at document.location, this script would still see it as domain-one.

  • http://www.brothercake.com/ brothercake

    Oh, the <base> element, yeah I hadn’t though of that.

    Well that would similarly require feeding the function with a context location, and to do that it would need to start parsing location URIs as strings instead of relying on the location object; but doing that might also fix the <iframe> problem in IE, as well as allowing for processing of any context however created, not just window locations.

    I’ll give it some thought and post again in due course!