Need help fine-tuning my CF link checker

For the past several years my office has been using an in-house link checker to validate links on web pages. The reason for the in-house solution is that a large percentage of our links go through our exit door (you are now leaving our site…") and none of the commercial ones would check past it.

The in-house link checker had some limitations, the main one being that it could only check one page at a time. Also, it used perl (using a cfexecute tag), so a few weeks ago when we upgraded our perl version, the link checker broke.

I’ve taken it upon myself to write a new link checker using CF only, and the task wasn’t as difficult as I had assumed. It took me only a few hours to come up with a working version using the <cfhttp> tag, and I’ve managed to duplicate the functionality of the original link checker.

Now that I have it working, I’d like to try to do the following:

  1. Get a better regex for my rematch() statement. The current one matches any http:// url, so it trips over the link in the doctype tag and any links in the head section of the page. I’d like it get it to look for <a href= but I’m not so good with regexes. Here’s the one I’m currently using:
([A-Za-z]+:\\/\\/[A-Za-z0-9-_]+\\.[A-Za-z0-9-_:%&amp;\\?\\/.=]+)
  1. I’d like to get it to follow links. If I load a page in, I’d like it to check all the child pages. I’m not 100% sure on how to do this, but I’m thinking I need to build some sort of loop that loads each link back into the <cfhttp> tag, as long as the link is on our domain (I don’t want it to endlessly follow links til the end of time).

Any advice on that would be appreciated.

Here’s my current code (sans the page formatting and code to pull out the exit door):


<cfif structkeyexists(form, "submitform")> 

<cfhttp method="get" url="#FORM.url#" resolveurl="yes"> 

<cfset result = #rematch(regex,cfhttp.filecontent)#>

<table border="1" cellspacing="0" cellpadding="5">
<tr><th>URL</th><th>Status</th></tr>

<cfloop from="1" to="#arraylen(result)#" index="i">
	<cfhttp method="get" url="#result[i]#" resolveurl="yes"> 	
		<cfoutput><tr><td><a href="#result[i]#">#result[i]#</a></td><td class="<cfif listfirst(cfhttp.statusCode," ") eq 200>good<cfelse>bad</cfif>">#cfhttp.statusCode#</td></tr></cfoutput>
</cfloop>

</table>

Ok it’s maybe not answering the question as such :blush: but thought I’d just add that rather than “reinventing the wheel” have you looked at Open Source products that do Link Checking and reporting for you?

The one we tend to use is Linklint which does a great job for our Web Editing Team - Linklint - fast html link checker

Well like I mentioned in the first paragraph, none of the commercial products seem to be able to handle our exit door (“You are now leaving our site…”). Any link that leads away from our site goes to a CF page that displays a notice like that, then forwards you with a meta tag after a few seconds. Or you can click the link.

Another reason I’m “reinventing the wheel” is that with this being a federal gov’t agency, if I asked for something that needed to be bought, I’d likely get a copy by the time I retire. I’ve been waiting almost a year for a copy of CF Builder, and that was actually approved. I can’t imagine how long I’d have to wait if the approval process was thrown into the mix.