Link extraction from HTML, with noting of nofollow and capturing of anchor text?

What code/regex could one use to extract links from HTML, with noting of whether each individual link was rel=nofollow or not, and capturing the anchor text associated with each link?

I’m confused by what regex one would use, since href and rel could occur in any order, their contents could be enclosed in either single or double quotes, and there are probably many other issues that are not occurring to me. Perhaps regex is not even the proper approach.

Also, how would one robustly identify relative links - e.g. “/path”, “#path” etc. rather than “http://www.example.com/path”?

Use the DOM http://us.php.net/manual/en/book.dom.php

Does anybody have any suggestions on a proper robust way to do this?

Thanks - myself was thinking of using

preg_match_all("/<a\\s[^>]*?href=\\"([^>]*?)\\".*?[^>]*?>(.*?)<\\/a>/i", $html, $matches[1]);
preg_match_all("/<a\\s[^>]*?href='(.*?)'[^>]*?>(.*?)<\\/a>/i", $html, $matches[2]);

To capture links enclosed with either single or double quotes, and then merging the two arrays. But there remains the issue of possible rel tags, which also could be enclosed with either single or double quotes, and may be either before or after the href tag.

I’m not sure what the proper approach here is - whether to try and match all the href/rel/single or double quotes possibilities with regex and then merge them, or use the two regexes above, and then parse the sub-patterns for the presence of rel tags - or something completely different.

All links are built the same way

  • they start with <a
  • then there is some stuff (space, rel, onclick, whatever)
  • then there is href=“some_value”
  • then there possibly is some more stuff (space, rel, onclick, whatever)
  • then there is a >
  • then there is the link caption
  • and it ends with </a>

To translate to regex:
~ - start regex
<a - match <a literally
.? - match some stuff, lazily
href=" - match href=" literally
(.
?) - match the contents of the href attribute, and add a backreference to that
.? - match some more stuff, lazily
> - match > literally
(.
?) - match the link caption, lazily, and add a backreference
</a> - match </a> literally
~ - end regex
i - make the regex case insensitive, to also match <A HREF=“some_value”>some_caption</A>

To sum up:


~<a.*?href="(.*?)".*?>(.*?)</a>~i

:slight_smile:

PS. If you know what values are expected, you could better use classes like [a-zA-Z0-9] or \w etc instead of .*