What code/regex could one use to extract links from HTML, with noting of whether each individual link was rel=nofollow or not, and capturing the anchor text associated with each link?
I’m confused by what regex one would use, since href and rel could occur in any order, their contents could be enclosed in either single or double quotes, and there are probably many other issues that are not occurring to me. Perhaps regex is not even the proper approach.
Also, how would one robustly identify relative links - e.g. “/path”, “#path” etc. rather than “http://www.example.com/path”?
To capture links enclosed with either single or double quotes, and then merging the two arrays. But there remains the issue of possible rel tags, which also could be enclosed with either single or double quotes, and may be either before or after the href tag.
I’m not sure what the proper approach here is - whether to try and match all the href/rel/single or double quotes possibilities with regex and then merge them, or use the two regexes above, and then parse the sub-patterns for the presence of rel tags - or something completely different.
then there is some stuff (space, rel, onclick, whatever)
then there is href=“some_value”
then there possibly is some more stuff (space, rel, onclick, whatever)
then there is a >
then there is the link caption
and it ends with </a>
To translate to regex:
~ - start regex
<a - match <a literally
.? - match some stuff, lazily
href=" - match href=" literally
(.?) - match the contents of the href attribute, and add a backreference to that
.? - match some more stuff, lazily
> - match > literally
(.?) - match the link caption, lazily, and add a backreference
</a> - match </a> literally
~ - end regex
i - make the regex case insensitive, to also match <A HREF=“some_value”>some_caption</A>
To sum up:
~<a.*?href="(.*?)".*?>(.*?)</a>~i
PS. If you know what values are expected, you could better use classes like [a-zA-Z0-9] or \w etc instead of .*