I have a series of web pages with navigation links at the bottom to the other pages. All pages but the last one have a link to the next page. I want to capture that link or return a no-match on the last page, which doesn’t have that link. I’ve look at lookaheads, but I’m not getting it to work. Here’s the text I’m working with.
I need a regex expression I can use in preg_match that says, “Look for the text Next » and then look for the first href=before that and capture the text between the href’s double quotes.” Thanks for any advice!
EDIT: Funny how you can get focused on a challenge and lose track of an easier solution. I can easily extract the whole <div class=pagination.*?<\/div> part, and then use PHP’s functions to capture the last href. Still, it would be nice to figure out how to do it all in regex.
Fantastic! Thank you! And yes, you’re right about using DOM. Using regex isn’t very robust.
Edit: Darn, that expression didn’t work either. As happened with my efforts, it matches the first href, not the last. But no matter. I solved the problem with PHP’s string functions for now.
Indeed, the problem was that there are span tags inside the links and .*? which I meant to consume the spans actually consumed all the text up to the last link. This should work with the (optional) span tags:
I also added |\s into the subpatterns so that white space around the tags doesn’t break it.
As you can see it can get quite complicated if you want to take all the possibilities that a normal html parser would allow. The one below might be more robust - certainly longer but I think more readable:
$doc = new DOMDocument;
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
$link = null;
foreach ($doc->getElementsByTagName('a') as $linkElem) {
if (trim($linkElem->textContent) == "Next\xC2\xA0»") {
$link = $linkElem->getAttribute('href');
break;
}
}
echo $link;
As you can see the is converted to the unicode non-breaking space (C2A0) so we need to look for that when comparing link texts.