I am using file() to get contents of a URL. I am then looping through the array to get the contents of the file line by line. How can I get php to get all <a title=“”> links off the page? Basically I want PHP to grab whatever the title=“” is.
/ - start regex
< - match “<” literally, 1 time
a - match “a” litterally, 1 time
.? - match any character, the ? makes this part lazy (see here, section “Laziness Instead of Greediness”) - this is because the title does not have to directly next to the start of the tag. This way you can also grab <a href=“someurl” title=“my title”>
title=" - match title=" literally, 1 time
([^"]) - match as many characters as possible, but not " (double quote) - this makes the regex stop when it finds a ". Take care when crawling websites that do not adhere to standards (<a title=mypicture href=someurl>)
/i - end the regex, and make it case insensitive (the “i” at the end)
It would be much more appropriate to load the HTML document into a proper parser (handily, we have the DOM) and use that to grab what you need. For example:
$dom = new DOMDocument;
$dom->loadHTMLFile('./myhtmlfile.html');
foreach ($dom->getElementsByTagName('a') as $anchor) {
if ($anchor->hasAttribute('href')) {
echo $anchor->getAttribute('href') . PHP_EOL;
}
}