Loop through file()

Zaggs · January 29, 2010, 11:05am

Hi Guys!

I am using file() to get contents of a URL. I am then looping through the array to get the contents of the file line by line. How can I get php to get all <a title=“”> links off the page? Basically I want PHP to grab whatever the title=“” is.

Hope someone can help.

Zaggs · January 29, 2010, 11:23am

I am guessing I need something like:


if(preg_match('<a title="/^[a-zA-Z]$/">', $line, $matches)) {
print_r($matches);
}

The above does not work, but would appreciate it if anyone could correct the code.

Thanks!

rpkamp · January 29, 2010, 11:48am

Almost correct. Use:


if(preg_match_all('/<a.*?title=\\"([^\\"]*)/i', $str, $matches)) {
print_r($matches);
}

To break it down:

/ - start regex
< - match “<” literally, 1 time
a - match “a” litterally, 1 time
.? - match any character, the ? makes this part lazy (see here, section “Laziness Instead of Greediness”) - this is because the title does not have to directly next to the start of the tag. This way you can also grab <a href=“someurl” title=“my title”>
title=" - match title=" literally, 1 time
([^"]) - match as many characters as possible, but not " (double quote) - this makes the regex stop when it finds a ". Take care when crawling websites that do not adhere to standards (<a title=mypicture href=someurl>)
/i - end the regex, and make it case insensitive (the “i” at the end)

Zaggs · January 29, 2010, 12:30pm

Thanks, worked a treat

rpkamp:

Almost correct. Use:
if(preg_match_all('/<a.*?title=\\"([^\\"]*)/i', $str, $matches)) {
print_r($matches);
}
To break it down:

/ - start regex
< - match “<” literally, 1 time
a - match “a” litterally, 1 time
.? - match any character, the ? makes this part lazy (see here, section “Laziness Instead of Greediness”) - this is because the title does not have to directly next to the start of the tag. This way you can also grab <a href=“someurl” title=“my title”>
title=" - match title=" literally, 1 time
([^"]) - match as many characters as possible, but not " (double quote) - this makes the regex stop when it finds a ". Take care when crawling websites that do not adhere to standards (<a title=mypicture href=someurl>)
/i - end the regex, and make it case insensitive (the “i” at the end)

salathe · January 29, 2010, 9:32pm

It would be much more appropriate to load the HTML document into a proper parser (handily, we have the DOM) and use that to grab what you need. For example:


$dom = new DOMDocument;
$dom->loadHTMLFile('./myhtmlfile.html');

foreach ($dom->getElementsByTagName('a') as $anchor) {
    if ($anchor->hasAttribute('href')) {
        echo $anchor->getAttribute('href') . PHP_EOL;
    }
}

Mohandko · January 30, 2010, 12:47am

see attach file

Topic		Replies	Views
Brain Freeze on REGEX PHP	5	531	January 8, 2010
Regex and PHP help PHP	4	471	May 10, 2011
Extracting data using preg_match_all PHP	3	779	September 20, 2011
Regex help please PHP	8	478	May 21, 2010
Help with regex to grab all instances of pattern from string PHP	5	264	November 13, 2010

Loop through file()

Related topics