Getting Page Results

Hi there, I’ve created a simple screenscrape script but want it to display all the results from the page inside of the <tr class=“rowalt”><td>. The script is displaying one result but I’m wanting them all.

Think I need a while loop which I have tried but produces the same result over and over again, an infinite loop.

How can this be overcome?


<?php
$url = "http://www.test.com/home";
$raw = file_get_contents($url);

$newlines = array("\	","\
","\\r","\\x20\\x20","\\0","\\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,'<tr class="rowalt"><td>');
$end = strpos($content,'</td>',$start) + 8;

$table = substr($content,$start,$end-$start);
echo $table;


?>

The problem is that every time you call strpos to determine the $start variable, it finds and returns the first occurrence.

You need to use something like [fphp]preg_match_all/fphp or even better, work through the HTML using a HTML Parser. I’ve never actually used a HTML parser, so I’ll let someone else provide more information about that.

If you want to try the preg_match_all way, let me know and I’ll see if I can whip up some code.

Hey dude, yeah sure thing, if you could that would be a massive help, big time. I seem to be learning well, the help on here is incredibly useful. Cheers!

Have you thought about using the YQL Console to obtain/cache/parse your data?

For instance, if I wanted to obtain a list of all the PHP.net announcements which mentioned a new release; I could [/a[@class%20%3D%20%22bookmark%22%20and%20contains%28.%2C%20%22Released%22%29]%27"]build this in the console](http://developer.yahoo.com/yql/console/#h=SELECT%20*%20FROM%20html%20WHERE%20url%3D"http%3A//www.php.net"%20AND%20xpath%3D’//h1[@class%20%3D%20%22summary%20entry-title%22).

This would give [%2Fa[%40class%20%3D%20%22bookmark%22%20and%20contains%28.%2C%20%22Released%22%29]%27"]this URI](http://query.yahooapis.com/v1/public/yql?q=SELECT%20*%20FROM%20html%20WHERE%20url%3D"http%3A%2F%2Fwww.php.net"%20AND%20xpath%3D’%2F%2Fh1[%40class%20%3D%20%22summary%20entry-title%22) to obtain the data, which would be…


<query yahoo:count="5" yahoo:created="2010-12-18T19:56:15Z" yahoo:lang="en-US">
&#8722;
<results>
<a class="bookmark" href="http://www.php.net/archive/2010.php#id2010-12-16-1" id="id2010-12-16-1" name="id2010-12-16-1" rel="bookmark">PHP 5.2.16 Released!</a>
<a class="bookmark" href="http://www.php.net/archive/2010.php#id2010-12-10-1" id="id2010-12-10-1" name="id2010-12-10-1" rel="bookmark">PHP 5.3.4 Released!</a>
<a class="bookmark" href="http://www.php.net/archive/2010.php#id2010-12-09-1" id="id2010-12-09-1" name="id2010-12-09-1" rel="bookmark">PHP 5.2.15 Released!</a>
<a class="bookmark" href="http://www.php.net/archive/2010.php#id2010-07-22-2" id="id2010-07-22-2" name="id2010-07-22-2" rel="bookmark">PHP 5.3.3 Released!</a>
<a class="bookmark" href="http://www.php.net/archive/2010.php#id2010-07-22-1" id="id2010-07-22-1" name="id2010-07-22-1" rel="bookmark">PHP 5.2.14 Released!</a>
</results>
</query>
<!-- total: 657 -->
&#8722;
<!--
 yqlengine6.pipes.re4.yahoo.com compressed Sat Dec 18 11:56:15 PST 2010 
-->

That looks pretty cool! bookmarks


$content = '<tr class="rowalt"><td>test 1</td></tr><tr class="rowalt"><td>test 2</td></tr><tr class="rowalt"><td>test 3</td></tr>';
preg_match_all('~rowalt"><td>(.*?)</td~', $content, $matches);

If you print_r($matches), your desired results will be contains in $matches[1].

Hi, thanks for your post. I have tried using this but it doesn’t seem to want to work. Any ideas?