coxdabd
December 18, 2010, 5:03pm
1
Hi there, I’ve created a simple screenscrape script but want it to display all the results from the page inside of the <tr class=“rowalt”><td>. The script is displaying one result but I’m wanting them all.
Think I need a while loop which I have tried but produces the same result over and over again, an infinite loop.
How can this be overcome?
<?php
$url = "http://www.test.com/home";
$raw = file_get_contents($url);
$newlines = array("\ ","\
","\\r","\\x20\\x20","\\0","\\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,'<tr class="rowalt"><td>');
$end = strpos($content,'</td>',$start) + 8;
$table = substr($content,$start,$end-$start);
echo $table;
?>
Immerse
December 18, 2010, 7:28pm
2
The problem is that every time you call strpos to determine the $start variable, it finds and returns the first occurrence.
You need to use something like [fphp]preg_match_all/fphp or even better, work through the HTML using a HTML Parser. I’ve never actually used a HTML parser, so I’ll let someone else provide more information about that.
If you want to try the preg_match_all way, let me know and I’ll see if I can whip up some code.
coxdabd
December 18, 2010, 7:31pm
3
Immerse:
The problem is that every time you call strpos to determine the $start variable, it finds and returns the first occurrence.
You need to use something like [fphp]preg_match_all/fphp or even better, work through the HTML using a HTML Parser. I’ve never actually used a HTML parser, so I’ll let someone else provide more information about that.
If you want to try the preg_match_all way, let me know and I’ll see if I can whip up some code.
Hey dude, yeah sure thing, if you could that would be a massive help, big time. I seem to be learning well, the help on here is incredibly useful. Cheers!
Have you thought about using the YQL Console to obtain/cache/parse your data?
For instance, if I wanted to obtain a list of all the PHP.net announcements which mentioned a new release; I could [/a[@class %20%3D%20%22bookmark%22%20and%20contains%28.%2C%20%22Released%22%29]%27"]build this in the console](http://developer.yahoo.com/yql/console/#h=SELECT%20*%20FROM%20html%20WHERE%20url%3D"http%3A//www.php.net"%20AND%20xpath%3D’//h1 [@class %20%3D%20%22summary%20entry-title%22).
This would give [%2Fa[%40class%20%3D%20%22bookmark%22%20and%20contains%28.%2C%20%22Released%22%29]%27"]this URI](http://query.yahooapis.com/v1/public/yql?q=SELECT%20*%20FROM%20html%20WHERE%20url%3D"http%3A%2F%2Fwww.php.net"%20AND%20xpath%3D’%2F%2Fh1 [%40class%20%3D%20%22summary%20entry-title%22) to obtain the data, which would be…
<query yahoo:count="5" yahoo:created="2010-12-18T19:56:15Z" yahoo:lang="en-US">
−
<results>
<a class="bookmark" href="http://www.php.net/archive/2010.php#id2010-12-16-1" id="id2010-12-16-1" name="id2010-12-16-1" rel="bookmark">PHP 5.2.16 Released!</a>
<a class="bookmark" href="http://www.php.net/archive/2010.php#id2010-12-10-1" id="id2010-12-10-1" name="id2010-12-10-1" rel="bookmark">PHP 5.3.4 Released!</a>
<a class="bookmark" href="http://www.php.net/archive/2010.php#id2010-12-09-1" id="id2010-12-09-1" name="id2010-12-09-1" rel="bookmark">PHP 5.2.15 Released!</a>
<a class="bookmark" href="http://www.php.net/archive/2010.php#id2010-07-22-2" id="id2010-07-22-2" name="id2010-07-22-2" rel="bookmark">PHP 5.3.3 Released!</a>
<a class="bookmark" href="http://www.php.net/archive/2010.php#id2010-07-22-1" id="id2010-07-22-1" name="id2010-07-22-1" rel="bookmark">PHP 5.2.14 Released!</a>
</results>
</query>
<!-- total: 657 -->
−
<!--
yqlengine6.pipes.re4.yahoo.com compressed Sat Dec 18 11:56:15 PST 2010
-->
Immerse
December 18, 2010, 8:30pm
5
That looks pretty cool! bookmarks
$content = '<tr class="rowalt"><td>test 1</td></tr><tr class="rowalt"><td>test 2</td></tr><tr class="rowalt"><td>test 3</td></tr>';
preg_match_all('~rowalt"><td>(.*?)</td~', $content, $matches);
If you print_r($matches), your desired results will be contains in $matches[1].
coxdabd
December 20, 2010, 12:08am
6
Immerse:
That looks pretty cool! bookmarks
$content = '<tr class="rowalt"><td>test 1</td></tr><tr class="rowalt"><td>test 2</td></tr><tr class="rowalt"><td>test 3</td></tr>';
preg_match_all('~rowalt"><td>(.*?)</td~', $content, $matches);
If you print_r($matches), your desired results will be contains in $matches[1].
Hi, thanks for your post. I have tried using this but it doesn’t seem to want to work. Any ideas?