preg_match_all problem

What could be the problem? the <th> are extracted with no problem but the <td> don’t work:

//looking for the ths
$contents_of_page = file_get_contents('http://www.gbgrafix.com/wheelofgod/bible.htm');
preg_match_all("#<th.*>(.+)</th#Ui", $contents_of_page, $thInnerHTML);
//echo $contents_of_page;
echo " th: ";
print_r($thInnerHTML[1]);
echo " th count: ".count($thInnerHTML[1]);

//looking for the tds
//preg_match_all("#<td.*>(.+)</td#Ui", $contents_of_page, $tdInnerHTML);
//echo $contents_of_page;
echo " td: ";
//print_r($tdInnerHTML[1]);
echo " td count: ".count($tdInnerHTML[1]);

$html = '<p>sadf</p> <table> <tr><th>a</th><th>aaa aaa</th></tr> <tr><td>b</td><td>bbb</td></tr> </table> <p>asdf</p>';

preg_match_all('/(?<=<th>).+(?=<\\/th>)/iU', $html, $th);
preg_match_all('/(?<=<td>).+(?=<\\/td>)/iU', $html, $td);

print_r($th);
/*
Array
(
    [0] => Array
        (
            [0] => a
            [1] => aaa aaa
        )

)
*/
print_r($td);
/*
Array
(
    [0] => Array
        (
            [0] => b
            [1] => bbb
        )

)
*/

First of all, your code seems working fine but I am quite unsure because of the long content from the URL given is being retrieved and it is taking too much time. You have commented the following line:


preg_match_all("#<td.*>(.+)</td#Ui", $contents_of_page, $tdInnerHTML);

Edit:
I got the following output from your script:


 th: Array
(
    [0] => id
    [1] => book
    [2] => book_spoke
    [3] => recordType
    [4] => book_title
    [5] => chapter
    [6] => chapter_spoke
    [7] => verse
    [8] => verse_spoke
    [9] => text_data
)
 th count: 10 td:  td count: 311020

Just simple work around with domdocument for your goal:


$doc = new DOMDocument();
$doc->loadHTMLFile('http://www.gbgrafix.com/wheelofgod/bible.htm');
$ths = $doc->getElementsByTagName('th');
echo '<pre>';
foreach($ths as $th){
    echo $th->nodeValue . '<br />';
}

Yes, that page is quite lengthy. I took a look at it’s view-source and it took a long time for the page to load in on my slow dial-up cnx. So it could be the script is timing out or maybe you gave up waiting before it finished.

You could try adding

error_reporting(E_ALL);
ini_set('display_errors', true);

to the top of the file to make sure you’ll see errors. And you can try using

set_time_limit(###);

where ### is the amount of time in seconds, to give the script time to complete.

Then sit back, relax, and wait.

Hopefully this is a one-time-only script. The page takes so long to load now you wouldn’t to slow it down even more.

Yeah I thought that to be the reason because from that line down it didn’t load even if there are things not associated with this. But the previous one th was no problem. BUt Firefox is asking me what program to use to open that page. When I choose firefox it repeats, asking me what program to use to open that page. I’ll get back to you later.

I was thinking maybe I can do chunk by chunk by putting a page like this: .php?page=1 and then use $_Get[‘page’]…

If Firefox is asking if you want to download a PHP file, it suggests that your server isn’t configured to serve files with the .php extension as HTML,

You should be able to get around this by putting this in the PHP file at the very first line i.e.

<?php
header("Content-Type: text/html; charset=utf-8");

(or whatever charset you’re using). The main thing is there can be no output whatsoever before the header(), not even whitespace. If there is output of any kind, the default content-type header will be sent and you will get a “headers already sent” error.

As for breaking the page up, IMHO that’s a good idea. Either by topic or “pagination” should work better than one long page.

I have this:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

Does it matter?
After putting the error I got:

Fatal error: Allowed memory size of 33554432 bytes exhausted (tried to allocate 11 bytes) in … on line 99

The PHP header() sends an HTTP header (Hyper Text Transfer Protocol) and tells the browser what kind of file to expect. So if it thinks it’s a PHP file it won’t see the meta http-equiv as being part of an HTML file to begin with.

I forgot that if your PHP is configured with safe mode on, the time limit won’t work. Similary memory limit won’t either. But you could try.

I ran the URL through an online analyzer http://www.websiteoptimization.com/services/analyze/
but it said

The size of this web page (7911433 bytes) has exceeded the maximum size of 3000000 bytes.

I agree that having the entire Bible on one page might be handy for doing “search”. And it may not cause a problem for you on your computer if it isn’t fetching the info over the internet. But IMHO an 8MB page is a bit much. And your memory error message suggests it’s even larger than that. I can’t see that many visitors with slow connections would wait around for it to load.

So I would suggest at least breaking that page down into books.

Anyway, you can find out your PHP configuration by creating a very simple file

<?php
phpinfo();
?>

uploading it to your site and then going to it.

Or you could just add this to your existing file and try it straight away

if ( !ini_get('safe_mode') )
{
	set_time_limit(300);//300 secs = 5 mins
	ini_set('memory_limit', '64M');//hopefully be more than enough
}
else
{
	echo '<h4>in safe mode</h4>';
}

If you keep getting time or memory errors bump up the numbers until it works.