Processing multiple html pages via a script?

Hello,

Wondering if anyone can help and point me in the direction of some tutorials that may help me with this. What I want to do is create a folder on my server which I drop html pages into that are supplied to me.

In this folder I run a script which loops through each of the html files and extracts data from each file and saves it to a MYSQL database for each html file.

For example here is a standard layout of one of the pages:


<p class="smallfont">Headline Here</p>
<p class="smallfont"><img class="smallfont" src="file1.jpg" alt="image" align="right"/>Here is the opening Paragraph Blurb</p>
<p class="smallfont"><img class="smallfont" src="file2.jpg" alt="image" align="right"/>Here is the second Paragraph Blurb</p>
<p class="smallfont"><img class="smallfont" src="file3.jpg" alt="image" align="right"/>Here is the third Paragraph Blurb</p>

Then my DB layout would look something like:


[B][U]id [/U][/B]//Auto inc
headline //Opening <p> tag
text_supplied //The <p> tag stuff
image1 //ie file1.jpg
image2 //ie file2.jpg
image3 //ie file3.jpg

Any advise on a tutorial/links that can help?

Also someone mentioned it may be better to create an XML file out of this first before being parsed into MYSQL – would that be best?

Thanks

Chris

AFAIK, if you are the one who creates those HTML pages then it is obviously better to change them to XML so that it will be easier to parse them which is available within PHP itself like simplexml etc. But still you can use [URL=“http://www.php.net/manual/en/book.dom.php”]domdocument to parse the HTML as well.

XML would be a better approach IMHO, a lot more easier to handle should your structure change.


<documents>
    <document>
        <title>Foo</title>
        <content>Foo html content</content>
        <images>
            <image src="/some/http/file/path/img.jpg" alt="Foo" />
            <image src="/some/http/file/path/img.jpg" alt="Foo" />
        </images>
    </document>
    <document>
        <title>Bar</title>
        <content>Bar html content</content>
        <images>
            <image src="/some/http/file/path/img.jpg" alt="Bar" />
            <image src="/some/http/file/path/img.jpg" alt="Bar" />
        </images>
    </document>
</documents>

Then a rough PHP script to save them would be…

$documents = new SimpleXMLElement('/path/to/xml.file', null, true);

foreach($documents->document as $document){
    
    mysql_query(
        sprintf(
            "INSERT INTO article (title, content) VALUES ('&#37;s', '%s')",
            mysql_real_escape_string($document->title),
            mysql_real_escape_string($document->content)
        )
    );
    
    $id = mysql_insert_id();
    
    if(0 < count($document->images)){
        $sql = "INSERT INTO article_image (article_id, src, alt) VALUES ";
        foreach($document->images as $image){
            $sql .= sprintf(
                "(%d, '%s', '%s'),",
                $id,
                mysql_real_escape_string($image['src']),
                mysql_real_escape_string($image['alt'])
            );
        }
        mysql_query(rtrim($sql, ','));
    }
}

Thanks guys thats great.

Any tutorial suggestions for links that help me read the contents of the html pages in the folder in prep to create an XML doc before going into the DB (ps they are not my html files - supplied)?

Check out get_charset function to see how you can navigate through HTML tags using DOMDocument object on the following page:
http://www.forkaya.com/scripts/url-fetch.php?source=1

If the html files are not under your control, them let them to be as html itself and use domdocument to read/parse the html with PHP. By saying ‘in the folder in prep to create an XML doc before going into the DB’, I think you misunderstood the things. It is not worth to create XML doc after reading HTML. Directly insert the read content to the DB. But if you are the one who creates those HTML files then it is better to create them as XML which will then be easier to read later.