Go Back   SitePoint Forums > Forum Index > Program Your Site > PHP > PHP Application Design
Newsletter FAQ Members List Calendar Mark Forums Read

New to SitePoint Forums? Register here for free!

SitePoint Sponsor
 
Reply
 
Thread Tools Display Modes
Old Aug 1, 2003, 04:50   #1
prefab
SitePoint Zealot
 
prefab's Avatar
 
Join Date: Jan 2003
Location: Belgium
Posts: 133
parsing XML, but skipping entities

I'm working on a new template engine (yah, I know...) based on SAX parsing. While parsing, I check for certain tags, other tags are passed thru, so it won't affect normal html tags etc.

Now I've run into a problem. As SAX already tries to map (html) entities (especially &nbsp, they disappear when passed thru. In other words, I'd like to skip parsing those entities, so they'll stay unaffected.

If I only could get the data as raw as possible...

The next best option is using Harry Fs' HTMLSax (PEAR). Althought it works either way (and HTMLSax respects my entities!), pure SAX is a lot faster, ofcoarse.

Maybe anyone can help. Maybe I should avoid entities altogether and use document-encoding only. As for  ...I only need those in tables, but it seems tables are outfashioned by CSS soon anyway...

- prefab
prefab is offline   Reply With Quote
Old Aug 1, 2003, 06:13   #2
HarryF
SitePoint Wizard
gold trophysilver trophy
 
Join Date: Nov 2000
Location: Switzerland
Posts: 2,898
Not sure if it's possible with the native SAX parser - have run into similar problems with entities. Know that HTMLSax isn't fast but then again, if you combine it with PEAR::Cache_Lite, as I did with Simple Template, you can limit that delay to only those occasions when either the content or the template changes.

One thing perhaps to consider is to compile your template into native PHP. The template won't change often on a live site - only the content.

Also if you check out what's happening with Simple Test - there's another SAX like HTML parser in there (you'll need to dig a little) which uses regular expressions to parse rather than the character by character approach used by HTMLSax, so should be faster.
HarryF is offline   Reply With Quote
Old Aug 1, 2003, 06:22   #3
prefab
SitePoint Zealot
 
prefab's Avatar
 
Join Date: Jan 2003
Location: Belgium
Posts: 133
I was actually thinking of adding caching...think It'll do the trick ...
I didn't know HTMLSax is parsing character by character, which explains a lot

Thanks

- prefab
prefab is offline   Reply With Quote
Old Aug 1, 2003, 07:08   #4
lastcraft
SitePoint Victim
 
lastcraft's Avatar
 
Join Date: Apr 2003
Location: London
Posts: 2,385
Hi.

Quote:
Originally Posted by HarryF
Also if you check out what's happening with Simple Test - there's another SAX like HTML parser in there (you'll need to dig a little) which uses regular expressions to parse rather than the character by character approach used by HTMLSax, so should be faster.
Thanks for the plug Harry (again) . I think I owe 95% of my web traffic to you. On my TODO list was to backport the Lexer into your HtmlSax library and do a speed test. Would you be interested if I submitted to you such a version? It would save me doing the performance comparison and I would be interested in the results as a benchmark of how fast the PHP regexes are.

The problem with the Lexer in SimpleTest though is that it is not easy to understand how it works (it was TDD rather then "designed"). The plus side is that it is a stack machine which would allow proper filtering of JavaScript tags and CSS.

yours, Marcus.
__________________
Marcus Baker
Testing: SimpleTest, Cgreen, Fakemail
Other: Phemto dependency injector
Books: PHP in Action, 97 things
lastcraft is offline   Reply With Quote
Old Aug 1, 2003, 07:39   #5
HarryF
SitePoint Wizard
gold trophysilver trophy
 
Join Date: Nov 2000
Location: Switzerland
Posts: 2,898
Quote:
Would you be interested if I submitted to you such a version?
Definately!

Quote:
95% of my web traffic
Now I just got to get that site back up
HarryF is offline   Reply With Quote
Old Aug 1, 2003, 10:30   #6
prefab
SitePoint Zealot
 
prefab's Avatar
 
Join Date: Jan 2003
Location: Belgium
Posts: 133
Quote:
Originally Posted by lastcraft
The problem with the Lexer in SimpleTest though is that it is not easy to understand how it works (it was TDD rather then "designed"). The plus side is that it is a stack machine which would allow proper filtering of JavaScript tags and CSS.
Could you give a small parsing example? In my test sofar (with a custom listener), I only got all my markup as one big 'cdata' string, as if start and end tag handlers weren't called. Also, some of the tags set up like:

SimpleSaxParser::_addTag($lexer, "title");

in createLexer() break all processing it seems.

Clearly, I haven't got a clue how it works, yet

- prefab
prefab is offline   Reply With Quote
Old Aug 1, 2003, 17:06   #7
lastcraft
SitePoint Victim
 
lastcraft's Avatar
 
Join Date: Apr 2003
Location: London
Posts: 2,385
Hi.

I really do need to refactor that part of the code, don't I?

Quote:
Originally Posted by prefab
Could you give a small parsing example? In my test sofar (with a custom listener), I only got all my markup as one big 'cdata' string, as if start and end tag handlers weren't called.
The parser was tuned to the task in hand and selected the lexer patterns accordingly, it's not a general HTML parser as is. Er...I'll have to explain...

The Lexer works by building up a bunch of regexes with brackets around them, so if it has to look for the Perl patterns "a.*?b" and "fred" it constructs this call...
PHP Code:

preg_match('/(a.*?b)|(fred)/', $html, $matches) 

It actually bulds a regex for each of it's possible states (modes).

When run, this will find the earliest match for the mode it is in and hide it in $matches somewhere. The Lexer digs it out and uses the result to find the point of matching. That gives two tokens to return, the non-matching one up to the match and the match itself. The ordering of the paterns can be important, with a general pattern masking a later more specific pattern, thus "aaabbb" should come before "a*b*".

Each pattern has a mode (state really), usually just the name of a callback (a handler in the parser) or if not then a name that maps to a callback. It also has an action which is either nothing (carry on in the same mode), enter the new name mode, leave this mode after this token or a special token which calls a different handler this once only. This way the modes nest, forming a stack machine rather than a state machine.

If you didn't get all of that from the code, I don't blame you at all. I had to look at the code to write the above and I wrote it .

So why am I going into all of this? Because the lexer is set up to only match the HTML tags it needs to recognise: anchors, title, etc., attribute start and finishes and irrelevant whitespace. That's why it just scooped up just about every other tag, I wanted it to go as fast as possible. Where it matches specific tag starts, you will probably want a general tag in what ever factory function creates it. Try...
PHP Code:

$lexer->addSpecialPattern("</[a-zA-Z]+>", 'text', 'acceptEndToken');

$lexer->addEntryPattern("<[a-zA-Z]+", 'text', 'tag');
...although this is off the top of my head.

The first one is the end of tag. It occours in 'text' mode and invokes acceptEndToken() on the parser whilst staying in text mode. The second one is the start of the tag which is found in 'text' mode and enters 'tag' mode as soon as it is encountered. For your own parser you can choose your own mode names and handler names of course. In fact you will have to rename the classes as well to avoid clashing with the ones in SimpleTest if that is what you use for testing.

I'll try to send a patches along these lines to HTMLSax next week and hopefully make a clearer job of it. Harry, can you mail me the curent unit tests for the parser as that would save a lot of time.

yours, Marcus.
__________________
Marcus Baker
Testing: SimpleTest, Cgreen, Fakemail
Other: Phemto dependency injector
Books: PHP in Action, 97 things
lastcraft is offline   Reply With Quote
Old Aug 2, 2003, 00:36   #8
prefab
SitePoint Zealot
 
prefab's Avatar
 
Join Date: Jan 2003
Location: Belgium
Posts: 133
Thanx for your insights. I'll have to admit, it still has me rather stumped. But if I gather correctly, HTMLSax will benefit from your efforts soon? I think I'll stay with HTMLSax (or even SAX) for now, if everything is well, it should work just the same.

I'm looking forward to a speedier HTMLSax
prefab is offline   Reply With Quote
Old Aug 2, 2003, 12:47   #9
prefab
SitePoint Zealot
 
prefab's Avatar
 
Join Date: Jan 2003
Location: Belgium
Posts: 133
I decided to take another turn on the entities problem.
I guess this is the fastest method, although it involves a global preg_replace before and after parsing.

PHP Code:

function _preParse(&$data) {

        
$data = preg_replace("/&(.*?);/", '{ent{$1}}', $data);
    }
    
function
_postParse(&$data) {
    
$data = preg_replace("/\{ent\{(.*?)\}\}/", '&$1;', $data);
    }
Still looking forward to HTMLSax v.2 though...

- prefab
prefab is offline   Reply With Quote
Old Aug 2, 2003, 19:00   #10
Selkirk
SitePoint Guru
 
Join Date: Nov 2002
Posts: 846
Quote:
Originally Posted by lastcraft
The problem with the Lexer in SimpleTest though is that it is not easy to understand how it works (it was TDD rather then "designed"). The plus side is that it is a stack machine which would allow proper filtering of JavaScript tags and CSS.
I understood how it works. I only looked at your Lexer briefly, but I got the impression from it that you were familiar with tools like Flex? (if not, I am getting warm fuzzy feelings about TDD).

I independantly wrote a similiar parser for WACT. It is not as generic or nice as yours (actually it is unfortunately over integrated with a recursive descent parser), but uses a similar regex approach.

I will be very much be interested in your results.

I suspect that regex is slow. I suspect that a lexer hand optimized to the task using standard string functions will be faster.

I guess it depends on the number of patterns to match, pattern density and the length of the string.
__________________
Professional PHP Blog - twitter - about MVC
Selkirk is offline   Reply With Quote
Old Aug 3, 2003, 06:29   #11
lastcraft
SitePoint Victim
 
lastcraft's Avatar
 
Join Date: Apr 2003
Location: London
Posts: 2,385
Hi...

Quote:
Originally Posted by Selkirk
I only looked at your Lexer briefly, but I got the impression from it that you were familiar with tools like Flex?
You are correct, Lex and Awk were the starting points.


Quote:
Originally Posted by Selkirk
I will be very much be interested in your results.
I was hoping that Harry would run the actual comparisons . It should be pretty fascinating.

Quote:
Originally Posted by Selkirk
I suspect that regex is slow. I suspect that a lexer hand optimized to the task using standard string functions will be faster.

I guess it depends on the number of patterns to match, pattern density and the length of the string.
I simply had no way of working it out and so took a guess , keeping the number of matches low to cut down on the number of PHP calls and separating the modes out to keep the regexes small. A tag dense page whilst matching every tag will be pretty brutal on it and heavily favours the current HTMLSax. At least if it wins that battle then the switch is a no brainer. We have parsed PHP commented code for documentation extraction with a similar Lexer and it easily crunched a meg. a second. This was on pages of about a third of the matching desity of tag dense HTML, so if it comes in at about this level then it should be fine for parsing pages from a network.

yours, Marcus.
__________________
Marcus Baker
Testing: SimpleTest, Cgreen, Fakemail
Other: Phemto dependency injector
Books: PHP in Action, 97 things
lastcraft is offline   Reply With Quote
Old Aug 3, 2003, 17:06   #12
Selkirk
SitePoint Guru
 
Join Date: Nov 2002
Posts: 846
Here is an alternate way to implement a parser in php.

I suspect that it will be relatively fast for xml.
  • It never concatinates strings
  • It uses a Null object to avoid having to check for handler method existence on each event trigger. (hint hint, Harry )
  • It uses built in PHP functions when possible to skip over (hopefully) large tracts of uninteresting characters.

I have done absoluately no optimization on this. optimization wise, it would be best to focus on the scan* methods.

This thing probably doesn't have enough states to robustly handle html. (its just a proof of concept.)

PHP Code:

<?php


define
('STATE_STOP', 0);
define('STATE_START', 1);

define('STATE_TAG', 2);
define('STATE_OPENING_TAG', 3);
define('STATE_CLOSING_TAG', 4);
define('STATE_TAG_CLEANUP', 5);
define('STATE_ATTRIBUTE', 6);

class
StartingState  {
    function
parse(&$context) {
        
$data = $context->scanUntilChar('<');
        if (
$data == '') {
            return
STATE_STOP;
        } else {
            
$context->IgnoreCharacter();
            
$context->handler_object_data->{$context->handler_method_data}($data);
            return
STATE_TAG;
        }
    }
}

class
TagState {
    function
parse(&$context) {
        
$char = $context->ScanCharacter();
        if (
$char == '/') {
            return
STATE_CLOSING_TAG;
        } else {
            
$context->unscanCharacter();
            return
STATE_OPENING_TAG;
        }
    }
}

class
ClosingTagState {
    function
parse(&$context) {
        
$tag = $context->scanUntilChar('>');
        if (
$tag == '') {
            return
STATE_STOP;
        } else {
            
$context->handler_object_element->{$context->handler_method_closing}($tag);
            return
STATE_TAG_CLEANUP;
        }
    }
}

class
OpeningTagState {

    var
$attributes = array();

    function
attributeHandler($attributename, $attributevalue) {
        
$this->attributes[$attributename] = $attributevalue;
    }

    function
parse(&$context) {
        
$tag = $context->scanUntilCharSet("/> \n\r\t");
        if (
$tag == '') {
            return
STATE_STOP;
        } else {
            
$context->_parse(STATE_ATTRIBUTE);
            
$context->handler_object_element->{$context->handler_method_opening}($tag, $this->attributes);
            return
STATE_TAG_CLEANUP;
        }
    }
}

class
TagCleanupState {
    function
parse(&$context) {
        
$char = $context->scanCharacter();
        if (
$char == '/') {
            
$char = $context->scanCharacter();
            if (
$char != '>') {
                
$context->unscanCharacter();
            }
        }
        return
STATE_START;
    }
}

class
AttributeStart {

    var
$attribute_handler;
    
    function
parse(&$context) {
        
$context->scanPastWhitespace();
        
$attributename = $context->scanUntilCharSet("=/> \n\r\t");
        if (
$attributename == '') {
            return
STATE_STOP;
        } else {
            
$attributevalue = NULL;
            
$context->scanPastWhitespace();
            
$char = $context->scanCharacter();
            if (
$char == '=') {
                
$context->scanPastWhitespace();
                
$char = $context->ScanCharacter();
                if (
$char == '"') {
                    
$attributevalue= $context->scanUntilChar('"');
                    
$context->IgnoreCharacter();
                } else if (
$char == "'") {
                    
$attributevalue= $context->scanUntilChar("'");
                    
$context->IgnoreCharacter();
                } else {
                    
$context->unscanCharacter();
                    
$attributevalue = $context->scanUntilCharSet("/> \n\r\t");
                }
            }
            
$this->attribute_handler->attributeHandler($attributename, $attributevalue);
            return
STATE_ATTRIBUTE;
        }
    }
}

class
StateParser {
    var
$rawtext;
    var
$position;
    var
$length;

    var
$State = array();

    function
unscanCharacter() {
        
$this->position -= 1;  // $this->position--; is broken?
    
}
    
    function
ignoreCharacter() {
        
$this->position++;
    }

    function
scanCharacter() {
        if (
$this->position < $this->length) {
            return
$this->rawtext{$this->position++};
        } else {
            return
'';
        }
    }
    
    function
scanUntilCharSet($string) {
        
$startpos = $this->position;
        
$pos = $startpos;
        while (
$pos < $this->length && strpos($string, $this->rawtext{$pos}) === FALSE) {
            
$pos++;
        }
        
$this->position = $pos;
        return
substr($this->rawtext, $startpos, $pos-$startpos);
    }

    function
scanUntilChar($char) {
        
$pos = strpos($this->rawtext, $char, $this->position);
        if (
$pos === FALSE) {
            
$result = substr($this->rawtext, $this->position);
            
$this->position = $this->length;
        } else {
            
$result = substr($this->rawtext, $this->position, $pos - $this->position);
            
$this->position = $pos;
        }
        return
$result;
    }
    
    function
scanPastWhitespace() {
        while (
$this->position < $this->length &&
            
strpos(" \n\r\t", $this->rawtext{$this->position}) !== FALSE) {
            
$this->position++;
        }
    }

    function
parse($test) {
        
$this->rawtext = $test;
        
$this->length = strlen($test);
        
$this->position = 0;
        
$this->_parse();
    }
    
    function
_parse($state = STATE_START) {
        do {
            
$StateObj =& $this->State[$state];
            
$state = $StateObj->parse($this);
        } while (
$state != STATE_STOP && $this->position < $this->length);
    }

}

class
NullHandler {
    function
DoNothing($text) {
    }
}

class
HtmlParser extends StateParser {
    var
$handler_object_data;
    var
$handler_method_data;

    var
$handler_object_element;
    var
$handler_method_closing;
    var
$handler_method_opening;

    function
HtmlParser() {
        
$nullhandler =& new NullHandler();
        
$this->set_data_handler($nullhandler, 'DoNothing');
        
$this->set_element_handler($nullhandler, 'DoNothing', 'DoNothing');
        
        
$this->State[STATE_START] =& new StartingState();
        
$this->State[STATE_CLOSING_TAG] =& new ClosingTagState();
        
$this->State[STATE_TAG] =& new TagState();
        
$this->State[STATE_OPENING_TAG] =& new OpeningTagState();
        
$this->State[STATE_TAG_CLEANUP] =& new TagCleanupState();
        
$this->State[STATE_ATTRIBUTE] =& new AttributeStart();
        
        
$this->State[STATE_ATTRIBUTE]->attribute_handler =& $this->State[STATE_OPENING_TAG];
    }

    function
set_data_handler($data_handler_obj, $data_method) {
        
$this->handler_object_data =& $data_handler_obj;
        
$this->handler_method_data = $data_method;
    }
    
    function
set_element_handler($element_handler_obj, $opening_method, $closing_method) {
        
$this->handler_object_element =& $element_handler_obj;
        
$this->handler_method_opening = $opening_method;
        
$this->handler_method_closing = $closing_method;
    }
}

class
MyHandler {
    function
openHandler($name, $attrs) {
        echo (
'--Open Tag Handler: '.$name.'<br />' );
        echo (
'--Attrs:<pre>' );
        
print_r($attrs);
        echo (
'</pre>' );
    }
    function
closeHandler($name) {
        echo (
'--Close Tag Handler: '.$name.'<br />' );
    }
    function
dataHandler($data) {
        echo (
'--Data Handler: '.$data.'<br />' );
    }
}

$doc=<<<EOD
This is a <em>simple</em> example <tag test='attribute' />!
EOD;

$parser =& new HtmlParser();

$handler=& new MyHandler();
$parser->set_element_handler($handler, 'openHandler','closeHandler');
$parser->set_data_handler($handler, 'dataHandler');

$parser->parse($doc);

?>
__________________
Professional PHP Blog - twitter - about MVC
Selkirk is offline   Reply With Quote
Old Aug 3, 2003, 18:24   #13
Selkirk
SitePoint Guru
 
Join Date: Nov 2002
Posts: 846
Stupid me. A couple of bug fixes:

Use this version of StartingState instead:
PHP Code:

class StartingState  {

    function
parse(&$context) {
        
$data = $context->scanUntilChar('<');
        
$context->IgnoreCharacter();
        if (
$data != '') {
            
$context->handler_object_data->{$context->handler_method_data}($data);
        }
        return
STATE_TAG;
    }
}
Add $this->attributes = array(); after the else in OpenTagState :: parse
PHP Code:

class OpeningTagState {

...
    function
parse(&$context) {
...
        } else {
            
$this->attributes = array();
I bet there is an infinite loop waiting to happen somewhere in there, as well.
__________________
Professional PHP Blog - twitter - about MVC
Selkirk is offline   Reply With Quote
Old Aug 3, 2003, 19:50   #14
Selkirk
SitePoint Guru
 
Join Date: Nov 2002
Posts: 846
Ok, just tried a some performance tests with parsing a 124,423 byte XML file (timed using ab)

Code:
State based parser :   660 ms mean time per request.
xml_parse (expat)  :   433 ms
XML_HTMLSax        : 9,685 ms
wow. I don't know what to make of this.
__________________
Professional PHP Blog - twitter - about MVC
Selkirk is offline   Reply With Quote
Old Aug 3, 2003, 23:12   #15
Selkirk
SitePoint Guru
 
Join Date: Nov 2002
Posts: 846
Ok, here is an updated version. I fixed some bugs and updated the interface to more closely resemble XML_HTMLSax.
Attached Files
File Type: txt htmlparser.php.txt (9.9 KB, 295 views)
__________________
Professional PHP Blog - twitter - about MVC
Selkirk is offline   Reply With Quote
Old Aug 4, 2003, 02:08   #16
prefab
SitePoint Zealot
 
prefab's Avatar
 
Join Date: Jan 2003
Location: Belgium
Posts: 133
Quote:
Originally Posted by Selkirk
wow. I don't know what to make of this.
Well, as far as I can see, this is great! Seems it's barely slower than the real SAX parser. In my test it works great

Thanks a bunch!

- prefab
prefab is offline   Reply With Quote
Old Aug 4, 2003, 02:33   #17
Phil.Roberts
No.
 
Phil.Roberts's Avatar
 
Join Date: May 2001
Location: Nottingham, UK
Posts: 1,142
Quote:
Originally Posted by Selkirk
Ok, just tried a some performance tests with parsing a 124,423 byte XML file (timed using ab)

Code:
State based parser :   660 ms mean time per request.
xml_parse (expat)  :   433 ms
XML_HTMLSax        : 9,685 ms
wow. I don't know what to make of this.
Not bad for an un-optimised proof of concept.
Phil.Roberts is offline   Reply With Quote
Old Aug 4, 2003, 06:04   #18
lastcraft
SitePoint Victim
 
lastcraft's Avatar
 
Join Date: Apr 2003
Location: London
Posts: 2,385
Hi.

Quote:
Originally Posted by Selkirk
Ok, just tried a some performance tests with parsing a 124,423 byte XML file (timed using ab)

Code:
State based parser :   660 ms mean time per request.
xml_parse (expat)  :   433 ms
XML_HTMLSax        : 9,685 ms
wow. I don't know what to make of this.
Fantastic! I don't know what's more dramatic, that this version is so fast or that the expat version is so slow . Is it I/O bound? Anyone fancy running them through apd?

Switching the state parser to a stack based one would allow the processing of any language (state machines fall far short of being turing complete) and would probably add only 25% more code, mostly in passing the stack around and setting up the handlers. As for catching the infinite loop, just add a check that the position has advanced at least one space. IMO this could be needed if the state parser is to work on it's own as getting the states right could be rather tricky if they are created by hand. I needed it while debugging the SimpleTest one!

yours, Marcus.
__________________
Marcus Baker
Testing: SimpleTest, Cgreen, Fakemail
Other: Phemto dependency injector
Books: PHP in Action, 97 things
lastcraft is offline   Reply With Quote
Old Aug 4, 2003, 07:11   #19
HarryF
SitePoint Wizard
gold trophysilver trophy
 
Join Date: Nov 2000
Location: Switzerland
Posts: 2,898
Outstanding!

Gobsmacked by those performance figures.

Selkirk - you mind if I use your code for PEAR::XML_HTMLSax v2?
HarryF is offline   Reply With Quote
Old Aug 4, 2003, 07:26   #20
Dr Livingston
Non-Member
 
Join Date: Jan 2003
Posts: 5,788
Looks smart doesn't it although is there any more sample Templates and script for parsing them ?

Please
Dr Livingston is offline   Reply With Quote
Old Aug 4, 2003, 07:49   #21
Chris82
SitePoint Wizard
 
Chris82's Avatar
 
Join Date: Mar 2002
Location: Osnabrück
Posts: 1,003
The performance results look really impressing.
I am a bit at a loss of how to use the parser actually.
I worked with XML_Transformer (seems to be down currently) and there you could define a handler for each tag. In the example there was one open/close Handler. Is it possibly to define a filter for each element?

This is what I currently use:

PHP Code:

$doc = <<<EOD

<article>
    <title>This is a test</title>
    <author>Some Guy</title>
</article>
EOD;

class
MyHandler {
    function
MyHandler() {}
    
    function
openHandler(& $parser,$name,$attrs) {
        switch (
strtolower($name)) {
            case
'title':
                echo
'<h1>';
                break;
            case
'author':
                echo
'<em>';
                break;
        }            
    }
    
    function
closeHandler(& $parser,$name) {
        switch (
strtolower($name)) {
            case
'title':
                echo
'</h1>' , "\n";
                break;
            case
'author':
                echo
'</em>' , "\n";
                break;
        }
    }
    
    function
dataHandler(& $parser,$data) {
        echo
$data;
    }
}

$parser =& new HtmlParser();
$handler=& new MyHandler();

$parser->set_object($handler);
$parser->set_option('trimDataNodes', true);

$parser->set_element_handler('openHandler','closeHandler');
$parser->set_data_handler('dataHandler');

$parser->parse($doc);
Chris82 is offline   Reply With Quote
Old Aug 4, 2003, 08:55   #22
Chris82
SitePoint Wizard
 
Chris82's Avatar
 
Join Date: Mar 2002
Location: Osnabrück
Posts: 1,003
Okay, I have created a class Transformers which has registered methods for some tags. The methods have to follow the convention start_tag and stop_tag.


PHP Code:

class MyHandler {

    var
$transformer;

    function
MyHandler(&$transformer) {
        
$this->transformer =& $transformer;
    }
    
    function
openHandler(&$parser, $name, $attr) {
        
$method = 'start_' . $name;
        if (
method_exists($this->transformer, $method)) {
            
$this->transformer->$method($attr);
        }
        else {
            
// handle opening other stuff (this is currently not working)
        
}
    }
    
    function
closeHandler(&$parser, $name) {
        
$method = 'stop_' . $name;
        if (
method_exists($this->transformer, $method)) {
            
$this->transformer->$method($name);
        }
        else {
            
// handle closing other stuff (this is currently not working)
        
}
    }
    
    function
dataHandler(&$parser, $data) {
        echo
$data;
    }
}

class
Transformer {
    function
start_textbox($attr) {
        echo
'<input type="text"' , $this->attributesToString($attr) . '/>' , "\n";
    }

    function
attributesToString($attr) {
        
$string = ' ';
        foreach (
array_keys($attr) as $key) {
            
$string .= $key . '="' . $attr[$key]. '" ';
        }
        return
preg_replace('#\s$#', '', $string);
    }
    
    function
stop_textbox($name) {}
}
Example:

PHP Code:

$doc = <<<EOD

<textbox name="name" length="20"></textbox>
This should be displayed
EOD;

$parser      =& new HtmlParser();
$transformer =& new Transformer();
$handler     =& new MyHandler($transformer);

$parser->set_object($handler);
$parser->set_option('trimDataNodes', true);

$parser->set_element_handler('openHandler','closeHandler');
$parser->set_data_handler('dataHandler');

$parser->parse($doc);
Chris82 is offline   Reply With Quote
Old Aug 4, 2003, 09:17   #23
Dr Livingston
Non-Member
 
Join Date: Jan 2003
Posts: 5,788
Umm.... Got to have a study of this. Looks interesting though also looks complicated
Dr Livingston is offline   Reply With Quote
Old Aug 4, 2003, 12:42   #24
Selkirk
SitePoint Guru
 
Join Date: Nov 2002
Posts: 846
Quote:
Originally Posted by Phil.Roberts
Not bad for an un-optimised proof of concept.
I think I stumbled on optimal implementions of the methods in the StateParser class. I spent an hour or so trying different things and I could not speed any of them up.

Combining operators that are frequently called together ended up slower.

strpos is just very fast in PHP. Part of the problem is that none of the other string searching functions in PHP can begin their search at a specific position, like strpos can. (In PHP 5, I think the preg_ functions will take a starting position parameter).

Alternatively, there is room in the state definitions for further optimization. I got the mtpr down to 614 ms by eliminating the cleanup state.

It is also possible to take advantage of the essentially sequential state transitions to completely unroll the parser and implement it in a single function using a big do loop with break statements to return to the starting state at the top of the loop and falling through for each of the other state transitions. based on a couple of simple tests with eliminating state and unrolling the StateParser scan functions, I think it might possibly cut execution time as much as in half. For some people without much OO experience, the resulting code might even be easier to understand, although insane to modify. I leave this as an exercise for the reader.

Quote:
Originally Posted by lastcraft
Fantastic! I don't know what's more dramatic, that this version is so fast or that the expat version is so slow . Is it I/O bound? Anyone fancy running them through apd?
Indeed. I am no expert with xml in PHP. I might have bungled the expat implementation. Here is the code I used:
PHP Code:

$xml_parser = xml_parser_create();

xml_set_object($xml_parser, $handler);
xml_set_element_handler($xml_parser, "openHandler", "closeHandler");
xml_set_character_data_handler($xml_parser, "dataHandler");
xml_parse($xml_parser, $doc);
I used the same handler for all three versions. One difference seems to be that expat called the dataHandler method ALOT more, possibly because of line breaks. What do you make of this?

Quote:
Originally Posted by lastcraft
As for catching the infinite loop, just add a check that the position has advanced at least one space.
Not every state advances the current position. An infinite loop detecter would have to make sure that the state did not change AND the current position was not advanced.

Quote:
Originally Posted by HarryF
Selkirk - you mind if I use your code for PEAR::XML_HTMLSax v2?
Please do.


Bug wise, there is one thing that I would watch out for.

This thing is just WAY too tolerent of badly formatted files. because of that, some of the states just keep going merrily along after things have completely failed to make sense.

For some complex states using the unscanCharacter method near the end, they could advance past the end of file earlier in the state and then end up backing up leaving an extra character or two to be parsed twice. Possibly also in the wrong state. (hello Mr. infinite loop)

I think where this would show up is in abruptly truncated files.

I think it would show up as a couple of garbage events at the end of processing.

One of the things that I am not happy with is the end of file handling (well, really end of string).

Right now, many of the states implicitly transition to the STOP state by going past the end of string and triggering the test in the main loop, rather than explicitely triggering the transition. This is probably confusing.

There is an elegent solution to this that I will think of in two weeks while eating dinner.
__________________
Professional PHP Blog - twitter - about MVC
Selkirk is offline   Reply With Quote
Old Aug 4, 2003, 13:20   #25
Dr Livingston
Non-Member
 
Join Date: Jan 2003
Posts: 5,788
Badly formatted files ? Umm... IMO though this shouldn't be an issue for you the developer to account for ?

Sure a few other members would agree on this point as well; You can't be responsible for those who have little idea of how to format a document, etc.

Myself included
Dr Livingston is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread | Next Thread »

Thread Tools
Display Modes

 
Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

 
Forum Jump


All times are GMT -7. The time now is 21:13.


Powered by vBulletin® Version 3.8.5
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Copyright 1998-2009, SitePoint Pty Ltd. All Rights Reserved