SitePoint Sponsor

User Tag List

Page 1 of 3 123 LastLast
Results 1 to 25 of 65
  1. #1
    SitePoint Zealot prefab's Avatar
    Join Date
    Jan 2003
    Location
    Belgium
    Posts
    133
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    parsing XML, but skipping entities

    I'm working on a new template engine (yah, I know...) based on SAX parsing. While parsing, I check for certain tags, other tags are passed thru, so it won't affect normal html tags etc.

    Now I've run into a problem. As SAX already tries to map (html) entities (especially &nbsp, they disappear when passed thru. In other words, I'd like to skip parsing those entities, so they'll stay unaffected.

    If I only could get the data as raw as possible...

    The next best option is using Harry Fs' HTMLSax (PEAR). Althought it works either way (and HTMLSax respects my entities!), pure SAX is a lot faster, ofcoarse.

    Maybe anyone can help. Maybe I should avoid entities altogether and use document-encoding only. As for  ...I only need those in tables, but it seems tables are outfashioned by CSS soon anyway...

    - prefab

  2. #2
    SitePoint Wizard gold trophysilver trophy
    Join Date
    Nov 2000
    Location
    Switzerland
    Posts
    2,479
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Not sure if it's possible with the native SAX parser - have run into similar problems with entities. Know that HTMLSax isn't fast but then again, if you combine it with PEAR::Cache_Lite, as I did with Simple Template, you can limit that delay to only those occasions when either the content or the template changes.

    One thing perhaps to consider is to compile your template into native PHP. The template won't change often on a live site - only the content.

    Also if you check out what's happening with Simple Test - there's another SAX like HTML parser in there (you'll need to dig a little) which uses regular expressions to parse rather than the character by character approach used by HTMLSax, so should be faster.

  3. #3
    SitePoint Zealot prefab's Avatar
    Join Date
    Jan 2003
    Location
    Belgium
    Posts
    133
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I was actually thinking of adding caching...think It'll do the trick ...
    I didn't know HTMLSax is parsing character by character, which explains a lot

    Thanks

    - prefab

  4. #4
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi.

    Quote Originally Posted by HarryF
    Also if you check out what's happening with Simple Test - there's another SAX like HTML parser in there (you'll need to dig a little) which uses regular expressions to parse rather than the character by character approach used by HTMLSax, so should be faster.
    Thanks for the plug Harry (again) . I think I owe 95% of my web traffic to you. On my TODO list was to backport the Lexer into your HtmlSax library and do a speed test. Would you be interested if I submitted to you such a version? It would save me doing the performance comparison and I would be interested in the results as a benchmark of how fast the PHP regexes are.

    The problem with the Lexer in SimpleTest though is that it is not easy to understand how it works (it was TDD rather then "designed"). The plus side is that it is a stack machine which would allow proper filtering of JavaScript tags and CSS.

    yours, Marcus.
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  5. #5
    SitePoint Wizard gold trophysilver trophy
    Join Date
    Nov 2000
    Location
    Switzerland
    Posts
    2,479
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Would you be interested if I submitted to you such a version?
    Definately!

    95% of my web traffic
    Now I just got to get that site back up

  6. #6
    SitePoint Zealot prefab's Avatar
    Join Date
    Jan 2003
    Location
    Belgium
    Posts
    133
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by lastcraft
    The problem with the Lexer in SimpleTest though is that it is not easy to understand how it works (it was TDD rather then "designed"). The plus side is that it is a stack machine which would allow proper filtering of JavaScript tags and CSS.
    Could you give a small parsing example? In my test sofar (with a custom listener), I only got all my markup as one big 'cdata' string, as if start and end tag handlers weren't called. Also, some of the tags set up like:

    SimpleSaxParser::_addTag($lexer, "title");

    in createLexer() break all processing it seems.

    Clearly, I haven't got a clue how it works, yet

    - prefab

  7. #7
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi.

    I really do need to refactor that part of the code, don't I?

    Quote Originally Posted by prefab
    Could you give a small parsing example? In my test sofar (with a custom listener), I only got all my markup as one big 'cdata' string, as if start and end tag handlers weren't called.
    The parser was tuned to the task in hand and selected the lexer patterns accordingly, it's not a general HTML parser as is. Er...I'll have to explain...

    The Lexer works by building up a bunch of regexes with brackets around them, so if it has to look for the Perl patterns "a.*?b" and "fred" it constructs this call...
    PHP Code:
    preg_match('/(a.*?b)|(fred)/'$html$matches
    It actually bulds a regex for each of it's possible states (modes).

    When run, this will find the earliest match for the mode it is in and hide it in $matches somewhere. The Lexer digs it out and uses the result to find the point of matching. That gives two tokens to return, the non-matching one up to the match and the match itself. The ordering of the paterns can be important, with a general pattern masking a later more specific pattern, thus "aaabbb" should come before "a*b*".

    Each pattern has a mode (state really), usually just the name of a callback (a handler in the parser) or if not then a name that maps to a callback. It also has an action which is either nothing (carry on in the same mode), enter the new name mode, leave this mode after this token or a special token which calls a different handler this once only. This way the modes nest, forming a stack machine rather than a state machine.

    If you didn't get all of that from the code, I don't blame you at all. I had to look at the code to write the above and I wrote it .

    So why am I going into all of this? Because the lexer is set up to only match the HTML tags it needs to recognise: anchors, title, etc., attribute start and finishes and irrelevant whitespace. That's why it just scooped up just about every other tag, I wanted it to go as fast as possible. Where it matches specific tag starts, you will probably want a general tag in what ever factory function creates it. Try...
    PHP Code:
    $lexer->addSpecialPattern("</[a-zA-Z]+>"'text''acceptEndToken');
    $lexer->addEntryPattern("<[a-zA-Z]+"'text''tag'); 
    ...although this is off the top of my head.

    The first one is the end of tag. It occours in 'text' mode and invokes acceptEndToken() on the parser whilst staying in text mode. The second one is the start of the tag which is found in 'text' mode and enters 'tag' mode as soon as it is encountered. For your own parser you can choose your own mode names and handler names of course. In fact you will have to rename the classes as well to avoid clashing with the ones in SimpleTest if that is what you use for testing.

    I'll try to send a patches along these lines to HTMLSax next week and hopefully make a clearer job of it. Harry, can you mail me the curent unit tests for the parser as that would save a lot of time.

    yours, Marcus.
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  8. #8
    SitePoint Zealot prefab's Avatar
    Join Date
    Jan 2003
    Location
    Belgium
    Posts
    133
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanx for your insights. I'll have to admit, it still has me rather stumped. But if I gather correctly, HTMLSax will benefit from your efforts soon? I think I'll stay with HTMLSax (or even SAX) for now, if everything is well, it should work just the same.

    I'm looking forward to a speedier HTMLSax

  9. #9
    SitePoint Zealot prefab's Avatar
    Join Date
    Jan 2003
    Location
    Belgium
    Posts
    133
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I decided to take another turn on the entities problem.
    I guess this is the fastest method, although it involves a global preg_replace before and after parsing.

    PHP Code:
    function _preParse(&$data) {
            
    $data preg_replace("/&(.*?);/"'{ent{$1}}'$data);
        }
        
    function 
    _postParse(&$data) {
        
    $data preg_replace("/\{ent\{(.*?)\}\}/"'&$1;'$data);
        } 
    Still looking forward to HTMLSax v.2 though...

    - prefab

  10. #10
    SitePoint Guru
    Join Date
    Nov 2002
    Posts
    841
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by lastcraft
    The problem with the Lexer in SimpleTest though is that it is not easy to understand how it works (it was TDD rather then "designed"). The plus side is that it is a stack machine which would allow proper filtering of JavaScript tags and CSS.
    I understood how it works. I only looked at your Lexer briefly, but I got the impression from it that you were familiar with tools like Flex? (if not, I am getting warm fuzzy feelings about TDD).

    I independantly wrote a similiar parser for WACT. It is not as generic or nice as yours (actually it is unfortunately over integrated with a recursive descent parser), but uses a similar regex approach.

    I will be very much be interested in your results.

    I suspect that regex is slow. I suspect that a lexer hand optimized to the task using standard string functions will be faster.

    I guess it depends on the number of patterns to match, pattern density and the length of the string.

  11. #11
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi...

    Quote Originally Posted by Selkirk
    I only looked at your Lexer briefly, but I got the impression from it that you were familiar with tools like Flex?
    You are correct, Lex and Awk were the starting points.


    Quote Originally Posted by Selkirk
    I will be very much be interested in your results.
    I was hoping that Harry would run the actual comparisons . It should be pretty fascinating.

    Quote Originally Posted by Selkirk
    I suspect that regex is slow. I suspect that a lexer hand optimized to the task using standard string functions will be faster.

    I guess it depends on the number of patterns to match, pattern density and the length of the string.
    I simply had no way of working it out and so took a guess , keeping the number of matches low to cut down on the number of PHP calls and separating the modes out to keep the regexes small. A tag dense page whilst matching every tag will be pretty brutal on it and heavily favours the current HTMLSax. At least if it wins that battle then the switch is a no brainer. We have parsed PHP commented code for documentation extraction with a similar Lexer and it easily crunched a meg. a second. This was on pages of about a third of the matching desity of tag dense HTML, so if it comes in at about this level then it should be fine for parsing pages from a network.

    yours, Marcus.
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  12. #12
    SitePoint Guru
    Join Date
    Nov 2002
    Posts
    841
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Here is an alternate way to implement a parser in php.

    I suspect that it will be relatively fast for xml.

    • It never concatinates strings
    • It uses a Null object to avoid having to check for handler method existence on each event trigger. (hint hint, Harry )
    • It uses built in PHP functions when possible to skip over (hopefully) large tracts of uninteresting characters.


    I have done absoluately no optimization on this. optimization wise, it would be best to focus on the scan* methods.

    This thing probably doesn't have enough states to robustly handle html. (its just a proof of concept.)

    PHP Code:
    <?php

    define
    ('STATE_STOP'0);
    define('STATE_START'1);

    define('STATE_TAG'2);
    define('STATE_OPENING_TAG'3);
    define('STATE_CLOSING_TAG'4);
    define('STATE_TAG_CLEANUP'5);
    define('STATE_ATTRIBUTE'6);

    class 
    StartingState  {
        function 
    parse(&$context) {
            
    $data $context->scanUntilChar('<');
            if (
    $data == '') {
                return 
    STATE_STOP;
            } else {
                
    $context->IgnoreCharacter();
                
    $context->handler_object_data->{$context->handler_method_data}($data);
                return 
    STATE_TAG;
            }
        }
    }

    class 
    TagState {
        function 
    parse(&$context) {
            
    $char $context->ScanCharacter();
            if (
    $char == '/') {
                return 
    STATE_CLOSING_TAG;
            } else {
                
    $context->unscanCharacter();
                return 
    STATE_OPENING_TAG;
            }
        }
    }

    class 
    ClosingTagState {
        function 
    parse(&$context) {
            
    $tag $context->scanUntilChar('>');
            if (
    $tag == '') {
                return 
    STATE_STOP;
            } else {
                
    $context->handler_object_element->{$context->handler_method_closing}($tag);
                return 
    STATE_TAG_CLEANUP;
            }
        }
    }

    class 
    OpeningTagState {

        var 
    $attributes = array();

        function 
    attributeHandler($attributename$attributevalue) {
            
    $this->attributes[$attributename] = $attributevalue;
        }

        function 
    parse(&$context) {
            
    $tag $context->scanUntilCharSet("/> \n\r\t");
            if (
    $tag == '') {
                return 
    STATE_STOP;
            } else {
                
    $context->_parse(STATE_ATTRIBUTE);
                
    $context->handler_object_element->{$context->handler_method_opening}($tag$this->attributes);
                return 
    STATE_TAG_CLEANUP;
            }
        }
    }

    class 
    TagCleanupState {
        function 
    parse(&$context) {
            
    $char $context->scanCharacter();
            if (
    $char == '/') {
                
    $char $context->scanCharacter();
                if (
    $char != '>') {
                    
    $context->unscanCharacter();
                }
            }
            return 
    STATE_START;
        }
    }

    class 
    AttributeStart {

        var 
    $attribute_handler;
        
        function 
    parse(&$context) {
            
    $context->scanPastWhitespace();
            
    $attributename $context->scanUntilCharSet("=/> \n\r\t");
            if (
    $attributename == '') {
                return 
    STATE_STOP;
            } else {
                
    $attributevalue NULL;
                
    $context->scanPastWhitespace();
                
    $char $context->scanCharacter();
                if (
    $char == '=') {
                    
    $context->scanPastWhitespace();
                    
    $char $context->ScanCharacter();
                    if (
    $char == '"') {
                        
    $attributevalue$context->scanUntilChar('"');
                        
    $context->IgnoreCharacter();
                    } else if (
    $char == "'") {
                        
    $attributevalue$context->scanUntilChar("'");
                        
    $context->IgnoreCharacter();
                    } else {
                        
    $context->unscanCharacter();
                        
    $attributevalue $context->scanUntilCharSet("/> \n\r\t");
                    }
                }
                
    $this->attribute_handler->attributeHandler($attributename$attributevalue);
                return 
    STATE_ATTRIBUTE;
            }
        }
    }

    class 
    StateParser {
        var 
    $rawtext;
        var 
    $position;
        var 
    $length;

        var 
    $State = array();

        function 
    unscanCharacter() {
            
    $this->position -= 1;  // $this->position--; is broken?
        
    }
        
        function 
    ignoreCharacter() {
            
    $this->position++;
        }

        function 
    scanCharacter() {
            if (
    $this->position $this->length) {
                return 
    $this->rawtext{$this->position++};
            } else {
                return 
    '';
            }
        }
        
        function 
    scanUntilCharSet($string) {
            
    $startpos $this->position;
            
    $pos $startpos;
            while (
    $pos $this->length && strpos($string$this->rawtext{$pos}) === FALSE) {
                
    $pos++;
            }
            
    $this->position $pos;
            return 
    substr($this->rawtext$startpos$pos-$startpos);
        }

        function 
    scanUntilChar($char) {
            
    $pos strpos($this->rawtext$char$this->position);
            if (
    $pos === FALSE) {
                
    $result substr($this->rawtext$this->position);
                
    $this->position $this->length;
            } else {
                
    $result substr($this->rawtext$this->position$pos $this->position);
                
    $this->position $pos;
            }
            return 
    $result;
        }
        
        function 
    scanPastWhitespace() {
            while (
    $this->position $this->length && 
                
    strpos(" \n\r\t"$this->rawtext{$this->position}) !== FALSE) {
                
    $this->position++;
            }
        }

        function 
    parse($test) {
            
    $this->rawtext $test;
            
    $this->length strlen($test);
            
    $this->position 0;
            
    $this->_parse();
        }
        
        function 
    _parse($state STATE_START) {
            do {
                
    $StateObj =& $this->State[$state];
                
    $state $StateObj->parse($this);
            } while (
    $state != STATE_STOP && $this->position $this->length);
        }

    }

    class 
    NullHandler {
        function 
    DoNothing($text) {
        }
    }

    class 
    HtmlParser extends StateParser {
        var 
    $handler_object_data;
        var 
    $handler_method_data;

        var 
    $handler_object_element;
        var 
    $handler_method_closing;
        var 
    $handler_method_opening;

        function 
    HtmlParser() {
            
    $nullhandler =& new NullHandler();
            
    $this->set_data_handler($nullhandler'DoNothing');
            
    $this->set_element_handler($nullhandler'DoNothing''DoNothing');
            
            
    $this->State[STATE_START] =& new StartingState();
            
    $this->State[STATE_CLOSING_TAG] =& new ClosingTagState();
            
    $this->State[STATE_TAG] =& new TagState();
            
    $this->State[STATE_OPENING_TAG] =& new OpeningTagState();
            
    $this->State[STATE_TAG_CLEANUP] =& new TagCleanupState();
            
    $this->State[STATE_ATTRIBUTE] =& new AttributeStart();
            
            
    $this->State[STATE_ATTRIBUTE]->attribute_handler =& $this->State[STATE_OPENING_TAG];
        }

        function 
    set_data_handler($data_handler_obj$data_method) {
            
    $this->handler_object_data =& $data_handler_obj;
            
    $this->handler_method_data $data_method;
        }
        
        function 
    set_element_handler($element_handler_obj$opening_method$closing_method) {
            
    $this->handler_object_element =& $element_handler_obj;
            
    $this->handler_method_opening $opening_method;
            
    $this->handler_method_closing $closing_method;
        }
    }

    class 
    MyHandler {
        function 
    openHandler($name$attrs) {
            echo ( 
    '--Open Tag Handler: '.$name.'<br />' );
            echo ( 
    '--Attrs:<pre>' );
            
    print_r($attrs);
            echo ( 
    '</pre>' );
        }
        function 
    closeHandler($name) {
            echo ( 
    '--Close Tag Handler: '.$name.'<br />' );
        }
        function 
    dataHandler($data) {
            echo ( 
    '--Data Handler: '.$data.'<br />' );
        }
    }

    $doc=<<<EOD
    This is a <em>simple</em> example <tag test='attribute' />! 
    EOD;

    $parser =& new HtmlParser();

    $handler=& new MyHandler();
    $parser->set_element_handler($handler'openHandler','closeHandler');
    $parser->set_data_handler($handler'dataHandler');

    $parser->parse($doc);

    ?>

  13. #13
    SitePoint Guru
    Join Date
    Nov 2002
    Posts
    841
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Stupid me. A couple of bug fixes:

    Use this version of StartingState instead:
    PHP Code:
    class StartingState  {
        function 
    parse(&$context) {
            
    $data $context->scanUntilChar('<');
            
    $context->IgnoreCharacter();
            if (
    $data != '') {
                
    $context->handler_object_data->{$context->handler_method_data}($data);
            }
            return 
    STATE_TAG;
        }

    Add $this->attributes = array(); after the else in OpenTagState :: parse
    PHP Code:
    class OpeningTagState {
    ...
        function 
    parse(&$context) {
    ...
            } else {
                
    $this->attributes = array(); 
    I bet there is an infinite loop waiting to happen somewhere in there, as well.

  14. #14
    SitePoint Guru
    Join Date
    Nov 2002
    Posts
    841
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ok, just tried a some performance tests with parsing a 124,423 byte XML file (timed using ab)

    Code:
    State based parser :   660 ms mean time per request.
    xml_parse (expat)  :   433 ms
    XML_HTMLSax        : 9,685 ms
    wow. I don't know what to make of this.

  15. #15
    SitePoint Guru
    Join Date
    Nov 2002
    Posts
    841
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ok, here is an updated version. I fixed some bugs and updated the interface to more closely resemble XML_HTMLSax.
    Attached Files Attached Files

  16. #16
    SitePoint Zealot prefab's Avatar
    Join Date
    Jan 2003
    Location
    Belgium
    Posts
    133
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Selkirk
    wow. I don't know what to make of this.
    Well, as far as I can see, this is great! Seems it's barely slower than the real SAX parser. In my test it works great

    Thanks a bunch!

    - prefab

  17. #17
    No. Phil.Roberts's Avatar
    Join Date
    May 2001
    Location
    Nottingham, UK
    Posts
    1,142
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Selkirk
    Ok, just tried a some performance tests with parsing a 124,423 byte XML file (timed using ab)

    Code:
    State based parser :   660 ms mean time per request.
    xml_parse (expat)  :   433 ms
    XML_HTMLSax        : 9,685 ms
    wow. I don't know what to make of this.
    Not bad for an un-optimised proof of concept.

  18. #18
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi.

    Quote Originally Posted by Selkirk
    Ok, just tried a some performance tests with parsing a 124,423 byte XML file (timed using ab)

    Code:
    State based parser :   660 ms mean time per request.
    xml_parse (expat)  :   433 ms
    XML_HTMLSax        : 9,685 ms
    wow. I don't know what to make of this.
    Fantastic! I don't know what's more dramatic, that this version is so fast or that the expat version is so slow . Is it I/O bound? Anyone fancy running them through apd?

    Switching the state parser to a stack based one would allow the processing of any language (state machines fall far short of being turing complete) and would probably add only 25% more code, mostly in passing the stack around and setting up the handlers. As for catching the infinite loop, just add a check that the position has advanced at least one space. IMO this could be needed if the state parser is to work on it's own as getting the states right could be rather tricky if they are created by hand. I needed it while debugging the SimpleTest one!

    yours, Marcus.
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  19. #19
    SitePoint Wizard gold trophysilver trophy
    Join Date
    Nov 2000
    Location
    Switzerland
    Posts
    2,479
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Outstanding!

    Gobsmacked by those performance figures.

    Selkirk - you mind if I use your code for PEAR::XML_HTMLSax v2?

  20. #20
    Non-Member
    Join Date
    Jan 2003
    Posts
    5,748
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Looks smart doesn't it although is there any more sample Templates and script for parsing them ?

    Please

  21. #21
    SitePoint Wizard Chris82's Avatar
    Join Date
    Mar 2002
    Location
    Osnabrück
    Posts
    1,003
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The performance results look really impressing.
    I am a bit at a loss of how to use the parser actually.
    I worked with XML_Transformer (seems to be down currently) and there you could define a handler for each tag. In the example there was one open/close Handler. Is it possibly to define a filter for each element?

    This is what I currently use:

    PHP Code:
    $doc = <<<EOD
    <article>
        <title>This is a test</title>
        <author>Some Guy</title>
    </article>
    EOD;

    class 
    MyHandler {
        function 
    MyHandler() {}
        
        function 
    openHandler(& $parser,$name,$attrs) {
            switch (
    strtolower($name)) {
                case 
    'title':
                    echo 
    '<h1>';
                    break;
                case 
    'author':
                    echo 
    '<em>';
                    break;
            }            
        }
        
        function 
    closeHandler(& $parser,$name) {
            switch (
    strtolower($name)) {
                case 
    'title':
                    echo 
    '</h1>' "\n";
                    break;
                case 
    'author':
                    echo 
    '</em>' "\n";
                    break;
            }
        }
        
        function 
    dataHandler(& $parser,$data) {
            echo 
    $data;
        }
    }

    $parser =& new HtmlParser();
    $handler=& new MyHandler();

    $parser->set_object($handler);
    $parser->set_option('trimDataNodes'true);

    $parser->set_element_handler('openHandler','closeHandler');
    $parser->set_data_handler('dataHandler');

    $parser->parse($doc); 

  22. #22
    SitePoint Wizard Chris82's Avatar
    Join Date
    Mar 2002
    Location
    Osnabrück
    Posts
    1,003
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Okay, I have created a class Transformers which has registered methods for some tags. The methods have to follow the convention start_tag and stop_tag.


    PHP Code:
    class MyHandler {
        var 
    $transformer;

        function 
    MyHandler(&$transformer) {
            
    $this->transformer =& $transformer;
        }
        
        function 
    openHandler(&$parser$name$attr) {
            
    $method 'start_' $name;
            if (
    method_exists($this->transformer$method)) {
                
    $this->transformer->$method($attr);
            }
            else {
                
    // handle opening other stuff (this is currently not working)
            
    }
        }
        
        function 
    closeHandler(&$parser$name) {
            
    $method 'stop_' $name;
            if (
    method_exists($this->transformer$method)) {
                
    $this->transformer->$method($name);
            }
            else {
                
    // handle closing other stuff (this is currently not working)
            
    }
        }
        
        function 
    dataHandler(&$parser$data) {
            echo 
    $data;
        }
    }

    class 
    Transformer {
        function 
    start_textbox($attr) {
            echo 
    '<input type="text"' $this->attributesToString($attr) . '/>' "\n";
        }

        function 
    attributesToString($attr) {
            
    $string ' ';
            foreach (
    array_keys($attr) as $key) {
                
    $string .= $key '="' $attr[$key]. '" ';
            }
            return 
    preg_replace('#\s$#'''$string);
        }
        
        function 
    stop_textbox($name) {}

    Example:

    PHP Code:
    $doc = <<<EOD
    <textbox name="name" length="20"></textbox>
    This should be displayed
    EOD;

    $parser      =& new HtmlParser();
    $transformer =& new Transformer();
    $handler     =& new MyHandler($transformer);

    $parser->set_object($handler);
    $parser->set_option('trimDataNodes'true);

    $parser->set_element_handler('openHandler','closeHandler');
    $parser->set_data_handler('dataHandler');

    $parser->parse($doc); 

  23. #23
    Non-Member
    Join Date
    Jan 2003
    Posts
    5,748
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Umm.... Got to have a study of this. Looks interesting though also looks complicated

  24. #24
    SitePoint Guru
    Join Date
    Nov 2002
    Posts
    841
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Phil.Roberts
    Not bad for an un-optimised proof of concept.
    I think I stumbled on optimal implementions of the methods in the StateParser class. I spent an hour or so trying different things and I could not speed any of them up.

    Combining operators that are frequently called together ended up slower.

    strpos is just very fast in PHP. Part of the problem is that none of the other string searching functions in PHP can begin their search at a specific position, like strpos can. (In PHP 5, I think the preg_ functions will take a starting position parameter).

    Alternatively, there is room in the state definitions for further optimization. I got the mtpr down to 614 ms by eliminating the cleanup state.

    It is also possible to take advantage of the essentially sequential state transitions to completely unroll the parser and implement it in a single function using a big do loop with break statements to return to the starting state at the top of the loop and falling through for each of the other state transitions. based on a couple of simple tests with eliminating state and unrolling the StateParser scan functions, I think it might possibly cut execution time as much as in half. For some people without much OO experience, the resulting code might even be easier to understand, although insane to modify. I leave this as an exercise for the reader.

    Quote Originally Posted by lastcraft
    Fantastic! I don't know what's more dramatic, that this version is so fast or that the expat version is so slow . Is it I/O bound? Anyone fancy running them through apd?
    Indeed. I am no expert with xml in PHP. I might have bungled the expat implementation. Here is the code I used:
    PHP Code:
    $xml_parser xml_parser_create();
    xml_set_object($xml_parser$handler);
    xml_set_element_handler($xml_parser"openHandler""closeHandler");
    xml_set_character_data_handler($xml_parser"dataHandler");
    xml_parse($xml_parser$doc); 
    I used the same handler for all three versions. One difference seems to be that expat called the dataHandler method ALOT more, possibly because of line breaks. What do you make of this?

    Quote Originally Posted by lastcraft
    As for catching the infinite loop, just add a check that the position has advanced at least one space.
    Not every state advances the current position. An infinite loop detecter would have to make sure that the state did not change AND the current position was not advanced.

    Quote Originally Posted by HarryF
    Selkirk - you mind if I use your code for PEAR::XML_HTMLSax v2?
    Please do.


    Bug wise, there is one thing that I would watch out for.

    This thing is just WAY too tolerent of badly formatted files. because of that, some of the states just keep going merrily along after things have completely failed to make sense.

    For some complex states using the unscanCharacter method near the end, they could advance past the end of file earlier in the state and then end up backing up leaving an extra character or two to be parsed twice. Possibly also in the wrong state. (hello Mr. infinite loop)

    I think where this would show up is in abruptly truncated files.

    I think it would show up as a couple of garbage events at the end of processing.

    One of the things that I am not happy with is the end of file handling (well, really end of string).

    Right now, many of the states implicitly transition to the STOP state by going past the end of string and triggering the test in the main loop, rather than explicitely triggering the transition. This is probably confusing.

    There is an elegent solution to this that I will think of in two weeks while eating dinner.

  25. #25
    Non-Member
    Join Date
    Jan 2003
    Posts
    5,748
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Badly formatted files ? Umm... IMO though this shouldn't be an issue for you the developer to account for ?

    Sure a few other members would agree on this point as well; You can't be responsible for those who have little idea of how to format a document, etc.

    Myself included


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •