SitePoint Sponsor

User Tag List

Page 3 of 3 FirstFirst 123
Results 51 to 65 of 65
  1. #51
    SitePoint Wizard gold trophysilver trophy
    Join Date
    Nov 2000
    Location
    Switzerland
    Posts
    2,479
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Regarding strcspn found this so it's definately 4.3.0+.

    That way if could select which implementation of the state parser it was going to use depending upon which version of PHP was running. (really just picking which set of scan* methods to use). It could select the fastest version allowed by the current version of PHP.
    Like the way that sounds. More soon...

  2. #52
    SitePoint Wizard gold trophysilver trophy
    Join Date
    Nov 2000
    Location
    Switzerland
    Posts
    2,479
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    OK - the "front end" class no uses a concrete subclass of the the StateParser, depending on the PHP version. The approach used is cruder right now then Selkirk suggested - simply moved all the handler variables to the State Parser and the "front end" methods route through to those variable names, meaning it's tightly coupled.

    FYI, the performance difference between using strcspn / strspn (in PHP 4.3.0+) and not using them (PHP < 4.3.0) on that RSS document is typically;

    PHP 4.3.0 + took: 0.082s
    PHP < 4.3.0 took: 0.104s


    Gonna start making things more PEAR shaped so I tagged the last non-PEAR release. You should be able to check out with;

    Code:
    cvs -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/htmlsax login
    [Just press enter]
    cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/htmlsax co -r XML_HTMLSax20030811 htmlsax

  3. #53
    SitePoint Wizard Ren's Avatar
    Join Date
    Aug 2003
    Location
    UK
    Posts
    1,060
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ah I was thinking along the lines of

    Code:
    class XML_HTMLSax_StateParser4_0 {
        var $rawtext;
        var $position;
        var $length;
    
        var $State = array();
    
        function unscanCharacter() {
            $this->position -= 1;
        }
    
        function ignoreCharacter() {
            $this->position += 1;
        }
    
        function scanCharacter() {
            if ($this->position < $this->length) {
                return $this->rawtext{$this->position++};
            }
        }
    
    	function scanUntilString($string) {
            $start = $this->position;
            $this->position = strpos($this->rawtext, $string, $start);
            if ($this->position === FALSE) {
                $this->position = $this->length;
            }
            return substr($this->rawtext, $start, $this->position - $start);
        }
    
    	function scanUntilCharacters($string) {
    		$startpos = $this->position;
    		while ($this->position < $this->length && strpos($string, $this->rawtext{$this->position}) === FALSE) {
    			++$this->position;
    		}
    		return substr($this->rawtext, $startpos, $this->position - $startpos);
    	}
    	function ignoreWhitespace() {
    		while ($this->position < $this->length &&
    			strpos(" \n\r\t", $this->rawtext{$this->position}) !== FALSE) {
    			++$this->position;
    		}
    	}
    
        function parse($data) {
            $this->rawtext = $data;
            $this->length = strlen($data);
            $this->position = 0;
            $this->_parse();
        }
    
        function _parse($state = XML_HTMLSAX_STATE_START) {
            do {
                $state = $this->State[$state]->parse($this);
            } while ($state != XML_HTMLSAX_STATE_STOP &&
                        $this->position < $this->length);
        }
    }
    
    // version_compare is PHP 4.1
    if (function_exists('version_compare') && version_compare(phpversion(), '4.3') >= 0) {
    
    	class XML_HTMLSax_StateParser4_3 extends XML_HTMLSax_StateParser4_0 {
    		 function scanUntilCharacters($string) {
    			$startpos = $this->position;
    			$length = strcspn($this->rawtext, $string, $startpos);
    			$this->position += $length;
    			return substr($this->rawtext, $startpos, $length);
    		}
    		function ignoreWhitespace() {
    			$this->position += strspn($this->rawtext, " \n\r\t", $this->position);
    		}
    	}
    
    	class XML_HTMLSax_StateParser extends XML_HTMLSax_StateParser4_3 { }
    }
    else
    {
    	class XML_HTMLSax_StateParser extends XML_HTMLSax_StateParser4_0 { }
    }
    Then

    Code:
    class XML_HTMLSax extends XML_HTMLSax_StateParser {
    would inherit the correct version. Bit more of a hack maybe.
    Last edited by Ren; Aug 11, 2003 at 05:06.

  4. #54
    SitePoint Wizard gold trophysilver trophy
    Join Date
    Nov 2000
    Location
    Switzerland
    Posts
    2,479
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Things I'm supposed to be doing have come between me and things I like doing hence the delayed announcement but PEAR::XML_HTMLSax 2.0.1 (alpha) is now available.

    Before this goes stable I need to write alot more extensive unit tests (actually testing the units) plus get some feedback.

    Final changes not yet discussed include the handlers now get back the "user facade" XML_HTMLSax itself, allowing modifications to be made while parsing is in progress, plus [CDATA[ ]] is now better handled. Thinking about it, that former change introduces problems if people trying to modify the decorators one parsing has started.

    Anyway - just to say it's out there.

  5. #55
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi Harry...

    After as few days off I am back to work. If you like I'll wade into the tests (I am just doing the checkout now). Just send me a quick email if that's OK.

    yours, Marcus
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  6. #56
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi Harry.

    Refactoring with unit tests with mocks has tightened the tests and shortened them too. I do like Mocks! To me it all looks rather clearer, but I had better let you be the judge.

    I haven't added any new nasty cases yet, but I'll be doing that next. I have managed to find the odd little glitch which should get me into doing some real code rather than test stuff. I'll tackle that later. In the meantime I have some interface concerns...

    1) For each call to the listener the parser sends itself as a reference. As the Listener will usually be creating the parser in a factory anyway this seems rather pointless. Not only that, but the information should flow downhill only, surely. It shouldn't need to make a call upstream. Could I just strip out all of these extra parameters?

    2) The case folding and whitespace stripping options should surely be done by a filter. It strikes me that the parser is gathering responsibilities that aren't really it's concern. Again, stripping these and passing everything verbatim would simplify things.

    3) The return value of parse() is void. I think that a false return should indicate the parser has halted with an error, true otherwise. This will stop the script blindly feeding it more data, which could be a costly operation over a network.

    4) If parse is called twice with say "<ta" and "g>" does this screw things up? It shouldn't if it follows the expat interface. Expat marks the last chunk with a flag. I would imagine that if you are using SAX instead of DOM you will also be pulling the data in small pieces and so it makes sense to correctly buffer the input.

    yours, Marcus
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  7. #57
    SitePoint Guru
    Join Date
    Nov 2002
    Posts
    841
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    As I said before, I am a bit ignorant about XML. I found Processing XML with Java to be an excellent introduction. It really cleared up a lot of issues for me. (Thanks, Harry, for the link)

    Quote Originally Posted by lastcraft
    1) For each call to the listener the parser sends itself as a reference. As the Listener will usually be creating the parser in a factory anyway this seems rather pointless. Not only that, but the information should flow downhill only, surely. It shouldn't need to make a call upstream. Could I just strip out all of these extra parameters?
    If you do, you lose compatability with the built in expat based parser. (This parameter looks useless to me there, too.) The SAX API for handlers in Java does not have this parameter. (and has a little different structure.)

    I am not sure that parting from the expat API is a bad thing.


    My understanding is that the purpose of XML_HTMLSax is to be able to parse "badly formed XML documents, such as HTML."


    I did a thought experiment of writing an html syntax coloring function such as highlight_file(). This function should be able to output html with exactly the same text as its input, except with different elements colored. As a result of this experiment, I can think of two problems with the call back interface:

    Attributes without values are not allowed in XML. Thus
    Code:
    <input type="checkbox" name="remember" checked>
    would pass through the interface as
    Code:
    <input type="checkbox" name="remember" checked=true>
    Not a good transformation for a syntax coloring function.

    Also, self closing tags cannot be properly detected for the purposes of syntax coloring:
    Code:
    <br />
    would pass through the interface as
    Code:
    <br></br>
    So, it seems that if XML_HTMLSax is to be used to process HTML without transforming it, its callback API must diverge from the v1 and expat compatibility.


    After reading a bit on XML, I see now that my previous comment about validation was wrong. What I meant was not validation, but well-formedness. The trouble spots that I was talking about come from badly formed HTML. My gut feeling is that changing the state implementations to detect well-formedness will cause a more elegent solution to the end of string handling to appear.

    HTML is badly formed XML, but is there a need to also be able to parse badly formed HTML? Should the parser report when it has encountered badly formed HTML? badly formed XML?

  8. #58
    SitePoint Wizard Ren's Avatar
    Join Date
    Aug 2003
    Location
    UK
    Posts
    1,060
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Selkirk
    Attributes without values are not allowed in XML. Thus
    Code:
    <input type="checkbox" name="remember" checked>
    would pass through the interface as
    Code:
    <input type="checkbox" name="remember" checked=true>
    Not a good transformation for a syntax coloring function.
    Setting the minimized attribute value to true doesnt seem a good idea, I think setting it to the value to the name is a better solution.

    Code:
    <input type="checkbox" name="remember" checked>
    becomes
    Code:
    <input type="checkbox" name="remember" checked="checked">
    This would then agree with the XHTML1.0 standard on attribute minimization.


    Quote Originally Posted by Selkirk
    Also, self closing tags cannot be properly detected for the purposes of syntax coloring:
    Code:
    <br />
    would pass through the interface as
    Code:
    <br></br>
    Been thinking that an extra parameter on the start_element, and end_element handlers would be required.

    Code:
    function start_element($name, $attributes, $isMinimized = FALSE)
    {
    }
    
    function end_element($name, $isMinimised = FALSE)
    {
    }
    If wanted to output HTML4, the output functions ignore the $isMinimized parameters, where if XHTML4 was needed, then element minimization could be handled.

  9. #59
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi...

    Quote Originally Posted by Selkirk
    If you do, you lose compatability with the built in expat based parser. (This parameter looks useless to me there, too.)
    Well expat is C program, so I guess it might be missing there too (haven't looked). Looking at our own SAX parser, an expat wrapper at work, the very first thing we did was ditch that useless parameter. My feeling is that it should go.

    Quote Originally Posted by Selkirk
    I am not sure that parting from the expat API is a bad thing.
    Me neither. I am not sure that for the object interface we cannot just use default names for the handlers as well.

    Quote Originally Posted by Selkirk
    My understanding is that the purpose of XML_HTMLSax is to be able to parse "badly formed XML documents, such as HTML."

    I did a thought experiment of writing an html syntax coloring function such as highlight_file().
    Which is exactly the point. Your thought experiments highlights this in the most apposite way.

    There are two tasks here, parsing HTML and producing XML compatible output. I feel that the core parser should return the data verbatim (no options at all) and a SAXFilter (say 'WellFormer') could run over the events downstream. There are quite a few things to do for this filter that would complicate the parsing stage a great deal...
    1) Adding <?xml ... ?> header and doctype.
    2) Adding entity/removing definitions to match the big five and/or defining the HTML ones.
    3) Balancing tag nesting (may be tricky with things like <br> lying around).
    4) Stripping JASP stuff.
    5) Filling in attributes.
    6) Other stuff I haven't thought of.

    Quote Originally Posted by Selkirk
    After reading a bit on XML, I see now that my previous comment about validation was wrong. What I meant was not validation, but well-formedness. The trouble spots that I was talking about come from badly formed HTML. My gut feeling is that changing the state implementations to detect well-formedness will cause a more elegent solution to the end of string handling to appear.
    My gut feeling is the other way, but I am anything but sure.

    Quote Originally Posted by Selkirk
    HTML is badly formed XML, but is there a need to also be able to parse badly formed HTML? Should the parser report when it has encountered badly formed HTML? badly formed XML?
    I think the tasks should be split, HTML parsing and then HTML tidying. Otherwise there are going to be a lot of options to fill in on the class at instantiation. What do you think? What do other peoiple expect from the interface?

    yours, Marcus
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  10. #60
    SitePoint Guru
    Join Date
    Nov 2002
    Posts
    841
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by lastcraft
    There are two tasks here, parsing HTML and producing XML compatible output. I feel that the core parser should return the data verbatim ... I think the tasks should be split, HTML parsing and then HTML tidying.
    I agree. However, well-formedness is not something that can be determined downstream.

    Here are some test cases for badly formed HTML:
    Code:
    <tag attribute="value>contents</tag>
    <tag attribute=">">contents</tag>
    <tag attribute=""value">
    <tag attribute="value"">
    <tag>contents<</tag>
    </tag attribute="value">
    </tag/>
    </>
    <>
    More badly formed HTML due to truncation:
    Code:
    <ta
    <tag attribute="val
    </
    These cases are impossible to detect by examining the event stream. Only at the parser level can you determine if the syntax is valid HTML.

    This parser will make its best attempt to issue some sort of event stream for any input. It just will not tell you when it encounters bad syntax.

  11. #61
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi...

    Quote Originally Posted by Selkirk
    Here are some test cases for badly formed HTML:
    Which is a very convincing point . I was also thinking that excess whitespace within elements would be lost as well, so a syntax highlighter cannot really be part of the SAX domain unless two more handlers are added: one for white space and one for discarded junk. I think this is stretching the job description a bit far and I gather that you do too. Would make a good subclass though .

    Quote Originally Posted by Selkirk
    This parser will make its best attempt to issue some sort of event stream for any input. It just will not tell you when it encounters bad syntax.
    A 'lossy' conversion then from text to events.

    yours, Marcus
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  12. #62
    SitePoint Wizard gold trophysilver trophy
    Join Date
    Nov 2000
    Location
    Switzerland
    Posts
    2,479
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    1) For each call to the listener the parser sends itself as a reference. As the Listener will usually be creating the parser in a factory anyway this seems rather pointless. Not only that, but the information should flow downhill only, surely. It shouldn't need to make a call upstream. Could I just strip out all of these extra parameters?
    The main reason, as mentioned, is compatibility with the Expat API. Perhaps it is just overhead but depending on how you use the parser, it can be vaguely useful for a handler to being able to access the parser directly once parsing is taking place such as a the XML_HTMLSax::get_current_position() method. Hmmm - to break or not to break?

    2) The case folding and whitespace stripping options should surely be done by a filter. It strikes me that the parser is gathering responsibilities that aren't really it's concern. Again, stripping these and passing everything verbatim would simplify things.
    Agreed although it's more to match Expat again which has most of these "behaviours" switched on by default. Agree you're right although the implementation with Decorators is reasonably well seperated.

    3) The return value of parse() is void. I think that a false return should indicate the parser has halted with an error, true otherwise. This will stop the script blindly feeding it more data, which could be a costly operation over a network.
    Good point - need to add that as well to look the same as Expat. The only thing is XML_HTMLSax won't stop - not for anything My guess is you could probably even try using it to parse binary and it will still reach the end. I just tried parsing http://ch.php.net/images/php.gif and it did OK

    4) If parse is called twice with say "<ta" and "g>" does this screw things up? It shouldn't if it follows the expat interface. Expat marks the last chunk with a flag. I would imagine that if you are using SAX instead of DOM you will also be pulling the data in small pieces and so it makes sense to correctly buffer the input.
    Hmmm - there's a bug there. Just tried it and "ta" calls the open handler with "g" becoming an attribute. According to the XML specs, well formed tags shouldn't contain spaces.

    So, it seems that if XML_HTMLSax is to be used to process HTML without transforming it, its callback API must diverge from the v1 and expat compatibility.
    That's very true. Perhaps the name is bad - should really be XML_BadlyFormedSax or something. The main problem I was dealing with when I originally got into this is I needed a parsing which was capabable of capturing the "formatting knowledge" defined with HTML from a document I knew about. That's to some extent a precondition for HTMLSax - you need to know in what ways the document you are parsing is badly formed.

    Think to make HTMLSax a parser for HTML, capable of parsing and checking HTML well formedness, it probably needs to know about HTML - i.e. recognise a tag. That's alot more work but perhaps the way to go is to make the state "engine" easy to plug into. Note there's a guy who's written a parser that "understands" HTML http://anton.concord.ru/ - only glanced at it (can't say what approach here's using). Thinking about HTML itself, there's probably about 10 "special cases" to deal with, compared to XML so perhaps by allowing for those, the rest can throw errors.

    On the issue of <tag selected> becoming <tag selected=true> that's a tricky one. If I'm reconstructing tags I've parsed, I make sure to exactly check the value of attributes with

    PHP Code:
    if ( $attrib === true 
    Not sure what the best way to go here is though. Can see it's not ideal.

    Anyway - not much time to think about it right now. I've tagged the 2.0.1 release so feel free to see what works.

  13. #63
    SitePoint Wizard gold trophysilver trophy
    Join Date
    Nov 2000
    Location
    Switzerland
    Posts
    2,479
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    One other possible example to consider, if you can get by with C and Perl, is http://search.cpan.org/dist/HTML-Parser/ - seems to be a lexer and delivers a SAX based API but it also "knows" what HTML is.

  14. #64
    SitePoint Enthusiast
    Join Date
    Dec 2003
    Location
    earth
    Posts
    43
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I hope this isn't a bad place to ask, but need some help on using selkirk's XML class. I would like to loop through the values of the attributes and elements in xml data in a string as if I were with an associative array. This is because I want to use those values in a loop as I'm making a dynamic sql statement.

    Any help is appreciated. Thanks!

  15. #65
    SitePoint Enthusiast
    Join Date
    Oct 2001
    Location
    London
    Posts
    26
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Getting Tidy to work in PHP 4.3x

    Just a quick note to those trying to get Tidy to work in PHP 4.3x.

    INSTALLATION

    1. Get libtidy source from http://tidy.sourceforge.net/src/tidy_src.tgz.

    Code:
    tar -zvxf tidy_src.tgz
    cd tidy/build/gmake
    make all
    make install.
    2. Tidy is currently available for PHP 4.3.x and PHP 5 as a PECL extension from http://pecl.php.net/package/tidy. You can download it directly from http://pecl.php.net/get/tidy-1.0.tgz. Run the following commands to unpack and install:

    Code:
    tar -zvxf tidy-xxx.tar
    cd tidy-xxx
    phpize
    ./configure && make && make install
    3. Then add

    Code:
    extension=tidy.so
    to your php.ini file and restart Apache.

    You should now see Tidy in the phpinfo();


    MY OBSERVATIONS

    Functions like tidy_get_html which return a TidyNode Object which you can traverse are not available in PHP 4.3.x, only PHP 5.

    Since tidy_node is only available in PHP >= 5.0.0, the only useful functions I can see are:

    tidy_parse_string($html);
    tidy_clean_repair();
    html = tidy_get_output();

    ----

    The reason for this post is just a suggestion really. It might be a good idea to tidy up all the HTML with Tidy before running a parser over it. Especially if you are building a DOM like tree using a SAX parser.

    Cheers,

    Mike Mindel
    Wordtracker


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •