|
|||||||
New to SitePoint Forums? Register here for free!
|
![]() |
|
|
Thread Tools | Display Modes |
|
|
#1 |
|
SitePoint Zealot
![]() ![]() Join Date: Jan 2003
Location: Belgium
Posts: 133
|
parsing XML, but skipping entities
I'm working on a new template engine (yah, I know...) based on SAX parsing. While parsing, I check for certain tags, other tags are passed thru, so it won't affect normal html tags etc.
Now I've run into a problem. As SAX already tries to map (html) entities (especially   , they disappear when passed thru. In other words, I'd like to skip parsing those entities, so they'll stay unaffected.If I only could get the data as raw as possible... The next best option is using Harry Fs' HTMLSax (PEAR). Althought it works either way (and HTMLSax respects my entities!), pure SAX is a lot faster, ofcoarse. Maybe anyone can help. Maybe I should avoid entities altogether and use document-encoding only. As for ...I only need those in tables, but it seems tables are outfashioned by CSS soon anyway... - prefab |
|
|
|
|
|
#2 |
|
SitePoint Wizard
![]() ![]() Join Date: Nov 2000
Location: Switzerland
Posts: 2,898
|
Not sure if it's possible with the native SAX parser - have run into similar problems with entities. Know that HTMLSax isn't fast but then again, if you combine it with PEAR::Cache_Lite, as I did with Simple Template, you can limit that delay to only those occasions when either the content or the template changes.
One thing perhaps to consider is to compile your template into native PHP. The template won't change often on a live site - only the content. Also if you check out what's happening with Simple Test - there's another SAX like HTML parser in there (you'll need to dig a little) which uses regular expressions to parse rather than the character by character approach used by HTMLSax, so should be faster. |
|
|
|
|
|
#3 |
|
SitePoint Zealot
![]() ![]() Join Date: Jan 2003
Location: Belgium
Posts: 133
|
I was actually thinking of adding caching...think It'll do the trick
...I didn't know HTMLSax is parsing character by character, which explains a lot ![]() Thanks - prefab |
|
|
|
|
|
#4 | |
|
SitePoint Victim
![]() ![]() ![]() ![]() ![]() ![]() Join Date: Apr 2003
Location: London
Posts: 2,385
|
Hi.
Quote:
. I think I owe 95% of my web traffic to you. On my TODO list was to backport the Lexer into your HtmlSax library and do a speed test. Would you be interested if I submitted to you such a version? It would save me doing the performance comparison and I would be interested in the results as a benchmark of how fast the PHP regexes are. The problem with the Lexer in SimpleTest though is that it is not easy to understand how it works (it was TDD rather then "designed"). The plus side is that it is a stack machine which would allow proper filtering of JavaScript tags and CSS. yours, Marcus.
__________________
Marcus Baker Testing: SimpleTest, Cgreen, Fakemail Other: Phemto dependency injector Books: PHP in Action, 97 things |
|
|
|
|
|
|
#5 | ||
|
SitePoint Wizard
![]() ![]() Join Date: Nov 2000
Location: Switzerland
Posts: 2,898
|
Quote:
Quote:
![]() |
||
|
|
|
|
|
#6 | |
|
SitePoint Zealot
![]() ![]() Join Date: Jan 2003
Location: Belgium
Posts: 133
|
Quote:
SimpleSaxParser::_addTag($lexer, "title"); in createLexer() break all processing it seems. Clearly, I haven't got a clue how it works, yet ![]() - prefab |
|
|
|
|
|
|
#7 | |
|
SitePoint Victim
![]() ![]() ![]() ![]() ![]() ![]() Join Date: Apr 2003
Location: London
Posts: 2,385
|
Hi.
I really do need to refactor that part of the code, don't I? ![]() Quote:
The Lexer works by building up a bunch of regexes with brackets around them, so if it has to look for the Perl patterns "a.*?b" and "fred" it constructs this call... PHP Code:
When run, this will find the earliest match for the mode it is in and hide it in $matches somewhere. The Lexer digs it out and uses the result to find the point of matching. That gives two tokens to return, the non-matching one up to the match and the match itself. The ordering of the paterns can be important, with a general pattern masking a later more specific pattern, thus "aaabbb" should come before "a*b*". Each pattern has a mode (state really), usually just the name of a callback (a handler in the parser) or if not then a name that maps to a callback. It also has an action which is either nothing (carry on in the same mode), enter the new name mode, leave this mode after this token or a special token which calls a different handler this once only. This way the modes nest, forming a stack machine rather than a state machine. If you didn't get all of that from the code, I don't blame you at all. I had to look at the code to write the above and I wrote it .So why am I going into all of this? Because the lexer is set up to only match the HTML tags it needs to recognise: anchors, title, etc., attribute start and finishes and irrelevant whitespace. That's why it just scooped up just about every other tag, I wanted it to go as fast as possible. Where it matches specific tag starts, you will probably want a general tag in what ever factory function creates it. Try... PHP Code:
The first one is the end of tag. It occours in 'text' mode and invokes acceptEndToken() on the parser whilst staying in text mode. The second one is the start of the tag which is found in 'text' mode and enters 'tag' mode as soon as it is encountered. For your own parser you can choose your own mode names and handler names of course. In fact you will have to rename the classes as well to avoid clashing with the ones in SimpleTest if that is what you use for testing. I'll try to send a patches along these lines to HTMLSax next week and hopefully make a clearer job of it. Harry, can you mail me the curent unit tests for the parser as that would save a lot of time. yours, Marcus.
__________________
Marcus Baker Testing: SimpleTest, Cgreen, Fakemail Other: Phemto dependency injector Books: PHP in Action, 97 things |
|
|
|
|
|
|
#8 |
|
SitePoint Zealot
![]() ![]() Join Date: Jan 2003
Location: Belgium
Posts: 133
|
Thanx for your insights. I'll have to admit, it still has me rather stumped. But if I gather correctly, HTMLSax will benefit from your efforts soon? I think I'll stay with HTMLSax (or even SAX) for now, if everything is well, it should work just the same.
I'm looking forward to a speedier HTMLSax ![]() |
|
|
|
|
|
#9 |
|
SitePoint Zealot
![]() ![]() Join Date: Jan 2003
Location: Belgium
Posts: 133
|
I decided to take another turn on the entities problem.
I guess this is the fastest method, although it involves a global preg_replace before and after parsing. PHP Code:
- prefab |
|
|
|
|
|
#10 | |
|
SitePoint Guru
![]() ![]() ![]() ![]() ![]() Join Date: Nov 2002
Posts: 846
|
Quote:
I only looked at your Lexer briefly, but I got the impression from it that you were familiar with tools like Flex? (if not, I am getting warm fuzzy feelings about TDD).I independantly wrote a similiar parser for WACT. It is not as generic or nice as yours (actually it is unfortunately over integrated with a recursive descent parser), but uses a similar regex approach. I will be very much be interested in your results. I suspect that regex is slow. I suspect that a lexer hand optimized to the task using standard string functions will be faster. I guess it depends on the number of patterns to match, pattern density and the length of the string. |
|
|
|
|
|
|
#11 | |||
|
SitePoint Victim
![]() ![]() ![]() ![]() ![]() ![]() Join Date: Apr 2003
Location: London
Posts: 2,385
|
Hi...
Quote:
Quote:
. It should be pretty fascinating.Quote:
, keeping the number of matches low to cut down on the number of PHP calls and separating the modes out to keep the regexes small. A tag dense page whilst matching every tag will be pretty brutal on it and heavily favours the current HTMLSax. At least if it wins that battle then the switch is a no brainer. We have parsed PHP commented code for documentation extraction with a similar Lexer and it easily crunched a meg. a second. This was on pages of about a third of the matching desity of tag dense HTML, so if it comes in at about this level then it should be fine for parsing pages from a network.yours, Marcus.
__________________
Marcus Baker Testing: SimpleTest, Cgreen, Fakemail Other: Phemto dependency injector Books: PHP in Action, 97 things |
|||
|
|
|
|
|
#12 |
|
SitePoint Guru
![]() ![]() ![]() ![]() ![]() Join Date: Nov 2002
Posts: 846
|
Here is an alternate way to implement a parser in php.
I suspect that it will be relatively fast for xml.
I have done absoluately no optimization on this. optimization wise, it would be best to focus on the scan* methods. This thing probably doesn't have enough states to robustly handle html. (its just a proof of concept.) PHP Code:
|
|
|
|
|
|
#13 |
|
SitePoint Guru
![]() ![]() ![]() ![]() ![]() Join Date: Nov 2002
Posts: 846
|
Stupid me.
A couple of bug fixes:Use this version of StartingState instead: PHP Code:
PHP Code:
![]() |
|
|
|
|
|
#14 |
|
SitePoint Guru
![]() ![]() ![]() ![]() ![]() Join Date: Nov 2002
Posts: 846
|
Ok, just tried a some performance tests with parsing a 124,423 byte XML file (timed using ab)
Code:
State based parser : 660 ms mean time per request. xml_parse (expat) : 433 ms XML_HTMLSax : 9,685 ms |
|
|
|
|
|
#15 |
|
SitePoint Guru
![]() ![]() ![]() ![]() ![]() Join Date: Nov 2002
Posts: 846
|
Ok, here is an updated version. I fixed some bugs and updated the interface to more closely resemble XML_HTMLSax.
|
|
|
|
|
|
#16 | |
|
SitePoint Zealot
![]() ![]() Join Date: Jan 2003
Location: Belgium
Posts: 133
|
Quote:
![]() Thanks a bunch! - prefab |
|
|
|
|
|
|
#17 | |
|
No.
![]() ![]() ![]() ![]() ![]() ![]() Join Date: May 2001
Location: Nottingham, UK
Posts: 1,142
|
Quote:
![]() |
|
|
|
|
|
|
#18 | |
|
SitePoint Victim
![]() ![]() ![]() ![]() ![]() ![]() Join Date: Apr 2003
Location: London
Posts: 2,385
|
Hi.
Quote:
. Is it I/O bound? Anyone fancy running them through apd?Switching the state parser to a stack based one would allow the processing of any language (state machines fall far short of being turing complete) and would probably add only 25% more code, mostly in passing the stack around and setting up the handlers. As for catching the infinite loop, just add a check that the position has advanced at least one space. IMO this could be needed if the state parser is to work on it's own as getting the states right could be rather tricky if they are created by hand. I needed it while debugging the SimpleTest one! ![]() yours, Marcus.
__________________
Marcus Baker Testing: SimpleTest, Cgreen, Fakemail Other: Phemto dependency injector Books: PHP in Action, 97 things |
|
|
|
|
|
|
#19 |
|
SitePoint Wizard
![]() ![]() Join Date: Nov 2000
Location: Switzerland
Posts: 2,898
|
Outstanding!
![]() Gobsmacked by those performance figures. Selkirk - you mind if I use your code for PEAR::XML_HTMLSax v2? |
|
|
|
|
|
#20 |
|
Non-Member
![]() ![]() ![]() ![]() ![]() ![]() Join Date: Jan 2003
Posts: 5,788
|
Looks smart doesn't it although is there any more sample Templates and script for parsing them ?
Please ![]() |
|
|
|
|
|
#21 |
|
SitePoint Wizard
![]() ![]() ![]() ![]() ![]() ![]() Join Date: Mar 2002
Location: Osnabrück
Posts: 1,003
|
The performance results look really impressing.
I am a bit at a loss of how to use the parser actually. I worked with XML_Transformer (seems to be down currently) and there you could define a handler for each tag. In the example there was one open/close Handler. Is it possibly to define a filter for each element? This is what I currently use: PHP Code:
|
|
|
|
|
|
#22 |
|
SitePoint Wizard
![]() ![]() ![]() ![]() ![]() ![]() Join Date: Mar 2002
Location: Osnabrück
Posts: 1,003
|
Okay, I have created a class Transformers which has registered methods for some tags. The methods have to follow the convention start_tag and stop_tag.
PHP Code:
PHP Code:
|
|
|
|
|
|
#23 |
|
Non-Member
![]() ![]() ![]() ![]() ![]() ![]() Join Date: Jan 2003
Posts: 5,788
|
Umm.... Got to have a study of this. Looks interesting though also looks complicated
![]() |
|
|
|
|
|
#24 | ||||
|
SitePoint Guru
![]() ![]() ![]() ![]() ![]() Join Date: Nov 2002
Posts: 846
|
Quote:
Combining operators that are frequently called together ended up slower. strpos is just very fast in PHP. Part of the problem is that none of the other string searching functions in PHP can begin their search at a specific position, like strpos can. (In PHP 5, I think the preg_ functions will take a starting position parameter). Alternatively, there is room in the state definitions for further optimization. I got the mtpr down to 614 ms by eliminating the cleanup state. It is also possible to take advantage of the essentially sequential state transitions to completely unroll the parser and implement it in a single function using a big do loop with break statements to return to the starting state at the top of the loop and falling through for each of the other state transitions. based on a couple of simple tests with eliminating state and unrolling the StateParser scan functions, I think it might possibly cut execution time as much as in half. For some people without much OO experience, the resulting code might even be easier to understand, although insane to modify. I leave this as an exercise for the reader. ![]() Quote:
PHP Code:
Quote:
Quote:
Bug wise, there is one thing that I would watch out for. This thing is just WAY too tolerent of badly formatted files. because of that, some of the states just keep going merrily along after things have completely failed to make sense.For some complex states using the unscanCharacter method near the end, they could advance past the end of file earlier in the state and then end up backing up leaving an extra character or two to be parsed twice. Possibly also in the wrong state. (hello Mr. infinite loop) I think where this would show up is in abruptly truncated files. I think it would show up as a couple of garbage events at the end of processing. One of the things that I am not happy with is the end of file handling (well, really end of string). Right now, many of the states implicitly transition to the STOP state by going past the end of string and triggering the test in the main loop, rather than explicitely triggering the transition. This is probably confusing. There is an elegent solution to this that I will think of in two weeks while eating dinner. ![]() |
||||
|
|
|
|
|
#25 |
|
Non-Member
![]() ![]() ![]() ![]() ![]() ![]() Join Date: Jan 2003
Posts: 5,788
|
Badly formatted files ? Umm... IMO though this shouldn't be an issue for you the developer to account for ?
Sure a few other members would agree on this point as well; You can't be responsible for those who have little idea of how to format a document, etc. Myself included ![]() |
|
|
|
![]() |
| Bookmarks |
«
Previous Thread
|
Next Thread
»
| Thread Tools | |
| Display Modes | |
|
|
|
All times are GMT -7. The time now is 21:13.




, they disappear when passed thru. In other words, I'd like to skip parsing those entities, so they'll stay unaffected.



. I think I owe 95% of my web traffic to you. On my TODO list was to backport the Lexer into your HtmlSax library and do a speed test. Would you be interested if I submitted to you such a version? It would save me doing the performance comparison and I would be interested in the results as a benchmark of how fast the PHP regexes are. 

. Is it I/O bound? Anyone fancy running them through apd?




Linear Mode
