SitePoint Sponsor

User Tag List

Page 1 of 2 12 LastLast
Results 1 to 25 of 27

Thread: need help on regex

  1. #1
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    need help on regex

    i could probably construct one myself, but it would take me hours. i wondered if any of you are able to string together a regex (javascript syntax) that does the following :

    the input string contains html (from MSHTML) with attributes that may or may not have quotes surrounding values. eg :
    <img href="http://www.example.com/image.gif" target=_blank>

    i need to transform this into
    <img href="http://www.example.com/image.gif" target="_blank">

    any help appreciated.

  2. #2
    SitePoint Wizard silver trophy
    Join Date
    May 2003
    Posts
    1,843
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    This looks too simple to be correct:
    Code:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
        "http://www.w3.org/TR/html4/loose.dtd">
    
    <html>
    <head>
    <title>untitled</title>
    </head>
    <body>
    <script type="text/javascript">
    
    var RE = /\s*=\s*['"]?([^ '">]+)['"]?/g;
    var str = 'RE:\n\n' + RE;
    str += '\n\n';
    var HTML = '<img href="http://www.example.com/image.gif"  target=_blank>';
    str += 'HTML:\n\n\'' + HTML + '\'';
    str += '\n\n';
    str += 'HTML.replace(RE, \'="$1"\'):\n\n' + HTML.replace(RE, '="$1"');
    alert(str);
    
    str = 'RE:\n\n' + RE;
    str += '\n\n';
    HTML = '<img href  =    \'http://www.example.com/image.gif\' target=  "_blank">';
    str += 'HTML:\n\n\'' + HTML + '\'';
    str += '\n\n';
    str += 'HTML.replace(RE, \'="$1"\'):\n\n' + HTML.replace(RE, '="$1"');
    alert(str);
    
    str = 'RE:\n\n' + RE;
    str += '\n\n';
    HTML = '<img href =http://www.example.com/image.gif target=_blank  >';
    str += 'HTML:\n\n\'' + HTML + '\'';
    str += '\n\n';
    str += 'HTML.replace(RE, \'="$1"\'):\n\n' + HTML.replace(RE, '="$1"');
    alert(str);
    
    </script>
    </body>
    </html>
    Post back when you break it...
    ::: certified wild guess :::

  3. #3
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    it works like a charm. thanks a bunch!

  4. #4
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    the complete script, in case other people are looking for something similar:
    PHP Code:
    var this.editor.getInnerHTML().toLowerCase();
    var 
    RE_attr = /\s*=\s*['"]?([^ '">]+)['"]?/g;
    s.replace(RE_attr'="$1"');
    var 
    RE_tag = /<(img|br|hr)([^>]*)>/g;
    s.replace(RE_tag'<$1$2 />'); 
    it doesn't take into account that mshtml may write attributes on short form eg : <input type="checkbox" checked>

    also, all content is put into lowercase. that doesn't matter in my particular case, since i just need to get it wellformed, to validate it against a DTD.

  5. #5
    SitePoint Wizard
    Join Date
    Mar 2001
    Posts
    3,513
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by adios
    Post back when you break it
    class= first_class second_class


    Quote Originally Posted by kyberfabrikken
    the complete script, in case other people are looking for something similar:
    You made a mistake in your regex:

    s* looks for the letter s, zero or more times. You want to look for whitespace, zero or more times. Whitespace is denoted by \s.

  6. #6
    SitePoint Wizard silver trophy
    Join Date
    May 2003
    Posts
    1,843
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    This is kinda fun.
    Code:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
        "http://www.w3.org/TR/html4/loose.dtd">
    
    <html>
    <head>
    <title>untitled</title>
    </head>
    <body>
    <script type="text/javascript">
    
    String.prototype.toXHTML = function()
    {
    	var addAttrQt = /\s*=\s*['"]?([^ '">]+)['"]?/gi;
    	var fixSingletTag = /<(img|input|br|hr)([^>]*)>/gi;
    	var fixSingletAttr = /((checked)|(selected)|(disabled))/gi;
    	return this.replace(addAttrQt, '="$1"').replace(fixSingletAttr, '$1="$1"').replace(fixSingletTag, '<$1$2 />');
    }
    
    var HTML = '<img href="http://www.example.com/image.gif"  target=_blank>';
    var str = 'HTML:\n\n\'' + HTML + '\'';
    str += '\n\n';
    str += 'HTML.toXHTML():\n\n' + HTML.toXHTML();
    alert(str);
    
    HTML = '<img href  =    \'http://www.example.com/image.gif\' target=  "_blank">';
    str = 'HTML:\n\n\'' + HTML + '\'';
    str += '\n\n';
    str += 'HTML.toXHTML():\n\n' + HTML.toXHTML();
    alert(str);
    
    HTML = '<input type=checkbox name=foo value=feh checked>';
    str = 'HTML:\n\n\'' + HTML + '\'';
    str += '\n\n';
    str += 'HTML.toXHTML():\n\n' + HTML.toXHTML();
    alert(str);
    
    </script>
    </body>
    </html>
    Sure I missed something, always do with eregs.

    Are you a real kyberfabrikken? Never met one.
    Last edited by adios; Oct 8, 2004 at 12:57.
    ::: certified wild guess :::

  7. #7
    SitePoint Wizard silver trophy
    Join Date
    May 2003
    Posts
    1,843
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    class= first_class second_class
    No point in 'fixing' that: it's invalid HTML, as the second class will be ignored (being parsed as an attribute name) without quotes bounding both values.
    ::: certified wild guess :::

  8. #8
    SitePoint Wizard
    Join Date
    Mar 2001
    Posts
    3,513
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    it's invalid HTML...without quotes bounding both values.
    I thought the whole point of the script was to put quote marks around attribute values.

    class= "first_class second_class" is valid HTML, but the regex can't put the quotes around that correctly.
    Last edited by 7stud; Oct 8, 2004 at 12:36.

  9. #9
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    good point with the \s

    7stud is right. <foo value="foobar farfar" /> won't work properly.

    Are you a real kyberfabrikken? Never met one
    yep - now you have.

  10. #10
    SitePoint Wizard silver trophy
    Join Date
    May 2003
    Posts
    1,843
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well, it's always something. I'll look at that later. Maybe a simple solution but doesn't appear to be.

    That guy wasn't 'right' - his first post specified invalid HTML (attribute values with included spaces and no bounding quotes) that would have required a lot more code to parse. Not writing a "tidy" here (been done). Then he changed the complaint the second time around to a valid one. Notice he didn't suggest a solution.
    ::: certified wild guess :::

  11. #11
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    after looking at it again, i'll agree to that. the class= first_class second_class won't ever happen with MSHTML. it may not conform to w3c, but it's at least consistent to it self.

  12. #12
    SitePoint Wizard
    Join Date
    Mar 2001
    Posts
    3,513
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    it may not conform to w3c
    It absolutely conforms with w3c. You can have multiple classes for the class attribute. See here if you are interested:

    http://www.w3.org/TR/html4/struct/gl...tml#adef-class

  13. #13
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    i meant MSTML output

  14. #14
    SitePoint Wizard silver trophy
    Join Date
    May 2003
    Posts
    1,843
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    kyber...

    See if this is any better. Hard to know what to do with uppercase, as mixed case is acceptable for some attribute values. Any suggestions welcome.


    Code:
    
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
        "http://www.w3.org/TR/html4/loose.dtd">
    
    <html>
    <head>
    <title>untitled</title>
    <style type="text/css">
    
    body {
    	background: buttonface;
    }
    #t, #readout {
    	width: 90%;
    	padding: 4px;
    	margin: 20px auto;
    }
    #t td {
    	font: 12px monospace;
    	text-align: center;
    	border: 1px #000 solid;
    	background: #eee;
    }
    #readout {
    	height: 14px;
    	font: 11px monospace;
    	color: #000;
    	text-align: center;
    	margin-bottom: 100px;
    	background: #eee;
    }
    #z {
    	font: 12px monospace;
    	text-align: center;
    	background: #eee;
    }
    pre {
    	font: 11px monospace;
    }
    
    </style>
    </head>
    <body>
    <script type="text/javascript">
    
    String.prototype.toXHTML = function()
    {
    	var lowCaseAttr = /[< ]+([^= ]+)/gi,
    	addAttrQt = /\s*=\s*(['"])?(([^>"' ]| (?=[^"]+"))+)\1?/gi,
    	fixSingletTag = /<(br|hr|img|input|link|meta)([^>]*)>/gi,
    	fixSingletAttr = /((checked)|(selected)|(disabled)|(nowrap))/gi;
    	return this.replace(lowCaseAttr, function($1){return $1.toLowerCase();})
    	.replace(addAttrQt, '="$2"')
    	.replace(fixSingletTag, '<$1$2 />')
    	.replace(fixSingletAttr, '$1="$1"');
    }
    
    var x = [
    		'<img href="http://www.example.com/image.gif"  target=_blank>'			,
    		'<a href  =   \'http://www.example.com/image.gif\' target=  "_blank">hah</a>'	,
    		'<input type=checkbox name=foo value=feh checked>'				,
    		'<foo name="hoohah poobah" value="foobar farfar" nowrap>'			,
    		'<TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 ALIGN=center><FORM METHOD=get ACTION=/search><TR><TD NOWRAP>'
    	];
    
    function demo(idx)
    {
    	str = x[idx] || document.getElementById('z').value;
    	el = document.getElementById('readout');
    	while (el.hasChildNodes())
    		el.removeChild(el.lastChild);
    	el.appendChild(document.createTextNode(str.toXHTML()));
    }
    
    </script>
    <form>
    <table id="t">
    <tbody>
    <tr>
    <td><input type="radio" name="r" value="" onclick="demo(0)" /></td>
    <td>&lt;img href="http://www.example.com/image.gif"  target=_blank&gt;</td>
    </tr><tr>
    <td><input type="radio" name="r" value="" onclick="demo(1)" /></td>
    <td>&lt;a href  =   \'http://www.example.com/image.gif\' target=  "_blank"&gt;hah&lt;/a&gt;</td>
    </tr><tr>
    <td><input type="radio" name="r" value="" onclick="demo(2)" /></td>
    <td>&lt;input type=checkbox name=foo value=feh checked&gt;</td>
    </tr><tr>
    <td><input type="radio" name="r" value="" onclick="demo(3)" /></td>
    <td>&lt;foo name="hoohah poobah" value="foobar farfar" nowrap&gt;</td>
    </tr><tr>
    <td><input type="radio" name="r" value="" onclick="demo(4)" /></td>
    <td>&lt;TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 ALIGN=center&gt;&lt;FORM METHOD=get ACTION=/search&gt;&lt;TR&gt;&lt;TD NOWRAP&gt;</td>
    </tr><tr>
    <td><input type="radio" name="r" value="" onclick="z.focus()" /></td>
    <td><input type="text" id="z" name="z" style="width:99%;" onblur="demo(5)" /></td>
    </tr>
    </tbody>
    </table>
    </form>
    <div id="readout"></div>
    <h4>String.toXHTML()</h4>
    <pre>
    <script type="text/javascript">
    document.write(String.prototype.toXHTML);
    </script>
    </pre>
    </body>
    </html>
    
    Is it just me, or are these entry boxes getting smaller?
    ::: certified wild guess :::

  15. #15
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    you're pure magic adios. it even takes namespaceprefix'es.
    for the sake of completion, fixSingletAttr should be :
    Code:
    fixSingletAttr = /((compact)|(checked)|(declare)|(readonly)|(disabled)|(selected)|(defer)|(ismap)|(nohref)|(noshade)|(nowrap)|(multiple)|(noresize))/gi;
    about mixed case - isn't that deprecated in xhtml ? as far as i understand, onClick should be written onclick in the markup. (though referred as onClick from javascript)

    Is it just me, or are these entry boxes getting smaller?
    yes that annoys me too. couldn't they just force some linebreak through on looong lines ?

  16. #16
    SitePoint Wizard silver trophy
    Join Date
    May 2003
    Posts
    1,843
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    k-fab...

    Fixed a few bugs, found a few more. Regarding 'mixed case', I meant as far as attribute values were concerned; xhtml attribute names are definately lower-case. afaik there's no hard & fast rule on the values; lower-case seems pretty universal for 'system' type values (e.g., "get"), but obviously, values for ids, names, CSS classes, etc., permit mixed case. Think I'll leave that alone. Thanks for the feedback; keep it coming if you notice anything.

    magic a.

    Code:
    
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
        "http://www.w3.org/TR/html4/loose.dtd">
    
    <html>
    <head>
    <title>•• toXHTML ••</title>
    <style type="text/css">
    
    body {
    	background: buttonface;
    }
    #t, #readout {
    	width: 90%;
    	padding: 4px;
    	margin: 20px auto;
    }
    #t td {
    	font: 12px monospace;
    	text-align: center;
    	border: 1px #000 solid;
    	background: #eee;
    }
    #readout {
    	font: 11px monospace;
    	color: #000;
    	text-align: center;
    	margin-bottom: 40px;
    	border: 3px #fff groove;
    	background: #eee;
    	min-height: 14px;
    }
    #z {
    	font: 12px monospace;
    	text-align: center;
    	background: #eee;
    }
    
    </style>
    </head>
    <body>
    <script type="text/javascript">
    
    //    •••••••••••••••••••••••••••••••••••••••••
    //    •• String.prototype.toXHTML()          ••
    //    •• lightweight HTML »» XHTML convertor ••
    //    •• JavaScript String Class extension   ••
    //    •••••••••••••••••••••••••••••••••••••••••
    
    String.prototype.toXHTML = function()
    {
    	return this.replace(/[< ]+([^= ]+)/gi, function($1){return $1.toLowerCase();}).
    	replace(/\s*=\s*(['"])?(([^>" ]| (?=[^"=]+['"]))+)\1?/gi, '="$2"').
    	replace(/<(br|hr|img|input|link|meta)([^>]*)>/gi, '<$1$2 />').
    	replace(/((checked)|(compact)|(declare)|(defer)|(disabled)|(ismap)|(multiple)|(nohref)|(noresize)|(noshade)|(nowrap)|(readonly)|(selected))/gi, '$1="$1"').
    	replace(/(="[^']*)'([^'"]*")/, '$1$2').replace(/&/g, '&amp;').replace(/\s{2,}/g, ' ');
    }
    
    var x = [
    		'<img href="http://www.example.com/image.gif"  target=_blank>'							,
    		'<a href  =   \'http://www.example.com/image.gif\' target=  "_blank">hah</a>'					,
    		'<input type=checkbox name=foo value=feh checked>'								,
    		'<foo name="hoohah poobah" value="foobar farfar" nowrap>'							,
    		'<td class="nilink" align=right width="120">'									,
    		'<TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 ALIGN=center><FORM METHOD=get ACTION=/search><TR><TD NOWRAP>'	,
    		'<BUTTON OnCLICK="self.location=\'http://www.someplace.com/c.count?u=briggl&c=1\'">go !</BUTTON>'
    	];
    
    function demo(idx)
    {
    	str = x[idx] || document.getElementById('z').value;
    	el = document.getElementById('readout');
    	while (el.hasChildNodes())
    		el.removeChild(el.lastChild);
    	el.appendChild(document.createTextNode(str.toXHTML()));
    }
    
    String.prototype.HTMLose = function()
    {
    	return this.replace(/<([^>]+)>/gi, '&lt;$1&gt;');
    }
    
    </script>
    <form>
    <table id="t">
    <tbody>
    <tr>
    <td><input type="radio" name="r" value="" onclick="demo(0)" /></td>
    <td><script>document.write(x[0].HTMLose())</script></td>
    </tr><tr>
    <td><input type="radio" name="r" value="" onclick="demo(1)" /></td>
    <td><script>document.write(x[1].HTMLose())</script></td>
    </tr><tr>
    <td><input type="radio" name="r" value="" onclick="demo(2)" /></td>
    <td><script>document.write(x[2].HTMLose())</script></td>
    </tr><tr>
    <td><input type="radio" name="r" value="" onclick="demo(3)" /></td>
    <td><script>document.write(x[3].HTMLose())</script></td>
    </tr><tr>
    <td><input type="radio" name="r" value="" onclick="demo(4)" /></td>
    <td><script>document.write(x[4].HTMLose())</script></td>
    </tr><tr>
    <td><input type="radio" name="r" value="" onclick="demo(5)" /></td>
    <td><script>document.write(x[5].HTMLose())</script></td>
    </tr><tr>
    <td><input type="radio" name="r" value="" onclick="demo(6)" /></td>
    <td><script>document.write(x[6].HTMLose())</script></td>
    </tr><tr>
    <td><input type="radio" name="r" value="" onclick="z.focus();z.select()" /></td>
    <td><input type="text" id="z" name="z" style="width:99%;" onblur="demo(7)" /></td>
    </tr>
    </tbody>
    </table>
    </form>
    <div id="readout"></div>
    <h4>String.toXHTML() =</h4>
    <pre>
    <script type="text/javascript">
    document.write(String.prototype.toXHTML);
    </script>
    </pre>
    </body>
    </html>
    
    ::: certified wild guess :::

  17. #17
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    i've build a testsuite that compares the performance against the original recursive function used to build valid xhtml. the improvement is amazing. on a large site i get averages of over 1000%
    to complete the picture i do need to strip off some wierd attributes that mozilla adds in. thees attributes all begines with _moz ... would you mind filling it in for me ?

    i don't think the engine should bother about attribute-values, since this is out of the scope of the html-specs.

  18. #18
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    this seems to do the trick
    Code:
    String.prototype.toXHTML = function()
    {
    	return this.replace(/[< ]+([^= ]+)/gi, function($1){return $1.toLowerCase();}).
    	replace(/\s*=\s*(['"])?(([^>" ]| (?=[^"=]+['"]))+)\1?/gi, '="$2"').
    	replace(/<(br|hr|img|input|link|meta)([^>]*)>/gi, '<$1$2 />').
    	replace(/((checked)|(compact)|(declare)|(defer)|(disabled)|(ismap)|(multiple)|(nohref)|(noresize)|(noshade)|(nowrap)|(readonly)|(selected))/gi, '$1="$1"').
    	replace(/(="[^']*)'([^'"]*")/, '$1$2').replace(/&/g, '&amp;').replace(/\s{2,}/g, ' ').
    	replace(/(_moz_dirty(="[^"]*")?)/g, '').
    	replace(/(\s*\S+="_moz[^"]*")/g, '');
    
    }

  19. #19
    SitePoint Wizard silver trophy
    Join Date
    May 2003
    Posts
    1,843
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Latest:

    Code:
    
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
        "http://www.w3.org/TR/html4/loose.dtd">
    
    <html>
    <head>
    <title>•• toXHTML ••</title>
    <style type="text/css">
    
    body {
    	background: buttonface;
    }
    form {
    	width: 90%;
    	font: 11px monospace;
    	margin: 20px auto;
    }
    #demo {
    	width: 100%;
    	height: 300px;
    	font: 12px monospace;
    	padding: 4px;
    	border: 1px #000 solid;
    }
    #readout {
    	width: 90%;
    	height: 100px;
    	font: 12px monospace;
    	padding: 4px;
    	margin: 20px auto;
    	border: 1px #000 solid;
    	background: #fff;
    	cursor: pointer;
    }
    
    </style>
    </head>
    <body>
    <script type="text/javascript">
    
    String.prototype.toXHTML = function()
    {
    	return this.replace(/[< ]+([^= ]+)/gi, function($1){return $1.toLowerCase();}).
    	replace(/\s*=\s*(['"])?(([^>" ]| (?=[^"=]+['"]))+)\1?/gi, '="$2"').
    	replace(/<(br|img|hr|input|link|meta)([^>]*)/gi, function(x,y,z){return '<'+y+z.replace(/\/$/,'')+' /'}).
    	replace(/(checked|compact|declare|defer|disabled|ismap|multiple|no(href|resize|shade|wrap)|readonly|selected)/gi, '$1="$1"').
    	replace(/_moz[^=]*=\s*\S*/g, '').
    	replace(/(="[^']*)'([^'"]*")/, '$1$2').
    	replace(/&/g, '&amp;').
    	replace(/\s{2,}/g, ' ');
    }
    
    function convert(obj)
    {
    	el = document.getElementById('readout');
    	while (el.hasChildNodes())
    		el.removeChild(el.lastChild);
    	el.appendChild(document.createTextNode(obj.value.toXHTML()));
    }
    
    </script>
    <form>
    &rarr; enter HTML &rarr; click below &darr;
    <textarea id="demo" onblur="return convert(this)"></textarea>
    </form>
    <div id="readout"></div>
    </body>
    </html>
    
    Fixed your singlet element 'closer', it was adding a slash if the element was already closed. The '_moz' stripper seems OK. Working on adding an alt="" to img tags that lack it, and I'll toss in anything else you can think of.

    This made me feel dumb:

    http://javascript.internet.com/generators/html2xhtml.js

    just noticed, this board added spaces to 'wrap' thanks...
    ::: certified wild guess :::

  20. #20
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Fixed your singlet element 'closer', it was adding a slash if the element was already closed
    yep, just noticed that too.

    The '_moz' stripper seems OK
    yes, but wierdly enough, adding theese two (simple) regex's doubled the executiontime, so the improvement now dropped to a mere +700% ... i wonder if it's possible to combine some of the expressions into fewer, but more complex ones - it should give better results.

    This made me feel dumb
    don't - the code i'm improving on looks a lot like that. the cardinal point is, that it is a recursive function, that traverses every single element through DOM. it's precise in terms of the final result, but it's very cpu-consuming too, and i'm using it in a time-critical place.

  21. #21
    SitePoint Wizard silver trophy
    Join Date
    May 2003
    Posts
    1,843
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Actually, my damn moose found that script and left it up on my display...he thinks that's amusing. Impressive though...

    I'd guess it's the singlet expander that's slowing things down; that callback with the nested .replace() can't be helping. Couldn't find another way to do it (tried fashioning a lookahead, unsuccessfully). I'll get back to it, along with combining some of the routines if possible.

    Be interested in any details on your benchmarking.

    Vi ses!
    ::: certified wild guess :::

  22. #22
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Actually, my damn moose found that script and left it up on my display
    Choosing a pet of that size wasn't wise in the first place, so i'll opt to say you had it comming.

    I managed to improve further on performance by compiling the regexp's beforehand. have a look at :

    http://www.kyberfabrikken.dk/opensou...te/regex-test/

  23. #23
    SitePoint Wizard silver trophy
    Join Date
    May 2003
    Posts
    1,843
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Wow, that's interesting. I always thought regular expression literals already were compiled. Have to look into that. This routine can (will) be syntactically improved on - soon.

    Finally ! A quality regex site.
    ::: certified wild guess :::

  24. #24
    SitePoint Wizard silver trophy
    Join Date
    May 2003
    Posts
    1,843
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Yet another take on this...

    Code:
    
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
        "http://www.w3.org/TR/html4/loose.dtd">
    
    <html>
    <head>
    <title>•• toXHTML ••</title>
    <style type="text/css">
    
    body {
    	background: buttonface;
    }
    form {
    	width: 90%;
    	font: 11px monospace;
    	margin: 40px auto;
    }
    .tbox {
    	width: 100%;
    	height: 200px;
    	font: 12px monospace;
    	padding: 4px;
    	border: 1px #000 solid;
    
    </style>
    </head>
    <body>
    <script type="text/javascript">
    
    String.prototype.H2X =
    [
    	new RegExp().compile(/[< ]+([^= ]+)/gi),
    	new RegExp().compile(/(\S*\s*=\s*)?_moz[^=>]*(=\s*[^>]*)?/gi),
    	new RegExp().compile(/\s*=\s*(['"])?(([^>" ]| (?=[^"=]+['"]))+)\1?/gi),
    	new RegExp().compile(/\/>/),
    	new RegExp().compile(/<(br|hr|img|input|link|meta)([^>]*)>/gi),
    	new RegExp().compile(/(checked|compact|declare|defer|disabled|ismap|multiple|no(href|resize|shade|wrap)|readonly|selected)/gi),
    	new RegExp().compile(/(="[^']*)'([^'"]*")/),
    	new RegExp().compile(/&(?=[^<]*>)/g),
    	new RegExp().compile(/<\s+/g),
    	new RegExp().compile(/\s+(\/)?>/g),
    	new RegExp().compile(/\s{2,}/g)
    ]
    
    String.prototype.toXHTML = function()
    {
    	return this.replace(this.H2X[0], function($1){return $1.toLowerCase();}).
    	replace(this.H2X[1], ' ').replace(this.H2X[2], '="$2"').replace(this.H2X[3], '>').
    	replace(this.H2X[4], '<$1$2 />').replace(this.H2X[5], '$1="$1"').replace(this.H2X[6], '$1$2').
    	replace(this.H2X[7], '&amp;').replace(this.H2X[8], '<').replace(this.H2X[9], '$1>').replace(this.H2X[10], ' ');
    }
    
    </script>
    <form>
    1) &rarr; enter HTML &darr;
    <textarea class="tbox" onblur="readout.value=this.value.toXHTML()"></textarea>
    2) &rarr; click below &darr;
    <textarea class="tbox" name="readout"></textarea>
    </form>
    </body>
    </html>
    
    A little more OO, I think. Lost that callback, by the simple expedient of pre-stripping any slashes from closing brackets first. Added some cleanup.

    Are you sure using .compile() makes a difference? All the documentation I could find would indicate that regular expression literals are compiled at load-time, and only RegExp objects evaluated from strings need be compiled to run more efficiently. Obviously wrong, based on your tests.

    As always, feedback welcome. Every time I dump real-world tagsoup in there, I discover something else unexpected happening. cheers /adios
    Last edited by adios; Oct 15, 2004 at 00:54.
    ::: certified wild guess :::

  25. #25
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I found a bug. RegExp #4 should be changed to :
    new RegExp().compile(/<(br|hr|img|input|link|meta)([^>\/]*)\/?>/g)

    otherwise if the tag is already correctly closed, it will get an extra slash, witch is obviously incorrect. (Eg. <br /> became <br //>)

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •