SitePoint Sponsor

User Tag List

Results 1 to 10 of 10
  1. #1
    SitePoint Evangelist
    Join Date
    May 2006
    Posts
    457
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Split string with HTML tags

    Hello All,

    Im looking to split a string with HTML tags into two sections. 1st section with a limited number of characters, the next section with the remainder of the original string.

    Im currently using the following in order to split the string:

    Code:
    public static string SplitWord(string x, int length)
    	{
            if (x.Length > length)
            {
                x = x.Substring(0, length);
            }
            return x;
    	}
    
    string1 = SplitWord(string, 1000);
    string2 = string.Substring(1000);
    With my current solution, on the odd occasion it will split up a html tag. For example:

    string = "<di"
    string1 = "v>";

    or


    string = "<div>"
    string1 = "</div>";

    Is there a way of diving up a string into two parts, but making sure it does after a closing html tag.

    Ideal solution:
    string = "<div></div>"
    string1 = "";

  2. #2
    SitePoint Evangelist
    Join Date
    Jun 2007
    Location
    North Yorkshire, UK
    Posts
    483
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I believe what you are after is a regular expression

    Code JavaScript:
    var strText="<div>when <b>in doubt</b> do nothing</div>";
    var strBits = strText.match(/<[^> ]+[^>]*>[^<]*/g);

    Will split the string so you end up with an array
    strBits[0] = "<div when "
    strBits[1] = "<b>in doubt"
    strBits[2] = "</b> do nothing"
    strBits[4] = "</div>"

  3. #3
    Community Advisor ULTiMATE's Avatar
    Join Date
    Aug 2003
    Location
    Bristol, United Kingdom
    Posts
    2,160
    Mentioned
    46 Post(s)
    Tagged
    0 Thread(s)
    Never, ever, ever use a regular expression or string manipulation when dealing with HTML. Tasks like this are what HTML parsers like HtmlAgilityPack. were made for!

    If you can get to grips with a bit of XPath, then you can separate the content from the HTML, do your necessary splits and then rebuild the HTML around it however you wish.

  4. #4
    SitePoint Evangelist
    Join Date
    Jun 2007
    Location
    North Yorkshire, UK
    Posts
    483
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The question related to javascript. HtmlAgilityPack says it is a .Net code library, so is server side and therefore not appropriate in this instance.

  5. #5
    Community Advisor ULTiMATE's Avatar
    Join Date
    Aug 2003
    Location
    Bristol, United Kingdom
    Posts
    2,160
    Mentioned
    46 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by PhilipToop View Post
    The question related to javascript. HtmlAgilityPack says it is a .Net code library, so is server side and therefore not appropriate in this instance.
    This thread is in the .NET forum, which is the reason I posted a .NET library, unless I'm missing something?

    Regardless, it's a matter of language. HTML is too complex a language to parse with Regular Expressions, as this question on SO shows.

    I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar - you can't possibly hope to make this work. But many will try, some will claim success and others will find the fault and totally mess you up.
    http://stackoverflow.com/questions/1...732454#1732454

  6. #6
    SitePoint Evangelist
    Join Date
    Jun 2007
    Location
    North Yorkshire, UK
    Posts
    483
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    This thread is in the .NET forum,
    Indeed it is - my mistake.

  7. #7
    SitePoint Member williamjerry's Avatar
    Join Date
    Jul 2011
    Posts
    10
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I think only regular expression can help in this case otherwise code will get very heavy.

  8. #8
    Community Advisor ULTiMATE's Avatar
    Join Date
    Aug 2003
    Location
    Bristol, United Kingdom
    Posts
    2,160
    Mentioned
    46 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by williamjerry View Post
    I think only regular expression can help in this case otherwise code will get very heavy.
    Read my post again.

    It is linguistic fact that a regular expression is not capable of handling HTML.

  9. #9
    SitePoint Wizard
    Join Date
    Feb 2007
    Posts
    1,274
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by ULTiMATE View Post
    Read my post again.

    It is linguistic fact that a regular expression is not capable of handling HTML.
    Ahem. Actually with the .NET extensions to regular expressions - specifically the way you can create expressions which matches levels - I would argue that it *can* be done.

    It will not be pretty - and you are correct to point to alternative solutions. I'm just being obnoxious.

  10. #10
    Community Advisor ULTiMATE's Avatar
    Join Date
    Aug 2003
    Location
    Bristol, United Kingdom
    Posts
    2,160
    Mentioned
    46 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by honeymonster View Post
    Ahem. Actually with the .NET extensions to regular expressions - specifically the way you can create expressions which matches levels - I would argue that it *can* be done.

    It will not be pretty - and you are correct to point to alternative solutions. I'm just being obnoxious.
    I think this SO post (despite it relating to JSON) covers my opinion on that method.

    "Some systems offer extensions to regular expressions that kinda-sorta handle balanced expressions. However they're all ugly hacks, they're all unportable, and they're all ultimately the wrong tool for the job."
    I've used extensions before (admittedly not with .NET) and it was far more trouble than it was worth, and it didn't handle wild HTML code very well. More often than not if you're performing a simple string task on a bit of HTML then the HtmlAgilityPack will do it in a couple of lines. I'd argue that it's the best .NET library I've ever used, and like many developers it brings me great pain to see anyone using a regular expression to parse HTML.


Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •