SitePoint Sponsor

User Tag List

Results 1 to 12 of 12
  1. #1
    SitePoint Member
    Join Date
    Sep 2006
    Location
    Currently Toronto, Canada.
    Posts
    5
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Question regex: remove everything not div tags

    Hi,

    Searched but haven't found a solution to this.
    I want to remove everything from html code that is not a <div> or </div> tag (opening or closing).
    Since this matches the divs:
    Code:
    <div.*?>|</div>
    I thought I could just negate it somehow, such as:
    Code:
    [^(<div.*?>)]|[^(</div>)]
    (does not work)

    Any ideas?
    Cheers
    Last edited by pog; Jul 20, 2010 at 15:48. Reason: pasted wrong code

  2. #2
    SitePoint Evangelist
    Join Date
    Jun 2007
    Location
    North Yorkshire, UK
    Posts
    483
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I want to remove everything from html code that is not a <div> or </div> tag (opening or closing).
    Am I misreading this. Surely you will then just be left with a string containing <div>s and </div>s which doesn't seem much use.

  3. #3
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,094
    Mentioned
    153 Post(s)
    Tagged
    2 Thread(s)
    This works for me:

    Code:
    ~</(?!div).*?>|<(?!/)(?!div).*?>~is
    Use in PHP as follows:
    PHP Code:
    $some_string preg_replace('~</(?!div).*?>|<(?!/)(?!div).*?>~is'''$some_html); 
    Breakdown of this regex:

    ~ - Start regex
    </ - match </ literally
    (?!div) - Negative lookahead for the literal string div
    .*? - match anything, lazyly. Shouldn't be needed here, but without it the regex doesn't work !?
    > - match > literally
    | - OR match the following:
    < - match < literally
    (?!/) - Negative lookahead for the literal string /
    (?!div) - Negative lookahead for the literal string div
    .*? -match anything, lazyly.
    > - match > literally
    ~ - End regex
    is - Modifiers: Case Insensitive (i) and Single Line mode (s)

    Single line mode is to also remove HTML that spans multiple lines, like

    <script language="javascript"
    src="/some/path/to/some/javascript.js">

    For info on negative lookahead, see here: http://www.regular-expressions.info/lookaround.html

    Hope that helps
    Rémon - Hosting Advisor

    SitePoint forums will switch to Discourse soon! Make sure you're ready for it!

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  4. #4
    SitePoint Wizard Stomme poes's Avatar
    Join Date
    Aug 2007
    Location
    Netherlands
    Posts
    10,281
    Mentioned
    51 Post(s)
    Tagged
    2 Thread(s)
    Agreed with Phillip. Is this the story where someone asks how to move a mountain because they want to lay a pipeline from point A to point B?

  5. #5
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,094
    Mentioned
    153 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by Stomme poes View Post
    Agreed with Phillip. Is this the story where someone asks how to move a mountain because they want to lay a pipeline from point A to point B?
    How I understood it is that the OP wished to remove all tags except for div tags, thus leaving everything outside tags (content) and div tags in tact. Which is exactly what my regex provided in post #3 does
    Rémon - Hosting Advisor

    SitePoint forums will switch to Discourse soon! Make sure you're ready for it!

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  6. #6
    SitePoint Wizard Stomme poes's Avatar
    Join Date
    Aug 2007
    Location
    Netherlands
    Posts
    10,281
    Mentioned
    51 Post(s)
    Tagged
    2 Thread(s)
    I'll have to see it to understand it then.

  7. #7
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,094
    Mentioned
    153 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by Stomme poes View Post
    I'll have to see it to understand it then.
    PHP Code:
    $some_html = <<<HT
    <div id="some_div"><a href="#">some link</a></div><hr /><abbr>PM</abbr>
    HT;

    $some_string preg_replace('~</(?!div).*?>|<(?!/)(?!div).*?>~is'''$some_html); 

    var_dump(htmlentities($some_string));

    /* OUTPUT:
    string(58) "<div id="some_div">some link</div>PM"
    */ 
    Does that help?
    Rémon - Hosting Advisor

    SitePoint forums will switch to Discourse soon! Make sure you're ready for it!

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  8. #8
    SitePoint Wizard Stomme poes's Avatar
    Join Date
    Aug 2007
    Location
    Netherlands
    Posts
    10,281
    Mentioned
    51 Post(s)
    Tagged
    2 Thread(s)
    If that's the type of input, no bizarre nesting or whatever, and never actually outputted to a real HTML page, then yes.

    .*? - match anything, lazyly. Shouldn't be needed here, but without it the regex doesn't work !?
    Because the lookahead doesn't match stuff, just looks? But also prolly misunderstanding that question too.

  9. #9
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,094
    Mentioned
    153 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by Stomme poes View Post
    Because the lookahead doesn't match stuff, just looks?
    That makes sense. Thank you for that
    Rémon - Hosting Advisor

    SitePoint forums will switch to Discourse soon! Make sure you're ready for it!

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  10. #10
    SitePoint Member
    Join Date
    Sep 2006
    Location
    Currently Toronto, Canada.
    Posts
    5
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks ScallioXTX! That's pretty much what I was after. And thanks for the detailed explanation. I remember look-ahead now, but it's been a while. Thanks also to the other comments.

    I was, in fact, trying to get a string containing only div tags (as mentioned by philip). The reason for this is that when examining (e.g. wordpress) generated pages it can be useful to have a skeleton outline of the (potentially bloated) div structure. This can be done by hand, of course, but seems to be against the spirit of computing

    Since the tags contain id and class properties, which are useful to know, combining the regex from Scallio with the following gives a visual guide viewable in a browser, showing the nesting and naming of each div without other clutter:

    Code:
    preg_replace('~<div(.*?)>|<div$1>\n$1<br>\n~is', '', $some_html)
    Is this the story where someone asks how to move a mountain because they want to lay a pipeline from point A to point B?
    Possibly . Although I knew it would be relatively straightforward to some. I know there are various tools for examining source code, but this seems like a fair use of regexps and can be done in a text editor.

    Cheers

  11. #11
    SitePoint Wizard Stomme poes's Avatar
    Join Date
    Aug 2007
    Location
    Netherlands
    Posts
    10,281
    Mentioned
    51 Post(s)
    Tagged
    2 Thread(s)
    They can be, just be careful. Regular expressions work on regular languages. HTML isn't a regular language. Meaning, for small things, a regex will be fine, but when there's complicated nesting and possibly strange content floating around, you'll want to check by hand afterwards if it matters.

  12. #12
    SitePoint Evangelist
    Join Date
    Jun 2007
    Location
    North Yorkshire, UK
    Posts
    483
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I actually find that the hierarchical HTML view shown when you use the "Inspect element" contextual menu option, in for example Chrome and Firefox, are invaluable for this.


Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •