SitePoint Sponsor

User Tag List

Results 1 to 14 of 14
  1. #1
    SitePoint Enthusiast Zeldinha's Avatar
    Join Date
    Sep 2004
    Location
    Barcelona [Spain]
    Posts
    89
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Unhappy Regular expressions, conditional element

    Hello,

    I have a problem with a rather complicated regular expression I'm building, I hope someone can help me find what's the problem.

    I can't type the full regexp here, but the problem is that I have to parse the HTML of a forum and take out the posts and info I'm interested in. All the posts look more or less the same so it's easy, until I get to a part where there is an element that, sometimes it's there, sometimes it's not. I tried to add a '?' at the end of that element so it becomes optional, but then it is always ignored. If I take out the '?' (thus making it mandatory), then the regexp jumps to the element of the next post when it doesn't find it in the current, and that makes all the other elements of the current post ignored as well.

    This part of the regular expression would be something like this:

    PHP Code:
    preg_match_all"/(?:.*)(<li class=\"icon\">(.*)<\/li>)(?:.*)/Uis"$contents$postsPREG_SET_ORDER ); 
    How do I make this element optional (the whole <li></li> thing) and avoid the problems I'm finding, does anyone have any ideas? I feel it's rather simple, though I've been stuck at it for a while now.

    Thanks in advance,

  2. #2
    Theoretical Physics Student bronze trophy Jake Arkinstall's Avatar
    Join Date
    May 2006
    Location
    Lancaster University, UK
    Posts
    7,062
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    I don't know much about regex (it's over-complicated, and slow at processing), but try this:

    PHP Code:
     preg_match_all"/(?:.*)((<li class=\"icon\">(.*)<\/li>)*)(?:.*)/Uis"$contents$postsPREG_SET_ORDER ); 
    Jake Arkinstall
    "Sometimes you don't need to reinvent the wheel;
    Sometimes its enough to make that wheel more rounded"-Molona

  3. #3
    SitePoint Enthusiast Zeldinha's Avatar
    Join Date
    Sep 2004
    Location
    Barcelona [Spain]
    Posts
    89
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Unhappy

    Quote Originally Posted by arkinstall View Post
    PHP Code:
     preg_match_all"/(?:.*)((<li class=\"icon\">(.*)<\/li>)*)(?:.*)/Uis"$contents$postsPREG_SET_ORDER ); 
    Hello and thanks a lot for your reply. It doesn't work, has the same effect as with the '?'.

    I tried to make it with a pipe sign '|' and add a space as second condition or something, but it also doesn't work.

    Any other ideas? Should be something silly that I overlooked, maybe a parenthesis or so...

    Thanks in advance,

  4. #4
    Theoretical Physics Student bronze trophy Jake Arkinstall's Avatar
    Join Date
    May 2006
    Location
    Lancaster University, UK
    Posts
    7,062
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    What about the second condition being

    (.*)
    Jake Arkinstall
    "Sometimes you don't need to reinvent the wheel;
    Sometimes its enough to make that wheel more rounded"-Molona

  5. #5
    SitePoint Enthusiast Zeldinha's Avatar
    Join Date
    Sep 2004
    Location
    Barcelona [Spain]
    Posts
    89
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hmmmm, I just tried that, what happens then is that it always picks the second case (I think) and never shows that element, even if it exists.

  6. #6
    SitePoint Enthusiast Zeldinha's Avatar
    Join Date
    Sep 2004
    Location
    Barcelona [Spain]
    Posts
    89
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Does anyone else have any more ideas?

    Thanks,

  7. #7
    Chessplayer kleineme's Avatar
    Join Date
    Apr 2004
    Location
    Germany
    Posts
    608
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    (?:.*)((<li class=\"icon\">(.*)<\/li>)*)(?:.*)
    This regex won't work because all three (resp. four) parts are now optional, so it would match any input string. Whether your problem can be solved depends on the context in which this list element appears, so I would have to see a little bit more information about your input string.
    Never ascribe to malice,
    that which can be explained by incompetence.
    Your code should not look unmaintainable, just be that way.

  8. #8
    SitePoint Enthusiast Zeldinha's Avatar
    Join Date
    Sep 2004
    Location
    Barcelona [Spain]
    Posts
    89
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Unhappy

    Quote Originally Posted by kleineme View Post
    This regex won't work because all three (resp. four) parts are now optional, so it would match any input string. Whether your problem can be solved depends on the context in which this list element appears, so I would have to see a little bit more information about your input string.
    Ah, alright, that makes sense. I know where that "li" is (or well, should be). I tried to add the surroundings to the mix, more or less like this:

    PHP Code:
    preg_match_all"/(?:<div id=\"avatar(?:.*)(?:<li class=\"icon\">(.*)<\/li>|(.*))(?:.*)<\/div>)/Uis"$contents$postsPREG_SET_ORDER ); 
    With and without '?' elements etc.; tried a lot of combinations. But it keeps either ignoring or looking for the next post in line (I can understand why, it really doesn't change anything, does it?).

    Another issue, if anyone could answer it (though maybe I should make a new post about it). I was browsing around looking for a solution and I read a message that said that the modifier 'U' shouldn't be used. Anyone would care to explain me why is that? Thanks,

  9. #9
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Using 'U' leads to a big confusion for those reading your regexp. It's the same as writing "#define TRUE FALSE" in a C program, if you understand what I mean.

    As far as your regexp, it would be helpful to post an example of what you've got and what you want to match.

  10. #10
    SitePoint Enthusiast Zeldinha's Avatar
    Join Date
    Sep 2004
    Location
    Barcelona [Spain]
    Posts
    89
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Unhappy

    Using 'U' leads to a big confusion for those reading your regexp. It's the same as writing "#define TRUE FALSE" in a C program, if you understand what I mean.
    Well, I understand your example, but I cannot see the relation too well, sorry (brain is on vacation, lately ). Could you elaborate please? I'm really interested in knowing what's wrong. Would you think that my approach is incorrect and that I should not use that modifier?


    As for the HTML code, it's something like this (well, not really, but this is the rebellious part):

    HTML Code:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
    <HTML>
    <HEAD>
    <TITLE> Blah blah </TITLE>
    </HEAD>
    
    <BODY>
    <div>some useless code, part 1</div>
    <div id="avatar1">
    	<p>random stuff</p>
    	<ul>
    		<li class="icon"><small><b>some text</b></small></li>	
    	</ul>			
    	<p>more random stuff</p>
    </div>
    <div>some useless code, part 2</div>
    <div id="avatar2">
    	<p>random stuff</p>
    	<ul>
    		<li class="icon"><small><b>some text</b></small></li>	
    	</ul>			
    	<p>more random stuff</p>
    </div>
    <div>some useless code, part 3</div>
    </BODY>
    </HTML>
    There can be lots and lots of divs that have an id that starts with "avatar", and I want to get them all and find what is inside. The problem is, that unordered list is most of the time there, but not 100% of the time. Right now, if the first div did not have a '<ul><li></li></ul>' structure, my regexp would then stop at the second div and keep going from there (ignoring everything on the way).

    Thanks for your attention and help,

  11. #11
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Zeldinha View Post
    Would you think that my approach is incorrect and that I should not use that modifier?
    Metachars like dot or star are basic symbols for the regexp engine, like letters in a human language. 'U' changes the meanings of those letters, using 'U' is basically like saying "in this text we're using A in place of Z and vice versa". You probzbly won't be hzppy rezding z text like this.

    There can be lots and lots of divs that have an id that starts with "avatar", and I want to get them all and find what is inside.
    Unless "avatar" divs can contain other divs, you can simply match everything from <div> to </div>
    Code:
    ~<div id="avatar\d">(.*?)</div>~si

  12. #12
    SitePoint Enthusiast Zeldinha's Avatar
    Join Date
    Sep 2004
    Location
    Barcelona [Spain]
    Posts
    89
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Smile

    Quote Originally Posted by stereofrog View Post
    Metachars like dot or star are basic symbols for the regexp engine, like letters in a human language. 'U' changes the meanings of those letters, using 'U' is basically like saying "in this text we're using A in place of Z and vice versa". You probzbly won't be hzppy rezding z text like this.
    Hehe, ok, but it's my regexp so if I do it like that it's no problem, right? I thought it would be something related to performance or so, by the tone of that post I read. Thanks for the clarification!

    Quote Originally Posted by stereofrog View Post
    Unless "avatar" divs can contain other divs, you can simply match everything from <div> to </div>
    Code:
    ~<div id="avatar\d">(.*?)</div>~si
    Ok, that was the last option I had in mind, since there is some 'garbage' inside that div that I would need to filter manually. I really thought there would be a way to have the regexp to return like, an empty string or so, or just keep working normally, not doing the silly stuff I got. But if that isn't the case, then I'll just stick to this and parse it later somehow

    Thank you all for your answers!

  13. #13
    SitePoint Wizard TheRedDevil's Avatar
    Join Date
    Sep 2004
    Location
    Norway
    Posts
    1,188
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by stereofrog View Post
    Metachars like dot or star are basic symbols for the regexp engine, like letters in a human language. 'U' changes the meanings of those letters, using 'U' is basically like saying "in this text we're using A in place of Z and vice versa". You probzbly won't be hzppy rezding z text like this.
    Are you not really overcomplicating the issue?
    The U only makes it ungreedy, which is easily explained with that the regex will only gather as little as possible before returning the result, instead of beeing "greedy" and gathering as much as possible before returning the result.

  14. #14
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thank you, I know what 'U' does. My point was that it reduces readability, without any real benefit.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •