SitePoint Sponsor

User Tag List

Results 1 to 7 of 7
  1. #1
    SitePoint Member
    Join Date
    Jul 2011
    Posts
    8
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    php recursive regex problem

    I want to parse SQL statements with extra notations. I am trying to use the recursive
    notation to match nested parentheses. The following example reduces the problem to
    the simplest form:
    Code:
    <?php
    header("Content-type: text/plain");
    
    function doMatch($re, $string)
    {            
       printf("\nusing \"%s\":\n", $re);
       if (preg_match($re, $string, $matches))
       {
          printf("%d Matches\n",count($matches));
          for ($i=0; $i<count($matches); $i++)
             printf("%d: %s\n", $i, $matches[$i]);
       }         
       else          
          printf("No match (boo hoo)\n");
    }            
                                                                                                                                                                                       
    $string = "<click:blah,blah> (SELECT GROUP_CONCAT(s.fname) FROM Scouts) AS bogus";
    $reThatWorks = "/\((?:([^()]+)|(?R))*\)/";
    $reThatFails = "/ \((?:([^()]+)|(?R))*\)/";
    
    doMatch($reThatWorks,$string);
    doMatch($reThatFails,$string);
    ?>
    This is the output:

    Code:
    using "/\((?:([^()]+)|(?R))*\)/":
    2 Matches
    0: (SELECT GROUP_CONCAT(s.fname) FROM Scouts)
    1:  FROM Scouts
    
    using "/ \((?:([^()]+)|(?R))*\)/":
    No match (boo hoo)
    The only difference between the regexes is the failing regex tries to match a space in
    front of the nested parentheses part. If fails with a space after as well.

    Ultimately, I'd like to match the contents of the <...>, content up to optional "AS alias",
    and the alias if used. For now, I'd be happy if I could understand why the space causes
    failure, and a workaround, if possible.

    Thanks in advance (if anyone takes this on)
    Chuck Jungmann

    PS. I know I could do this by iterating through each character, but using regexes would
    be more flexible.
    or after the regex for the nested parentheses prevents a match.

  2. #2
    Avid Logophile silver trophy
    ParkinT's Avatar
    Join Date
    May 2006
    Location
    Central Florida
    Posts
    2,332
    Mentioned
    192 Post(s)
    Tagged
    4 Thread(s)
    Can you include the desired space in square brackets, as a literal?
    Code:
    "/[ ]\((?:([^()]+)|(?R))*\)[ ]/":
    Don't be yourself. Be someone a little nicer. -Mignon McLaughlin, journalist and author (1913-1983)


    Git is for EVERYONE
    Literally, the best app for readers.
    Make Your P@ssw0rd Secure
    Leveraging SubDomains

  3. #3
    SitePoint Member
    Join Date
    Jul 2011
    Posts
    8
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thank you for your reply. I did try the [ ], but it still fails to match.

    I have played with it a bit and found that one or more . (wildcard) in place of
    the space character will match, but any specific character causes it to fail.

    Another thing was removing the left-paren "(" from the character class
    results in a match, but it's not the right one, as it doesn't handle the recursion,
    and stops the match at the first right-paren ")".

    Although I'm still interested in why it fails and how to fix it, I decided to put
    in the non-recursive match for up to three nested levels:

    \w*\s*\(([^()]+|\(([^()]+|\(([^()]+)*\))*\))*\)

    to match a parenthetical expression:
    (SELECT GROUP_CONCAT(name) FROM Scouts)

    or a function:
    CONCAT(fname,' ', lname)

    Since I posted, I've read some things that suggest that recursive matching
    steps outside of what one should expect from regular expressions. It could
    be that the exceptional nature of recursion prevents it from playing nice
    with the rest of the statement.

  4. #4
    SitePoint Guru aamonkey's Avatar
    Join Date
    Sep 2004
    Location
    kansas
    Posts
    953
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I think the recursion is acting greedily (I don't quite understand it, but that's the best way I can put it) and trying to match everything in parentheses - so in the part "GROUP_CONCAT(name)" there is no space in front of "(name)" which causes it to fail.

    try:

    PHP Code:
    $re "#\s*\((?:([^\(\)]+)|(?R))*\)#"
    aaron-fisher.com - PHP articles and more

  5. #5
    SitePoint Member
    Join Date
    Jul 2011
    Posts
    8
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    aamonkey,

    First, thanks for your reply. It helped me learn something new about regular expressions.

    At first, I thought that your example was identical to mine, except for the escaping the parentheses in the character class. As I continued to look for differences, I noticed the critical one: you used \s* where I had used ' ' or \s+, knowing that my string included a space. Further experimentation showed that completely optional (either * or ?) preceding text would still allow a match, but anything specific fails.

    Here's my mostly working regex that matches the line in the example:
    $re = "/(<([^>]+)>\s*)?((?:\((?:([^()]+)|(?R))*\))(\s+as\s+(\w+))?)/i";

    What I want from the match is to save the contents of <...>, and the alias that follows the "as" and return what follows the <...> ($matches[3] for this regex). I don't care about the parentheses except that I need to know when they end to match the proper "as." The problem is that the regex only matches when the <blah,blah> is optional, and I'm only interested in the match if the <blah,blah> part is there. I know this is a small difference, and that I can work around it by checking for $matches[2], but it seems like there should be a way to positively match the <...> and still recursively match the parentheses.

    You speculated that perhaps the lack of a space before the parenthesis after GROUP_CONCAT might be the problem, and I worked with that assumption for a while, and I don't think it matters. I tried putting an optional space at several locations within the recursive part, but nothing seems to make it possible to positively match <...>.

  6. #6
    Keeper of the SFL StarLion's Avatar
    Join Date
    Feb 2006
    Location
    Atlanta, GA, USA
    Posts
    3,748
    Mentioned
    71 Post(s)
    Tagged
    0 Thread(s)
    I dont understand what you're trying to get out of your parsing.

    More specifically define your input, and your intended output.

    Will there only be one SQL statement in the string? What parts do you actually want to capture? Will the <> bit always be first?
    Never grow up. The instant you do, you lose all ability to imagine great things, for fear of reality crashing in.

  7. #7
    SitePoint Member
    Join Date
    Jul 2011
    Posts
    8
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I am parsing complete SQL statements. In defining the statements, I use the angle-brackets to pass extra information to the program that uses the SQL output. I am using preg_replace_callback() with the callback in a closure to extract the text within the angle-brackets, putting them into an associative array where the key is the either the field name or the alias. For matches, I return the text that follows the angle-brackets. So...I need to capture the contents of <> and the fieldname or alias. In the example, I obviously need to capture the alias because there will be no field name for the parenthetical phrase.

    The <> is always first, and it defines an action or attribute of the field in the output. The <> can also precede the SELECT to define an action or attribute of the row. I only parse a single SQL statement at a time.

    This fragment I'm working on is a small part of a larger regular expression, but it stands alone for the purposes of figuring out how, or why it doesn't, work.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •