SitePoint Sponsor

User Tag List

Results 1 to 7 of 7
  1. #1
    SitePoint Enthusiast
    Join Date
    Apr 2012
    Posts
    70
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    preg match syntax issue?

    PHP Code:
    $String 'page=4&amp;searchId=2" title="Go to page 4">4</a>, <a href="Results.jsp?page=5&amp;searchId=2" title="Go to page 5">5</a>, <a href="Results.jsp?page=6&amp;searchId=2" title="Go to page 6">6</a>, <a href="Results.jsp?page=7&amp;searchId=2" title="Go to page 7">7</a>, <a href="Results.jsp?page=8&amp;searchId=2" title="Go to page 8">8</a> [<a href="Results.jsp?page=2&amp;searchId=2">Next</a><a href="Results.jsp?page=180&amp;searchId=2">Last</a>]</span>';
    preg_match_all('#page=180&amp;searchId=2">(.*?)</a>#'$String$Values);
    print_r($Values); 
    returns "Last" in the array, as expected.

    PHP Code:
    $String 'page=4&amp;searchId=2" title="Go to page 4">4</a>, <a href="Results.jsp?page=5&amp;searchId=2" title="Go to page 5">5</a>, <a href="Results.jsp?page=6&amp;searchId=2" title="Go to page 6">6</a>, <a href="Results.jsp?page=7&amp;searchId=2" title="Go to page 7">7</a>, <a href="Results.jsp?page=8&amp;searchId=2" title="Go to page 8">8</a> [<a href="Results.jsp?page=2&amp;searchId=2">Next</a><a href="Results.jsp?page=180&amp;searchId=2">Last</a>]</span>';
    preg_match_all('#page=(.*?)&amp;searchId=2">Last</a>#'$String$Values);
    print_r($Values); 
    doesn't return 180. Anyone know why?

  2. #2
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,097
    Mentioned
    153 Post(s)
    Tagged
    2 Thread(s)
    Because (.*) matches everything you throw at it, it's very greedy. If you change your code to indicate you're only interested in numbers it works just fine:

    PHP Code:
    $String 'page=4&amp;searchId=2" title="Go to page 4">4</a>, <a href="Results.jsp?page=5&amp;searchId=2" title="Go to page 5">5</a>, <a href="Results.jsp?page=6&amp;searchId=2" title="Go to page 6">6</a>, <a href="Results.jsp?page=7&amp;searchId=2" title="Go to page 7">7</a>, <a href="Results.jsp?page=8&amp;searchId=2" title="Go to page 8">8</a> [<a href="Results.jsp?page=2&amp;searchId=2">Next</a><a href="Results.jsp?page=180&amp;searchId=2">Last</a>]</span>'
    preg_match_all('#page=(\d+)&amp;searchId=2">Last</a>#'$String$Values); 
    print_r($Values); 
    Code:
    Array
    (
        [0] => Array
            (
                [0] => page=180&searchId=2">Last
            )
    
        [1] => Array
            (
                [0] => 180
            )
    
    )
    Rémon - Hosting Advisor

    SitePoint forums will switch to Discourse soon! Make sure you're ready for it!

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  3. #3
    SitePoint Enthusiast
    Join Date
    Apr 2012
    Posts
    70
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi Scallio,

    I used ([0-9]{0,4}) and it worked too. Yours is a bit nicer though. I still don't *understand* why, because I figured it would match any string between the "page=" and "&amp;searchId=2">Last</a>". Obviously, that's not so. I just need to learn syntax better :S lol

  4. #4
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,097
    Mentioned
    153 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by biglittle View Post
    Hi Scallio,

    I used ([0-9]{0,4}) and it worked too. Yours is a bit nicer though. I still don't *understand* why, because I figured it would match any string between the "page=" and "&searchId=2">Last</a>". Obviously, that's not so. I just need to learn syntax better :S lol
    \d is shorthand for [0-9] and + is shorthand for '1 or more times' (the same as {1,}), whereas {0,4} is the syntax for between 0 and 4 times, so in effect they are the same, yes, except that mine won't accept empty numbers, and will accept more than 4 numbers whereas your code doesn't.

    It does, just not the page= you were expecting

    Code:
    page=4&amp;searchId=2" title="Go to page 4">4</a>, <a href="Results.jsp?page=5&amp;searchId=2" title="Go to page 5">5</a>, <a href="Results.jsp?page=6&amp;searchId=2" title="Go to page 6">6</a>, <a href="Results.jsp?page=7&amp;searchId=2" title="Go to page 7">7</a>, <a href="Results.jsp?page=8&amp;searchId=2" title="Go to page 8">8</a> [<a href="Results.jsp?page=2&amp;searchId=2">Next</a><a href="Results.jsp?page=180&amp;searchId=2">Last</a>]</span>
    The part in red is matched by your regex since (.*) will just grab everything and anything. Think about it.
    Rémon - Hosting Advisor

    SitePoint forums will switch to Discourse soon! Make sure you're ready for it!

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  5. #5
    Keeper of the SFL StarLion's Avatar
    Join Date
    Feb 2006
    Location
    Atlanta, GA, USA
    Posts
    3,748
    Mentioned
    73 Post(s)
    Tagged
    0 Thread(s)
    (except his regex was (.*?), which makes it NON greedy...)
    Never grow up. The instant you do, you lose all ability to imagine great things, for fear of reality crashing in.

  6. #6
    Utopia, Inc. silver trophy
    ScallioXTX's Avatar
    Join Date
    Aug 2008
    Location
    The Netherlands
    Posts
    9,097
    Mentioned
    153 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by StarLion View Post
    (except his regex was (.*?), which makes it NON greedy...)
    Yes, but that only works going forward. i.e. if you have a string like /this/is/a/string and you match /this/(.*)/ it will match /this/is/a/string, i.e., everything up until the last / in the string.
    Whereas with /this/(.*?)/, i.e., making it non-greedy, will match /this/is/a/string, i.e., it stops directly after the first slash it finds and doesn't "eat" other slashes in between.

    With the problem of the OP however he wants to match as little as possible before the subject string, as far as I know there is nothing you can do to make that happen. Making the .* non-greedy in his case has no effect whatsoever.
    Rémon - Hosting Advisor

    SitePoint forums will switch to Discourse soon! Make sure you're ready for it!

    Minimal Bookmarks Tree
    My Google Chrome extension: browsing bookmarks made easy

  7. #7
    @php.net Salathe's Avatar
    Join Date
    Dec 2004
    Location
    Edinburgh
    Posts
    1,397
    Mentioned
    65 Post(s)
    Tagged
    0 Thread(s)
    Off Topic:

    > With the problem of the OP however he wants to match as little as possible before the subject string, as far as I know there is nothing you can do to make that happen.

    It could be done, assuming I'm understanding what you're looking for correctly. However, in this case, using \d is the right and proper thing to be doing.


    biglittle, it looks like your confusion arises from not quite understanding how PCRE (the regex library used for the preg_* functions) chooses what to return.

    Put simply, it returns the first valid match (of course, if there is one). The subject string is searched from left to right, character by character, when looking for a match.

    Given your regex, upon reaching the very first page= and matching it against the regex's page=, things are looking good. The next part is then executed, the (.*?), which happily eats up everything that it can with an eye to still getting a successful match of the whole regex. Since you only ask that what comes after the (.*?) be the literal &amp;searchId=2">Last</a>, then it eats up everything to that point.

    As an aside, a greedy version like (.*) would continue looking through the whole subject string after noticing that &amp;searchId=2">Last</a> had been seen. It's greedy and wants to eat as much as possible. In your case, since &amp;searchId=2">Last</a> does not occur later in the string, both greedy and non-greedy would eat the same amount. The only difference is how much of the string is examined after finding that part of the string.

    So, after (.*?) noms everything that it can, the rest of the regex goes on to try and get matched. The &amp;searchId=2">Last</a> is there at this stage so the regex has found its first match. At this point, nothing else is done. The match is returned and processing of the subject string stops immediately. A different regex engine, POSIX, would continue on in the string to try and find any more matches and would return the longest (leftmost) match possible (POSIX doesn't have the concept of greedy/non-greedy): in your case, there isn't a longer match from the initial page= starting point. However, PCRE gives up at the very first match that it can find.

    Hopefully that hasn't confused you entirely. In short, PCRE finds the first matching part of the subject string possible.

    A final point, since you were using preg_match_all(), after finding the first match then the subject string is examined again starting at the ending point of the previous match (i.e, between > and ] near the end of the string). From this point, the rest of the string (only >]</span>) does not match so only the one match is pushed into the array.
    Salathe
    Software Developer and PHP Manual Author.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •