SitePoint Sponsor

User Tag List

Results 1 to 6 of 6
  1. #1
    Patience... bronze trophy solidcodes's Avatar
    Join Date
    Jul 2006
    Location
    Philippines
    Posts
    911
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)

    Arrow What is this regular expression do?

    Below is the regular expression,

    "|<[^>]+>(.*)</[^>]+>|U"

    actually this regular expression combination, i got this from the php manual,
    because i'm trying to study it.

    and below is the text content to be extract using the above REGEX expression.

    "<b>example: </b><div align=\"left\">this is a test</div>"

    Can someone please explain to me it in detail. I will compare your explanation
    to my own understanding/research.

    Thanks in advance.

  2. #2
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    ("|<[^>]+>(.*)</[^>]+>|U
    In pieces:

    Code:
    ("
    is something you accidentally copied, it's not part of the pattern

    Code:
    |
    The regular expression begins and ends with some character that just marks the boundaries of the pattern. It isn't a specific character, but this example used |. So | is not part of the expression, it is just marking the start and the end. The characters after the final | have special meanings

    Code:
    <[^>]+>
    The < and >'s are literal characters, they match those characters in the input

    [] surrounds a set of characters you want to match from. The ^ karat means "not". So this is the set of all characters that are not a right angled bracket >. The + after the group means "one or more instances of the preceding".

    So in total, this piece means match "<", followed by one or more characters other than ">", followed by ">". This matches HTML tags nicely (<b>, a < followed by one more more characters that aren't >, namely "b", followed by >).

    Code:
    (.*)
    The dot "." matches any character, it's a wildcard. The * means "zero or more instances of the preceding", so matches anything, including the empty string.

    The parentheses around this pattern means you want to capture whatever's matched and save it to use later. This piece matches everything between two HTML tags, including nothing in the case of <tag></tag>.

    Code:
    </[^>]+>
    Same as the first piece, with the addition of a / so that it matches closing tags.

    Code:
    |
    Delimits the end of the pattern, since this matches the first character of the pattern

    Code:
    U
    No idea

  3. #3
    Patience... bronze trophy solidcodes's Avatar
    Join Date
    Jul 2006
    Location
    Philippines
    Posts
    911
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Thank you DAN very nice.

    I'll double check those. specially the --> U.

    thanks again.

  4. #4
    @php.net Salathe's Avatar
    Join Date
    Dec 2004
    Location
    Edinburgh
    Posts
    1,396
    Mentioned
    54 Post(s)
    Tagged
    0 Thread(s)
    The U modifier changes the pattern to what is called "ungreedy" (compared to the default which is "greedy"). It affects how the quantifiers (* and + and so on) act.

    By default, patterns are greedy and will consume as many characters as possible before moving along. Consider the pattern d.*d with the subject string dad is mad. When the pattern is greedy, the match would look for the letter d, followed by any number of other characters (including d!) and stop at another letter d. In other words, it would match dad is mad because being greedy means it will try to consume as much as possible.

    On the other hand, if the pattern was ungreedy then it will match as little as possible. The .* will this time match only the letter a in dad because the shortest match for the whole pattern d.*d is dad.
    Salathe
    Software Developer and PHP Manual Author.

  5. #5
    SitePoint Enthusiast nrg_alpha's Avatar
    Join Date
    Dec 2008
    Posts
    81
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    To expand on Dan's explanation:

    |<[^>]+>(.*)</[^>]+>|U

    The | characters are called delimiters... In PCRE (Perl Compatible Regular Expressions), a delimiter can be any ASCII non whitespace non alphanumeric character (except the backslash). Typically, the delimiter of choice is the slash (/). But you can use #....# or ~...~ or !...! , etc.. A few things to note about delimiters..

    1) you can use brackets / parenthesis of sorts.. so {......} or <.....> is acceptable.
    2) delimiters must have matching opening / closing characters.. so if you choose # as your opening delimiter, it has to be the same for the closing delimiter. And as just mentioned above, if you open with either ( < [ {, you need to close with their appropriate counterpart - ) > ] }
    3) In most cases (as there are oddball circumstances ), delimiter characters that also appear within the pattern itself must be escaped (backslash).. so if you use /...../, any / character within the pattern should be escaped as such: \/ (for this reason alone, I prefer #...# as my delimiter character of choice, as the / character is used in stuff like file paths for example.

    [] characters form what is called a character class.. What is important to understand is that a character class does not seek out the sequence of characters listed within it as to be found in the pattern.. but rather, at the current position in the string the regex engine is checking, it will check to see if that current character matches (or doesn't match) any of the characters listed within the class.. so if you see [abc], this means, check to see if the current character is either an a, b or c. Likewise, [^abc] negates the class (meaning, check to see if the current character is NOT an a, b or c). Many newcomers mistaken the class [abc] as 'look for letter a followed by b, followed by c in the string).

    As Dan mentioned, the dot is a wildcard. To be pedantic, the dot (a.k.a dot_match_all) will match anything except a newline by default (\n). To include newlines, you can add the s modifier after the closing delimiter (I would provide a link to modifers in the PHP manual, but it appears I have to have 10 posts or more to post links [understandable.. given spam these days]. Just go to the php manual and enter PCRE in the top field.. there is plenty of information there).

    As Salathe pointed out, .* (or even .+) are inherently greedy. In general, using such notation is frowned upon, mainly for speed and accuracy reasons. At the very least, I would make those lazy (you do this by adding the ? after the quantifier - in this case, * or +) by doing this: (.*?) or (.+?). This serves basically what that U modifier does (I personally prefer this method over the modifier version however).

    I would post some tutorial links.. but alas.. I must exceed the 10 post limit.. just google regex tutorials.. You'll find them.

  6. #6
    Programming Since 1978 silver trophybronze trophy felgall's Avatar
    Join Date
    Sep 2005
    Location
    Sydney, NSW, Australia
    Posts
    16,600
    Mentioned
    24 Post(s)
    Tagged
    1 Thread(s)
    The .*? or .+? approach to making it find the shortest string that matches is the more common approach than using U as the U modifier is not recognised on all platforms that recognise regular expressions whereas adding the ? is recognised by just about everywhere that knows what a regular expression is.
    Stephen J Chapman

    javascriptexample.net, Book Reviews, follow me on Twitter
    HTML Help, CSS Help, JavaScript Help, PHP/mySQL Help, blog
    <input name="html5" type="text" required pattern="^$">


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •