SitePoint Sponsor

User Tag List

Results 1 to 5 of 5
  1. #1
    SitePoint Zealot
    Join Date
    Dec 2008
    Posts
    125
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)

    preg_replace regex help needed

    I'm trying to write a function to encode a string as per RFC3987, similar to what rawurlencode does per RFC3986:
    PHP Code:
    function preg_iriencode($url){
        return 
    preg_replace('/[\x0-\xc2\x9f]|[\xef\xbf\xb0-\xef\xbf\xbd]/eu''rawurlencode("$0")'$url);
    }
    $url 'Exclamation!Question?NBSP*Newline
    Atsign@Tab    Hyphen-Plus+Tilde~好'
    ;
    echo 
    'iriencode='.number_format($iriencode/50030).' preg_iriencode='.number_format($preg_iriencode/50030);

    //Expected result: Exclamation%21Question%3FNBSP Newline%0AAtsign%40Tab%09Hyphen-Plus%2BTilde~好
    //Actual result: Exclamation%21Question%3FNBSP%C2%A0Newline%0AAtsign%40Tab%09Hyphen-Plus%2BTilde~好 
    Non breaking space (NBSP) is \xC2\xA0 in UTF8, the first character class in the regex only goes up to \xc2\x9f, so I don't understand why it is being matched and so encoded?

    (VBulletin seems to convert the nbsp in the test string into a * but the actual code does have an nbsp in it)

  2. #2
    Keeper of the SFL StarLion's Avatar
    Join Date
    Feb 2006
    Location
    Atlanta, GA, USA
    Posts
    3,748
    Mentioned
    73 Post(s)
    Tagged
    0 Thread(s)
    well, the problem i think (not 100%, but somewhere around 90%) you're facing is this:
    You write:
    [\x0-\xc2\x9f]
    and you think it says:
    "Anything from character code 0 to character code C29F".
    What it actually says is:
    "Anything from character code 0 to character code C2, or character code 9f".
    So it'll match the \xc2, but not the \xA0.

    Try
    \x{c29f} instead.
    Never grow up. The instant you do, you lose all ability to imagine great things, for fear of reality crashing in.

  3. #3
    SitePoint Zealot
    Join Date
    Dec 2008
    Posts
    125
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Thanks, I didn't realise that and haven't seen the \x{c29f} syntax before.
    Unfortunately \x{c29f} didn't work either.
    I changed function as follows:
    PHP Code:
    function preg_iriencode($url){
        return 
    preg_replace('/[\x0-\x{c29f}]/eu''rawurlencode("$0")'$url);

    But it now encodes the entire string, including the chinese character, which is way outside that range (\xe5\xa5\xbd)

  4. #4
    Keeper of the SFL StarLion's Avatar
    Join Date
    Feb 2006
    Location
    Atlanta, GA, USA
    Posts
    3,748
    Mentioned
    73 Post(s)
    Tagged
    0 Thread(s)
    e5, a5, and bd would all be between 00 and c29f.
    Never grow up. The instant you do, you lose all ability to imagine great things, for fear of reality crashing in.

  5. #5
    SitePoint Zealot
    Join Date
    Dec 2008
    Posts
    125
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Cheers, I get you now. I was using UTF8 hex rather than unicode code point so should be [\x{0000}-\x{009F}] for the character class to match what I wanted


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •