SitePoint Sponsor

User Tag List

Results 1 to 16 of 16
  1. #1
    SitePoint Member
    Join Date
    Jul 2007
    Posts
    12
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    A VERY serious problem related to functions' compatibility for UTF-8 encoding

    Hello,

    I'm a new user in this forum.
    At first I would like to apologize for my bad English.

    And now for my problem, that it's solution I couldn't find anywhere so you're kind of my last hope.

    I'm writing a system with PHP which encodes with UTF-8 encoding. Everything is encoded with UTF-8 encoding.

    In order to work with UTF-8 encoded strings, I need to use special functions - mbString function (stands for Multi Byte String), that specially compatible for UTF-8 encoding and others.

    The problem is that there aren't enough mbString functions so that I will be able to work well with UTF-8 encoded strings. Many important mbString functions are missing.

    I wrote a list of regular functions and I need to know if they can work well & suitable for UTF-8 encoded strings.

    Here is the list (links to the functions are included):

    mysql_real_escape_string() - http://il2.php.net/manual/en/functio...ape-string.php
    stripslashes() - http://il2.php.net/manual/en/function.stripslashes.php
    addslashes() - http://il2.php.net/manual/en/function.addslashes.php
    strstr() - http://il2.php.net/manual/en/function.strstr.php
    trim() - http://il2.php.net/manual/en/function.trim.php
    wordwrap() - http://il2.php.net/manual/en/function.wordwrap.php
    vsprintf() - http://il2.php.net/manual/en/function.vsprintf.php
    nl2br() - http://il.php.net/manual/en/function.nl2br.php

    The list above contains only part of the functions that I need to know if I can use with UTF-8 encoded strings.

    Does someone know if the above functions are compatible for UTF-8 encoded strings?
    How can I tell which functions is suitable for UTF-8 encoded strings?
    If all the above functions aren't compatibale for UTF-8 encoded strings, so what am I need to do which replace these functions?
    What is the solution?

    THANK YOU VERY MUCH !!!,
    neo444.

  2. #2
    SitePoint Enthusiast
    Join Date
    Nov 2006
    Posts
    50
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    If you haven't already read it, check the WACT unicode notes. There are some good extra unicode functions in docuwiki and I had a skim through this project when I was messing around with UTF-8.

    With regard to the functions listed..

    mysql_real_escape_string will work OK, as long as you have set the DB connection encoding to UTF-8.

    trim is OK too, as long as you don't pass in unicode characters to remove (i.e. ok with whitespace and newlines. You can write your own mb_*trim replacements (but these will be slower):
    PHP Code:
    /**
     * Unicode aware replacement for ltrim.
     *
     * Trimming can corrupt a Unicode string by replacing single bytes from a
     * multi-byte sequence. Used in a default manner, ltrim is UTF-8 safe, but
     * with the optional charlist variable specified it can corrupt strings.
     *
     * @see ltrim
     * @param string $str  string to trim
     * @param string $charlist  list of characters to trim
     * @return string  trimmed string
     */
    function mb_ltrim($str,$charlist='')
    {
        if (
    strlen($charlist)==0) {
            return 
    ltrim($str);
        } else {
            
    $charlist preg_quote($charlist,'#');
            return 
    preg_replace('#^['.$charlist.']+#u','',$str);
        }
    }

    /**
     * Unicode aware replacement for rtrim.
     *
     * @see rtrim
     * @param string $str  string to trim
     * @param string $charlist  list of characters to trim
     * @return string  trimmed string
     */
    function mb_rtrim($str,$charlist='')
    {
        if (
    strlen($charlist)==0) {
            return 
    rtrim($str);
        } else {
            
    $charlist preg_quote($charlist,'#');
            return 
    preg_replace('#['.$charlist.']+$#u','',$str);
        }
    }

    /**
     * Unicode aware replacement for trim.
     *
     * @see trim
     * @param string $str  string to trim
     * @param string $charlist  list of characters to trim
     * @return string  trimmed string
     */
    function mb_trim($str,$charlist='')
    {
        if (
    strlen($charlist)==0) {
            return 
    trim($str);
        } else {
            return 
    mb_ltrim(mb_rtrim($str,$charlist),$charlist);
        }

    wordwrap and nl2br will be OK I think, as spaces and line breaks are unique within UTF-8.

    strstr you can use the mbString replacement, mb_strstr.
    Last edited by robt; Jul 25, 2007 at 06:24. Reason: speeling

  3. #3
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi neo, wake up welcome to the forums.

    This page should get you started with encoding issues
    http://www.phpwact.org/php/i18n/charsets

  4. #4
    SitePoint Member
    Join Date
    Jan 2005
    Location
    Barcelona
    Posts
    16
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Check out PHP UTF-8 if you are on a rush; otherwise read the post that strereofrog suggested.

  5. #5
    SitePoint Member
    Join Date
    Jul 2007
    Posts
    12
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thank you for the info, I will check it all!

    This UTF-8 subject truly is complex...much to learn

  6. #6
    SitePoint Member
    Join Date
    Jul 2007
    Posts
    12
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    OK I read the info you all sent me and learned about UTF-8

    But I still have some questions to ask.

    Quote Originally Posted by robt View Post
    If you haven't already read it, check the WACT unicode notes. There are some good extra unicode functions in docuwiki and I had a skim through this project when I was messing around with UTF-8.

    With regard to the functions listed..

    mysql_real_escape_string will work OK, as long as you have set the DB connection encoding to UTF-8.

    trim is OK too, as long as you don't pass in unicode characters to remove (i.e. ok with whitespace and newlines. You can write your own mb_*trim replacements (but these will be slower):
    PHP Code:
    /**
     * Unicode aware replacement for ltrim.
     *
     * Trimming can corrupt a Unicode string by replacing single bytes from a
     * multi-byte sequence. Used in a default manner, ltrim is UTF-8 safe, but
     * with the optional charlist variable specified it can corrupt strings.
     *
     * @see ltrim
     * @param string $str  string to trim
     * @param string $charlist  list of characters to trim
     * @return string  trimmed string
     */
    function mb_ltrim($str,$charlist='')
    {
        if (
    strlen($charlist)==0) {
            return 
    ltrim($str);
        } else {
            
    $charlist preg_quote($charlist,'#');
            return 
    preg_replace('#^['.$charlist.']+#u','',$str);
        }
    }

    /**
     * Unicode aware replacement for rtrim.
     *
     * @see rtrim
     * @param string $str  string to trim
     * @param string $charlist  list of characters to trim
     * @return string  trimmed string
     */
    function mb_rtrim($str,$charlist='')
    {
        if (
    strlen($charlist)==0) {
            return 
    rtrim($str);
        } else {
            
    $charlist preg_quote($charlist,'#');
            return 
    preg_replace('#['.$charlist.']+$#u','',$str);
        }
    }

    /**
     * Unicode aware replacement for trim.
     *
     * @see trim
     * @param string $str  string to trim
     * @param string $charlist  list of characters to trim
     * @return string  trimmed string
     */
    function mb_trim($str,$charlist='')
    {
        if (
    strlen($charlist)==0) {
            return 
    trim($str);
        } else {
            return 
    mb_ltrim(mb_rtrim($str,$charlist),$charlist);
        }

    wordwrap and nl2br will be OK I think, as spaces and line breaks are unique within UTF-8.

    strstr you can use the mbString replacement, mb_strstr.

    (OMGGGG I wrote so much and accidently, when I almost finished writing me post, I pressed the "back" button in the IE )

    1. When you wrote:
    mysql_real_escape_string will work OK, as long as you have set the DB connection encoding to UTF-8.
    Did you mean that using the next code:
    PHP Code:
    mysql_set_charset(encoding$this->currentLink); 
    it would be OK ?
    ( encoding is the "UTF-8" of course, and $this->currentLink is MySQL's connection source)


    2. About the trim() function, I just need it to remove the extra space from the string's 2 sides. So I guess it will be OK according to what you wrote.

    3. In the mb_ltrim() function you wrote, I noticed the next line:
    PHP Code:
      return preg_replace('#^['.$charlist.']+#u','',$str); 
    Why not using mb_ereg_replace() function instead of the preg_replace() function?
    Within one of the pages you [all] linked to, someone wrote that the preg_replace() function (even while using the u flag) doesn't fully support the UTF-8 encoding.


    4. As I understood from the pages you all linked to, the addslashes() and stripslashes() functions are OK to use with
    UTF-8 encoded string, doesn't they? Because they are dealing with unique characters (under 128, they're ASCII codes.)
    Correct me if I wrong.

    5. About the vsprintf() function, in the next page:
    http://www.phpwact.org/php/i18n/utf-8
    The writer doesn't checked the function yet.
    If I use this vsprintf() function with UTF-8 encoded string, it will be OK ? I MUST know if this function safe,
    because I'm using this functions to prevent SQL injections. (I suppose you're using it too)

    6. Does someone have some more good updated info about the UTF-8 issue?

    7. If I understands how UTF-8 "works" (with code point etc.), can I be sure I'm right about some functions' compatibility for UTF-8 encoded strings?

    8. Just to be sure - if I write in some string the next thing "\u0065" it will be the "e" character?
    OR should I write "\U+0065" ? How the characters are actually presented and how can I treat them?
    For example when I'm looking for "e" in a string (or another character with a very high code point),
    what am I suppose to write in the regular expression?

    9. Sorry that I'm nagging you so much...

    10. THANK YOU ALL VERY MUCH FOR YOUR HELP!!! You can't believe how much you've helped me!
    This UTF-8 story drives me crazy.
    and again, sorry for my bad English, I'm not an English speaker.

  7. #7
    Non-Member
    Join Date
    Jan 2003
    Posts
    5,748
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    > Did you mean that using the next code...

    Yes

    PHP Code:
    // here is what I have,
    // ...
    public function __construct$srce$host$user$pass ) { 
                if( 
    $this -> connection_id = @mysql_connect$host$user$pass ) ) {
                    if( @
    mysql_select_db$srce$this -> connection_id ) ) {
                        if( 
    $this -> beginTransaction() ) {
                            @
    mysql_query"set names 'utf8'" );
                            return 
    true;
                        }
                    }
                }
                return 
    false;
            } 

  8. #8
    SitePoint Member
    Join Date
    Jul 2007
    Posts
    12
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Dr Livingston View Post
    > Did you mean that using the next code...

    Yes

    PHP Code:
    // here is what I have,
    // ...
    public function __construct$srce$host$user$pass ) { 
                if( 
    $this -> connection_id = @mysql_connect$host$user$pass ) ) {
                    if( @
    mysql_select_db$srce$this -> connection_id ) ) {
                        if( 
    $this -> beginTransaction() ) {
                            @
    mysql_query"set names 'utf8'" );
                            return 
    true;
                        }
                    }
                }
                return 
    false;
            } 

    oh, so using mysql_set_charset() isn't enough? oO
    I'v heard about this query you presented, but didn't think about using it.
    THX

  9. #9
    SitePoint Member
    Join Date
    Jul 2007
    Posts
    12
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    please help, these are the last things I need to know about this complex subject!

    THX!

  10. #10
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    mysql_set_charset does the same, as Dr Livingstons function. It doesn't work on older versions of MySql though.

  11. #11
    Non-Member
    Join Date
    Jan 2003
    Posts
    5,748
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Use the approach as in the example I posted and you should be safe enough with that; Is there anything else you want to know?

    Remember, the call shown in the script, to the database in the above script must be the first call you make, before you make any others, otherwise you may encounter side effects.

    PHP Code:
    // important!! before anything else,
    // ...
    @mysql_query"set names 'utf8'" );
    // ... etc ... 

  12. #12
    SitePoint Member
    Join Date
    Jul 2007
    Posts
    12
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Thanks for your answers!

    Quote Originally Posted by Dr Livingston
    Is there anything else you want to know?
    Yes

    I asked 10 additional questions, which are the last ones I must to know!
    Please answer my additional questions, Thank you all!

  13. #13
    Non-Member
    Join Date
    Jan 2003
    Posts
    5,748
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    > ...doesn't fully support the UTF-8 encoding.

    Not sure where you heard that, but I can't either confirm or deny that being true or not. However, if you use \w\ in your regular expression with u then it should allow based on your LOCALE the appropriate characters, according to the manual (see notes).

    PHP Code:
    ... preg_match"@[\w\ ]+$@uD", ... ); 
    > what am I suppose to write in the regular expression?

    Using the above snippet, that passes that character you asked about for me without any problems.

    From what I can tell, addslashes and stripslashes are UTF8 safe but don't quote me on that, as I don't know everything about PHPs unicode support.

    > I MUST know if this function safe, because I'm using this functions to prevent SQL
    > injections.

    Use PDO instead as it's just safer in any case... You just that that extra reassurance you get with PDO.

  14. #14
    SitePoint Addict GeertDD's Avatar
    Join Date
    Feb 2005
    Location
    Belgium
    Posts
    334
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Very recently I've been working on a utf8 class that should make the Kohana framework fully support unicode. The code is based on the phputf8 project of Harry Fuecks. Biggest differences are that all functions are included in one file. One utf8 class with static functions. Also autocleaning of $_GLOBALS. If you want to check it out follow this link: http://kohanaphp.com/trac/browser/br.../core/utf8.php

  15. #15
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Dr Livingston View Post
    However, if you use \w\ in your regular expression with u then it should allow based on your LOCALE the appropriate characters, according to the manual (see notes).

    PHP Code:
    ... preg_match"@[\w\ ]+$@uD", ... ); 
    Quite the contrary. \w without 'u' will match in a locale-aware mode, thus possibly corrupting utf-8 sequences in the subject. \w with 'u' will match only latin letters (a-z), leaving utf8 sequences intact.

    PHP Code:
    $a "a " utf8_encode("\x80") . " b";

    echo 
    $a"\n"// a € b
    echo preg_replace('~\w~',  '*'$a), "\n"// * * * - utf broken
    echo preg_replace('~\w~u''*'$a), "\n"// * € * - utf ok 

  16. #16
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by neo444 View Post
    How the characters are actually presented and how can I treat them?
    That's perhaps the question you should have asked first. I'd suggest you follow the link from that wact page and also read

    http://www.phpwact.org/php/i18n/charsets

    This should give you a basic idea on what 'uncode', 'encodings', 'code points' and all that stuff is.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •