SitePoint Sponsor |
|
User Tag List
Results 1 to 16 of 16
-
Jul 25, 2007, 05:14 #1
- Join Date
- Jul 2007
- Posts
- 12
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
A VERY serious problem related to functions' compatibility for UTF-8 encoding
Hello,
I'm a new user in this forum.
At first I would like to apologize for my bad English.
And now for my problem, that it's solution I couldn't find anywhere so you're kind of my last hope.
I'm writing a system with PHP which encodes with UTF-8 encoding. Everything is encoded with UTF-8 encoding.
In order to work with UTF-8 encoded strings, I need to use special functions - mbString function (stands for Multi Byte String), that specially compatible for UTF-8 encoding and others.
The problem is that there aren't enough mbString functions so that I will be able to work well with UTF-8 encoded strings. Many important mbString functions are missing.
I wrote a list of regular functions and I need to know if they can work well & suitable for UTF-8 encoded strings.
Here is the list (links to the functions are included):
mysql_real_escape_string() - http://il2.php.net/manual/en/functio...ape-string.php
stripslashes() - http://il2.php.net/manual/en/function.stripslashes.php
addslashes() - http://il2.php.net/manual/en/function.addslashes.php
strstr() - http://il2.php.net/manual/en/function.strstr.php
trim() - http://il2.php.net/manual/en/function.trim.php
wordwrap() - http://il2.php.net/manual/en/function.wordwrap.php
vsprintf() - http://il2.php.net/manual/en/function.vsprintf.php
nl2br() - http://il.php.net/manual/en/function.nl2br.php
The list above contains only part of the functions that I need to know if I can use with UTF-8 encoded strings.
Does someone know if the above functions are compatible for UTF-8 encoded strings?
How can I tell which functions is suitable for UTF-8 encoded strings?
If all the above functions aren't compatibale for UTF-8 encoded strings, so what am I need to do which replace these functions?
What is the solution?
THANK YOU VERY MUCH !!!,
neo444.
-
Jul 25, 2007, 06:15 #2
- Join Date
- Nov 2006
- Posts
- 50
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
If you haven't already read it, check the WACT unicode notes. There are some good extra unicode functions in docuwiki and I had a skim through this project when I was messing around with UTF-8.
With regard to the functions listed..
mysql_real_escape_string will work OK, as long as you have set the DB connection encoding to UTF-8.
trim is OK too, as long as you don't pass in unicode characters to remove (i.e. ok with whitespace and newlines. You can write your own mb_*trim replacements (but these will be slower):
PHP Code:/**
* Unicode aware replacement for ltrim.
*
* Trimming can corrupt a Unicode string by replacing single bytes from a
* multi-byte sequence. Used in a default manner, ltrim is UTF-8 safe, but
* with the optional charlist variable specified it can corrupt strings.
*
* @see ltrim
* @param string $str string to trim
* @param string $charlist list of characters to trim
* @return string trimmed string
*/
function mb_ltrim($str,$charlist='')
{
if (strlen($charlist)==0) {
return ltrim($str);
} else {
$charlist = preg_quote($charlist,'#');
return preg_replace('#^['.$charlist.']+#u','',$str);
}
}
/**
* Unicode aware replacement for rtrim.
*
* @see rtrim
* @param string $str string to trim
* @param string $charlist list of characters to trim
* @return string trimmed string
*/
function mb_rtrim($str,$charlist='')
{
if (strlen($charlist)==0) {
return rtrim($str);
} else {
$charlist = preg_quote($charlist,'#');
return preg_replace('#['.$charlist.']+$#u','',$str);
}
}
/**
* Unicode aware replacement for trim.
*
* @see trim
* @param string $str string to trim
* @param string $charlist list of characters to trim
* @return string trimmed string
*/
function mb_trim($str,$charlist='')
{
if (strlen($charlist)==0) {
return trim($str);
} else {
return mb_ltrim(mb_rtrim($str,$charlist),$charlist);
}
}
strstr you can use the mbString replacement, mb_strstr.Last edited by robt; Jul 25, 2007 at 06:24. Reason: speeling
-
Jul 25, 2007, 06:16 #3
- Join Date
- Apr 2004
- Location
- germany
- Posts
- 4,324
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Hi neo,
wake upwelcome to the forums.
This page should get you started with encoding issues
http://www.phpwact.org/php/i18n/charsets
-
Jul 25, 2007, 08:40 #4
- Join Date
- Jan 2005
- Location
- Barcelona
- Posts
- 16
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Check out PHP UTF-8 if you are on a rush; otherwise read the post that strereofrog suggested.
-
Jul 25, 2007, 11:58 #5
- Join Date
- Jul 2007
- Posts
- 12
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Thank you for the info, I will check it all!
This UTF-8 subject truly is complex...much to learn
-
Jul 26, 2007, 13:23 #6
- Join Date
- Jul 2007
- Posts
- 12
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
OK I read the info you all sent me and learned about UTF-8
But I still have some questions to ask.
(OMGGGG I wrote so much and accidently, when I almost finished writing me post, I pressed the "back" button in the IE)
1. When you wrote:
mysql_real_escape_string will work OK, as long as you have set the DB connection encoding to UTF-8.
PHP Code:mysql_set_charset(encoding, $this->currentLink);
( encoding is the "UTF-8" of course, and $this->currentLink is MySQL's connection source)
2. About the trim() function, I just need it to remove the extra space from the string's 2 sides. So I guess it will be OK according to what you wrote.
3. In the mb_ltrim() function you wrote, I noticed the next line:
PHP Code:return preg_replace('#^['.$charlist.']+#u','',$str);
Within one of the pages you [all] linked to, someone wrote that the preg_replace() function (even while using the u flag) doesn't fully support the UTF-8 encoding.
4. As I understood from the pages you all linked to, the addslashes() and stripslashes() functions are OK to use with
UTF-8 encoded string, doesn't they? Because they are dealing with unique characters (under 128, they're ASCII codes.)
Correct me if I wrong.
5. About the vsprintf() function, in the next page:
http://www.phpwact.org/php/i18n/utf-8
The writer doesn't checked the function yet.
If I use this vsprintf() function with UTF-8 encoded string, it will be OK ? I MUST know if this function safe,
because I'm using this functions to prevent SQL injections. (I suppose you're using it too)
6. Does someone have some more good updated info about the UTF-8 issue?
7. If I understands how UTF-8 "works" (with code point etc.), can I be sure I'm right about some functions' compatibility for UTF-8 encoded strings?
8. Just to be sure - if I write in some string the next thing "\u0065" it will be the "e" character?
OR should I write "\U+0065" ? How the characters are actually presented and how can I treat them?
For example when I'm looking for "e" in a string (or another character with a very high code point),
what am I suppose to write in the regular expression?
9. Sorry that I'm nagging you so much...
10. THANK YOU ALL VERY MUCH FOR YOUR HELP!!! You can't believe how much you've helped me!
This UTF-8 story drives me crazy.
and again, sorry for my bad English, I'm not an English speaker.
-
Jul 26, 2007, 13:42 #7
- Join Date
- Jan 2003
- Posts
- 5,748
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
> Did you mean that using the next code...
Yes
PHP Code:// here is what I have,
// ...
public function __construct( $srce, $host, $user, $pass ) {
if( $this -> connection_id = @mysql_connect( $host, $user, $pass ) ) {
if( @mysql_select_db( $srce, $this -> connection_id ) ) {
if( $this -> beginTransaction() ) {
@mysql_query( "set names 'utf8'" );
return true;
}
}
}
return false;
}
-
Jul 26, 2007, 13:57 #8
- Join Date
- Jul 2007
- Posts
- 12
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
-
Jul 30, 2007, 10:20 #9
- Join Date
- Jul 2007
- Posts
- 12
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
please help, these are the last things I need to know about this complex subject!
THX!
-
Jul 30, 2007, 12:28 #10
- Join Date
- Jun 2004
- Location
- Copenhagen, Denmark
- Posts
- 6,157
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
mysql_set_charset does the same, as Dr Livingstons function. It doesn't work on older versions of MySql though.
-
Jul 31, 2007, 08:01 #11
- Join Date
- Jan 2003
- Posts
- 5,748
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Use the approach as in the example I posted and you should be safe enough with that; Is there anything else you want to know?
Remember, the call shown in the script, to the database in the above script must be the first call you make, before you make any others, otherwise you may encounter side effects.
PHP Code:// important!! before anything else,
// ...
@mysql_query( "set names 'utf8'" );
// ... etc ...
-
Jul 31, 2007, 10:25 #12
- Join Date
- Jul 2007
- Posts
- 12
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Thanks for your answers!
Originally Posted by Dr Livingston
I asked 10 additional questions, which are the last ones I must to know!
Please answer my additional questions, Thank you all!
-
Jul 31, 2007, 10:50 #13
- Join Date
- Jan 2003
- Posts
- 5,748
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
> ...doesn't fully support the UTF-8 encoding.
Not sure where you heard that, but I can't either confirm or deny that being true or not. However, if you use \w\ in your regular expression with u then it should allow based on your LOCALE the appropriate characters, according to the manual (see notes).
PHP Code:... preg_match( "@[\w\ ]+$@uD", ... );
Using the above snippet, that passes that character you asked about for me without any problems.
From what I can tell, addslashes and stripslashes are UTF8 safe but don't quote me on that, as I don't know everything about PHPs unicode support.
> I MUST know if this function safe, because I'm using this functions to prevent SQL
> injections.
Use PDO instead as it's just safer in any case... You just that that extra reassurance you get with PDO.
-
Jul 31, 2007, 12:29 #14
- Join Date
- Feb 2005
- Location
- Belgium
- Posts
- 334
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Very recently I've been working on a utf8 class that should make the Kohana framework fully support unicode. The code is based on the phputf8 project of Harry Fuecks. Biggest differences are that all functions are included in one file. One utf8 class with static functions. Also autocleaning of $_GLOBALS. If you want to check it out follow this link: http://kohanaphp.com/trac/browser/br.../core/utf8.php
-
Jul 31, 2007, 16:12 #15
- Join Date
- Apr 2004
- Location
- germany
- Posts
- 4,324
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Quite the contrary. \w without 'u' will match in a locale-aware mode, thus possibly corrupting utf-8 sequences in the subject. \w with 'u' will match only latin letters (a-z), leaving utf8 sequences intact.
PHP Code:$a = "a " . utf8_encode("\x80") . " b";
echo $a, "\n"; // a € b
echo preg_replace('~\w~', '*', $a), "\n"; // * *€ * - utf broken
echo preg_replace('~\w~u', '*', $a), "\n"; // * € * - utf ok
-
Jul 31, 2007, 16:26 #16
- Join Date
- Apr 2004
- Location
- germany
- Posts
- 4,324
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
That's perhaps the question you should have asked first. I'd suggest you follow the link from that wact page and also read
http://www.phpwact.org/php/i18n/charsets
This should give you a basic idea on what 'uncode', 'encodings', 'code points' and all that stuff is.
Bookmarks