SitePoint Sponsor

User Tag List

Results 1 to 24 of 24
  1. #1
    SitePoint Wizard
    Join Date
    Jan 2005
    Location
    blahblahblah
    Posts
    1,447
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    utf-8 encoding problem

    Hi,

    I am trying to utf-8 encode a string containing the html that the client will display.

    I used this code found in the manual:

    PHP Code:
    protected function goUtf8($in_str
      { 
        
    $cur_encoding mb_detect_encoding($in_str) ; 
        if(
    $cur_encoding == "UTF-8" and mb_check_encoding($in_str,"UTF-8")) 
          return 
    $in_str
        else 
          return 
    utf8_encode($in_str); 
      } 
    When I display my html page and I view the source, I have this part, which is needed I guess:

    HTML Code:
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    And yet... signs like "", "" etc. are displayed in a very weird, funky manner.

    Where could the problem come from?

    Regards,

    -jj.

  2. #2
    SitePoint Wizard bronze trophy
    Join Date
    Jul 2008
    Posts
    5,757
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    dont use a meta tag, use a real http header to tell the browser the content-type. a meta tag is generally ignored by browsers if a real header is also present, which is probably the case for you.

  3. #3
    SitePoint Wizard REMIYA's Avatar
    Join Date
    May 2005
    Posts
    1,351
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    It depends very much what exactly you are trying to do.

    Keep in mind that Unicode conversion with PHP doesn't work always as expected, but is very much dependent on the character set used for input.

    If the input is done by a form specify the needed encoding also for the form. In your case UTF-8.

    BTW. Keep the meta tag in all cases.

  4. #4
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The code you picked up only takes in data encoded in UTF-8 or ISO-8859-1.

    Where did you get the text from? Did you write in your editor?

  5. #5
    SitePoint Wizard
    Join Date
    Jan 2005
    Location
    blahblahblah
    Posts
    1,447
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks for your replies.

    @crmalibu: how would I send headers?

    @REMIYA: wouldn't a submitted form be, by default, the same encoding as your webpage?

    @sk89q: I wrote it in my editor, which is to say, Notepad2. It has the nasty habit of ANSI encoding. Is there a way to solve it?

    Regards,

    -jj.

  6. #6
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by jjshell View Post
    @crmalibu: how would I send headers?
    PHP Code:
    <?php
    header 
    ('Content-type: text/html; charset=utf-8');
    ?>
    @AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.

  7. #7
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Does Notepad2 let you select a different encoding?

    And does the stuff you write in Notepad2 pass through goUtf8()?

  8. #8
    hi galen's Avatar
    Join Date
    Jan 2006
    Location
    New Haven, CT
    Posts
    1,228
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    if you are getting the text from a database make sure you set the connection charset

    also make sure your editor is saving the files in utf-8

  9. #9
    SitePoint Wizard
    Join Date
    Jan 2005
    Location
    blahblahblah
    Posts
    1,447
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    if you are getting the text from a database make sure you set the connection charset
    How would I do that?

    also make sure your editor is saving the files in utf-8
    Notepad2 lets me select a different encoding, utf-8 for that matter. And I did save my files as utf-8. At some point, I must have missed a file, or cut and pasted an old file or who knows... But there sure are ANSI encoded files, in a library that now counts close to a hundred of files. It's just such a hassle to go through all the files now, and re-save them as utf-8, that I wondered if there was another way to output a string as utf-8 encoded.

    And does the stuff you write in Notepad2 pass through goUtf8()?
    The php code written in notepad2 creates an $html string which I then echo() to the client. This is what is passed to goUtf8(), this string. So my guess is: encode from ANSI to utf-8...

  10. #10
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    These utf-8 tagged conversations might provide some leads for you to chase this problem down.

    http://www.sitepoint.com/forums/tags.php?tag=utf-8

    Off Topic:

    Why is so difficult to navigate this site by tags? Surely there are enough intelligently tagged discussions to make it worthwhile having a menu item?

  11. #11
    SitePoint Wizard REMIYA's Avatar
    Join Date
    May 2005
    Posts
    1,351
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by jjshell View Post
    @REMIYA: wouldn't a submitted form be, by default, the same encoding as your webpage?
    Actually the form is by default with your encoding, but the user input is not.

    I have had a similar problem with an online address book, that had problem with user input encodings, and this solved it.

  12. #12
    SitePoint Wizard
    Join Date
    Jan 2005
    Location
    blahblahblah
    Posts
    1,447
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    @REMIYA: how would I do that? I have absolutely no clue where to start...

    As for the encoding problem, is there a php way to open a file and save it as UTF-8? I could do a bit of recursion and avoid the unpleasantness of doing all this manually...

    Regards, and thanks to you all.

  13. #13
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The browser converts the user input to the page encoding.

    You can open a file normally, convert it to UTF-8 using the function that you already have, and then save it back.

  14. #14
    SitePoint Wizard REMIYA's Avatar
    Join Date
    May 2005
    Posts
    1,351
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by jjshell View Post
    @REMIYA: how would I do that? I have absolutely no clue where to start...
    See this links, they will give you a good start:
    http://www.w3.org/TR/html401/interact/forms.html#h-17.3

    http://www.intertwingly.net/blog/1761.html

    http://www.phpwact.org/php/i18n/charsets

    Especially pay attention to the following:
    http://stackoverflow.com/questions/1...ernet-explorer

    If it is a crucial application you may even consider using JavaScript for converting the form data prior to sending.

  15. #15
    SitePoint Wizard REMIYA's Avatar
    Join Date
    May 2005
    Posts
    1,351
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by sk89q View Post
    You can open a file normally, convert it to UTF-8 using the function that you already have, and then save it back.
    These functions are not trustworthy. I have tried all of them, and they had problem with files, that were using different encodings. In this connection I had to rewrite a ready program initially written in PHP (using the PHP-GTK project) to Java, in order to correctly get and work with the file data.

    Hopefully PHP 6 will be fully Unicode compatible and all these problems will be solved.

  16. #16
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Uh, what? What functions are you talking about? mbstring converts encodings just fine, but you need to know how to use it. I hope you're not talking about utf8_encode(), because utf8_encode() is only for an individual case (ISO-8859-1 -> UTF-8). People also confuse CP1252 for ISO-8859-1 sometimes too, and use utf8_encode() when it is not appropriate.

    The only thing that PHP can absolutely not do is access filenames with Unicode characters in them on Windows because Windows requires using the Unicode-aware API (the Linux kernal just passes filenames as UTF-8 as-is).

  17. #17
    SitePoint Wizard REMIYA's Avatar
    Join Date
    May 2005
    Posts
    1,351
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by sk89q View Post
    mbstring converts encodings just fine, but you need to know how to use it. I hope you're not talking about utf8_encode(), because utf8_encode() is only for an individual case (ISO-8859-1 -> UTF-8).
    What I am talking about is that currently there are no trustworthy Unicode compatible functions. If ASCII only is used it is OK. Whenever Unicode is needed, there are multiple solutions for each single case, but not for all.

    If mbstring was the panacea Unicode wouldn't be the next big thing expected by everybody to come with PHP 6.

  18. #18
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The mbstring functions are trustworthy. There are also pure-PHP implementations as well. It's just that mbstring is not at all portable and it's not a standard way to handle strings. Overriding the default string functions of PHP with mbstring causes more problems than it is worth.

  19. #19
    SitePoint Wizard
    Join Date
    Jan 2005
    Location
    blahblahblah
    Posts
    1,447
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I must admit that I am a bit lost...

    1. Should mbstring functions be used instead of native php string functions?
    2. What should I try to do? a) encode the $html string into utf-8 or b) save all the files as utf-8.

    thanks to you all for your help

  20. #20
    SitePoint Wizard
    Join Date
    Jan 2005
    Location
    blahblahblah
    Posts
    1,447
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ok here comes a very strange thing.

    Just before echo() my $html string, I have done this:

    PHP Code:
    die(mb_detect_encoding($html)); 
    And guess what the output is.... UTF-8.

    Then, why do I have this strange characters instead of the accented letters (&#233;,&#232;,&#224; etc.)

    I have also added this bit of code just before echo() the $html string:

    PHP Code:
    header ('Content-type: text/html; charset=utf-8'); 
    And if I do this instead:

    PHP Code:
    header ('Content-type: text/html; charset=iso-8859-1'); 
    Everything suddenly works fine... which gets me pretty lost.

    Can someone solve this mistery?


  21. #21
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by jjshell View Post
    I must admit that I am a bit lost...

    1. Should mbstring functions be used instead of native php string functions?
    2. What should I try to do? a) encode the $html string into utf-8 or b) save all the files as utf-8.

    thanks to you all for your help
    #1. If you need to do string manipulation and you don't want to mess up the string, then you should use mbstring, but be aware that you will have to convert all your code for PHP 6 later.
    #2. The latter option will be more efficient, obviously.

    PHP Code:
    mb_detect_encoding($html"ascii, cp1252, iso-8859-1, utf-8"
    The default list of encodings to detect is "ASCII, JIS, UTF-8, EUC-JP, SJIS".

  22. #22
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by sk89q View Post
    #2. The latter option will be more efficient, obviously.
    Also, it will allow you to use the entire unicode range of characters.

  23. #23
    SitePoint Wizard
    Join Date
    Jan 2005
    Location
    blahblahblah
    Posts
    1,447
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Will do

    I'm still lost regarding this issue...
    PHP Code:
    die(mb_detect_encoding($html)); 
    --> "UTF-8."

    I have also added this bit of code just before echo() the $html string:

    PHP Code:
    header ('Content-type: text/html; charset=utf-8'); 
    whichz does not refrain the strange chars from appearing.

    And if I do this instead:

    PHP Code:
    header ('Content-type: text/html; charset=iso-8859-1'); 
    Everything is displayed perfectly.

    Could someone explain to me what is going on?

  24. #24
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by jjshell View Post
    Could someone explain to me what is going on?
    mb_detect_encoding isn't accurate. It's guessing on the encoding, based on heuristics. You can call it with a list of candidate character-sets, and it will return the first one that matches. If you don't pass a list, it will use the default list, which - as sk89q mentioned - is "ASCII, JIS, UTF-8, EUC-JP, SJIS". The best match in that list happens to be UTF-8, which is why you're getting that back. In general, you shouldn't rely on guesswork to figure out the charset; It's much better to actually know what encoding you strings are in. For example, it's very hard to guess the difference between cp-1252 and iso-8859-1.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •