SitePoint Sponsor

User Tag List

Results 1 to 8 of 8
  1. #1
    SitePoint Guru
    Join Date
    Jul 2004
    Location
    Netherlands
    Posts
    672
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    PHP and encodings ??

    I am having this issue, where some files are edited by different people and different editors and sometimes files are stored in:
    UTF8
    UTF8 + BOM
    ISO-8859-1
    TIS-620

    And I get problems with encoding errors in certain browsers, is there a tool or way to get php to treat all files the same way e.g. utf8 without bom?

    Or a tool to convert all the files
    Go visit my site :-D you know you want to ;-)
    www.mech7.net

  2. #2
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    If you want your output to be in UTF-8, you need to save the files in UTF-8 (and without BOM). Converting each file to UTF-8 every time a team member edits a file is more hassle than anything.

    PHP doesn't actually care about encodings at all. It passes what you give it, verbatim, to the browser.

  3. #3
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    If you know which encoding a file is in, you can convert it with iconv or a similar tool. But you would be much better off making sure everybody uses the same encoding.

  4. #4
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    hmmm ... ran slap bang into this issue over the weekend ... 30 mins before a demo.
    yuk.
    PHP doesn't actually care about encodings at all. It passes what you give it, verbatim, to the browser.
    But apache apparently does.

    I am thinking of making a real low-tech multi-lang(latin) cms with some flat text files, running iconv over the output to the text file on save seems a nice way round it for me.

    But starting from my IDE, I still cannot work out which setting is better for all western latin type languages ( en, fr, de etc) ISO-8859-1 or utf-8 ?

    If I have understood what I have read and retained on this subject so far I have to be explicit on which type at these levels;

    IDE settings
    database storage settings
    apache settings
    and finally browser encoding

    Or are there settings in PHP too?

  5. #5
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Cups View Post
    But apache apparently does.
    Nope. But Apache sends http-headers, and the browser cares about the http-headers. Php can tell Apache which headers to send, and it will, but if php doesn't tell it anything, it will pick a default. What that is, depends on the whim of the sysadm on your server.

    Quote Originally Posted by Cups View Post
    But starting from my IDE, I still cannot work out which setting is better for all western latin type languages ( en, fr, de etc) ISO-8859-1 or utf-8 ?
    If you have any choice - eg. you're starting an application from scratch - use utf-8 for everything.

    Quote Originally Posted by Cups View Post
    If I have understood what I have read and retained on this subject so far I have to be explicit on which type at these levels;
    You need to encode the sourcefiles too, if they contain non-ascii characters. That would generally mean any html/template files. This list covers the most important points.

    Quote Originally Posted by Cups View Post
    Or are there settings in PHP too?
    Php as such doesn't know anything about charsets/encodings. However, most of the built-in string-functions assume that strings are bytestreams. This affects functions such as strlen, which returns the byte-size of the input. For a single-byte stream, byte-size == number of characters. For utf-8, this is not true. Some functions also expect strings to be in iso-8859-1. For example, lowercase does. There's a quite complete list over here. These things are mostly edge-cases, and you can get around them by using mb_string and iconv.

  6. #6
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    @kyberfabrikken.

    Thanks for identifying those two links, very helpful of you.

    So, yes, it looks like a take a big breath, and config the IDE, dbase and server to accept and output utf-8.

    I get the feeling that the side effects on code previously built around native pcre and string functions could be quite subtle though.

    As a matter of interest, can you tell me what you think will be different about moving (or developing) applications from latin to utf-8 if I faced the same situation using PHP6?

    Thanks again.

  7. #7
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Cups View Post
    I get the feeling that the side effects on code previously built around native pcre and string functions could be quite subtle though.
    One might expect that, but my experience is that it isn't really any harder to use all utf-8, then all-latin1. The thing is, that even if you try to use latin1 everywhere, you will run into charset issues anyway, when dealing with other applications etc. Some of phps extensions use utf-8 (Most notably the xml-libraries), so here you actually have to do more work, if you want to use latin1.

    The number of things that will break on utf-8 is fairly limited too. It's generally only functions that need to do something on a per-character-basis, which is a fairly limited subset of operations. Most string-functions are safe enough. For example, most regular expressions will work transparently (Although there is a utf-8 switch, if you need it to treat utf-8 sequences as characters).

    Quote Originally Posted by Cups View Post
    As a matter of interest, can you tell me what you think will be different about moving (or developing) applications from latin to utf-8 if I faced the same situation using PHP6?
    You'll certainly have a situation then, but it'll be a different one. Php 6 is going to distinguish between bytestreams and unicode strings. The latter is used for in-memory strings, and all internal functions will use these. I would assume that utf-8 is going to be the default format for php-source files. I'm not entirely sure of the implications of an upgrade, but I don't think it will be more or less work whether you use latin1 or utf-8 today. The complexity of charsets is mostly in configuration - If you have everything sorted out nicely, it's simple enough to change from one charset to another. It's when things gets mixed up, problems occur.

  8. #8
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    I find your comments reassuring, and thanks for the "heads-up" on these issues.


Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •