SitePoint Sponsor

User Tag List

Page 2 of 2 FirstFirst 12
Results 26 to 35 of 35
  1. #26
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Actually, php core and most functions are completely unaware of the encoding. For the php interpreter, a "string" is just a sequence of bytes, it doesn't make any assumptions about how the string is encoded. However, some functions do use encoding info, this is usually documented on the function's man page, e.g. htmlentities.

    For more info on charsets, encodings and all that stuff I'd recommend this excellent article

    http://www.phpwact.org/php/i18n/charsets

  2. #27
    SitePoint Zealot
    Join Date
    Nov 2006
    Posts
    119
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You're right, I did find stuff on utf8 / iso8859 on functions like htmlentities, or utf8_decode and functions like that. Specifying that you can convert them from to iso8859 to utf8 or other encodings. But nothing specifically on that php sees strings as iso8859 by default.

    I was just wondering if I could use hex codes for e.g. accented letters which are on the standard iso 8859 list (http://www.ascii.cl/htmlcodes.htm), without having set the charset in the php script itself. Not that for some reason they wouldn't work on someones browser, because e.g. their browser was set with a different charset.

    I did notice that with a php script with only a echo "é" my browser automatically switched to iso 8859 even when I set it first to unicode.

    Perhaps, even though php doesn't see strings as iso8859-1 encoded by default (just that it sees it as a sequence of bytes, not specifically as a certain encoding), it would be safer to just set it in the html page by having this charset=iso8859-1 line? Or otherwise with such a charset header line in php?

  3. #28
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Yes, immediate byte values like "\xc3\xbc" will be interpreted by the browser according to the encoding being used, for example the above will be displayed as "ü" in ISO-8859, as "УМ" in Cyrillic and as "ü" in utf8. You can use html entities to force encoding-independent display, e.g. "ü" will be rendered as "ü" no matter which encoding is used, but more general and better approach is always to specify intended charset with content-type header, as you suggested. The page I linked above has an example on this.

  4. #29
    SitePoint Zealot
    Join Date
    Nov 2006
    Posts
    119
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    So I have my content type header already set as iso 8859-1. I'm using this regular expression range to check for the special characters in that range (192-255) À-ÿ:
    \xC0-\xFF
    Which I tested is the same as checking for À-ÿ.
    Having set the iso 8859 content type header I guess I would be safe to use both ways for checking this range?
    It's probably not possible to use html entities in regular expressions? Like I did with hexadecimal values above?

  5. #30
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    "in" encoding is generally not the same as the "out" one, while for html pages this is mostly the case, flash and javascript (ajax) requests are always in utf-8. Unfortunately, there's no way to tell in which encoding the specific http request was sent (there's no "encoding" field for requests), therefore all you can do is to hope that client's encoding is the same as yours. If you have to support multiple charsets (e.g. iso for html and utf for flash) this can end up with a huge mess, that's why some people recommend to use utf8 exclusively even despite that php lacks proper support for it (this is going to change in php6, btw).

  6. #31
    SitePoint Zealot
    Join Date
    Nov 2006
    Posts
    119
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    yes, i've noticed that in flash that it sends its post values as utf.
    So what I do is always set the content type header as iso 8859-1 and convert any values posted by flash to iso 8859-1 by using utf8_decode.
    So this basically makes everything iso 8859 for the php script I'm using.
    I'm right it doing it that way, aren't I?

  7. #32
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    digitalecartoons, it really depends on many factors. Sometimes it's better to convert, sometimes it's better to keep everything in utf. You should really read the page I linked. It explains the stuff pretty well.

  8. #33
    SitePoint Zealot
    Join Date
    Nov 2006
    Posts
    119
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    It's a little too complicated for a novice like me. Especially why you should write a page in utf-8 instead of a more default iso 8859-1 page. Isn't that more when you would write a more international webpage? I want to use a typical dutch page with the normal 0-191 ascii characters and the accented ones which fall in the 192-255 category. Is in my case iso 8859-1 good enough?

  9. #34
    SitePoint Zealot
    Join Date
    Nov 2006
    Posts
    119
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I guess I'm a litte confused about when to use utf-8. Programs as dreamweaver start a default html page as iso 8859-1 and also most pages I visit are iso 8859. So I thought iso 8859 is mostly used by default and only when needed more characters in a site you could switch programming to utf-8. Like I said, I would use nothing more than a-z en some accented ones like é ë á ë ö ó.

  10. #35
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Let me just quote
    Quote Originally Posted by http://www.phpwact.org/php/i18n/charsets
    ...in PHP, if you simply “accept the defaults” you probably will end up with all kinds of strange characters and question marks the moment anyone outside the US or Western Europe submits some content to your site

    This page won’t rehash existing discussions suffice to say you should be thinking in terms of Unicode, the grand unified solution to all character issues and, in particular, UTF-8, a specific encoding of Unicode and the best solution for PHP applications.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •