SitePoint Sponsor

User Tag List

Results 1 to 7 of 7
  1. #1
    SitePoint Addict
    Join Date
    Feb 2009
    Location
    Austin Texas
    Posts
    289
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Charset for Spanish language site

    I have been assigned to do a site in Spanish only. We want to do away with special characters like í and just use í, for example. I've read up on charsets and I don't fully understand the difference between iso-8859-1 and UTF-8. Which should I use to get this result?
    Is it possible not to use í?

    I'm not sure of the origin of the text, but it'll likely be coming from microsoft word and be pasted into my text editor (since I don't speak Spanish and won't be writing the copy anyway).

    Up until this point, my text editor (Notepad++) was saving in ANSI, and I just switched it to UTF-8. Will it still work properly with pages encoded in ISO 8859-1?

    This is what is on es.yahoo.com
    Code:
    <html lang="es">
    <head>
    <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
    Is this the correct way of doing things?

  2. #2
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by vanishdesign View Post
    I've read up on charsets and I don't fully understand the difference between iso-8859-1 and UTF-8.
    First of all, you shouldn't use the term 'charset' at all, because it's ambiguous.

    There are two concepts of importance here: the character repertoire and the encoding. The term 'charset' has been used for both of these, which can lead to some confusion.

    The character repertoire is the total set of available characters. For HTML this is defined to be ISO/IEC 10646, which to all intents and purposes is equivalent to Unicode. You cannot change this; it's built into HTML.

    So what you can vary is the encoding, which is how the characters in a given repertoire are represented numerically, i.e., in a form that computers can understand.

    ISO 8859-1 is actually both a repertoire and an encoding. As a repertoire it's a small subset of ISO/IEC 10646, so we can regard it as an encoding capable of representing a small part of Unicode. UTF-8 can represent any Unicode character.

    In ISO 8859-1, each character is encoded using a single octet (an octet is an 8-bit number, i.e., an integer between 0 and 255, inclusive). In UTF-8 characters are encoded using a variable number of octets. The first 128 positions, equivalent to US-ASCII, are encoded as a single octet. Most additional characters used in European languages, Hebrew, Arabic and others use two octets. Eastern writing systems like Japanese and Chinese require three octets per character.

    The important thing to understand is that the encoding you declare for your web page must match the encoding under which you saved your files! Browsers don't automatically convert anything; they trust what you tell them.

    Quote Originally Posted by vanishdesign View Post
    Which should I use to get this result?
    Is it possible not to use &iacute;?
    In the case of Spanish you can choose either ISO 8859-1 or UTF-8. Both will let you use a literal '&#237;' (and the other letters with diacritical marks used in Spanish, plus the '&#191;' and '&#161;' punctuation characters).

    If you choose UTF-8, make sure you save the files without a BOM (byte order mark). A BOM is completely unnecessary in UTF-8, and will cause problems with some browsers.

    Quote Originally Posted by vanishdesign View Post
    I'm not sure of the origin of the text, but it'll likely be coming from microsoft word and be pasted into my text editor (since I don't speak Spanish and won't be writing the copy anyway).
    Then be careful, because the encoding used for the original text may then come into play as well. Your editor may convert the pasted text automatically, or it may not.

    Quote Originally Posted by vanishdesign View Post
    Up until this point, my text editor (Notepad++) was saving in ANSI, and I just switched it to UTF-8. Will it still work properly with pages encoded in ISO 8859-1?
    As I said, the declared encoding must match the encoding used in the file. If you save as UTF-8 and declare as ISO 8859-1 – or vice versa – you'll run into problems with all characters outside the US-ASCII range.

    Quote Originally Posted by vanishdesign View Post
    Is this the correct way of doing things?
    Yes and no.
    The <meta> element is good to have there, but it will be ignored if your web server sends encoding information in the real Content-Type HTTP header. (Many web servers do, by default.)

    If you cannot affect the server-side setting, then you have to choose the encoding declared by your server. Unless you use a server-side scripting language like PHP, which lets you override the headers.
    Birnam wood is come to Dunsinane

  3. #3
    Programming Team silver trophybronze trophy
    Mittineague's Avatar
    Join Date
    Jul 2005
    Location
    West Springfield, Massachusetts
    Posts
    17,154
    Mentioned
    190 Post(s)
    Tagged
    2 Thread(s)
    I like utf-8 because it's more portable. eg. content -> feed, compared to using something like Windows-1252

    But if your page content will only ever be in pages you could use whatever I suppose (as long as it's a common supported charset). Is your example of ES Yahoo what you want?
    HTML Code:
    <title>Yahoo! Espa&ntilde;a</title>
    ...
    ... Im&aacute;genes</a>
    ... V&iacute;deos</a>
    <label for="v11">en espa&ntilde;ol</label>
    Una mujer da a luz a dos beb&eacute;s y resulta que son hijos de padres distintos
    ... &#187; &iquest;C&oacute;mo es posible?</a>

  4. #4
    SitePoint Addict
    Join Date
    Feb 2009
    Location
    Austin Texas
    Posts
    289
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks Tommy for your detailed explanation. It is very helpful. I've read that ANSI is a superset of ISO8859-1, and learned that the server I'm going to use is transmitting in ISO 8859-1. Should I assume I'm safe using ANSI in my text editor and ISO on the server?

    Actually mittineague, I noticed that Yahoo was using html entities as you posted, which spurred me to start the thread. I'd prefer not to use html entities.

  5. #5
    Programming Team silver trophybronze trophy
    Mittineague's Avatar
    Join Date
    Jul 2005
    Location
    West Springfield, Massachusetts
    Posts
    17,154
    Mentioned
    190 Post(s)
    Tagged
    2 Thread(s)
    Yes, Tommy, thanks for that. No matter how many times I read about it I still have an uneasy feeling that I'm over my head.

    I think the main thing is to be consistent. The most common problems involving "weird characters" are almost always a result of using something different in the text editor, stated encoding, database, etc. So the best thing is to pick something and stick with it across the board.

    The other common problem is the BOM. AFAIK Notepad++ refers to this as the "signature" rather than BOM. Don't use it for UTF-8 or you may see "weird characters" at the beginning of your files.

  6. #6
    Resident curmudgeon bronze trophy gary.turner's Avatar
    Join Date
    Jan 2009
    Location
    Dallas
    Posts
    990
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Some reading:

    The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

    Subject: UTF-8 history ⇦ Absotively, posilutely fascinating reading, and the best description of utf-8 anywhere.

    If you don't have a Spanish keyboard, use this Free Online Unicode Character Map.

    If you configure Tidy to output utf-8, it will convert character entities to the character its ownself.

    cheers,

    gary
    Anyone can build a usable website. It takes a graphic
    designer to make it slow, confusing, and painful to use.

    Simple minded html & css demos and tutorials

  7. #7
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by vanishdesign View Post
    I've read that ANSI is a superset of ISO8859-1
    I think that 'ANSI' is what Microsoft sometimes call their proprietary encoding Windows-1252. That is similar to ISO 8859-1, except that it adds a number of characters in the 0x80-0x9F range. In the ISO encodings that range is reserved for C1 control characters.

    Quote Originally Posted by vanishdesign View Post
    and learned that the server I'm going to use is transmitting in ISO 8859-1. Should I assume I'm safe using ANSI in my text editor and ISO on the server?
    Yes, if you're careful. You must not use any of the characters that Microsoft put in the 0x80-0x9F range, because those code positions are not allowed in ISO 8859-1. Browsers often assume that the encoding is Windows-1252 if it's declared as ISO 8859-1, because many non-savvy authors don't understand about encoding concepts and believe that Microsoft complies with standards. But your page may fail validation – and display incorrectly in some browsers – if you use literal representations for characters in this range (such as dashes, ellipses or typographically correct quotation marks).

    Quote Originally Posted by Mittineague View Post
    No matter how many times I read about it I still have an uneasy feeling that I'm over my head.
    It can be confusing, but once the penny drops it becomes fairly clear.
    • Computers can only deal with (binary) numbers.
    • Characters must therefore be represented by numeric values.
    • There are a lot of different characters used in writing.
    • Many different characters require large numbers to represent them all.
    • Most authors use only a limited subset of the total character repertoire.
    • Using large numbers to represent few characters is a waste of space.
    • Thus various encodings try to represent such subsets as efficiently as possible.
    • The Writer and the Reader must agree on which representation to use.
    • The Writer chooses an encoding and declares what it is.
    • The Reader trusts the declaration and interprets the numbers accordingly.
    • If the Writer is lying about the encoding, chaos and mayhem may follow.
    • The encoding declaration should be sent by the Writer's server for HTML.
    • The encoding declaration should be in the XML declaration for XHTML.
    • Using a <meta> equivalent in HTML is good practice, to ensure correct interpretation in the absence of a server.
    • A <meta> equivalent must match the server's declaration.
    Birnam wood is come to Dunsinane


Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •