DOCTYPE and cyrillic

Hi,
I want to write clean xhtml pages for cyrillic and not find how which DOCTYPT to use and which tag in the head for the language.

Is it that I just change the EN in the DOCTYPE into RU?
What is best to make clear it is cyrillic in the head of the page?

Thanks for the answers and tips.
DancingMathilde

I think that’s right, but I won’t swear on it. :slight_smile: I have certainly seen this before:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
[COLOR="Red"]<html lang="ru">[/COLOR]

Here is a very old but still useful thread on this topic, which might be of use:

we are talking about the charset here, rather than the DTD.

first, your page should be written and encoded using utf-8, which is a large charset including cyrilics. doing that is also a life saver when you have mixed language content: russian, german, french.

then, you need to make the server send out appropriate charset information: AddType directives or AddDefaultCharset for Apache. by doing that, the line in the HTTP header will look like this: Content-Type: text/html; charset=utf-8. as you’ll see it’s also recommended you put this header info in the head section in your page.

and finally, you should include these in your page:


<html lang="ru">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="Content-Language" content="ru">
</head>

you could use a more smaller cyrillic charset, like iso-8859-5, or windows-1251, but utf-8 is a safer bet.

noonope is correct: and you do NOT change the “EN” in the doctype!
That is no part of the charset. Instead, it has to do with the published language of the DTD.

Keep the doctype the same as on any Western, Asian or whatever page, and set your language as noonope said: with the lang attribue in the HTML tag (or also xml: lang attributes if you’re writing XHTML), the meta content-lang tag, and most importantly on your server (if the server and your meta tags conflict, the server wins out).

Lastly, only thing noonope didn’t mention was, save your document also in UTF-8 (or lesser charset if you choose some other charset— just keep them all consistant with each other). If the document’s editor saves in some other charset, and the server tries to send as another charset, the user will see a lot of ??? everywhere.

you could use a more smaller charset, like iso-8859-5, or windows-1251, but utf-8 is a safer bet.

A much safer bet… I don’t run Windows, so my computer doesn’t always do well with Windows-only charsets (1251, 1252).

Thanks a lot for the to the point information.
“Stomme poes” : The ??? I noticed in the browser
Thanks again.

but i did…

first, your page should be written and encoded using utf-8

Ah true… I only think of it as “saving as” because in most editors, the charset is set when you save. But that may also be different per editor.

i use notepad++ for my html and css. in it you can change encoding of your file on-the-fly, using the options in the Encoding menu. and i set the default for a new file in notepad++ to UTF-8 w/o BOM. so no save as for me:)

this is the essential part for a non-ANSI content in your web page: starting an UTF-8 encoded document for your page, and this is why i put it first:

first, your page should be written and encoded using utf-8, which is a large charset including cyrilics. doing that is also a life saver when you have mixed language content: russian, german, french.

if you serve such a file as UTF-8 at server side, it should be enough. there isn’t really a need for lang=“ru”, or charset=utf-8, at least not to assure proper chars displaying in your site :slight_smile:

I assume you are meaning you would like to change the human language used on a full page rather than use a combination of languages upon a page?

For example azbuka, ISO-8859-5 as was mentioned earlier you typically would set the “charset” parameter of the “Content-Type” header field of the HTTP protocol.

Failing that in (x)html you’d probably use META ‘Content-Type’ declarations that should appear as early as possible in the HEAD element, i.e. before the TITLE. As a final fail-safe you could use the ‘charset’ attribute.

<?xml version=“1.0” encoding=“UTF-8”?>
. . .
<html xmlns=“http://www.w3.org/1999/xhtml” xml:lang=“ru” lang=
“ru”>

. . .
<meta http-equiv=“Content-Type” content=“text/html; charset=utf-8” />
. . .
<meta http-equiv=“content-language” content=“ru” />

In other words if there is a conflict between multiple encoding declarations within XHTML it follows:

  1. HTTP Content-Type header
  2. byte-order mark (BOM)
  3. XML declaration
  4. meta element
  5. link charset attribute

As was mentioned typically UTF-8 will usually cover most things although in some instances there can be rendering issues due to fonts/glyphs, in some languages, etc.

Unicode as a general rule does not include accented Cyrillic letters. KOI8-R encoding is popular for Russian text and is used more than ISO-8859-5 but Unicode support is slowing replacing them.

Don’t use Windows only charsets they are totally evil!

Poes, was probably talking about the ‘replacement character’ � (often a black diamond with a white question mark) a symbol found in the Unicode standard at codepoint U+FFFD in the Specials table.

there isn’t really a need for lang=“ru”,

Actually, there is. Screen readers and other a11y are supposed to pay attention to lang attributes, and they do.
If my reader is defaulted to English, I don’t want it trying to read out Russian with English pronunciation. I shouldn’t have to listen, figure out what the language really is, and fiddle with my settings.

or charset=utf-8,

I always include it for two reasons:

  1. validation
  2. I don’t have control over the server, and I want mismatches to appear when they set goofy charsets on the server. I also hard-code in all my non-ASCII chars with decimal HTML entities for the ones that should always appear correctly.
    It’s true that if your server is set correctly, there is no need for the meta tag, as the browser will ignore them, but the validator insists on that one.

Poes, was probably talking about the ‘replacement character’ � (often a black diamond with a white question mark) a symbol found in the Unicode standard at codepoint U+FFFD in the Specials table.

I noticed Safari uses a weird one, and the Doze machine just has empty boxes.
When there’s a char mismatch, you can get the �, but if you don’t have the font on system, at least on Linux, you’ll get a box with little symbols in it.

did a little research and i think this is worth mentioning:

Unicode does not include accented Cyrillic letters, but they can be combined by adding U+0301 (“combining acute accent”) after the accented vowel (e.g., ы́ э́ ю́ я́). Some languages, including modern Church Slavonic, are still not fully supported.

www.google.ru also uses utf-8. i think it’s safe to assume utf-8 it’s ok to use for cyrillic pages.

Unicode does not include accented Cyrillic letters, but they can be combined by adding U+0301 (“combining acute accent”) after the accented vowel (e.g., ы́ э́ ю́ я́). Some languages, including modern Church Slavonic, are still not fully supported.

This is a problem. There are many characters who can either be represented as a single character, or a combination of two. There’s a letter, I think it was a small j but with a circumflex instead of a dot on the top? Or some similar letter, where there was a two-char combo for uppercase but not for lower case, or vice versa… I was reading about that particular letter in a book on regexes. Regular expressions can puke on those differences. Why I’m scared to get too far into regular expressions and unicode!

i think it’s safe to assume utf-8 it’s ok to use for cyrillic pages.

I would.

you are right but… :wink:

there isn’t really a need for lang=“ru”, or charset=utf-8, at least not to assure proper chars displaying in your site

i was trying to make sure OP understands:

  • what are the minimum requirements to make a cyrillic web page: the text file containing the page to be encoded properly and sent with the proper content type header.

  • what are the other inclusions for: lang, meta :slight_smile:

and your additions were indeed necessary to make things even clearer :slight_smile: there are cases when those are a must.

For those looking to cut to the chase, here’s the code I use religiously. I use XHTML so I need to include both lang and xml:lang attributes in the html tag. You define the character set in a meta tag like so.


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">  
     <head>
          <meta http-equiv="content-type" content="text/html; charset=utf-8" />
     </head>

“<?xml version=“1.0” encoding=“UTF-8”?>” before the doctype is technically correct for defining the character set. However it causes IE (or at least certain versions) to melt into small puddles of non-functional goo, so I’ve never used it and I don’t recommend it. I haven’t used a meta tag to define the language, however, without knowing some key reason to include it, it’s bad practice to define any setting in multiple locations. As soon as one is updated but not the other, your page will contradict itself which can lead to problems.

I think we should go back to this. This is the real reason why you use utf-8. There are all kinds of subtle hiccups that can occur if you try to code one page in one charset and a different page in the same site in another. Trying to use two different languages on the same page might work depending on the exact characters you use and whether or not they’re available… ha ha… try keeping tabs on THAT bit of maintenance on pages across your website. This is the nightmare which caused people to create Unicode in the first place, and I use it consistently on all my pages on all my websites. Rarely a small bump may surface, but it’s nothing compared to trying to manage alternatives, especially if you are building a website for international use.

Note that this does not mean you will always get good display out of Unicode. Using Unicode simply means that when the HTTP response you send to a visitor’s browsers, it is more likely to be understood than using any other technique. Sadly, whether or not the characters render is ALSO dependent on the font support of the visitor’s computer. So if you use Unicode, but your computer doesn’t have fonts which include those glyphs, you will see question marks, or little rectangles, or squares with four digit unicode encodings, or whatever other wildcard your OS uses. This isn’t the fault of Unicode it’s a font issue. A different character set would not render any better, because the problem is lack of font support; plus you have the added problem that the character set is more likely to cause problems.

You can also do everything right, but if your web host serves content headers that specify a different character set, it will override the settings in your XHTML. You’ll need to contact your web host to correct this situation (and there are some web hosts who refuse). Fortunately, this is much less common than it used to be because Unicode is finally emerging as a dominant standard.

Also, you should ALWAYS include the language of the document. There is probably some logic that defaults to english or something, but you can cause serious issues with usabiliity and accessibility especially for non-browser interfaces. Having worked at Braille Institute, I can cite a couple of cases where not specifying the language or doing so incorrectly will result in complete gibberish for some users.

  1. Voice output programs read what’s on screen for users. They have completely different code, recordings, accents, etc. depending on the language you specify. If you do so incorrectly, it will try to read English as French, or Russian as English, etc. The result is usually unintelligible.

  2. Braille displays render completely differently depending on the language; this is not even a technology-specific issue, but a Braille issue. Braille characters are completely different in every language. Someone can learn English Braille and read Braille books; however, if they pick up a book in French Braille all they can make out is gibberish even if they are fluent in French, the symbols actually stand for totally different letter combinations. And if English and French are incompatible, you can understand the problems for non-European languages.

The only versions of IE that don’t like that tag are versions that do not support XHTML at all. It doesn’t have any issues in any browser that actually does support XHTML.

The earliest version of IE to actually support XHTML is IE9. For all earlier versions you can’t serve the page as XHTML and have it display in Internet Explorer (it gets offered for download if you try) and so you have to load it as HTML if you want the page to display. That statement is invalid for HTML but only IE6 actually has any problem with it being there.

  1. Voice output programs read what’s on screen for users. They have completely different code, recordings, accents, etc. depending on the language you specify. If you do so incorrectly, it will try to read English as French, or Russian as English, etc. The result is usually unintelligible.

Sadly, I don’t have a Dutch voice on my copy of JAWS (unless I wanna pay about a thousand euros). For JAWS testing of any of my work pages, this means I must translate everything I want to test into English. Dutch-pronounced-as-English is very difficult to follow.

  1. Braille displays render completely differently depending on the language; this is not even a technology-specific issue, but a Braille issue. Braille characters are completely different in every language. Someone can learn English Braille and read Braille books; however, if they pick up a book in French Braille all they can make out is gibberish even if they are fluent in French, the symbols actually stand for totally different letter combinations. And if English and French are incompatible, you can understand the problems for non-European languages.

Cool info. I’ve always wondered if it was like Sign Language (different in different languages).

That statement is invalid for HTML but only IE6 actually has any problem with it being there.

Yes, they’ve trained IE7 to ignore the XML tag, even though other comments before the doctype throws 7 into Quirks Mode.

Thanks a lot for the above discussion, learn a lot from it :slight_smile:

I had a look for the server settings, and I can choose 3 options:

  • k 018-r
  • windows-1251
  • x-mac-cyrillic

As far as I followed above (and understood), I best can select the ‘k 018-r’ option.
In the html document I start with (and remove the red coding as explained by felgall):
<code>
<!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Transitional//EN” “http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd”>

<html xmlns=“http://www.w3.org/1999/xhtml” xml:lang=“en” lang=“en”>
<head>
<meta http-equiv=“content-type” content=“text/html; charset=utf-8” />
<meta http-equiv=“Content-Language” content=“ru”>
</head>
</code>

Mathilde

Well, you certainly don’t want lang=“en” on a Russian-language site!

But you do want lang=“ru” and xml:lang=“ru”!

Side note: why transitional doctype? Use Strict! : )

Note 2: our code tags (and other tags) use square brackets

Stephen was making a joke about the fact that most live versions of M$IE cannot handle XHTML so was probably advocating HTML 4.01.

<?xml version="1.0" encoding="UTF-8"?>
   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
       "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   [B]<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ru" 
    lang="ru">[/B]
     <head>
       <meta http-equiv="Content-Type" content=
       "text/html; charset=UTF-8" />
       <title>
         The guzzlin' wizard
       </title>
     </head>
     <body>
     </body>
   </html>

If you are writing XHTML grammar it is required you have the xml namespace and it is usually a good idea to add the xml:lang’ and ‘lang’ values. XHTML will typically default to UTF-8 in the absence of other higher-level protocols, etc.

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

should be, from what options i understand you have,

<meta http-equiv="content-type" content="text/html; charset=koi8-r" />

and you should have

<html lang="ru">

or

<html lang="ru" xml:lang="ru" xmlns="http://www.w3.org/1999/xhtml">

see more on this here.