Frequently Asked Questions about HTML

AutisticCuckoo · October 9, 2006, 7:38am

Frequently Asked Questions About HTML

Table of contents

[list=1][]What is HTML?
[]What does the DOCTYPE declaration do?
[]What is a DTD?
[]What is the difference between Strict, Transitional and Frameset?
[]Which DOCTYPE should I use?
[]Why should I validate my markup?
[]Why does HTML allow sloppy coding?
[]Why does the validator complain about my <embed> tag?
[]What does character encoding (charset) mean?
[]What is a BOM?
[]What encoding should I declare?
[]How do I insert characters outside the encoding range?
[]Why do I need to write & instead of &?
[]How should heading elements be used?
[]What are block-level and inline elements?
[]Can I make an inline element block-level with CSS?
[]Why are external CSS and JavaScript files a good idea?
[]Should I use P or BR?
[]What does ‘semantic’ mean?
[]Should I replace B/I with STRONG/EM?
[]Why are layout tables considered harmful?
[]Should I use DIVs instead of layout tables?
[]Are all TABLEs bad?
[]What is the use of the ADDRESS element type?
[]What is the use of the DFN element type?
[]What is the use of the VAR element type?
[]Should I use quotation marks within or around a Q element?
[]What is the difference between ABBR and ACRONYM?
[]Why is <feature X> deprecated?
[]Must I have an ALT attribute for every image?
[]What is the difference between CLASS and ID?
[]Why doesn’t id="123" work?
[]Why doesn’t href=My Cool Page.html work?
[]How can I include an HTML page in another HTML page?[/list]

This FAQ deals with HTML. For information about the differences between HTML and XHTML, please see the [thread=393445]XHTML vs HTML FAQ[/thread].

What is HTML?
HTML, or HyperText Markup Language, is a tagged markup language primarily used for web documents. A tagged markup language means that the content is interspersed with instructions, tags, that mark up the semantic meaning of certain passages. HTML is an application of SGML (Standard Generalized Markup Language), a more generic markup language.

HTML defines a number of element types (written in uppercase throughout this FAQ, although HTML is case insensitive). An element type, e.g., EM, assigns some semantic meaning to its content.

An element is a concrete instance of an element type. An element usually consists of a start tag (), some content, and an end tag (). Tags are written in lowercase in this FAQ. HTML allows some end tags (and even a few start tags) to be omitted. Do not confuse tags with elements; the BODY element will be present even if the <body> and </body> tags are omitted. Certain element types – declared as EMPTY – must not have an end tag. One example is the IMG element type.

A start tag can contain attributes, comprising an attribute name, an equals sign (=), and an attribute value. Example: lang="en". Attribute values must be quoted in some instances, so it is good practice to always quote all attribute values. Some boolean attributes are allowed to be minimised in HTML, which means the name and the equals sign are omitted; e.g. selected. Some attributes are required for some element types, e.g., the alt attribute in an IMG element.

An example of an EM element with a lang attribute:

<em lang="en">content</em>

Beginners often use phrases like ‘alt tag’, but that is not correct nomenclature; alt is an attribute, not a tag. Tags are surrounded by <…>.

The first version of HTML (1989) didn’t have a version number; it was just ‘HTML’.
The first ‘standardised’ version of HTML (IETF, 1995) was called HTML 2.0.
Then the World Wide Web Consortium (W3C) was formed. It presented its first ‘standard’ version (W3C isn’t a standards body, so their ‘standards’ are really called ‘recommendations’) in 1997: HTML 3.2.
HTML 4.0 came out in 1998, and was quickly replaced by HTML 4.01 in 1999. That is the latest and current version of HTML. The W3C has announced that it will not create further versions of HTML. HTML 4.01 is the recommended version for creating HTML documents.
However, the Web Hypertext Application Technology Working Group (WHATWG) are working on what is referred to as [url=http://whatwg.org/specs/web-apps/current-work/#html5]HTML5, hoping that it will eventually be accepted as a W3C recommendation.

What does the DOCTYPE declaration do?
The DOCTYPE declaration, which must precede any other markup in the document, can look something like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

It specifies the element type of the document’s root element (HTML), a public identifier and a system identifier.

The public identifier (-//W3C//DTD HTML 4.01//EN) shows who has issued the document type definition, or DTD, (W3C); the name of the DTD (DTD HTML 4.01); and the language in which the DTD is written (EN, for English). Note that it doesn’t say anything about the language of the web page itself; it is the language of the DTD that is specified.

The system identifier (http://www.w3.org/TR/html4/strict.dtd) is the URI (uniform resource identifier, or ‘web address’) for the actual DTD.

The DOCTYPE declaration tells a validator (a program that checks the syntactic validity of a web page) against which DTD to test for compliance. Browsers didn’t used to care about the DOCTYPE declaration, but modern browsers use it for a completely different purpose: deciding if the page is ‘modern’ (and presumably standards compliant) or old-school. This affects the rendering mode in IE5/Mac, IE6+/Win, Opera, Firefox (and other Gecko browsers), Safari, etc. A complete DOCTYPE declaration (including the system identifier) tells the browser that it is a modern document. If the system identifier is missing, or if there is no DOCTYPE declaration at all, browsers assume it is an old document and render it in ‘quirks mode’.

What is a DTD?
A DTD, or document type definition, specifies the element types and attributes that we can use in our web page. It also specifies how these element types relate to one another, e.g., which type can be subordinate to which. We can regard the DTD as the grammar specification for our markup language. The DTD can also declare the character entities we can use; more about those later.

A validator will test a web page for compliance with the DTD specified in the DOCTYPE declaration; either explicitly via the system identifier or implicitly using the public identifier. Browsers use non-validating parsers and do not actually read the DTD. They have built-in knowledge about the various element types and usually a hard-coded list of character entities as well.

For HTML 4.01, which is the latest and greatest version of HTML and the only one we should consider when creating new web pages, there are three different DTDs: Strict, Transitional and Frameset.

What is the difference between Strict, Transitional and Frameset?
The difference is which element types and attributes they declare, and how they allow or require element types to nest.

The HTML 4.01 Strict DTD emphasises the separation of content from presentation and behaviour. This is the DTD that the W3C recommend for all new documents.

The HTML 4.01 Transitional DTD is meant to be used transitionally when converting an old-school (pre-HTML4) document into modern markup. It is not intended to be used for creating new documents. It contains 11 presentational element types and a plethora of presentational attributes that are deprecated in the Strict DTD. The Transitional DTD is also often necessary for pages that reside within a frameset, because it declares the target attribute required for opening links in another frame.

The HTML 4.01 Frameset DTD is for frameset pages. Frames are deprecated by the W3C. For modern websites, using server-side scripting technologies is usually regarded as a far better solution.

Which DOCTYPE should I use?
If you are creating a new web page, the W3C recommend using HTML 4.01 Strict.

If you are trying to convert an ancient HTML 2.0 or HTML 3.2 document to the modern world, you can use HTML 4.01 Transitional until you have managed to transfer all presentational issues to CSS and all behavioural issues to JavaScript.

Why should I validate my markup?
Why should you spell-check your text before publishing it on the Web? Because mistakes and errors can confuse your readers and detract from the important information. It is the same with markup. Invalid markup can confuse browsers, search engines and other user agents. The result can be improper rendering, dysfunctional pages, unindexed pages in the search engines, program crashes, or the end of the universe as we know it.

If your page doesn’t display the way you intended, always validate your markup before you start looking for other problems (or asking for help on SitePoint). With invalid markup, there are no guarantees.

Use the HTML validator at W3C to check for compliance. Don’t forget to include a DOCTYPE declaration, so the validator knows what to check against.

HTML Tidy is a free tool that can help you tidy up sloppy markup and make it nicely formatted and easier to read.

Why does HTML allow sloppy coding?
It doesn’t, but it recommends that user agents handle markup errors and try to recover.

It is sometimes alleged – usually as an argument for why XHTML is superior to HTML – that HTML allows improperly nested elements like foo. That is not true; the validator will complain about that because it is not valid HTML. However, browsers will usually guess what the author meant, so the error can go by undetected.

Some dislike that HTML allows certain (but not all!) end tags to be omitted. That is not a problem for browsers, because valid markup can be parsed unambiguously anyway. In the early years it was very common to omit certain end tags, e.g.,  and </li>. Nowadays it’s usually considered good practice to use explicit end tags for all elements except those where it is forbidden (like BR and IMG).

Why does the validator complain about my <embed> tag?
Because EMBED has never been part of any HTML recommendation. It is a non-standard extension which, although supported by most browsers, is not part of HTML.

During the ‘browser wars’ of the late 1990s, browser vendors like Microsoft and Netscape competed by adding lots of ‘cool’ features to HTML, to make it possible to style web pages. The problem with those additions was that they were not standardised and that they were mostly incompatible between browsers.

There are other elements that used to be quite common (MARQUEE anyone?) that have never been included in an HTML recommendation. Don’t use them, if you can avoid it.

There are also a number of attributes that were very common in the 1990s, but which have never been included in an official HTML recommendation. For example, marginwidth.

What does character encoding (charset) mean?
Computers can only deal with numbers. What we see on the screen as letters or images are really just numeric codes, which the computer sees as groups of binary digits (ones and zeros).

First, we need to define a minimum unit capable of conveying some sort of information. This is called a character. This is a rather abstract concept. The character known as ‘uppercase A’ has no defined visual appearance; it’s more like ‘the idea of an A’.

Then we need to establish a set of such abstract characters that we will want to use. That is called a character repertoire, or sometimes a ‘character set’, but that term is used for several different things, so I will avoid it here. A character repertoire is the total set of abstract characters that we have at our disposal. For HTML, the character set is ISO 10646, which is virtually the same thing as Unicode. It is a repertoire of tens of thousands of characters representing most of the written languages on the planet.

The visual appearance of a character is called a glyph. A certain set of glyphs is known as a font. The glyph for ‘uppercase A’ will differ between fonts, but that doesn’t change the underlying meaning of the abstract character.

Now, since computers only deal with numbers, we must have a way to represent each character with a numeric code. Each character in a repertoire has a code position, or code point. The code point is the numeric representation (index) of the character within the repertoire. Code points in Unicode are usually expressed in hexadecimal, e.g., U+0041 for ‘uppercase A’.

Finally, the encoding – sometimes, unfortunately, called a ‘character set’ or ‘charset’ – is a mechanism for expressing those code points, usually with octets, which are groups of 8 binary digits (thus capable of representing numbers between 0 and 255, inclusive).

In the early days of computer communication, people used small character repertoires containing only the bare necessities for a specific language. The most well-known one is probably ASCII (ISO 646), which only contains 128 characters – and 33 of those are unprintable ‘control codes’ (the C0 range plus DEL). The repertoire has 128 code points numbered sequentially from 0 to 127. The encoding is a simple one-to-one: the codepoint for ‘uppercase A’ is 65 (0x41), which is encoded as 65 (the octet 01000001, in binary).

ASCII isn’t very useful outside the English-speaking world, because it only contains the letters A-Z, digits 0-9, and some punctuation. ISO issued a set of standards called ISO 8859, which augment the ASCII repertoire with characters that are needed in other languages. In the Western world, the most common one is ISO 8859-1, known as Latin-1. It contains characters needed to write most Western European languages. The ISO 8859 series are both character repertoires and encodings (one-to-one). Each repertoire contains 256 characters, which can be encoded using single octets. They use the ASCII repertoire as a subset, i.e., the first 128 code points are the same.

But even 256 characters is not enough to write some languages. Chinese, for instance, needs thousands of characters. Several mutually incompatible encodings for Chinese were devised, but there was still a big problem when you wanted to exchange information across linguistic and cultural barriers. The ISO 8859-1 encoding for ‘uppercase A’ might be something totally different in one of the Chinese encodings; perhaps even something rude!

An easy solution would be to create a 32-bit encoding that would enable direct access to four billion code points, or at least a 16-bit encoding (65,536 code points). Both of those exist, but there is a drawback: there will be a lot of useless octets for most Western languages. With a 32-bit encoding, every document would be four times are large as with an 8-bit encoding.

The solution was a variable-length encoding called UTF-8. It uses between one and six octets to encode each code point, and it can address the entire Unicode (or ISO 10646) character repertoire. The first 128 code points are encoded with single octets, and are identical to the same code points in the ISO 8859 series or in ASCII (US version). Most Western European languages can be encoded with single octets, sprinkled with the occasional double octet for letters with diacritical marks (e.g., ‘Ä’).

OK, so how does this affect us as authors of web documents? If we use characters whose code point is outside the ASCII range, the encoding becomes really crucial. Specify the wrong encoding, and the page will be difficult – or even impossible – to read.

So how do we go about specifying the encoding? The proper way to do it is to send this information in the Content-Type HTTP header:

Content-Type: text/html; charset=utf-8

The HTTP headers are sent by our web server, so we must tweak the server to change the encoding information. How to do that depends on which web server we use. For Apache, it can be specified in the global configuration file (/etc/httpd.conf) or in local .htaccess files. But if we are on a shared host, we may not have sufficient privileges to tweak the configuration. In that case, we need a server-side scripting language to send our own HTTP header; e.g., with PHP:

header('Content-Type: text/html; charset=utf-8');

We can also specify the encoding using an HTTP equivalent in a META element:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

This META element will be ignored if the real HTTP header contains encoding information. It can be useful anyway, however, because it will be used if a visitor saves our page to the hard drive and looks at it locally. In that situation there is no web server to send HTTP headers, so the META element will be used instead.

There is no default encoding for HTML, so we should always make sure to specify it.

Under Microsoft Windows, a common encoding is Windows-1252. It is very similar to ISO 8859-1, but there are differences. In ISO 8859-1, the range of code points between 128 and 159 (0x80-0x9F) is reserved for C1 control characters. In Windows-1252, that range is instead used for a number of useful characters that are missing from the ISO encoding, e.g., typographically correct quotation marks. This is not an encoding that I would recommend for use on the Web, since it’s Windows specific. It is, however, the default encoding in many text editors under Windows.

What is a BOM?
The BOM, or byte order mark, is used for some encodings that use multiple octets to encode code points, e.g., UTF-8 and UTF-16. Computer processors (CPUs) can employ different schemes for storing large integer numbers, e.g., ‘big-endian’ or ‘little-endian’ (this has to do with whether the least significant byte comes first or last). The BOM is a representation of the value 0xFFFE (65,535), which means that it’s possible to detect the byte order (FF FE vs FE FF). The BOM is thus two octets which are written at the very beginning of the file, to tell the parser how to interpret multi-octet values.

Unfortunately, many older browsers cannot handle this, so they display these octets as character data. If you see something like ‘ï»¿’ at the top of the page, the reason is probably a BOM that isn’t handled by the browser (or an incorrectly specified encoding).

The only resolution is to avoid using the BOM. Editors that can save as UTF-8 will usually allow us to choose whether or not to include the BOM.

What encoding should I declare?
It’s very, very simple: we must specify the encoding that we used when saving our .html file! If we save the file as ISO 8859-1, we must specify the encoding as iso-8859-1; if we save as UTF-8, we specify it as utf-8. The only problem here is that we may not always know what encoding our editor is using to save the file. Any editor worth its salt should give us an option to specify the encoding, though.

If we are writing in English, it doesn’t matter all that much what encoding we choose, because we are mostly going to use characters that are encoded the same in most encodings. US-ASCII, ISO 8859-1, UTF-8, … take a pick. For those of us who write in other languages, the choice becomes more important. My native language – Swedish – uses three letters more than what the English alphabet has to offer. Those are present in ISO 8859-1, though, so I can choose between that and UTF-8. If you are writing in an East Asian language like Chinese or Japanese, you may want to look into UTF-16, since UTF-8 can be a bit inefficient for those languages. Otherwise, though, I wouldn’t recommend UTF-16, because there seems to be quite a bit of problems with that in Western browsers.

Avoid Windows-1252 on public web pages, since it’s a Windows specific encoding. Use ISO 8859-1 instead (or ISO 8859-15, if you need the Euro sign, €).

If at all possible, my recommendation is to use UTF-8. It can natively represent any character in the Unicode repertoire.

How do I insert characters outside the encoding range?
What if we are using ISO 8859-1 as the encoding and wish to include an em-dash in our content? There is no em-dash in that character repertoire, and hence no way to encode it, although it is present in ISO 10646 and can be used on a web page.

We have two choices: a named entity or a numeric reference.
The named entity for an em-dash is —. Entities start with an ampersand (&) and end with a semicolon. In some circumstances we can get away with omitting the semicolon, but it is definitely good practice to always put it in.

A numeric reference can be either decimal (&#8212;) or hexadecimal (&#x2014;), but it’s generally safer to stick with decimal notation, because some old browsers don’t handle the hex version. Note that the numeric value references the code point in ISO 10646; it has nothing whatsoever to do with what encoding we have specified for our document.

References (in decimal) always work. Named entities may cause problems in older browsers, because some of them only support a subset of HTML entities.

Why do I need to write & instead of &?
(How do I show HTML markup on the page?)
Certain characters have special meaning in HTML: ‘<’ (less than), ‘>’ (greater than), ‘&’ (ampersand), ‘"’ (double quote) and ‘’’ (single quote). In some circumstances, we need to ‘escape’ them. For instance, the ‘<’ character signals the start of a tag, and needs to be escaped. The ampersand signals the start of a named entity or numeric reference, and must always be escaped (except in CDATA elements like SCRIPT and STYLE). The quotation marks need only be escaped within attribute values surrounded by the same quotation marks.

The first four have predefined entities in HTML, but not the single quote. XML defines ', but HTML does not, so a single quote (apostrophe, really) can only be escaped using a reference (&#39;). The entities for the other four are as follows:

< (<)
> (>)
& (&)
" (")

Since the ampersand is very special, it must always be escaped, including when it’s used inside attribute values. Like the href attribute of links. Unfortunately, the ampersand is a very common argument separator in URIs, which means that it’s quite common to encounter ampersands in URIs.

Most of the time, it doesn’t break anything (in HTML; XHTML is a different story). The error handling routines in browsers recover from the error and it all works. But if we should happen to have a query parameter whose name matches one of the predefined named entities in HTML …

How should heading elements be used?
HTML heading element types are H1, H2, H3, H4, H5 and H6. The number denotes the structural level of the heading, which means we should look at headings as in those outlines we had to learn in school (and promptly forgot about right after graduation).

The top-level heading on a page must be an H1. It should describe what the page is about. Most pages will only have one H1 heading, but very complex documents that deal with several disparate topics may need more.

H2 headings will mark up the next structural level. Any sub-levels under that will be H3, and so on. We can never skip a heading level (downward). An H4 cannot possibly follow an H2; there must be an H3 in between. (The validator will not complain, this is merely an issue of good practice.)

It’s important to mark up headings with the Hx element types. Assistive technologies, e.g., screen readers, can make use of a proper heading hierarchy to present an outline of the document. If we use ..., they cannot.

What are block-level and inline elements?
There are two main categories of element types in HTML: block-level elements and inline elements. The differences between them are mainly semantic and grammatic.

Block-level elements are usually ‘containers’ for other elements. Examples of block-level elements are DIV, P, FORM and TABLE. Some block-level elements, e.g., P, can only contain text and inline elements. Others, e.g., FORM, can only contain block-level elements (in the Strict DTD). And some, like DIV, can contain text, inline elements and block-level elements. Block-level elements are by default rendered with an implicit line break before and after; in other words, we cannot have two block-level elements side by side using only Strict HTML (that requires CSS).

Inline elements are elements that can occur ‘inline’ within text. Examples: A, EM, Q and SPAN. An inline element can only contain text and other inline elements. An inline element cannot contain a block-level element, with one exception: OBJECT (which is known as a replaced inline element, the same as IMG). Inline elements, when rendered, do not have any implied line breaks before or after.

Sometimes there are additional restrictions on child element types. For instance, anchor links (A) can contain text and inline elements, but not other A elements; you cannot nest links.

The rules are somewhat different between the Strict and the Transitional DTD. In the Strict DTD, some block-level elements like BODY, BLOCKQUOTE and FORM can only have block-level children. In the Transitional DTD they can also contain text and inline elements as immediate children.

Can I make an inline element block-level with CSS?
No. This is a common misconception. Beginners sometimes think that using display:block on an A element will allow them to put a block-level H1 inside the link. That is not the case.

HTML has block-level and inline elements. CSS has block and inline boxes (plus a few others). Those are very different things. The distinction in HTML has to do with semantics and syntax, while the distinction in CSS has to do with rendering and presentation. By default, block-level elements generate block boxes, and inline element generate inline boxes (grossly simplified). The display property can change the type of the generated box, but CSS cannot change the grammatical or syntactical rules of HTML.

Why are external CSS and JavaScript files a good idea?
From a maintenance perspective, a full separation between content, presentation and behaviour is something to strive for. If we want to redesign our site, we can simply edit a single style sheet instead of updating possibly thousands of HTML documents. If we use style attributes and write inline CSS, we will have to edit those HTML documents when redesigning our site, instead of simply editing a single style sheet file.

There is also another issue: both CSS and JavaScript often contain characters with special meaning in HTML. If the CSS code or JavaScript code is embedded into the HTML document, these characters need to be escaped. If we have embedded JavaScript, and use the archaic practice of ‘hiding’ the script code within SGML comments (), we cannot use the decrement operator (e.g., --i), because the double hyphen will terminate the comment.

Should I use P or BR?
The P element marks up a paragraph of text. A paragraph is one or more sentences that deal with a single thought.

A line break (BR) is mostly a presentational thing, and should be handled by CSS rather than HTML. However, there are a few cases where line breaks can be said to have semantic meaning, for instance in poetry, song lyrics, postal addresses and computer code samples. These can be legitimate uses for BR, but using BR to separate ‘paragraphs’ is definitely not a legitimate use.

On the other hand, P has a very clear semantic meaning: it denotes a paragraph. Sometimes web authors tend to treat P as a generic block-level container element, but that’s not correct. It’s not uncommon to see a LABEL and an INPUT field wrapped inside a P within a FORM, but I would argue that it’s semantically wrong. A label and an input field does not constitute a ‘paragraph’.

What does ‘semantic’ mean?
se·man·tic [si-'man-tik]
adj. Of, pertaining to, or arising from the different meanings of words or other symbols.

When we talk about ‘semantic markup’, we mean the proper use of element types – based on their meaning – to mark up content. The opposite is ‘presentational markup’ or ‘tag soup’, where authors choose element types because of their default rendering, rather than their semantic meaning.

An example: This is a semantically correct way to mark up the top-level heading of a web page:

<h1>Heading Text</h1>

This is an unsemantic (presentational) way to do it:

<br><font size="7"><b>Heading Text</b></font><br>

The semantic richness of HTML is quite limited. HTML was originally used by physicists to exchange scientific documents, and that shows quite clearly in the set of available element types. HTML would probably have had a very different set of element types if it had been invented by accountants or librarians.

HTML has two semantically neutral element types as well: the block-level DIV and the inline-level SPAN. Neither of those two implies any particular semantics about its content; DIV is just a ‘division of the document’, while SPAN is a ‘span of characters’. On the other side of the spectrum we have element types with clearly defined semantics: P (paragraph of text), TABLE (tabular data), UL (unordered list), etc.

The purpose of HTML is to mark up the semantics of a document, and – to some extent – to show the structure of its content. It has nothing at all to do with the way this document looks in a browser (although browsers have a default style for each element type).

Should I replace B/I with STRONG/EM?
Only if we really mean to emphasise something. They are not interchangeable.

In the Bad Old Days, authors would use B and I to emphasise words.
In the Equally Bad Modern Days, authors will use STRONG and EM to make text boldfaced or italic.

EM signifies semantic emphasis. The content should have some sort of emphasis (louder, more slowly) when read out loud. STRONG indicates even stronger emphasis, but is now often considered to be redundant (you could nest EM elements to indicate increasing emphasis). Some experts recommend that STRONG be used only for certain page elements that should be clearly indicated (like a ‘current page’ indicator), and not to mark up words or phrases in the body copy.

B and I have no semantics, they only indicate bold or italics. They are useful for adhering to typographic conventions that do not have a semantically correct element type in HTML. For instance, ship names are traditionally written in italics, but there is no SHIP element type in HTML. Thus we can use Titanic.

Why are layout tables considered harmful?

Because it is semantically wrong to mark up non-tabular information as a TABLE.
Because they can cause accessibility or usability problems (especially with some assistive technologies), particularly when nested several levels deep.
Because they mix presentational issues with the content, making it difficult or impossible to achieve alternate styling and output device independence.
Because they bloat the markup with lots of unnecessary HTML tags, which can be detrimental for low-bandwidth users (dial-up, mobile devices) as well as for the web server’s load and bandwidth.

Should I use DIVs instead of layout tables?
No, we should use semantically correct element types as far as possible, and only revert to DIVs when there are no other options.

Abusing DIVs is no better than abusing TABLEs. We can set id and class attributes on virtually any element type. We can assign CSS rules to virtually any element type. Not only DIVs.

Are all TABLEs bad?
Not at all. TABLE is the proper, semantically correct element type to use for marking up tabular data: information with relationships in two or more dimensions. Tables are not deprecated, only layout tables.

What is the use of the ADDRESS element type?
To mark up contact information for the page (or for a part of a page). This can be a postal address, an email address, a telephone number, or virtually anything. ADDRESS is a block-level element which can only contain text and inline elements. The default rendering is italic in most browsers, but that can easily be changed with CSS.

A common misconception is that ADDRESS is meant to be used for marking up any postal address, but that is not the case.

What is the use of the DFN element type?
To mark up the ‘defining instance’ of a term. It is a typographic convention, especially in scientific documents, that the first time a new term – with which the reader cannot be expected to be familiar – appears in the text, it is italicised. The default rendering of DFN is thus italic.

A common misconception is that DFN means ‘definition’, and many authors use it in the same what that they use ABBR or ACRONYM: using the title attribute to provide an explanation of the term. A certain term should only be marked up with DFN once in a document (where it is first used and explained).

What is the use of the VAR element type?
To mark up a variable, or placeholder, part of an example. It is a typographic convention to italicise such variables, which will be replaced by actual data in real life use. For instance, in a telephone system manual, the instruction for relaying incoming calls to another extension could look something like this:

<kbd>* 21 * <var>extension</var> #</kbd>

Here, a VAR element is used to mark up ‘extension’ (which will be italic by default). Someone trying to program the telephone system to relay his incoming calls to extension 942 would type ‘*21*942#’. Thus the VAR element indicates that you shouldn’t actually type ‘e-x-t-e-n-s-i-o-n’, but enter the actual extension number instead. The word ‘extension’ is a variable.

A common misconception is that VAR should be used for marking up variables in programming code samples.

Should I use quotation marks within or around a Q element?
No, the specification clearly says that it is the responsibility of the user agent to add quotation marks to inline quotations. Unfortunately, Internet Explorer 6 and older do not comply with the specification and will not add quotation marks. An option is to insert those with JavaScript for IE, and use some special styling with CSS to distinguish quotations for IE users with JavaScript disabled. Some CSS-only solutions have been proposed, but they will fail in non-CSS browsers like Lynx.

What is the difference between ABBR and ACRONYM?
No one really seems to know. Even the HTML specification is contradicting itself.

ABBR was a Netscape extension to HTML during the browser wars. ACRONYM was Microsoft’s extension. Both meant the same thing, more or less. Both element types were incorporated into the HTML specification, with different semantics. The problem is that no one seems to be able to explain what those semantics are.

Let us look at a couple of dictionary definitions, then:

ab·bre·vi·a·tion [uh-bree-vee-'ey-shuhn]
n. A shortened or contracted form of a word or phrase, used to represent the whole.

ac·ro·nym ['ak-ruh-nim]
n. A word formed from the initial letters or groups of letters of words in a set phrase or series of words.

The definition for acronym says that it is a word, i.e., it can be pronounced. Thus, NATO would be an acronym, formed from the initial letters in the phrase North Atlantic Treaty Organization. FBI, however, would not be an acronym according to the dictionary definition, because it is not pronounced as a word, but rather spelled out (eff bee eye). And this is where the problems begin. FBI is technically known as an initialism, about which the dictionary has the following to say:

in·i·tial·ism [i-'nish-uh-liz-uhm]
n. 1. A name or term formed from the initial letters of a group of words and pronounced as a separate word.
2. A set of initials representing a name, organization, or the like, with each letter pronounced separately.

The first definition is almost the same as for acronym, but the second is more relaxed. But there is no INITIALISM element type in HTML, and the confusion is exacerbated by the fact that ‘acronym’ in normal American parlance is used as a synonym for ‘initialism’.

The HTML specification has the following definitions:
ABBR: Indicates an abbreviated form (e.g., WWW, HTTP, URI, Mass., etc.).
ACRONYM: Indicates an acronym (e.g., WAC, radar, etc.).

So far it looks like it is adhering to the dictionary definitions, which means that FBI should be marked up with ABBR since it’s not pronounceable as a word. However, a few paragraphs further down, the specification says,

Western languages make extensive use of acronyms such as “GmbH”, “NATO”, and “F.B.I.”, as well as abbreviations like “M.”, “Inc.”, “et al.”, “etc.”.

Are you confused yet? I am. The safe thing to do then, should be to always use ABBR, since all acronyms are abbreviations, but not vice versa. Aaahh … ahem … there’s a slight problem with that. Microsoft were so miffed when the W3C decided(?) to use ABBR for abbreviations and initialisms instead of their ACRONYM, that they actually refused to support ABBR! They’ve finally promised to support ABBR in IE7, only eight years after HTML4 became a recommendation, but there will still be millions of users with older IE versions out there for many years to come.

So what is a poor web author to do? Why should we even bother? It might be nice to have an element to attach a title attribute to, but we could use SPAN for that. The idea, allegedly, is that marking up abbreviations and acronyms would be beneficial for assistive technologies; especially screen readers. But screen readers tend to ignore ABBR and ACRONYM, since no one knows how to use them properly and Microsoft doesn’t support ABBR. Catch-22.

The answer to this frequently asked question is: I don’t know! I personally use ABBR for obvious abbreviations like ‘Inc.’ and for initialisms like ‘FBI’, and I use ACRONYM for things that can be pronounced as words, like ‘GIF’. But due to the ambiguity of the specification, I cannot fault anyone for marking up ‘FBI’ as an acronym (although ‘Inc.’ certainly is not). And what about ‘SQL’, which some spell out and others pronounce as ‘sequel’? (I would use ABBR.)

Why is <feature X> deprecated?
The most common ‘feature’ that beginners ask about is the target attribute for links. This is deprecated (disapproved) in HTML 4.01 Strict, but it’s still valid in HTML 4.01 Transitional. Many other element types and attributes that are allowed in Transitional are removed from Strict.

The reason for deprecating those things is that the W3C want to promote the separation between content (HTML), presentation (CSS) and behaviour (JavaScript). Making an element centred within the viewport is a presentational issue, thus it should be handled with CSS instead of a CENTER element. Opening a new browser window is a behavioural issue, thus it should be handled with JavaScript instead of a target attribute.

The deprecated features are those that arose during the browser war era of the late 1990s, when browser vendors were competing by adding various extensions to HTML to make it into some sort of page layout language. They were included in HTML 3.2 to bring some sort of order to the chaos, but this is not what HTML was intended for. When HTML4 came out, the authors tried to ‘reclaim the Web’ by deprecating what they saw as ‘harmful’ parts of HTML 3.2, at least in the Strict DTD.

In other words, things are deprecated for a reason. Don’t use them unless you absolutely have to.

Must I have an ALT attribute for every image?
Yes, the alt attribute is required for the IMG element type. Why? Because not all users can perceive images, and because not all user agents can understand or display images. Examples:

A person who is blind or has very low vision cannot see an image. A screen reader cannot describe an image.
Users with slow connections (dial-up or mobile) sometimes disable images for faster surfing.
Text browsers like Lynx do not support images.
Search engine 'bots cannot understand images.

Thus we have to provide a text equivalent for each image, using the alt attribute. This should not describe the image; it should convey the equivalent information. Writing good text equivalents is not easy, and it takes a lot of practice. The text equivalent is displayed instead of the image.

So what is a good text equivalent for a given image? That depends on the context in which the image is used! It’s not like there is a single ‘perfect’ text equivalent for each image. Let us look at an example: say we have an image of a grazing cow. This particular cow happens to be an Aberdeen Angus. Let us then consider a few use cases for this image.

In the first case, this image is used as a generic illustration for an article about beef cattle farming in Scotland. The actual cow isn’t germane to the article; it’s just an illustration, a decorative design element that draws the reader’s eye and relieves the monotony of the text. In this case, the image doesn’t convey any relevant information. Therefore it should have a null (empty) text equivalent: alt="".

In the second case, the image is used on a children’s website about farm animals. The page shows pictures of various animals: a horse, a sheep, a pig, a cow, etc. Next to each image is a block of text that presents some facts about each species. In this case, alt="Cow:" could be appropriate. It’s not important that it’s an Aberdeen Angus; it represents bovine quadrupeds in general.

In the third case, the image is used on a site about different breeds of cattle. Here it is used to illustrate what an Aberdeen Angus looks like, and how it is different from other breeds. The page comprises a number of images, each with a caption that identifies the breed, but no other textual information. In this case, the text equivalent should describe the particular attributes and traits that are specific to an Aberdeen Angus: the robust build, the massive chest, the relatively short legs, the buffalo-like hump behind the head, etc.

In the fourth case, the image is used on a photographer’s portfolio page. It’s one image among several others, with very different motifs. This is one of the few cases where the alt attribute might actually include a description of the image itself, e.g., ‘A black Aberdeen Angus grazing in the sunshine with Ben Nevis in the background.’

As we can see, the appropriate text equivalent depends on the context. Sometimes (often, actually) it should be null, because the image doesn’t convey any information that isn’t available in the accompanying text. Some claim that such images should be background images specified via CSS, but there are many cases where that is impractical and where the image is really part of the content – even though it doesn’t convey any useful information to those who cannot see it.

For images containing text, the text equivalent should of course be the same text as in the image. For things like pie charts, the text equivalent should convey information about the percentages: the same information as the image conveys.

The alt text shouldn’t be too long. Some browsers don’t word wrap them, and they cannot be formatted in any way. If we need a longer text equivalent, we should put it somewhere else and link to it via the longdesc attribute.

Internet Explorer and old Netscape browsers display the alt attribute in a tool-tip when the user hovers the mouse pointer over the image. This is wrong. We should use the title attribute for ‘tool-tip’ information. To suppress the tool-tip for alt texts, we can use an empty title: title="" (at least in IE).

What is the difference between CLASS and ID?
An ID uniquely identifies a particular element in an HTML document. It’s like a social security number, providing a unique handle for that element. Just as two people cannot have the same SS#, no two elements in a document can have the same ID. IDs must be unique within the page.

A class says that an element has some traits which it (possibly) shares with other elements. An element can belong to more than one class. An analogy could be professions: a person could be both a carpenter and a nurse, and there are many carpenters and many nurses. (They all have unique social security numbers, though.)

Both IDs and classes are mainly used with CSS and/or JavaScript. In CSS, an ID has higher specificity than a class, making it easy to specify special rules for a specific element. With JavaScript we can look up an element using its ID (document.getElementById()).

We assign IDs to page elements that can occur at most once per page, like a navigation menu, a footer, a sidebar, etc. We can also assign IDs to specific elements that only occur once in the whole site, like a specific image, if we want to have certain CSS rules for it or manipulate it with JavaScript.

We assign classes to elements that share some common traits, usually display properties via CSS rules.

IDs and class names should be as ‘semantic’ as possible. They should describe what something is, not what it looks like. Thus, id="menu" is much better than id="left"; especially if we redesign and move the menu to the right-hand side.

IDs and class names are case sensitive, even in HTML. We shouldn’t rely on that, though – ie., we should not have names that differ only in case.

Why doesn’t id="123" work?
Values for the id, name and class attributes must start with a letter (A-Z or a-z).

Why doesn’t href=My Cool Page.html work?
There are two reasons in this case.

Attribute values that contain characters other than letters, digits and a few others must be enclosed in quotation marks (double or single). Any attribute value that needs to contain a space, for instance, must be quoted. The easiest and safest solution is to always quote attribute values. To include quotation marks in a quoted value, we can either use the ‘other’ quotation mark to enclose the value (alt='My "new" car', alt="Jane's car") or use an entity or reference (alt="My "new" car", alt='Jane&#39;s car'). (Note that the ' entity cannot be used with HTML.)

The second reason is the spaces in the URI. These need to be encoded: href="My%20Cool%20Page.html". ‘%20’ means ‘a character with code point 20(hex)’. 20(hex) – 32(decimal) – is the code point for SPACE (U+0020). This applies to URIs only, not attribute values in general.

How can I include an HTML page in another HTML page?
With a Strict DTD, there is only one valid option: the OBJECT element type:

<object type="text/html" href="http://example.com/foo.html">
  alternate content here
</object>

Unfortunately, support for OBJECT is all but non-existent in Internet Explorer.

With a Transitional DTD, we can also use the IFRAME element type:

<iframe src="http://example.com/foo.html">
  alternate content here
</iframe>

A much better way is to handle inclusion on the server-side. Server-side includes (SSI) is the simplest way to include a file into another, as long as they are from the same domain:

<!--#include virtual="/foo.shtml"-->

(This cannot be used to include a complete HTML document into another, though; only fragments.)

With other server-side technologies you can do more advanced things. Your web server must support those technologies, of course. Shared servers with free hosting often don’t provide any such technologies – not even SSI.

Using JavaScript to ‘include’ things like navigation menus is not a very brilliant idea. That means visitors without JavaScript support won’t be able to navigate your site. In that case, using a FRAMESET might be the lesser evil.

zcorpan · October 9, 2006, 4:47pm

Awesome Tommy. :tup: Here are my remarks:

Actually, from an SGML point of view you omit the attribute name but keep the value, e.g. <input radio> is equivalent to <input type=“radio”> (though browsers only support this feature for boolean attributes).

The above is true, but the FPI is actually also a reference to the DTD, just like the system identifier.

Spec-wise it is undefined how invalid markup should be handled. Both SGML and HTML 4.01 generally don’t define error handling at all. An HTML4 or SGML UA that aborts processing (or crashes, or whatever) at a markup error for which there is no defined error handling would still be a conforming HTML4 or SGML UA, for instance.

According to SGML it only has to be esacaped when it is followed by a name start character. Also, for CDATA elements (SCRIPT, STYLE) it can’t be escaped.

xhtmlcoder · October 9, 2006, 6:05pm

Hmmm.

[quote=““Mr T.””]
…the range of code points be[t]ween 128 and 159…
[/quote]

AutisticCuckoo · October 10, 2006, 5:56am

Somehow I knew you’d have some…

Thanks! :tup: I didn’t know that. I don’t know much about the nitty-gritty details of SGML. I’ve updated the FAQ.

Of course (I even said so), but it’s not a URI; it doesn’t tell the Average Joe where to find the DTD.

That is correct. However, the spec has a non-normative note on recommended error handling for invalid documents. I don’t know of any HTML browser that doesn’t attempt to recover from markup errors.

I didn’t want to add too many special cases, since the post was long enough anyway. Also, I happen to think it’s much easier to always escape ampersands, add semicolons to entities, insert all end tags, and quote attribute values, than to try to remember when I can get away with a shortcut.

That’s not a problem, since the FAQ clearly says that style sheets and scripts should be external.

CDATA is a complex topic in HTML, because it means different things in different context. In CDATA attribute values, for instance, entities are parsed, but not in CDATA element content. You cannot escape ampersands in CDATA elements because you don’t need to; they don’t have their special meaning there.

Fixed. But if I got away with a single typo in a post that long, I’ll be quite satisfied.

I don’t know, exactly. A couple of hours or so.

system · October 15, 2006, 4:58am

Tommy. You must have a massive hard on for all the acolades . You have done a fabulous job and I thank you also.
The section concerning "validation of mark and spell checking “concerned me a little.
We are all aware of using the spell check, some are great and others totally useless depending on content but,you say " invalid mark-up can confuse browsers, search engines and other user agents.” The results you named and thanks.
Y’know I’m English. The way i spell certain words often varies considerably between English(UK) and English (USA). Do the search engines and browsers take this into consideration or to I need to start writing like a ‘Yank’. I ask because most search engines and browsers are predominantly American are they not?
the baldchemist

AutisticCuckoo · October 15, 2006, 6:22am

I used spell checking as an analogy, because for most professionals it is natural to make sure that there are no embarrassing typos before you publish anything. For some reason, web professionals don’t seem to think it’s equally natural to ‘spell check’ the markup.

Whether you use English or American spelling is up to you. The search engines as such don’t care. They don’t ‘understand’ what they are indexing. What it is about is the people using the search engines to find information. If a majority of them search for ‘tires’ and you have written ‘tyres’, your page may not be found.

If you want to reach the greatest number of people, you can try to use both spellings in your document. Either as in ‘we sell quality tires (tyres) at low prices’, or by using content negotiation to serve the page with UK or US spelling depending on the visitor’s Accept-Language header.

mattalexx · October 21, 2006, 9:16pm

Wow, that was great! I never knew the difference between and and between and .

Here’s a question: if you want to completely separate presentation and content, what would the appropriate way of making a piece of text bold in the midde of a paragraph where the semantic would not be appropriate? Adding a would be purely presentational, therefore it would not work in this case-- has the same overall purpose as something like , right?.

gnarly · October 21, 2006, 9:49pm

No. There are certain typographical conventions that require that text be bolded or italicised without it carrying any additional semantic meaning. A frequently-cited example is that people tend to italicise the names of ships. Now, you could either The Mary Rose or The Mary Rose - but only one of those will remain italicised when CSS in unavailable.

Tommy can probably word this better than I have

AutisticCuckoo · October 22, 2006, 8:16am

Olly explained this very well, I think.

STRONG implies strong semantic emphasis, not boldfacing.

B is suitable if there is a typographic convention to fall back upon, i.e., if the piece of text should stand out in some way even in non-CSS browsers.

If neither of those conditions apply, use a CLASS and define the styling with CSS.

astariza · October 22, 2006, 9:32am

As always, AutisticCuckoo, provides useful insight about the technologies we use every day but are not REALLY sure about the hows and whys…

Thank you AutisticCuckoo once again…

P.S. I only wished you didn’t have to “shut down” your personal website. Your articles there were pretty fascinating…nice to continue providing for the rest of us in your personal time!

gollux · January 30, 2007, 6:16am

Great usefull information! Only one more item to add to the FAQ.

I’m such a newb that I think my html code is so awsome that I want to hide it so nobody else on the planet can rip off what they could do better from scratch in 15 minutes. How do I do that?

AutisticCuckoo · January 30, 2007, 6:33am

LOL

You are right, ‘how do I stop others from viewing my markup’ is a fairly common question for beginners. They’ve toiled with their design for days and don’t want anyone to copy their singularly brilliant HTML.

I know you meant this in jest, but I think putting the answer into this FAQ may be a good idea.

The answer, of course, is: you can’t. The only way to protect markup, content, style sheets, JavaScript code, images, etc. is to not publish it on the Internet in the first place.

A browser needs to ‘see’ the HTML, CSS and so on in order to display the page. If the browser can see it, so can people.

There are various ways to make it slightly more inconvenient to see the code, by obfuscation and a few other tricks, but all they do is slow thieves down for a few seconds. There is no cross-browser way of disabling ‘view source’, and there shouldn’t be.

Content (text, images, etc.) is automatically copyrighted as soon as it is created. That doesn’t prevent theft, but it gives you legal leverage to go after the thieves and seek compensation.

gollux · January 30, 2007, 7:01am

Hello Autistic Cuckoo,

Chris Pederick’s Developer Toolbar for FireFox has this feature “View Generated Source”. Enough said…

Topic		Replies	Views
HTML 4.01 Transitional DTD is the answer? HTML & CSS	147	9392	July 26, 2010
<p> vs <span> HTML & CSS	102	43802	August 7, 2009
What DOCTYPE! to use for my site Get Started	47	3243	December 23, 2011
<strong> or <b> HTML & CSS	48	3282	November 14, 2011
If I code in strict XHTML will it validate in transitional doctype? HTML & CSS	59	3774	November 18, 2010

Frequently Asked Questions about HTML

Related topics