Today’s web publishing technology has come a long way from its US ASCII origins – these days, the typographic punctuation enjoyed for over four centuries by its venerable print publishing cousin is fully supported for use in our web documents.
Beautiful typography does require some additional configuration on the server side and a bit more attention by the document author – but in this article, I hope to convince you that it’s time well spent. Rich punctuation is a tool that, when used properly, will make your web site’s text come alive!
Considering “Proper” Punctuation
Looking around the Web, it’s clear that using and delivering proper punctuation is a basic problem for many authors of otherwise slick, professional web sites, including SitePoint until only recently (Ahem, better late than never – Ed.) Although the QWERTY keyboard has much the same limitations as it ever did, the advent of Unicode, and browsers offering widespread support mean that there’s really no excuse these days for online content bristling with unattractive straight quotes and hyphens used as textual dashes. Most web browsers and many publishing tools already support rich punctuation. It’s just a matter of getting started!
There are myriad style guides and local variations for punctuation. A good-quality style guide, or even one found on the Web, will provide a good starting point when expanding your punctuation repertoire. Reading a print magazine or book can also provide a lot of inspiration and exemplify the uses of rich punctuation marks.
Here are some typographic characters that when used properly will raise the quality of an article.
|2026||…||ELLIPSIS||deliberate omissions of words or characters|
|2010||-||HYPHEN||word contraction and hyphenation|
|2013||â€“||EN DASH||relationships and closed ranges|
|2014||–||EM DASH||open ranges and abrupt interruptions|
|2017||‘||APOSTROPHE||omissions and possessive identification|
|201C/1D||” and “||QUOTATION||enclosing contextual quotes|
|2016/17||‘ and ‘||2nd LEVEL QUOTATION||second level enclosed quotes|
As an example, compare the two following excerpts, courtesy of Lewis Carroll – see the difference between straight and gracefully curly quotes, and two hyphens versus an elegant em dash.
See what I’m talking about? Now imagine this in an attractive font on your thoughtfully produced classic children’s fiction web site!
Producing Non-keyboard Characters
The common problem of many punctuation marks is that they are simply not present on regular QWERTY keyboards. These keyboards were designed by and for computer programmers, and not even when typographers started working with computers did the keyboards get a real overhaul.
The most common misapprehension, though, surrounds several heavily used characters that are present on the keyboard. These are straight or dumb quotes, known as glyphs ("), the straight apostrophe (‘), and the hyphen or minus (-). Although these characters look like punctuation marks on QWERTY keyboards, they were originally intended only for use in computer programming, not in text documents. And though some text editor software, like WordPress Rich Text and OpenOffice.Org, tries to convert computer programming glyphs to punctuation marks on the fly, this usually creates more wrongs than rights.
To produce the proper “curly” quotation marks, the apostrophe, the ellipsis, the hyphen, as well as the en and em dashes absent from the keyboard, some work is required on behalf of the author. I’ve listed a few options below:
Utilize Keyboard Shortcuts
On many operating systems, keyboard shortcuts for producing many Unicode characters are built in.
On a Mac running OS X, for example, these shortcuts take the form Option + <key> and Shift + Option + <key>. The 68kMLA wiki contains a list of all of these keyboard shortcuts; here’s a sampling:
|Character||Keyboard Shortcut (OS X)|
|…||Option + ;|
|â€“||Option + -|
|–||Option + Shift + -|
|‘||Option + ]|
|‘||Option + Shift + ]|
|“||Option + [|
|“||Option + Shift + [|
In many distributions of Linux, the keyboard shortcuts listed above are also valid; just substitute the Option key for ALT.
Windows users can also input Unicode by using a keyboard shortcut, but there are a few caveats. For one, it requires that you know the Unicode number for the character you wish to enter (not quite as intuitive to remember as the approach taken on the Mac, and more time-consuming). Secondly, it depends on the application you have open and your language settings as to the characters that this will produce.
Entering Unicode characters in Windows can be done with the following key sequence:
- Hold down the ALT key.
- Type 0 on the number pad.
- Type the Unicode number of the character, again on the number pad.
- Release the ALT key.
Here are the key sequences for entering the commonly used characters on which we’ve focused for this article. I’ve marked the ALT button in parentheses to indicate that it should be held down while typing the characters that follow it:
|Character||Keyboard Shortcut (Windows)|
|…||ALT + 0133;|
|â€“||ALT + 0150|
|–||ALT + 0151|
|‘||ALT + 0145|
|‘||ALT + 0146|
|“||ALT + 0147|
|“||ALT + 0148|
There are a few alternatives to this approach – many Microsoft applications, for example, also support the ALT-X keyboard shortcut for converting numbers to Unicode characters. However, as this approach requires the author to know the hexadecimal number for the character, and is specific to Microsoft applications, I consider it to be less universally useful, but it may suit your workflow. (In fact, this is technique also works in OS X, by holding down the Option key, but unless you speak fluent hexadecimal, I’d recommend sticking with the regular shortcut keys).
Read more about entering Unicode characters on Windows at FileFormat.info.
Create Your Own Keyboard Shortcuts
Another great method is to manually assign characters to keys on your keyboard. For example, you may want to create a keyboard shortcut that produces an ellipsis when you hold down the right side Alt/Option and press the . key.
Depending on the operating system, it’s generally easy to modify keyboard layouts to assign keys different meanings. Refer to your particular operating system’s documentation or search for tutorials on the Web for full details on how to achieve this.
At first, you may find using dedicated keys slow when you have to stop and think through the necessary key combinations, but over time it will become second nature to rattle off these keystrokes.
Use a Third-party Application
Outside the browser, numerous shareware and freeware applications also exist for assisting in the entry of Unicode characters. One lightweight (and free) utility that I’ve tried on Windows is simply called UnicodeInput. It allows for entry of Unicode characters using a customizable keystroke.
Use a Cheat Sheet
If your data entry is restricted to a web browser, and are reluctant (or unable) to install an application such as UnicdeInput, you might consider using a cheat sheet – a plugin reference that contains the desired characters, delivering an easy copy-and-paste workflow.
This is obviously a more time-consuming approach compared with having dedicated keys. A more convenient version of the cheat sheet approach involves web authors taking advantage of their browser’s sidepanel. There are already many browser panels for Opera, Firefox, and Internet Explorer: Edicode offers quick access to rich punctuation via the Latin blocks and the General Punctuation block.
On the Mac, this type of cheat sheet is built in to the operating system – but only if your application is a native Cocoa app. You can determine whether this is the case by clicking the Edit menu. If you have a Special Characters menu option (keyboard shortcut: Option + Command + T) then you’re in business!
Of course, for the one-off lookup, there are plenty of online references including this one that lists the first 65,536 Unicode characters and the numbers that represent them.
Accessing More Characters
Whichever method you adopt to access non-keyboard characters, remember that there are many more characters and punctuation marks available than the most-used ones we have discussed. There are dedicated symbols such as smiley faces, degree Celsius, fractions, Roman numerals, mathematical operators, primes, and the minus sign. For a full overview of available symbols, explore the General Punctuation block and other Latin language blocks from the Unicode Standard.
That said, the availability of some typographic punctuation marks and characters is limited to font faces with broad support for the Unicode Standard. Whenever using non-typical characters, the designer must always check whether the font faces used supports the desired characters. Cascading Style Sheets (CSS) can specify stacks of font faces, thus ensuring broad support across platforms and devices. Font stacks should contain a generic fallback type as well as fonts from different operating systems.
But wait – that’s not all!
Delivering Web Documents in Unicode
Accessing non-keyboard characters and using key combinations at lightning-fast speed isn’t quite the end of the story, unfortunately. Your web document not only has to be encoded, but also identified as a Unicode document.
By default, most web browsers still read the document encoding as ASCII, the old standard of document encoding. This default must be overridden by specifying the encoding used through a HTTP header in the response given by the server to the browser’s request for the document. Refer to your particular server’s documentation for full details on how to achieve this. There’s also an excellent article by Tommy Olsson that deals with web character encoding – in fact, it’s helpful to read that article as a companion to this one.
The UTF-8 encoding will suffice for all Latin-based languages, as it will create small files and support the whole Unicode Standard. HTML files and all kind of XML files, including Atom web feeds, sitemap indexes, and XHTML, can be encoded using UTF-8.
Apache servers need just a little more persuasion to serve web documents as UTF-8. To achieve a server-wide change, the following can be added to a configuration file called
.htaccess at the web-root directory:
AddEncoding .atom .htm .html .xht .xhtml .xml UTF-8
This space-separated list of standard format extensions commonly found on the web server will be served using UTF-8 encoding. It can also be achieved on a per-document basis when using server-side coding, using this PHP code snippet:
<?php header("Content-Type: text/html;charset=UTF-8"); ?>
Take extra-special care to get these codes right, as HTTP headers are cAsE sEnSiTive!
Other server-side languages have very similar approaches to modifying HTTP headers. XML files require encoding information in the XML declarations (
<?xml [...] encoding="UTF-8"?>). Though it may not affect the actual document parsing, it’s good practice to always include encoding information in HTML files as well.
The following HTML code snippet must be the very first child of the
<head/> element to have any effect. Again, HTTP headers are cAsE sEnsiTive!
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8"/>
XHTML files use both the XML declaration specified encoding and the HTML method for backwards compatibility reasons. This is only a fallback mechanism if the server failed to deliver the appropriate HTTP header, as described above.
Escaping Parsing Problems
As mentioned above, escaping characters is a method of writing out characters as a code sequence in the markup, and performs the same function in HTML and XML alike. When presented, the escaped characters will appear as the actual characters.
Including the actual character glyph itself in the web document, instead of escaping it, is a wiser approach for non-typical characters – just to be on the safe side. Moreover, with proper document encoding and delivery, the practice of escaping everything only causes the document to become significantly larger in size than it has to be.
That said, there are four particular Unicode Standard characters which must always be escaped. These characters, shown in the table below, hold special meanings in the HTML and XML markup languages, and escaping them avoids potential parsing and rendering problems.
Negotiating Search Engines and Incompatible Devices
When working with any web document, authors have one crucial decision to make at almost every stage of development: favor the reader or the search engines?
As we’ve seen, a richer repertoire of punctuation marks will undoubtedly give visitors to your site a much better reading experience. But there’s one unfortunate downside – to some degree, it can give search engines a harder time understanding the non-typical punctuation in the document. We can live in certain hope that, in the not-too-distant future, the widespread uptake of rich punctuation will increase search engines’ understanding of a document instead of entailing a risk of decreasing it.
Unicode-incompatible devices and search engines may require an alternate version of your page, where everything is automatically mapped against ASCII/ANSI. This practice, however, is becoming more and more redundant as handheld devices and search engines smarten up. And web sites that offer syndication through Atom web feeds can easily work around the problem.
For example, when constructing the feed, the publishing tool should replace Unicode characters in the haystack with their almost-equivalent keyboard character pair. The same goes for mobile or hand-held versions of the document, as these devices tend not to fully support the Unicode Standard. So the hyphen (U+2010) would be replaced with the hyphen-minus (U+2D), the en dash (U+2013) with two hyphen-minuses (U+2D 2D), the curly apostrophe (U+2019) with apostrophe (U+27), and so on. Once these characters are taken care of, the feed will contain only ASCII/ANSI letters and symbols.
Offer the web feed simultaneously with the web version, but with only basic punctuation. Then search engines will find the
<link/> between the two published formats and treat them as the one document. As a side effect, the click-through rate from web feeds may also increase as readers click through to the web version of the document for a better reading experience.
Web designers the world over have run up against the issue of characters in online content – generally letting straight quotes hold sway, detracting from fine web design. As we’ve seen though, in this day and age there actually isn’t very much to prevent us from ensuring our typography enhances our sleek content presentation in the best way it can. It’s time for a turning point in the way designers present our text, now that we have a choice in the matter; in fact, you’ll notice that we at SitePoint have finally put this preaching into practice. Three weeks ago, we took the plunge and embraced the offerings of today’s technology for our articles – and we’d never go back!
The first problem designers run into when using rich punctuation is the limitations of the input method: the keyboard. But, as we’ve seen, there are several decent and quite trivial solutions to circumventing this limitation. The second problem that trips us up is the document delivery. It is necessary to carefully declare the encoding used in the document through minor server-side changes.
With these obstacles out of the way, there is no reason to hold back on the punctuation – so embrace Unicode and give your readers a richer typographic experience!