Byte-Order Mark found in UTF-8 File

i just got the following warning when trying to validate my website:

Byte-Order Mark found in UTF-8 File.

The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported.

this also has something to do with these funny characters i see in the top left corner of the browser as the page loads.

can someone tell me how to remove the BOM. i dont have microsoft expression only dreamweaver.

is there any other way to fix this error?

Thank you

Was it made from scratch in DW? When you save it out, there should be an option whether to include or not include a BOM. You could paste this into a new file and try resaving it.

Most text editors have options in the save dialog related to character encoding and the BOM. Look for an option to save “without BOM”, “without Byte Order Mark”, or “without Unicode signature.”

i’d never noticed that checkbox in the save window for “include unicode signature (BOM)” in dreamweaver!

Thank you i will always check that now. not sure how it got ticked, probably a slip of the keyboard.

Topic Solved!

The BOM is only needed where you use UTF-16 or UTF-32. UTF-8 always uses the same byte order for those characters that need more than one byte.

And since this thread should appear in searches re BOM, Notepad users are pretty much stuck with a BOM in any UTF-8 pages they create. Saving the file in any other text editor (including Notepad++) would fix this.

When saving the first time in Notepad, you’re given the choice of character encodings, ANSI (ASCII), Unicode, Unicode big endian, and utf-8. Either of the Unicode choices will prepend the BOM. Use utf-8 as your choice.

If you’ve already saved with the BOM, open the file and ‘save as’ to get a new shot at setting the proper encoding.

cheers,

gary

cheers,

gary

KK5st from Devshed?

On a, semi-related note, isn’t UTF-8 the only required Byte-order required by XHTML?

Ryan, yup that’s the same Gary : )

UTF-8 doesn’t require a BOM at all. Setting a byte order lets computers know where the ones, tens, and hundreds places are in a number-- we write our numbers all the same way (is it little endian? Cause the end of the number is the ones place? I forget) but computers can go either way. 813 could be three hundred eighteen, if the BOM said that was how it could be read (except my example sucks cause they’re numbers not bytes, but oh well).
And when this is possible, then you need to tell them which way they should read the number. UTF-16 and anything higher requires a BOM. Lucky for us, we don’t ever need to use UTF-anything higher than 8 : )

XHTML doesn’t require anything anyway-- unless it were real XHTML (XML) which does require UTF-something (XML 1.0 requires unicode). But fake XHTML could be in a Windows charset for all it cares, cause it’s really just HTML anyway.

Thanks Gary for the info, I thought people were getting the BOM when they were selecting utf-8, rather than just with unicode.

Since UTF-8 is one variant of Unicode the Unicode options presumable use either UTF-16 or UTF-32 (the other two unicode variants) with the two alternate orders in which the bytes that make up each character can occur.

I’m not sure I understand what you’re asking, but I think you may be thinking of the fact that an XML parser is only required to support UTF-8 and UTF-16 (if I remember correctly).

That isn’t strictly correct (you’re comparing apples to oranges).
Unicode (or, strictly speaking, the variant standardised as ISO/IEC 10646) is the character repertoire used with HTML and XML.
UTF-8 is an encoding: a method for specifying Unicode code positions using 1, 2, 3 or 4 octets.

So Unicode is the whole set of available characters, where every character has an index number (code position). UTF-8 is one of many ways to represent those code positions, numerically.

US-ASCII (ISO 646), ISO 8859-1, Windows-1252, etc. are both repertoires and encodings. Since the repertoires are very limited, any character can be represented with a single octet.

I’m not sure I understand what you’re asking, but I think you may be thinking of the fact that an XML parser is only required to support UTF-8 and UTF-16 (if I remember correctly).

That’s what I was thinking. I knew there was an X in there :}.

As others have said, the byte-order mark for UTF-8 is not necessary, and I would even recommend against it (for the same reason that the W3C does).

Unfortunately getting rid of it may be tricky, depending on which text editor you use. If the text editor supports UTF-8 (which most do now) then you won’t even see the byte-order mark in the file when you open it. You would have to rely on there being a menu item to choose between character encodings, one of which may be “UTF-8 (no byte-order mark)” or even just “UTF-8”.

As a quick and dirty hack you could try opening it in a program that does NOT support UTF-8 and then you’ll see the byte order mark as three strange characters and you’ll be able to just delete them. Take care that it doesn’t mess up any special characters elsewhere in the page though. And I can’t suggest a non-UTF-8 aware application off the top of my head, but most full featured text editors like Notepad++ or PSPad will allow you to switch between UTF-8 and other modes.

The purpose of the byte-order mark is to make sure that your computer system is not reading every sequence of 2 bytes in the wrong order. However, that is not relevant to UTF-8, because UTF-8 is encoded to a stream of single bytes and thus it has no byte-order issues. The byte-order mark is therefore useless (except as a “hint” that the document uses UTF-8 encoding, which is unnecessary).

Edit: oops, missed the fact that the OP’s problem has already been solved. Oh well, looks like I wasn’t the only one

Technically, that’s not completely correct - while some implementations may determine the default character encoding of an HTML and XML document differently in the absense of any indication, normally you would indicate the character encoding somewhere in the document. It is indeed legitimate to have ISO-8859-1 or even CP-1251 (microsoft) in an XML document as with HTML. The statement “XML 1.0 requires unicode” is misleading here. Both HTML and XML always represent characters internally by their Unicode code point, and all XML implementations need to be able to support UTF-8 and UTF-16, but that doesn’t mean the document needs to use either of those encodings where the intended implementation supports others.

Technically, that’s not completely correct - while some implementations may determine the default character encoding of an HTML and XML document differently in the absense of any indication, normally you would indicate the character encoding somewhere in the document. It is indeed legitimate to have ISO-8859-1 or even CP-1251 (microsoft) in an XML document as with HTML. The statement “XML 1.0 requires unicode” is misleading here. Both HTML and XML always represent characters internally by their Unicode code point, and all XML implementations need to be able to support UTF-8 and UTF-16, but that doesn’t mean the document needs to use either of those encodings where the intended implementation supports others.

Thanks for that. It made me go back to XML and re-read it. Where I saw that processors needed to accept Unicode, I read that as XML needed to have Unicode set.
: )