« Kids these days ... | Main | 2004 Christmas Letter »

December 8, 2004

Character Encoding issues distilled

I've always known that Microsoft did something non-standard with regard to character sets, but I never really knew specifically what their wrongdoing was. In attempting to convert some textual data into XML for purposes of an RSS feed, I ran into trouble. While this issue has been around for a while, I thought I might post my findings as a reference for later.

What Microsoft did wrong

The HTML 4.0 SGML specification strictly sets aside the ASCII range 128-159 as unused.


CHARSET
BASESET "ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with
implementation level 3//ESC 2/5 2/15 4/6"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
128 32 UNUSED
160 55136 160
55296 2048 UNUSED -- SURROGATES --
57344 1056768 57344

Specifically the line:


128 32 UNUSED

This means that starting at 128 through the next 32 numbers (to 159) are reserved (the range in fact is reserved by Unicode and ISO 10646 as control characters). In the character encoding Windows 1252, specific characters were assigned to numbers within this range. The most frequently used of these offending characters are the now infamous smart quotes, apostrophes, em dashes, etc. (‘ ’ “ ” —).

One workaround to this problem is to specify the character set used within the document like this:
HTML:
Content-Type: text/html; charset=Windows 1252 or
<meta http-equiv="Content-Type" content="text/html;charset=Windows 1252">
XML:
<? xml version="1.0" encoding="Windows 1252" ?>

However, this method can be overridden and should you fail to specify the character set, the content will be "broken." Also, if you ever need to merge data from different sources, having two completely disparate character sets makes this fundamentally more difficult.

What should you do?
In my opinion, at the very least, any characters that fall within the the forbidden range should be escaped with the proper Unicode Numeric Character References (NCRs). NCRs and entities are ways of representing any Unicode character in XHTML/HTML using only ASCII characters. I read a great W3C i18n tutorial regarding character-sets and encodings in HTML, XML, and CSS. You should choose a character set such as UTF-8 or ISO0-8859-1 that is most-likely to represent most characters without escaping, and then escape any characters above 127 (including the invalid range) with the Unicode NCR.

Here's a php array that maps the ASCII ordinal to the Unicode NCR that might be helpful.


$mapping = array(128 => '&#8364;',
130 => '&#8218;',
131 => '&#402;',
132 => '&#8222;',
133 => '&#8230;',
134 => '&#8224;',
135 => '&#8225;',
136 => '&#710;',
137 => '&#8240;',
138 => '&#352;',
139 => '&#8249;',
140 => '&#338;',
142 => '&#381;',
145 => '&#8216;',
146 => '&#8217;',
147 => '&#8220;',
148 => '&#8221;',
149 => '&#8226;',
150 => '&#8211;',
151 => '&#8212;',
152 => '&#732;',
153 => '&#8482;',
154 => '&#353;',
155 => '&#8250;',
156 => '&#339;',
158 => '&#382;',
159 => '&#376;');

An important distinction to remember is that a character set is a definition of a particular subset of characters for a particular purpose. (ie. The characters necessary to represent western languages). The character encoding refers to the mapping of these characters to bytes on a computer. It's important to remember that for a particular charset, there could be multiple encodings. (ie. UTF-8, UTF-16, or UTF-32). While they will all refer to the same characters, they will have different byte-mappings.

The Microsoft Windows 1252 specification shows the offending range (Hex rows 80 and 90). More importantly, here's a document that contains mappings from Windows 1252 to the proper Unicode NCR.

This is obviously NOT an exhaustive diagnosis of the challenges and proper usage of character sets and their respective encodings, but it was enough for me to transform a Windows 1252 encoded document into an XML document that rendered correctly within multiple newsreaders, blog readers, and multiple browsers.

Other References
http://www.dwheeler.com/essays/quotes-in-html.html
http://www.ascii.cl/htmlcodes.htm
Unicode Character Database (Description of the Unicode format and content)
UnicodeData.txt
FAQ regarding relationship between Unicode and ISO 10646
ASCII table for reference

Posted by mark at December 8, 2004 7:20 PM Subscribe (FeedBurner)

Comments

um. what?

Posted by: Nate at December 11, 2004 10:02 PM

This is awesome research man. Stuff that I've always intended to dig up on my own and never taken the time.

Posted by: Topher at May 5, 2005 7:42 AM

Awsome stuff,

Posted by: devzer0 at August 2, 2006 5:28 PM

Post a comment




Remember Me?

(you may use HTML tags for style)