When Bad Things Happen to Good Characters
Get to Know a Character
It can be useful to know your characters, but more practically useful to know one character well.My character is an "e" with an acute accent, character code 233 (decimal) in Latin-1 and Unicode.Inserting Characters
There are many ways it can be inserted into a document:- On Windows, I hold down the Alt key and type 0233 on the numeric keyboard and release the Alt key.I could use the charmap program, too.Or I could copy and paste it(e.g., é).But entering the code directly is risky because, if the character encoding changes,e.g., from Latin-1 to UTF-8,then the meaning of code 233 changes.
- In an HTML document, I can enter these magical incantations,which are displayed correctly regardless of encoding:
- é (decimal) ⇒ é
- é (hex) ⇒ é
- é (mnemonic) ⇒ é
- In Microsoft Word, I type an accent code followed by the accented letter.On Windows, Ctrl+quote, then 'e'. On Mac, Option+quote, then 'e'.Accent codes include: grave=backquote, acute=quote, circumflex=hat, colon=umlaut, comma=cedilla, tilde=tilde, slash=slash, and perhaps others.
What Could Possibly Go Wrong?
If é is UTF-8 encoded, but displayed without decoding, it looks like this:é
The first 128 characters in the Latin-1 character set (same as ASCII),are simply represented as themselves in UTF-8.The second half of Latin-1 characters are split.The first half of the non-ASCII Latin-1 characters are represented by themselves, preceded by code 194 decimalor C2 hex, so the UTF-8 encoding for character code 191 (decimal), ¿, is¿
The second half of the non-ASCII Latin-1 characters are represented by a different character,preceded by code 195 decimal or C3 hex.So, when looking at UTF-8 encodings of Latin-1 characters,if you see  or à where you do not expect it,there are probably too many UTF-8 encodings.Multiple extra encodings have a pattern to them:0 é
1 é
2 é
3 ÃÂé
4 ̮̮̩̉
5 you get the idea
Too few encodings can have a bad effect that looks different.When é is not UTF-8 encoded, it can appear like this very high numbered character:
�
Progressive under-encoding can result in a question mark being displayed.Diagnostic Reference
You are now ready to diagnose UTF-8 encoding problems (e.g., with é):
Symptom | Diagnosis |
---|---|
é | no problems |
é | too much UTF-8 encoding, or viewing UTF-8 encoded text with Latin-1 encoding |
é | much too much UTF-8 encoding |
� | too little UTF-8 encoding |
? | something bad happened to this character |
wild animals have eaten this character | |
𐀓 | if you see a box, the font in use is missing this character.Firefox 3's boxes contain the hexadecimal value for the missing character,but it's still just a missing character. |
Background Information from Wikipedia
- ASCII letters, numbers, punctuation with no accents/diacritics
- Latin-1 / ISO 8859-1 ASCII + some accented and special characters
- Diacritic accents added to letters
- Unicode huge international character set
- Character Encoding representations for characters, particularly for storage and transmission
- UTF-8 Encoding character encoding often used for Unicode
- Precomposed Character single character codes that represent both the letter and accents
- Combining Characters characters that combine with preceding letter to display as diacritics/accented characters
- Precomposed Latin-1 Characters in Unicode (remember, character missing from your character set may display as boxes)