ISO Latin-8 (aka ISO 8859-14): the 8-bit encoding standard with dotted consonants

(8-bit character encodings are relevant to Windows 3.x, 9x and ME.  Windows 2000 and XP use Unicode.)

Latin-8 is the 14th of a series of 8-bit character encodings, collectively known as ISO 8859.  It is the 8th part of ISO 8859 to deal with the Latin alphabet, whence its usual names 8859-14 and Latin-8.  Latin-8 was formally defined in 1998, and specifically addresses the encoding needs of the Celtic languages.  It is distinguished from the other parts of ISO 8859, in particular from the more common Part 1 (8859-1, Latin-1), by the fact that it contains dotted consonants.  Further details of Latin-8 are available here or here or here.  The characters which are common to Latin-1 and Latin-8 — including the acute- and grave-accented vowels — are identically encoded in both.

The following chart is the Latin-8 table found here, augmented by adding MS extensions in columns 8 and 9.

The region 80–9F is left blank in ISO 8859, but MS Windows locates extra characters in this region.  The characters shown above in this region are those which MS have added to Latin-1, to make their extended version of it, which they call "codepage CP1252".  MS have not defined a codepage corresponding to Latin-8.

Notes

1. Y-diaeresis

MS extensions to Latin-1 include Y-diaeresis at hex 9F.  However this character is already present in Latin-8 at hex AF, which is the proper code to use.  In considering extensions to Latin-8, it would be best not to assign position hex 9F to any character, to avoid uncertainly over what is intended to be represented when an actual instance of that code is encountered.

2. Variant glyphs

Lowercase r (hex 72), lowercase s (hex 73) and lowercase s-dot (hex BF) are realised in some styles of Gaelic script as so-called long forms, and in others as short forms — actually, reduced capitals.  Likewise, ampersand (hex 26) is realised in some styles in the familiar Latin way, and in others as the Tironian-et, looking like a lowered figure 7.

Whichever forms of these characters are provided in a Latin-8 font, it is important for text processing that they occupy the standard Latin-8 code positions, and no others.  The practice adopted in some Gaelic fonts of offering both variant glyphs of these characters, in different code positions (and produced by different keystrokes), is to be avoided, as it will lead to text processing problems.  The practice is sometimes encountered under the name of "extended Latin-8", mentioned in the note which follows. 

For further discussion of this problem (which also arises with some Unicode Gaelic fonts), including advice to font designers, see here.  It also contains instructions for a user to change the encodings of the variant glyphs, if other avenues fail.

3. Other extensions of Latin-8

The Latin-8 table shown above has been extended into the 80–9F region in the same way as MS have extended other ISO 8859 tables to produce Windows codepages.  But as MS have not implemented a codepage around Latin-8, it is open to others to extend Latin-8 into the 80–9F region in ways of their own.  Michael Everson has done this in his Extended Latin-8 v2.0.  Most of his extensions coincide with those of MS, but he adds, as separate characters, lowercase long r (at hex 89), lowercase long s (at hex 8A), lowercase long s-dot (at hex 9A), and Tironian-et (at hex 84).  However, as explained in the previous note, to assign separate code positions to these four variant glyphs is not a good idea.  Everson's extensions also place dotless-i at hex 9F.


Ciarán Ó Duibhín
2006/05/26
Clár cinn / Home page / Page d'accueil / Hauptseite