Chapter 3 - Using Character Sets in GEDCOM

Introduction

GEDCOM needs to be designed to accommodate different character sets to facilitate the sharing of genealogical data in different languages. In order to minimize the number of differing standards to accomplish this, we have chosen to have each system convert their usage to ANSEL and eventually UNICODE. In January of 1991 a Unicode Consortium was founded to promote the use of the Unicode standard which accommodates all characters in one character set (see the section on Unicode below). Unicode Consortium has agreed with the ISO 10646 standard to merge and Unicode will be a subset of the ISO 10646 international character encoding standard. The difficulty is in handling the two character code sequences. Therefore, until the multi-byte handling becomes more common, the usage of ANSEL to represent the latin-based international characters will be the standard.

The GEDCOM specification does not address the implementation methods for multilingual processing, such as keyboard arrangements, sorting sequences, or character and graphic representations (font styles, proportional spacing, and so forth) on the CRT or printers, however, Unicode standard has defined formatting characters which will indicate the direction of the text presentation as well as other text formatting character codes.

Most of the genealogy systems developed so far utilize either ASCII or ANSEL, or both. ANSEL accommodates the set of Latin-based languages, as explained below.

8-Bit ANSEL

The 8-Bit ANSEL (American National Standard for Extended Latin Alphabet Coded Character Set for Bibliographic Use, Z39.47, 1985 copyright) is the default character set for GEDCOM. It is used for all transmissions of information unless another character set is specified. The use of this character set standard makes it possible to preserve the full integrity of the language by providing a method of using the standard ASCII character set and supplementing it with both non-spacing character modifiers (diacritic) as well as spacing special characters. Non-spacing means that the diacritic is printed without advancing the device's print position. The character being modified is then printed in the same position, resulting in a combined image of both the character and the diacritic(s). The storage of ANSEL requires storing the non-spacing graphic character(s) preceding the ASCII character that the diacritic is to modify. The ANSEL standard specifies an extended 8-bit configuration (above 128) to represent the spacing and non-spacing graphic characters that make up most of the Latin based languages. ANSEL is a super-set of ASCII. The standard ASCII characters including the control characters are preserved.

ANSEL is known by two other names: (1) ANSI Z39.47-1985) and (2) the American Library Association character set, used in library systems worldwide, including the MARC (MAchine-Readable Catalog) format.

A description of the codes for the ANSEL character set has been reproduced with permission and is included with the printed version of The GEDCOM Standard. The description of ANSEL codes is not included in the electronic version. This description may be purchased from the American National Standards Institute at 1430 Broadway, New York, N.Y. 10018. The description of the ANSEL character set standard includes the following:

Character-set codes 0 through 127 are the same for 8-Bit ANSEL and 8-Bit ASCII (USA version--ANSI 8-Bit). Character-set codes 128 through 255 are unique to the ANSEL character set.

ASCII (USA version)

When there isn't a need for diacritics or other special characters, and if you are not transmitting binary data, you will find it convenient to use ASCII (8-bit USA version) if your computer already supports it. This is a standard of the American National Standards Institute (ANSI). Most of the basic printable characters of ANSEL and ASCII (USA version--ANSI 8-Bit) are identical.

Binary Character Set

Binary formats for representing photographs and other bit-mapped graphics should use the escape sequence "escape_to_supplementary_processing" for linking supplementary files to the GEDCOM context (see chapter 2).

Unicode (ISO 10646)

The Unicode standard is a new character code designed to encode text for storage in computer files. It is a subset of the upcoming ISO 10646 standard. The design of the Unicode standard is based on the simplicity and consistency of today's prevalent character code set, extended ASCII code set, but goes far beyond ASCII's limited ability to encode only the Latin alphabet: the Unicode encoding provides the capacity to encode all of the characters used for written languages throughout the world. In order to accommodate the many thousands of characters used in the international text, the Unicode standard uses a 16-bit code set instead of extended ASCII's 8-bit code set. This expansion provides codes for more than 65,000 characters. The Unicode standard assigns each character a unique 16-bit value, and does not use complex modes or escape codes to specify modified characters or special cases. The text representation of the Unicode 16-bit numbers is U+0041 which is assigned to the letter A, 65 decimal. The Unicode standard includes the Latin alphabet used for English, the Cyrillic alphabet used for Russian, the Greek, Hebrew, and Arabic alphabets. Other alphabets used in countries across Europe, Africa, the Indian subcontinent, and Asia, such as Japanese Kana, Korean Hangul, and Chinese Bopomofo are included. The largest part of the Unicode standard is devoted to thousands of unified character codes for Chinese, Japanese, and Korean ideographs. (See The Unicode standard, vol. 1 and 2, published by Addison-Wesley Publishing, for character code standards.)

The Unicode character set environment, which contains a character set for all languages, minimizes previous GEDCOM requirements to provide escape_sequences for moving from one character set to another. If the Unicode environment is used to produce a GEDCOM transmission, the header record would also be in Unicode, requiring receiving systems to determine whether the transmission is Unicode or ASCII before they could interpret the GEDCOM header. This would be done by reading the first two bytes of the transmission. If the first two bytes are 0x30 and 0x20 then the transmission will be in either ASCII or ANSEL as determined by the header record. If the first two bytes are 0x30 and 0x00 then the transmission should be processed as a Unicode transmission. (Different platforms may reverse the position of the null byte, in which case the test would be for 0x00 and 0x30.)

How to change character sets

The character set for an entire transmission is specified in the character-set line of the header record.

The example below shows the specification in the header record.

Example:

Lvl     Tag     Value
0       HEAD
1       SOUR    PAF
2       VERS    2.1
1       DEST    ANSTFILE
1       CHAR    ANSEL
The character-set change remains in effect until the TRLR record is encountered at the end of the transmission.

The lineage_linked form no longer makes use of the character escape_sequence to change a character set context inside of the transmission. Unicode does not require shifting from character set to character set and we should encourage its use for multi-language support.

For more information about character sets, see the following: