GEDCOM needs to be designed to accommodate different character sets to facilitate the sharing of genealogical data in different languages. In order to minimize the number of differing standards to accomplish this, we have chosen to have each system convert their usage to ANSEL and eventually UNICODE. In January of 1991 a Unicode Consortium was founded to promote the use of the Unicode standard which accommodates all characters in one character set (see the section on Unicode below). Unicode Consortium has agreed with the ISO 10646 standard to merge and Unicode will be a subset of the ISO 10646 international character encoding standard. The difficulty is in handling the two character code sequences. Therefore, until the multi-byte handling becomes more common, the usage of ANSEL to represent the latin-based international characters will be the standard.
The GEDCOM specification does not address the implementation methods for multilingual processing, such as keyboard arrangements, sorting sequences, or character and graphic representations (font styles, proportional spacing, and so forth) on the CRT or printers, however, Unicode standard has defined formatting characters which will indicate the direction of the text presentation as well as other text formatting character codes.
Most of the genealogy systems developed so far utilize either ASCII or ANSEL, or both. ANSEL
accommodates the set of Latin-based languages, as explained below.
The 8-Bit ANSEL (American National Standard for Extended Latin Alphabet Coded Character Set
for Bibliographic Use, Z39.47, 1985 copyright) is the default character set for GEDCOM. It is used
for all transmissions of information unless another character set is specified. The use of this
character set standard makes it possible to preserve the full integrity of the language by providing a
method of using the standard ASCII character set and supplementing it with both non-spacing
character modifiers (diacritic) as well as spacing special characters. Non-spacing means that the
diacritic is printed without advancing the device's print position. The character being modified is
then printed in the same position, resulting in a combined image of both the character and the
diacritic(s). The storage of ANSEL requires storing the non-spacing graphic character(s) preceding
the ASCII character that the diacritic is to modify. The ANSEL standard specifies an extended 8-bit
configuration (above 128) to represent the spacing and non-spacing graphic characters that make up
most of the Latin based languages. ANSEL is a super-set of ASCII. The standard ASCII
characters including the control characters are preserved.
ANSEL is known by two other names: (1) ANSI Z39.47-1985) and (2) the American Library
Association character set, used in library systems worldwide, including the MARC (MAchine-Readable Catalog) format.
A description of the codes for the ANSEL character set has been reproduced with permission and is
included with the printed version of The GEDCOM Standard. The description of ANSEL codes is
not included in the electronic version. This description may be purchased from the American
National Standards Institute at 1430 Broadway, New York, N.Y. 10018.
The description of the ANSEL character set standard includes the following:
Character-set codes 0 through 127 are the same for 8-Bit ANSEL and 8-Bit ASCII (USA version--ANSI 8-Bit).
Character-set codes 128 through 255 are unique to the ANSEL character set.
When there isn't a need for diacritics or other special characters, and if you are not transmitting
binary data, you will find it convenient to use ASCII (8-bit USA version) if your computer already
supports it. This is a standard of the American National Standards Institute (ANSI). Most of the
basic printable characters of ANSEL and ASCII (USA version--ANSI 8-Bit) are identical.
Binary formats for representing photographs and other bit-mapped graphics should use the escape
sequence "escape_to_supplementary_processing" for linking supplementary files to the GEDCOM
context (see chapter 2).
The Unicode standard is a new character code designed to encode text for storage in computer files.
It is a subset of the upcoming ISO 10646 standard. The design of the Unicode standard is based on
the simplicity and consistency of today's prevalent character code set, extended ASCII code set, but
goes far beyond ASCII's limited ability to encode only the Latin alphabet: the Unicode encoding
provides the capacity to encode all of the characters used for written languages throughout the
world. In order to accommodate the many thousands of characters used in the international text, the
Unicode standard uses a 16-bit code set instead of extended ASCII's 8-bit code set. This expansion
provides codes for more than 65,000 characters. The Unicode standard assigns each character a
unique 16-bit value, and does not use complex modes or escape codes to specify modified characters
or special cases. The text representation of the Unicode 16-bit numbers is U+0041 which is
assigned to the letter A, 65 decimal. The Unicode standard includes the Latin alphabet used for
English, the Cyrillic alphabet used for Russian, the Greek, Hebrew, and Arabic alphabets. Other
alphabets used in countries across Europe, Africa, the Indian subcontinent, and Asia, such as
Japanese Kana, Korean Hangul, and Chinese Bopomofo are included. The largest part of the
Unicode standard is devoted to thousands of unified character codes for Chinese, Japanese, and
Korean ideographs. (See The Unicode standard, vol. 1 and 2, published by Addison-Wesley
Publishing, for character code standards.)
The Unicode character set environment, which contains a character set for all languages, minimizes
previous GEDCOM requirements to provide escape_sequences for moving from one character set to
another. If the Unicode environment is used to produce a GEDCOM transmission, the header
record would also be in Unicode, requiring receiving systems to determine whether the transmission
is Unicode or ASCII before they could interpret the GEDCOM header. This would be done by
reading the first two bytes of the transmission. If the first two bytes are 0x30 and 0x20 then the
transmission will be in either ASCII or ANSEL as determined by the header record. If the first two
bytes are 0x30 and 0x00 then the transmission should be processed as a Unicode transmission.
(Different platforms may reverse the position of the null byte, in which case the test would be for
0x00 and 0x30.)
The character set for an entire transmission is specified in the character-set line of the header
record.
Example:
The lineage_linked form no longer makes use of the character escape_sequence to change a
character set context inside of the transmission. Unicode does not require shifting from character
set to character set and we should encourage its use for multi-language support.
For more information about character sets, see the following:
8-Bit ANSEL
ASCII (USA version)
Binary Character Set
Unicode (ISO 10646)
How to change character sets
The example below shows the specification in the header record.
The character-set change remains in effect until the TRLR record is encountered at the end of the
transmission.
Lvl Tag Value
0 HEAD
1 SOUR PAF
2 VERS 2.1
1 DEST ANSTFILE
1 CHAR ANSEL