Chapter 1 - Data Representation Grammar

Introduction

This chapter describes the core GEDCOM data representation language.

The generic data representation language defined in this chapter may be used to represent any form of structured information, not just genealogical data, using a sequential stream of characters.

Concepts

A GEDCOM transmission represents a database in the form of a sequential stream of related records. A record is represented as a sequence of tagged, variable-length lines, arranged in a hierarchy. A line always contains a hierarchical level number, a tag, and an optional value. A line may also contain a cross-reference identifier or a pointer. The GEDCOM-line is terminated by a carriage return, a line feed character, or any combination of these.

The tag in the GEDCOM-line identifies the type of information contained in the line, in the same sense that a field-name identifies a field in a database record. This means that the data is self-defining. Tags allow a field to occur any number of times within a record, including zero times. They also allow the use of different or new fields to be included in the GEDCOM data without introducing incompatibility, because the receiving system will ignore data which it does not understand and process only the data that it does understand.

The hierarchical relationships are indicated by the hierarchical level number. Subordinate lines have a higher level number. The hierarchy allows a line to have sub-lines, which in turn may have their own sub-lines, and so forth. A line and its sub-lines constitute a context or enclosure, that is, a cluster of information pertaining directly to the same thing. This hierarchical arrangement corresponds with the natural hierarchy found in most structured information.

A series of one or more lines constitutes a record. The beginning of a new record is indicated by a line whose level number is 0 (zero).

A GEDCOM receiver system scans the input for expected information by looking for specific tags and processing the associated values. Unrecognized tags (perhaps from a sending system whose database contains some different information) are handled by not processing the associated value nor its enclosed sub-lines; that is, the entire context is ignored. These are treated as exceptions by printing them in an exception report or saving them in some generic way. Saved exception lines may be recombined when the data is exported.

In addition to hierarchical relationships, GEDCOM defines inter-record relationships which allow a record to be logically related to other records, without introducing redundancy. These relationships are represented by two additional but optional parts of a line: a cross-reference pointer and a cross-reference identifier. The cross-reference pointer "points at" a related record, identified by a required, matching unique cross-reference identifier. The cross-reference identifier is analogous to a primary key in relational database terminology.

Grammar

The grammar for the GEDCOM data format--a data representation language--is defined in this chapter. The grammar is a set of rules that specify what sequences of characters are valid GEDCOM expressions. The rules are expressed as a set of pattern definitions, where each pattern is defined in terms of either a more primitive sub-pattern, or a constant. Pattern definitions consist of the pattern name, a separator (:=), followed by either a constant, a more primitive sub-pattern, or a set of alternatives of these. When a set is used, the alternatives are enclosed in square brackets [] with the alternatives separated by a vertical bar ([alternative_1 | alternative_2]). Only one is to be selected. The user can read the grammar components of the selected sub-pattern by substituting any sub-patterns until all sub-patterns are resolved.

A GEDCOM transmission consists of a sequence of physical records, each of which consists of a sequence of gedcom_lines, all contained in a sequential file or stream of characters. The following rules pertain to the gedcom_line:

Grammar Syntax

A gedcom_line has the following syntax:

gedcom_line:=
level delim opt_xref_id tag opt_line_value terminator
for example:
1 OCCU Teacher
The components of the sub-patterns above are defined below in alphabetical order. Some of the components are defined in terms of more primitive sub-patterns:
alpha:=
[ (0x41)-(0x5A) | (0x61)-(0x7A) | 0x5F ]
Any ASCII letter: A-Z, a-z, and (_) underscore
alphanum:=
[ alpha | digit ]
any_char:=
[ alpha | digit | otherchar | (#) | ( ) | (@) (@) ]
delim:=
[ (0x20) ]
space_character
digit:=
[ (0x30)-(0x39) ]
One of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
escape:=
[ (@) (#) escape_text (@) non_at ]
escape_text:=
[ any_char | escape_text any_char ]
The escape_text is coded to meet the rules of a particular GEDCOM form. For the lineage-linked form the definitions are found in Chap. 2.
level:=
[ digit | level digit ]
(Do not use non-significant leading zeroes such as 02.)
line_item:=
[ pointer | escape | any_char ]
line_value:=
[ line_item | line_value line_item ]
non_at:=
[ alpha | digit | otherchar | (#) | ( ) ]
null:=
()
nothing
opt_line_value:=
[ null | delim | delim line_value ]
opt_xref_id:=
[ null | pointer delim ]
otherchar:=
[(0x21)-(0x22) | (0x24)-(0x2F) | (0x3A)-(0x3F) | (0x5B)-(0x5E) | (0x60) | (0x7B)-(0x7E) | (0x80)-(0xFF)]
Any ASCII character except control characters (0x00 - 0x1F), alphanum, space ( ), number sign (#), at character (@), and the DEL character (0x7F).
pointer:=
[ "@" alphanum pointer_string "@" ]
pointer_char:=
[ non_at ]
pointer_string:=
[ null | pointer_char | pointer_string pointer_char ]
tag:=
[ alphanum | tag alphanum ]
terminator:=
[ carriage_return | line_feed | carriage_return line_feed | line_feed carriage_return ]

Usage Description

alpha:=
The alpha characters include the underscore which is used to link word pieces together in forming tag names or tag labels.
any_char:=
Any character except the control characters found in the range of 0x00 - 0x1F. If an @ is desired as part of the line_value, it must be written in GEDCOM as a double @, ie., "3 doz. @ $20.00" must be stored as "3 doz. @@ $20.00".
delim:=
The delim (delimiter), a single space character, terminates both the variable-length level number and the variable-length tag. Note that space characters may also be present in a value.
escape:=
The escape is a sequence in the grammar used to specify special processing, such as switching character sets or calendars for date interpretation, or for indicating an inclusion of a non_GEDCOM data form into the GEDCOM structure. The form of the escape sequence is:
@# escape_text @ non_at.
for example:
@#DJULIAN@.
The non_at after the final at character (@) should be discarded if it is a space ( ). Otherwise, it should be retained as part of the text following the escape. Output systems should always place a space ( ) after the escape sequence.

The specific format of the escape sequence is defined for the specific GEDCOM form being defined. (See chapter 2 for the escape sequence definition for the lineage-linked form).

escape_text:=
The escape_text is defined to meet the requirements of a particular GEDCOM form. For the lineage-linked form the definitions are found in Chap. 2.
level:=
The level number works the same way as the level of indentation in an indented outline, where indented lines provide detail about the item under which they are indented. A line at any level L is enclosed by and pertains directly to the nearest preceding line at level L-1. The Level L may increase by 1 at most. Level numbers must not contain leading zeroes which are not significant, for example level one must be (1), not (01).

The enclosed subordinate lines at level L are said to be in the context of the enclosing superior line at level L-1. The meaning of a tag (see tag below) is interpreted in the context of the tags of the enclosing line(s). Take the following record about an individual's birth and death dates, for example:

0 INDI
  1 BIRT
    2 DATE 12 MAY 1920
  1 DEAT
    2 DATE 1960
In this example, the expression DATE 12 MAY 1920 is interpreted within the INDI (individual) BIRT (birth) context, representing the Individual's birth date. The second DATE is in the INDI DEAT (death) context. The complete meaning of DATE depends on the context. (Note: the above example is indented according to the level numbers to make the concept more obvious. In the actual GEDCOM data there is no indentation, just level numbers lined up vertically on the left margin).

NOTE: Some existing systems provide an option to produce an indented GEDCOM output for user readability, using space or tab characters between the terminator and the level number of the next line to visibly show the hierarchy. Also, some have suggested allowing extra blank lines to visibly separate physical records. These features may be incorporated into the GEDCOM standard at some future time, but for now, such a change would render some existing systems incompatible. Therefore, we recommend that new systems be prepared to discard extra carriage returns, line feeds, spaces and tabs immediately preceding the level number during input. Output should still be constrained to level numbers without indentation or blank lines, until most receiving systems are prepared to deal with this change.

line_value:=
The line_value identifies an object within the domain of possible values allowed in the context of the tag. The combination of the tag, the line_value, and the hierarchical context of the supporting gedcom_lines provides the understanding of the enclosed values. This domain is defined by a specific grammar for representing a given GEDCOM form (see chapter 2 for Lineage-linked grammar).

Values whose source information contains illegible parts of the value should be indicated by replacing the illegible part with ... (ellipses).

Values are generally not encoded in binary or other abbreviation schemes for reducing space requirements, and they are generally constrained to be understandable by a typical user without decoding. This is intended to reduce the decoding burden on the receiving software. A GEDCOM-optimized data compression standard will be defined in the future to reduce space requirements. Meanwhile, users may agree to compress and decompress GEDCOM files using any compression system available to both sender and receiver.

The line_value within the context of a tag hierarchy of gedcom_lines represents one piece of information and corresponds to one field in traditional database or file terminology.

opt_xref_id:=
(See pointer.)
The opt_xref_id is formed by any arbitrary combination of characters from the pointer_char set. The first character must be an alpha or a digit. The opt_xref_id is not retained in the receiving system, and may therefore be formed from any convenient combination of identifiers from the sending system. No meaning is attributed by the receiver to any part of the opt_xref_id, other than its unique association with the associated record. The use of the colon (:) character is also reserved.
otherchar:=
[(0x21)-(0x22) | (0x24)-(0x2F) | (0x3A)-(0x3F) | (0x5B)-(0x5E) | (0x60) | (0x7B)-(0x7E) | (0x80)-(0xFF)]
Any ASCII character except control characters (0x00 - 0x1F), alphanum, space ( ), number sign (#), at character (@), and the DEL character (0x7F).

If any of these characters appear in the level, xref_ID, or pointer segments of the GEDCOM line, then that substructure should be written to an exception file. If any of these characters appear in the value segment and the proper escape processing has not been invoked, then they should be replaced by a (^) (0x5E) character, unless the character is a TAB (0x09) character which can be replaced with a space (0x20) character. These changes should also be recorded on an exception file.

pointer:=
A pointer stands in the place of the context identified by the matching xref_id. Theoretically, a receiving system should be prepared to follow a pointer to find any needed value in a manner that is transparent to the logic of the subsystem that is looking for specific tags. This highly-flexible facility will probably be used more in the future. For the time being, however, the use of pointers is explicitly defined within the GEDCOM form (Such as defined in Chapter 2).

The pointer represents the association between two objects that usually reside in different records. There can, however, be an association between objects within the same logical record. If this condition exists it is indicated in the pointer record composition containing an (!) character that separates the parent record's cross-reference ID from the specific substructure's cross-reference ID which is at some subordinate level to the logical at level zero. The cross-reference ID of the substructure subordinate to a zero level record is always composed of the Record ID number and the Substructure ID number, such as @I132!1@. By including the Record Id number in the pointers which associate objects within a record will allow the GEDCOM processors to build the index only at the record level and then search sequentially for the appropriate substructure cross reference ID.

Complex logical record structures are divided into small physical records to accommodate memory constraints, many-to-many relationships, and independent record creation and deletion.

The pointer must match a corresponding xref_id within the transmission, unless the colon (:) character is present (future network reference to a permanent file record). A pointer is given instead of duplicating an object, though the logical result is equivalent. An expanded traversal of a record tree includes following the pointers to related records to some depth, and splicing those records (logically) into the resultant expanded tree. Pointers may refer to either records which have not yet appeared in the transmission (forward reference) or to records that have already appeared earlier in the transmission (backward reference). This arrangement usually requires a preliminary pass to construct a look up table to support random access by xref_id during subsequent passes.

tag:=
A tag consists of a variable length sequence of alphanum characters. All user defined tags, that is tags used which have not been defined by the GEDCOM standard must begin with an underscore character. (0x95). All user defined tags must be defined in the SCHEMA substructure of the HEADer record.

The tag represents the meaning of the line_value within the context of the enclosing lines, and contributes to the meaning of enclosed subordinate lines. Specific tags are defined in Appendix A.

Although existing tags are only three or four characters long, systems should prepare to handle tags of any length. Tags will be unique within the first 15 characters.

Valid combinations of specific tags, line_values, xref_ids, and pointers are constrained by the GEDCOM form defined for representing a given kind of information (see chapter 2 for the Lineage-linked form grammar).

terminator:=
The terminator delimits the variable-length line_value and signals the end of the gedcom_line. The valid terminator characters are:
[ carriage_return | line_feed | carriage_return line_feed | line_feed carriage_return ]
Examples:
The following are examples of valid but unrelated GEDCOM-lines: The first line has a level number 0, a xref_id of @1234@, an INDI tag, and no value. The second line has a level number 1, no xref_id, an AGE tag, and a value of 13. The third line has a level number 1, no xref_id, a CHIL tag, and a value of a pointer to a xref_id named @1234@.