This chapter describes the core GEDCOM data representation language.
The generic data representation language defined in this chapter may be used to represent any form
of structured information, not just genealogical data, using a sequential stream of characters.
A GEDCOM transmission represents a database in the form of a sequential stream of related
records. A record is represented as a sequence of tagged, variable-length lines, arranged in a
hierarchy. A line always contains a hierarchical level number, a tag, and an optional value. A line
may also contain a cross-reference identifier or a pointer. The GEDCOM-line is terminated by a
carriage return, a line feed character, or any combination of these.
The tag in the GEDCOM-line identifies the type of information contained in the line, in the same
sense that a field-name identifies a field in a database record. This means that the data is
self-defining. Tags allow a field to occur any number of times within a record, including zero times.
They also allow the use of different or new fields to be included in the GEDCOM data without
introducing incompatibility, because the receiving system will ignore data which it does not
understand and process only the data that it does understand.
The hierarchical relationships are indicated by the hierarchical level number. Subordinate lines have
a higher level number. The hierarchy allows a line to have sub-lines, which in turn may have their
own sub-lines, and so forth. A line and its sub-lines constitute a context or enclosure, that is, a
cluster of information pertaining directly to the same thing. This hierarchical arrangement
corresponds with the natural hierarchy found in most structured information.
A series of one or more lines constitutes a record. The beginning of a new record is indicated by a
line whose level number is 0 (zero).
A GEDCOM receiver system scans the input for expected information by looking for specific tags
and processing the associated values. Unrecognized tags (perhaps from a sending system whose
database contains some different information) are handled by not processing the associated value nor
its enclosed sub-lines; that is, the entire context is ignored. These are treated as exceptions by
printing them in an exception report or saving them in some generic way. Saved exception lines
may be recombined when the data is exported.
In addition to hierarchical relationships, GEDCOM defines inter-record relationships which allow a
record to be logically related to other records, without introducing redundancy. These relationships
are represented by two additional but optional parts of a line: a cross-reference pointer and a
cross-reference identifier. The cross-reference pointer "points at" a related record, identified by a
required, matching unique cross-reference identifier. The cross-reference identifier is analogous to a
primary key in relational database terminology.
The grammar for the GEDCOM data format--a data representation language--is defined in this
chapter. The grammar is a set of rules that specify what sequences of characters are valid
GEDCOM expressions. The rules are expressed as a set of pattern definitions, where each pattern
is defined in terms of either a more primitive sub-pattern, or a constant. Pattern definitions consist
of the pattern name, a separator (:=), followed by either a constant, a more primitive sub-pattern,
or a set of alternatives of these. When a set is used, the alternatives are enclosed in square brackets
[] with the alternatives separated by a vertical bar ([alternative_1 | alternative_2]). Only one is to
be selected. The user can read the grammar components of the selected sub-pattern by substituting
any sub-patterns until all sub-patterns are resolved.
A GEDCOM transmission consists of a sequence of physical records, each of which consists of a
sequence of gedcom_lines, all contained in a sequential file or stream of characters. The
following rules pertain to the gedcom_line:
A gedcom_line has the following syntax:
The specific format of the escape sequence is defined for the specific GEDCOM form being
defined. (See chapter 2 for the escape sequence definition for the lineage-linked form).
The enclosed subordinate lines at level L are said to be in the context of the enclosing
superior line at level L-1. The meaning of a tag (see tag below) is interpreted in the context
of the tags of the enclosing line(s). Take the following record about an individual's birth
and death dates, for example:
NOTE: Some existing systems provide an option to produce an indented GEDCOM output
for user readability, using space or tab characters between the terminator and the level
number of the next line to visibly show the hierarchy. Also, some have suggested allowing
extra blank lines to visibly separate physical records. These features may be incorporated
into the GEDCOM standard at some future time, but for now, such a change would render
some existing systems incompatible. Therefore, we recommend that new systems be
prepared to discard extra carriage returns, line feeds, spaces and tabs immediately preceding
the level number during input. Output should still be constrained to level numbers without
indentation or blank lines, until most receiving systems are prepared to deal with this change.
Values whose source information contains illegible parts of the value should be indicated by
replacing the illegible part with
Values are generally not encoded in binary or other abbreviation schemes for reducing space
requirements, and they are generally constrained to be understandable by a typical user
without decoding. This is intended to reduce the decoding burden on the receiving software.
A GEDCOM-optimized data compression standard will be defined in the future to reduce
space requirements. Meanwhile, users may agree to compress and decompress GEDCOM
files using any compression system available to both sender and receiver.
The line_value within the context of a tag hierarchy of gedcom_lines represents one piece of
information and corresponds to one field in traditional database or file terminology.
If any of these characters appear in the level, xref_ID, or pointer segments of the GEDCOM
line, then that substructure should be written to an exception file. If any of these characters
appear in the value segment and the proper escape processing has not been invoked, then
they should be replaced by a (
The pointer represents the association between two objects that usually reside in different
records. There can, however, be an association between objects within the same logical
record. If this condition exists it is indicated in the pointer record composition containing an
(
Complex logical record structures are divided into small physical records to accommodate
memory constraints, many-to-many relationships, and independent record creation and
deletion.
The pointer must match a corresponding xref_id within the transmission, unless the colon (
The tag represents the meaning of the line_value within the context of the enclosing lines,
and contributes to the meaning of enclosed subordinate lines. Specific tags are defined in
Appendix A.
Although existing tags are only three or four characters long, systems should prepare to
handle tags of any length. Tags will be unique within the first 15 characters.
Valid combinations of specific tags, line_values, xref_ids, and pointers are constrained by
the GEDCOM form defined for representing a given kind of information (see chapter 2 for
the Lineage-linked form grammar).
Concepts
Grammar
1
, not 01
.
_
). The schema allows a receiving system to interpret the associated
data. (See the User Defined Tags section in chapter 2 for more information).
Grammar Syntax
The components of the sub-patterns above are defined below in alphabetical order. Some of the
components are defined in terms of more primitive sub-patterns:
1 OCCU Teacher
Any ASCII letter: A
-Z
, a
-z
, and (_
) underscore
#
) | (
) | (@
) (@
) ]
space_character
One of the digits 0
, 1
, 2
, 3
, 4
, 5
, 6
, 7
, 8
, 9
@
) (#
) escape_text (@
) non_at ]
The escape_text is coded to meet the rules of a particular GEDCOM form. For the lineage-linked form the definitions are found in Chap. 2.
(Do not use non-significant leading zeroes such as 02
.)
#
) | (
) ]
nothing
Any ASCII character except control characters (0x00 - 0x1F), alphanum, space (
), number sign (#
), at character (@
), and the DEL character (0x7F).
@
" alphanum pointer_string "@
" ]
Usage Description
@
, ie., "3 doz.
@ $20.00" must be stored as "3 doz. @@ $20.00
".
The non_at after the final at character (@#
escape_text @
non_at.
@#DJULIAN@
.@
) should be discarded if it is a space (
).
Otherwise, it should be retained as part of the text following the escape. Output systems
should always place a space (
) after the escape sequence.
1
), not (01
).
In this example, the expression
0 INDI
1 BIRT
2 DATE 12 MAY 1920
1 DEAT
2 DATE 1960
DATE 12 MAY 1920
is interpreted within the INDI
(individual) BIRT (birth) context, representing the Individual's birth date. The second
DATE is in the INDI DEAT (death) context. The complete meaning of DATE depends on
the context. (Note: the above example is indented according to the level numbers to make
the concept more obvious. In the actual GEDCOM data there is no indentation, just level
numbers lined up vertically on the left margin).
...
(ellipses).
The opt_xref_id is formed by any arbitrary combination of characters from the pointer_char
set. The first character must be an alpha or a digit. The opt_xref_id is not retained in the
receiving system, and may therefore be formed from any convenient combination of
identifiers from the sending system. No meaning is attributed by the receiver to any part of
the opt_xref_id, other than its unique association with the associated record. The use of the
colon (:
) character is also reserved.
Any ASCII character except control characters (0x00 - 0x1F), alphanum, space (
), number
sign (#
), at character (@
), and the DEL character (0x7F).
^
) (0x5E) character, unless the character is a TAB (0x09)
character which can be replaced with a space (0x20) character. These changes should also
be recorded on an exception file.
!
) character that separates the parent record's cross-reference ID from the specific
substructure's cross-reference ID which is at some subordinate level to the logical at level
zero. The cross-reference ID of the substructure subordinate to a zero level record is always
composed of the Record ID number and the Substructure ID number, such as @I132!1@
.
By including the Record Id number in the pointers which associate objects within a record
will allow the GEDCOM processors to build the index only at the record level and then
search sequentially for the appropriate substructure cross reference ID.
:
)
character is present (future network reference to a permanent file record). A pointer is
given instead of duplicating an object, though the logical result is equivalent. An expanded
traversal of a record tree includes following the pointers to related records to some depth,
and splicing those records (logically) into the resultant expanded tree. Pointers may refer to
either records which have not yet appeared in the transmission (forward reference) or to
records that have already appeared earlier in the transmission (backward reference). This
arrangement usually requires a preliminary pass to construct a look up table to support
random access by xref_id during subsequent passes.
[ carriage_return | line_feed | carriage_return line_feed | line_feed carriage_return ]
The first line has a level number 0, a xref_id of 0 @1234@ INDI
1 AGE 13
1 CHIL @1234@
1 NOTE This is a note field that is
2 CONT continued on the next line.
@1234@
, an INDI tag, and no value. The
second line has a level number 1, no xref_id, an AGE tag, and a value of 13
. The third line
has a level number 1, no xref_id, a CHIL tag, and a value of a pointer to a xref_id named
@1234@
.