This is an exploratory draft of the serialisation format for FHISO’s proposed suite of Extended Legacy Format (ELF) standards. This document is not endorsed by the FHISO membership, and may be updated, replaced or obsoleted by other documents at any time.
Comments on this draft should be directed to the tsc-public@fhiso.org mailing list.
FHISO’s Extended Legacy Format (or ELF) is a hierarchical serialisation format and genealogical data model that is fully compatible with GEDCOM, but with the addition of a structured extensibility mechanism. It also clarifies some ambiguities that were present in GEDCOM and documents best current practice.
The GEDCOM file format developed by The Church of Jesus Christ of Latter-day Saints is the de facto standard for the exchange of genealogical data between applications and data providers. Its most recent version is GEDCOM 5.5.1 which was produced in 1999, but despite many technological advances since then, GEDCOM has remained unchanged.
FHISO are undertaking a program of work to produce a modernised yet backward-compatible reformulation of GEDCOM under the name ELF, the new name having been chosen to avoid confusion with any other updates or extensions to GEDCOM, or any future use of the name by The Church of Jesus Christ of Latter-day Saints. This document is one of five that form the initial suite of ELF standards, known collectively as ELF 1.0.0:
ELF: Primer. This is not a formal standard, but is being released alongside the ELF standards to provide a broad overview of ELF written in a less formal style. It gives particular emphasis to how ELF differs from GEDCOM.
ELF: Serialisation Format. This standard defines a general-purpose serialisation format based on the GEDCOM data format which encodes a dataset as a hierarchical series of lines, and provides low-level facilities such as escaping.
ELF: Schemas. This standard defines flexible extensibility and validation mechanisms on top of the serialisation layer. Although it is an optional component of ELF 1.0.0, future ELF extensions to ELF will be defined using ELF schemas.
ELF: Date, Age and Time Microformats. This standard defines microformats for representing dates, ages and times in arbitrary calendars, together with how they are applied to the Gregorian, Julian, French Republican and Hebrew calendars.
ELF: Data Model. This standard defines a data model based on the lineage-linked GEDCOM form, reformulated to be usable with the ELF serialisation model and schemas. It is not a major update to the GEDCOM data model, but rather a basis for future extension and revision.
Where this standard gives a specific technical meaning to a word or phrase, that word or phrase is formatted in bold text in its initial definition, and in italics when used elsewhere. The key words must, must not, required, shall, shall not, should, should not, recommended, not recommended, may and optional in this standard are to be interpreted as described in [RFC 2119].
An application is conformant with this standard if and only if it obeys all the requirements and prohibitions contained in this document, as indicated by use of the words must, must not, required, shall and shall not, and the relevant parts of its normative references. Standards referencing this standard must not loosen any of the requirements and prohibitions made by this standard, nor place additional requirements or prohibitions on the constructs defined herein.
This standard depends on FHISO’s Basic Concepts for Genealogical Standards standard. To be conformant with this standard, an application must also be conformant with the referenced parts of [Basic Concepts]. Concepts defined in that standard are used here without further definition.
Certain facilities in this standard are described as deprecated, which is a warning that they are likely to be removed from a future version of this standard. This has no bearing on whether a conformant application must implement the facility: they may be required, recommended or optional as described in this standard.
Indented text in grey or coloured boxes does not form a normative part of this standard, and is labelled as either an example or a note.
The grammar given here uses the form of EBNF notation defined in §6 of [XML], except that no significance is attached to the capitalisation of grammar symbols. Conforming applications must not generate data not conforming to the syntax given here, but non-conforming syntax may be accepted and processed by a conforming application in an implementation-defined manner, providing a warning is issued to the user, except where this standard says otherwise.
The grammar productions in this standard uses the S
and Char
productions defined in §2 of [Basic Concepts] to match any non-empty sequence of whitespace characters or any valid character, respectively.
The ELF serialisation format is a structured, line-based text format for encoding data in a hierarchical manner that is both machine-readable and human-readable.
At a logical level, an ELF document is built from structures, the name ELF gives to the basic hierarchical data structures used to represent data. ELF uses two types of structure: tagged structures and typed structures. The serialisation layer described in this standard only deals with tagged structures, and the word structure is frequently used in this document to refer to what is properly a tagged structure.
A tagged structure consists of:
FAMC
structure is an example. Such structures do not neatly fit into the entity–attribute–value paradigm.
The tag describes how the structure is to be interpreted, and structures are commonly referred to by their tag in this standard.
NOTE
” will often be called a NOTE
structure.
The payload is either a language-tagged string or a pointer to another structure. A payload which is a language-tagged string is referred to as a string payload.
AUTH
” and a payload which is a language-tagged string consisting of the string “鈴木眞年
” tagged with the language tag ja
. The AUTH
tag is defined in [ELF Data Model] as meaning “the name of the primary creator of the source”, and 鈴木眞年 is the name of genealogist Suzuki Matoshi, written in his native Japanese language, which is denoted by the language tag ja
.
When the payload of a structure is a pointer, this represents a link between two structures, with the pointer in one structure referencing the cross-reference identifier in a second structure.
FAM
tag, and individual records denoted by the INDI
tag. These links are how genealogical relationships are represented in ELF. A FAM
structure may contain a CHIL
substructure whose payload is a pointer. Elsewhere in the document, there will be an INDI
structure whose cross-reference identifier is identical to the pointer in the payload of the CHIL
substructure of the FAMC
structure. This is stating that the person represented by the INDI
structure is a child of the family represented by the FAM
structure.
A top-level structure, meaning a structure which is not a substructure of any other structure, is called a record. An ELF document or dataset can have arbitrarily many records.
HEAD
and TRLR
are not records. Probably.
At a lexical level, a structure is encoded as sequence of lines, each terminated with a line break. The first line encodes the cross-reference identifier, tag and payload of the structure, while any substructures are encoded in order on subsequent lines. Each line consists of the following components, in order, separated by whitespace:
0 HEAD
1 CHAR UTF-8
1 GEDC
2 VERS 5.5.1
2 FORM LINEAGE-LINKED
1 ELF 1.0.0
0 INDI
1 NAME Charlemagne
0 TRLR
This ELF document has three lines with level 0
which mark the start of the three top-level structures or records. These records have, respectively, three, one and zero substructures, which are denoted by the lines with level 1
. The structure represented by the line with a ELF
tag is a substructure of the HEAD
record because there is no intervening line with level one less than 1
; the structure represented by the NAME
line naming Charlemagne is a substructure of the INDI
record, as that is the preceding line with a level 0
. The TRLR
record is an example of a record with no substructures.
Five of the lines in this example document have a payload. For example, the payload of the FORM
line is the string “LINEAGE-LINKED
”, while the payload of the NAME
line is the string “Charlemagne
”. None of the lines in this example have payloads which are pointers, nor do any have a cross-reference identifier.
A conformant application which parses the ELF serialisation format is called an ELF parser. A conformant application which outputs data in the ELF serialisation format is called an ELF writer.
The input to an ELF parser and output of an ELF writer is an octet stream, which is a sequence of 8-bit bytes or octets each with a value between 0 and 255.
This standard defines how an octet stream is parsed into a dataset, and how a dataset is serialised into an octet stream. Overviews of these processes can be found in §2.2 and §2.3, respectively. An octet stream which this standard requires an ELF parser to be able to read is called a conformant source.
An octet stream which is not a conformant source is called a non-conformant source. If the input to an ELF parser is not a conformant source, unless this standard says otherwise, the application must either terminate processing that octet stream or present a warning or error message to the user. If it continues processing, it does so in an implementation-defined manner.
This standard also recognises a class of application which reads data in the ELF serialisation format, applies a small number of changes to that data, and immediately produces output in the ELF serialisation format which is identical to the input, octet for octet, other than where the requested changes have been made. Such an application is called an ELF editor.
ELF editors are not required to conform to the full requirements of an ELF parser or ELF writer. The only requirement this standard places on ELF editors is that, when acting on a conformant source, they must either generate output which is a conformant source, or present a warning or error message to the user, or terminate.
This standard has an optional dependency on the [ELF Schemas] standard, which provides additional functionality for validating ELF documents and extending the ELF data model. An application which conforms to the [ELF Schemas] standard is described as schema-aware; other applications are described as non-schema-aware.
The parsing process can be summarised as follows:
An octet stream is converted to a sequence of line strings by:
determining its character encoding by
splitting on line breaks per §3.4.
Line strings are converted into records by:
The header record is parsed for serialisation metadata per §5.2.
A second pass is made recursively over each record, processing it per §4.2.2:
if the parser is schema-aware, converting tagged structures into typed structures, as described in [ELF Schemas]; and
each string payload is unescaped by:
The semantics of serialisation are defined by the following procedural outline.
The tagged structures are ordered and additional tagged structures created to represent serialisation metadata.
This step cannot happen before tagging because tagging may generate serialisation metadata that needs to be included in the tagged structures.
Payloads are converted to create xref structures by simultaneously
@
charactersSemantically, these actions must happen concurrently because none of them should be applied to the others’ results.
This step cannot happen before tagging because tags are needed to determine the set of valid escapes. This step cannot happen before adding serialisation metadata because it is applied to the serialisation metadata as well.
The dataset is converted to a sequence of lines by
CONT
and CONC
This step cannot happen before payload conversion because valid split points are dependant on proper escaping. This step must happen before encoding as octets because valid split points are determined by character, not octet.
The sequence of lines is converted to an octet stream by
A collections of structures intended to describe information about the dataset as a whole.
The relative order of structures with the same structure type identifier SHALL be preserved within this collection; the relative order of structures with distinct structure type identifiers is not defined by this specification.
A collection of any number of substructures, which are structures.
The relative order of structures with the same structure type identifier SHALL be preserved within this collection; the relative order of structures with distinct structure type identifiers is not defined by this specification.
In order to parse an ELF document, an ELF parser shall first convert the octet stream into a sequence of line strings, which are strings containing the unparsed lexical representations of lines.
The way in which octets are mapped to characters is called the character encoding of the document. ELF supports several different character encodings. Determining which is used is a two-stage process, with the first stage being to determine the detected character encoding of the octet stream per §3.1. Frequently there will be no detected character encoding.
Next, the initial portion of the octet stream is converted to characters using the detected character encoding, failing which in an ASCII-compatible manner. This character sequence is then scanned for a CHAR
line whose payload identifies the specified character encoding. This process is described in §3.2. If there is a specified character encoding, it is used as the character encoding for the ELF document; otherwise the detected character encoding is used, failing which the default is the ANSEL character encoding. Considerations for reading specific character encodings can be found in §3.3.
Once the character encoding is determined, the octet stream can be converted into a sequence of characters which are assembled into line strings as described in §3.4. The process of serialising a line string back into an octet stream is far simpler as the intended character encoding is already known; this process is described in §3.5.
If the octet stream begins with a byte-order mark (U+FEFF) encoded in UTF-8, the detected character encoding shall be UTF-8; or if the application supports the optional UTF-16 encoding and the octet steam begins with a byte-order mark encoded in UTF-16 of either endianness, the detected character encoding shall be UTF-16 of the appropriate endianness. The byte-order mark shall be removed from the octet stream before further processing.
Otherwise, if the application supports the optional UTF-16 encoding and the octet stream begins with any ASCII character (U+0001 to U+007F) encoded in UTF-16 of either endianness, this encoding shall be the detected character encoding.
0
”. In the big endian form of UTF-16, sometimes called UTF-16BE, this is encoded with the hexadecimal octets 00 30
. These two octets will be detected as an ASCII character encoded in UTF-16, and the detected charcter encoding will be determined to be UTF-16BE.
Otherwise, applications may try to detect other encodings by examining the octet stream in an implementation-defined manner, but this is not recommended.
Otherwise, there is no detected character encoding.
In this case, for the octet stream to be understood, it must use a 7- or 8-bit character encoding that is sufficiently compatible with ASCII that the CHAR
line can be read. The only 7 or 8-bit character encodings defined in this standard are ASCII, ANSEL and UTF-8 which encode ASCII characters identically. These will all be understood correctly if there is no detected character encoding.
Some character encodings with minor differences from ASCII can also be understood correctly. An example is the Japanese Shift-JIS character encoding which uses the octets 5C and 7E to encode the yen currency sign (U+00A5) and overline character (U+203E) where ASCII has a backslash (U+005C) and tilde (U+007E). An application does not need to understand these characters in order to scan for a CHAR
line.
These cases can be summarised as follows, where xx
denotes any octet with a hexadecimal value between 01
and 7F
, inclusive:
Initial octets | Detected character encoding |
---|---|
EF BB BF |
UTF-8, with byte-order mark |
FF FE |
UTF-16, little endian, with byte-order mark |
FE FF |
UTF-16, big endian, with byte-order mark |
xx 00 |
UTF-16, little endian, without byte-order mark |
00 xx |
UTF-16, big endian, without byte-order mark |
Otherwise | None |
To determine the specified character encoding, the initial portion of the octet stream shall temporarily be converted to characters using the detected character encoding.
If there is no detected character encoding, the application shall convert each octet to the character whose code point is the value of octet. An application shall issue an error and stop processing the octet stream if the null octet 00
is encountered. Restricted characters, as defined in §2.3 of [Basic Concepts], must be accepted without error while determining the specified character encoding.
00
might occur in the representation of a valid character in some character encoding, but almost all character encodings avoid this and it cannot happen in the ASCII, ANSEL or UTF-8 character encodings.
Characters from the initial portion of the octet stream are parsed into lines strings as described in §3.4. Each line string is whitespace normalised as described in §2.1 of [Basic Concepts], and all lowercase ASCII characters (U+0061 to U+007A) converted to the corresponding uppercase characters (U+0041 to U+005A).
Once normalised in this manner, the first line string of the file must be exactly “0 HEAD
”; otherwise the application must issue an error and cease parse the octet stream as ELF. If the application encounters a subsequent normalised line string beginning with a 0
digit (U+0030) followed by a space character (U+0020), the application shall stop scanning for a specified character encoding.
0
” encodes the start of the next record, and therefore the end of the HEAD
record. The specified character encoding is given in a CHAR
line in the HEAD
record; a CHAR
line found elsewhere in the file must not be used to supply the specified character encoding.
If the application encounters a line string beginning with “1 CHAR
” followed by a space character (U+0020) while scanning for the specified character encoding, then the remainder of the line string shall be used to determine the specified character encoding.
If the remainder of the line string is exactly “ASCII
”, “ANSEL
” or “UTF-8
”, then the specified character encoding shall be ASCII, ANSEL or UTF-8, respectively.
It is recommended that all ELF documents use UTF-8 and record this using a CHAR
line as follows:
0 HEAD
1 CHAR UTF-8
This CHAR
line string will be found while scanning for the specified character encoding. The line string begins with “1 CHAR
” followed by a space character; the remainder of the line string is “UTF-8
” so the specified character encoding is recognised as UTF-8.
Otherwise, if the remainder of the line string is exactly “UNICODE
” and the detected character encoding is UTF-16 in either endianness, the specified character encoding shall be the UTF-16 in that endianness.
UNICODE
” is used to specify the UTF-16 encoding, though without naming the encoding as such, and without specifying which endianness is meant. If the octet stream is a valid ELF document encoded in UTF-16 and the application supports UTF-16, then the detected character encoding will have been determined accordingly.
Otherwise, the application may determine the specified character encoding from the remainder of the line string and the detected character encoding in an implementation-defined way. The application may read one further line string, and if it begins with “2 VERS
” followed by a space character (U+0020), the application may also use the remainder of that line string in determining the specified character encoding.
It is fairly common to find “ANSI
” on the CHAR
line, though this has never been a legal option in any version of GEDCOM. It typically refers to one of several Windows code pages, most frequently CP-1252 which was the Windows default code page for English language installations and for several other Western European languages. However other code pages exist, and an application localised for, say, Hungarian might encode the file using CP-1250. In principle a VERS
line could contain information to specify the particular code page used, as in the following ELF fragment, but in practice this is rare.
0 HEAD
1 CHAR ANSI
2 VERS 1250
Otherwise, there is no specified character encoding.
If there is a specified character encoding, it shall be used as the character encoding of the octet stream. Otherwise, if there is a detected character encoding, it shall be used as the character encoding of the octet stream. Otherwise, the character encoding shall default to be UTF-8.
CHAR
line string is required in all versions of GEDCOM since 5.4, and ELF does not aim to be compatible with versions older than 5.5, GEDCOM’s default is largely moot. ELF changes the default, though requires ELF writers to include a CHAR
serialisation metadata structure. A future version of ELF will likely remove this requirement.
If the character encoding is one which the application does not support, the application shall issue an error and stop reading the file.
ELF parsers are required to support reading the ASCII, ANSEL and UTF-8 character encodings. ELF writers are only required to support the UTF-8 character encoding. Support for the UTF-16 character encoding is optional, and applications may support it in either its big or little endian forms, both, or neither. The ASCII, ANSEL and UTF-16 character encodings are all deprecated.
The UTF-8 and UTF-16 character encodings are the Unicode encoding forms defined in §9.2 of [ISO 10646], and the specifics of the big and little endian forms of UTF-16 are defined in §9.3 of [ISO 10646].
4D 69 6C 6F C5 A1
where the last two octets encode the character “š”. Only characters outside Unicode’s Basic Multilingual Plane — that is characters with a code point of U+10000 or higher — are encoded with four octets. An example is the ancient Chinese character “𠀡” which is encoded using the octets F0 A0 80 A1
. Such characters can occasionally be found encoded using six octets (e.g. ED A1 80 ED B0 A1
for “𠀡”). This form, which is called CESU-8 and is not valid UTF-8, typically results from an incorrect serialisation of UTF-16 data as UTF-8. Input containing CESU-8 forms but purporting to be UTF-8 is not a conformant source, however ELF parsers may read it providing they issue a warning to the user. ELF writers must not generate CESU-8 when serialising data as UTF-8.
The character encoding referred to as ASCII in this standard is the US version of ASCII which, for the purpose of this standard, is defined as the subset of UTF-8 which uses only Unicode characters U+0001 to U+007F.
ANSEL refers to the Extended Latin Alphabet Coded Character Set for Bibliographic Use defined in [ANSEL]. If an ELF file is determined to use the ANSEL character encoding it must be converted into a sequence of Unicode characters before it can be processed further. This is discussed in §3.3.1.
If other character encodings are supported, they too must be converted into a sequence of Unicode characters for further processing.
ansel-to-unicode.md
.
Before characters from the octet stream can be parsed into lines, they must be assembled into line strings. This is done by appending characters to the line string until a line break is encountered, at which point the character or characters forming the line break are discarded and a new line string is begun.
ELF parsers must be able to handle arbitrarily long line strings, subject to limits of available system resources.
Any leading whitespace shall be removed from the line string, but trailing whitespace must not also be removed except in the case that the line string is entirely whitespace. If this results in a line string which is an empty string, the empty line string is discarded.
These operations resolve ambiguities in [GEDCOM 5.5.1], and might therefore be a change from some current implementations’ interpretation of the GEDCOM standard. On the one hand, §1 of [GEDCOM 5.5.1] say that leading whitespace, including extra line terminators, should be allowed and ignored when reading; on the other hand, the relevant grammar production does not permit any such leading whitespace. For maximal compatibility with existing data, a conformant ELF application must accept and ignore leading whitespace and blank lines, but must not generate them.
For trailing whitespace, [GEDCOM 5.5.1] is even less clear. Twice, once in §2 and once in Appendix A, it states that applications sometimes remove trailing whitespace, but without saying whether this behaviour is legal; certainly it implies it is not required. There is little consistency in the behaviour of current applications, so any resolution to this will result in an incompatibility some applications. In ELF, the trailing whitespace must be preserved.
The Unicode escape mechanism defined in §6.3 provides ELF applications with a way of serialising a value which legitimately ends in whitespace without it being removed by older, non-ELF-aware applications.
Line strings are serialised by concatenating them together to form a single string, inserting a line break between each line string and after the last one. All the inserted line breaks must have identical lexical forms.
Finally, the resulting string is encoded into an octet stream using the character encoding that was documented in the serialisation metadata tagged structure with tag “CHAR
” (see §8.1). ELF writers are only required to support the UTF-8 character encoding, and this should be the default in applications supporting additional character encodings.
If the character encoding is one which allows a byte-order mark (U+FEFF) to be encoded, an ELF writer may prepend one the octet stream. This is recommended when serialising to UTF-16, but is not recommended when serialising to UTF-8.
For a line string to be parsed into a line, it must match the following Line
production:
Line ::= Number S (XRefLabel S)? Tag (PayloadSep Payload)?
PayloadSep ::= #x20 | #x9
The Line
production does not allow leading whitespace because this has already been removed in the process of creating line strings. The S
production is defined in §2.1 of [Basic Concepts] and matches any non-empty sequence of whitespace characters, though because carriage returns and line feeds are always treated as line breaks which delimit line strings, in practice the S
production can only match space or horizontal tab characters. Allowing tabs or multiple space characters is a departure from [GEDCOM 5.5.1], but one that is commonly implemented in current applications.
Only a single character of whitespace is permitted before the payload in the PayloadSep
production. This clarifies an ambiguity in [GEDCOM 5.5.1] where Appendix A warns that some applications look for the first non-space character as the start of the payload. There is no explicit statement that such applications are non-compliant, and this has left some doubt as to whether or not this behaviour permitted. In ELF this is explicitly not allowed for payloads which are strings.
Whitespace is required between each of the four components of the line. This is arguably a change from [GEDCOM 5.5.1] where the delim
grammar production says that the delimiter is an optional space character. Almost certainly that is a typo in the grammar that has persisted through several versions of GEDCOM, and GEDCOM does not intend the space to be optional. Documents written using very early versions of GEDCOM – long before its current grammar productions were written – did frequently merge the level, cross-reference identifier and tag together, as in “0@I1@INDI
”, but this is not permitted in ELF.
0@I1@INDI
” would be supported, and this could help make ELF Serialisation backwards compatible with GEDCOM 1.0. However the TSC know of no uses of this in files identifying as GEDCOM 5.x files, and is not generally supported in applications. Almost certainly it is an error arising from confusion over the two different uses of [
…]
in GEDCOM grammar productions. Files created using earlier versions of GEDCOM are only very rarely encountered and their data model is incompatible with [ELF Data Model]. There seems to be little benefit to supporting earlier versions of GEDCOM in the serialisation layer but not in the data model.
0 @I1@ INDI
1 NAME Cleopatra
1 FAMC @F2@
This ELF fragment contains three lines. The first line has a level of 0
, a cross-reference identifier of @I1@
, and a tag of INDI
; it has no payload. Neither the second nor the third line has a cross-reference identifier, and both have a payload: on the second line the payload is the string “Cleopatra
”, while the payload of the third line is a pointer, @F2@
.
Malformed lines are lines or line strings which contain certain particular types of syntactic error. Input containing a malformed line is a non-conformant source. If an ELF parser encounters a malformed line, it shall terminate processing the input file.
Any line string which does not match the Line
production is a malformed line.
Line
production, because they have already been removed from the input stream.
The Number
production encodes the level of the line, which is a non-negative decimal integer that records how many levels of substructures deep the current structure is nested.
Number ::= "0" | [1-9] [0-9]*
The previous level of a line is defined as the level of the closest preceding line. The first line in the input stream has no previous level.
0 INDI
1 NOTE The 16th President of the United States.
2 CONT Assassinated by John Wilkes Booth.
0 TRLR
In this example, the previous level of the TRLR
line is 2
, which is the level of the NOTE
line.
Any line that has a level more than one greater than its previous level is a malformed line. This does not apply to the first line in the input stream which is never a malformed line.
The following ELF fragment has a missing line.
0 @I1@ INDI
2 PLAC Москва
3 ROMN Moscow
1 NAME Иван Васильевич
0 TRLR
The second line of this example is a malformed line because it has a level of 2 and a previous level of 0.
0 HEAD
” while determining the specified character encoding per §3.2, which means the first line must always have a level of 0.
The XRefLabel
production encodes the cross-reference identifier of the line, which is used when referencing one structure from another using a pointer, and may be omitted when there is no need to refer to the structure. It is encoded with an “at” signs (@
; U+0040) before and after it, which are not themselves part of cross-reference identifer.
XRefLabel ::= "@" XRefID "@"
XRefID ::= IDChar+
IDChar ::= [A-Za-z0-9] | [?$&'*+,;=._~-]
| [#xA0-#xD7FF] | [#xF900-#xFFEF] | [#x10000-#xEFFFF]
The following is a well-formed line with a cross-reference identifier of “I1
”:
0 @I1@ INDI
[GEDCOM 5.5.1] allows cross-reference identifiers to contain any character other than a space (U+0020), the “at” sign (U+0040), the C0, C1 and DEL control characters (U+0001 to U+001F, U+0080 to U+009F, and U+007F), so long as it starts with an alphanumeric ASCII character. ELF removes the requirement that the first character of a cross-reference identifier be an alphanumeric ASCII character, and explicitly allows non-ASCII characters in cross-reference identifiers, though it prohibits the following characters which were allowed in GEDCOM:
Characters | Reason for exclusion |
---|---|
! : |
Reserved in ELF and GEDCOM pointers |
# % [ ] < > " { } | \ ^ \ ` |
Require escaping in IRI fragment identifiers |
( ) / |
Reserved for future FHISO use |
Private use characters | Ambiguous without agreed meaning |
[#xFFF0-#xFFFE] |
Require escaping in IRI fragment identifiers |
FHISO anticipates using cross-reference identifiers in IRI fragment identifiers in a future ELF standard, and have therefore prohibited all characters which [RFC 3987] says have to be escaped in this context.
For maximum compatibility, ELF writers should prefer cross-reference identifiers which only use ASCII characters, and should make the first character of a cross-reference identifier a letter (U+0041 to U+005A or U+0061 to U+007A), decimal digit (U+0030 to U+0039) or underscore (U+005F).
otherchar
production is intended to include all non-ASCII characters and not just U+0080 to U+00FE.
The Tag
production encodes the tag of the line which is a required string that denotes the meaning of the data encoded on the line.
Tag ::= [0-9a-zA-Z_]+
The ELF suite of standards defines a selection of tags for representing genealogical data.
Third parties may define additional tags for use in ELF documents in two ways. The first way, which is deprecated, is to use a legacy extension tag. These are tags beginning with an underscore (_
, U+005F). No legacy extension tags are defined in the ELF standards, and third parties can use them arbitrarily.
The _UID
tag is a legacy extension tag which has been implemented in a number of current applications and typically contains a 128-bit UUID as defined in [RFC 4122].
1 _UID 40ea7ad8-a5ba-4a7a-bb89-615cc2bf6639
_UID
legacy extension tag described in the previous example has also been used in some applications to contain a 144-bit identifier, which was a UUID followed by a 16-bit checksum. Applications expecting to find a standard 128-bit UUID will likely fail to parse this 144-bit form.
The second and preferred means of adding third-party tags is to define them in an ELF schema and reference that schema using a schema reference.
The HEAD
, TRLR
, CONC
, CONT
, PLANG
and DTYPE
tags are reserved in all contexts for recording header records, trailer records, continuation lines, payload languages and payload datatypes and must not be used in any other way.
A tag should be no more than 15 characters in length.
_FATHER_OF_BRIDE
is a valid tag, but should not be used because it is 16 characters long.
The payload of a line is an optional value associated with the line, which is encoded by the Payload
production. If present, it shall be either a string or a pointer, which are encoded by the PayloadString
and Pointer
productions, respectively. The String
production is given in §2 of [Basic Concepts] as a sequence of zero or more characters.
Payload ::= S? Pointer S? | PayloadString
PayloadString ::= String - ( S? Pointer S? )
Applications must treat a line with an omitted payload identically to a line with a payload consisting of an empty string.
Line
production matches via the S String
alternative, with an empty string; however if the line string ends with a tag with no subsequent whitespace, then the Line
production matches without the final optional Payload
component.
PayloadString
production explicitly excludes any string which matches the Pointer
production (with or without leading or trailing whitespace), which also match the String
production. This means ELF parsers must treat the payload as a pointer if it matches the Pointer
production, and only as a string if it does not.
An earlier draft of this standard used the following PayloadString
production.
PayloadString ::= PayloadItem*
PayloadItem ::= PayloadChar | EscapedAt | EscapeSeq
PayloadChar ::= [^#x40#xA#xD]
EscapedAt ::= "@@"
EscapeSeq ::= "@#" [A-Z] PayloadChar* "@"
This ensures that only strings with correctly escaped “at” signs (U+0040) are allowed in a payload. This draft does not do this because it would require all “at” signs to be correctly escaped. In practice, unescaped “at” signs are fairly commonly found in GEDCOM files, particularly in the payload of EMAIL
lines. It is fairly easy to specify ELF so that these can be accommodated and this draft does so. Many current products appear to allow unescaped “at” signs in the manner proposed here.
A pointer is a payload which represents a link to another structure. It is encoded using the following Pointer
production.
Pointer ::= "@" [^#x23#x40#xA#xD] [^#x40#xA#xD]* "@"
@
, U+0040), which is used to mark the end of the pointer; and the number sign (#
, U+0023), which is only prohibited as the first character in order to distinguish pointers from escape sequences.
Pointer
production as a pointer, in practice only those matching the XRefLabel
production in §4.1.2 are valid as pointers in ELF 1.0. Any other pointers will be discarded as invalid in §XXX, but are permitted in the grammar for future use.
[GEDCOM 5.5.1] describes a pointer syntax similar to the following production:
GEDCOMPointer ::= "@" (IDChar+ ":")? XRefID ("!" IDChar+)? "@"
The optional identifier before the colon (:
, U+003A) is used to reference a remote file, and the optional identifier following the exclamation mark (!
, U+0021) is used to reference a structure within a record. However, GEDCOM provides no means of using these, so they are effectively reserved for a future version of GEDCOM. They remain reserved for these purposes in ELF, and a future version of ELF is likely to provide a means of referencing structures outside the current document.
Once line strings have been parsed into lines, the sequence of lines is converted into a sequence of records.
This process starts by parsing the first line of the input as the first line of a tagged structure using the procedure given §4.2.1. If that record has substructures then additional lines will be read while parsing it. This structure is the first record in the dataset, and shall be the header record.
Once the header record has been read, it shall be parsed according to §5.2 to extract the serialisation metadata, which affects the subsequent parsing of the file.
If further lines remain after the header record has been fully parsed, then the first of the remaining lines is parsed as first line of the next record in the dataset, again using the procedure given in §4.2.1. This process is repeated until no further lines remain, at which point the dataset has the been fully read.
0 HEAD
”, ensures that the first line of every record necessarily has a level of 0.
If the last record has a tag of TRLR
, and no cross-reference identifier, payload or substructures, it is discarded. Such a record is called a trailer record. If the last record is not a trailer record, it is a malformed structure as defined in §4.2.3.
Once each record has been assembled, an ELF parser shall make a second pass over the record processing it as described in §4.2.2. This does not apply to the discarded trailer record.
The conversion of lines into structures is defined recursively. To read a structure, the parser starts by reading its first line, and creates a tagged structure whose components are as follows:
und
” language tag if the payload of the first line is a string rather than a pointer; andThe level of the first line of the structure is referred to in this section as the current level.
The parser then repeatedly inspects the next line to determine whether it represents the start of a substructure of the structure being read. If the next line has a level less than or equal to the current level, there are no further substructures and the application has finished reading the structure.
1 DEAT Y
0 TRLR
In the above ELF fragment, the parser reads the first line and creates a structure with a DEAT
tag and a payload of “Y
”. It then inspects the following line, but because the following line has a level of 0 which is less than the level of the first line of the DEAT
structure, this indicates that the DATE
structure has no substructures.
Otherwise, the application shall recursively parse the next line as the first line of a new structure and append it to the list of substructures being read. Parsing continues by inspecting the following line to see if it is the start of another substructure, as described above.
0 @I1@ INDI
1 NAME Elizabeth
1 BIRT
2 DATE 21 APR 1926
0 TRLR
In this fragment, an application reads the first line and creates an INDI
structure. The next line has a level one greater than the level of the INDI
line, so is parsed as the start of a substructure. The parser creates a NAME
structure, and as the level of the following line is no greater than the level of the NAME
line, the NAME
structure has no substructures. The NAME
structure is appended as a substructure of the INDI
structure.
The parser then repeats the process, looking for further substructures of the INDI
structure. The BIRT
line is also one greater than the level of the INDI
line, so is also parsed as the start of a substructure, but this time it has a substructure of its own, namely the DATE
structure. The TRLR
line has a level of 0 which tells the parser there are no further substructures of the INDI
structure.
The result is an INDI
structure with two substructures with tags NAME
and BIRT
, respectively, the latter of which has a substructure of its own with tag DATE
.
Once each of record has been assembled, an ELF parser shall make a second pass over the record, processing it and its substructures recursively. Each step of the recursion proceeds as follows.
First, if the structure has a tag of CONC
, CONT
or TRLR
, or if the tag is HEAD
and the structure is not the first record of the input, it is a malformed structure.
PLANG
and DTYPE
may need adding to this list.
CONC
or CONT
tags must only be used in continuation lines, as described in §6.4. They are removed when their parent structure is being processed in this second pass, and therefore no longer exist when processing recurses into the substructures. The TRLR
tag must only be used for the trailer record which is removed before this second pass. The HEAD
tag must only be used for the header record. If any of these tags remain at this stage, it is because they have been misused.
Next, if the ELF parser is schema-aware, the tagged structure shall be converted into a typed structure as described in [ELF Schemas].
A typed structure is defined in [ELF Schemas] as consisting of:
This differs from a tagged structure in two ways: first, the tag is replaced with a structure type, which is an IRI; and secondly, string payloads are literals rather than language-typed strings. A literal is a tagged string which has both a language tag and a datatype as tags.
In later stages of parsing, the ELF parser either acts on a tagged structure or a typed structure, depending on whether this conversion has taken place. The word structure is used to refer to either.
Next, if the payload of structure is a string payload, it is unescaped as described in §6.5.
Finally, each substructure of the structure is processed recursively, in order, as described in this section.
This standard defines two classes of error that can arise when processing a structure.
A malformed structure is a structure with a sufficiently serious error that an ELF parser must detect the error and must terminate processing the input file upon encountering one.
A non-conformant structure is a structure with a less serious error. Input containing either a malformed structure or a non-conformant structure is a non-conformant source.
Each xref structure is encoded as a sequence of one or more lines.
These are of three kinds, in order:
The level of each line is a non-negative integer. The level of a first line is 0 if the xref structure is a record or the serialisation metadata tagged structures with tag “HEAD
” and “TRLR
”; otherwise it is one greater than the level of the first line of its superstructure. The level of an additional line is one greater than the level of its xref structure’s first line.
Each first line has the same xref_id (if any) and tag as its corresponding xref line. Each additional line has no xref_id and either “CONT
” or “CONC
” as its tag.
CONC
” or “CONT
” as its tag, it is unambiguous which lines are additional lines and which first line they correspond to.
The payload of the xref structure is the concatenation of the payloads of the first line and all additional lines, with a line break inserted before the payload of each additional line with tag “CONT
”. Because the payload of a line must not contain a line-break, there must be exactly one “CONT
”-tagged additional line per line-break in the xref structure’s payload. The number of “CONC
”-tagged additional lines may be picked arbitrarily, subject to the following:
CONC
”-tagged line should not have an empty payload.CONC
”-tagged line must NOT end with a whitespace.CONC
”-tagged line’ payload should not begin with whitespace.[GEDCOM 5.5.1] is inconsistent in its discussion of leading and trailing whitespace.
CONC
split; they (nonsensically) require the same for CONT
s as well.optional_line_value
in Chapter 1 allows both leading and trailing space, with no permission to remove it.CONC {CONCATENATION}
in Appendix A says an implementation may “look for the first non-space starting after the tag to determine the beginning of the value” and hence leading spaces must not appear.CONT {CONTINUED}
in Appendix A says an implementation must keep leading spaces in a CONT
as an exception to the usual rules.NOTE_STRUCTURE
in Chapter 2 says that “most operating systems will strip off the trailing space and the space is lost in the reconstitution of the note.”The RECOMMENDATIONS above are compatible with the most restrictive of these, while the REQUIREMENTS with the most limiting of them.
Suppose an xref structure tag is “NOTE
”; it’s payload is “This is a test\nwith one line break
”; and its superstructure’s superstructure is a record. This xref structure requires at least two lines (because it contains one line break) and may use more. It could be serialised in many ways, such as
2 NOTE This is a test
3 CONT with one line break
or
2 NOTE This i
3 CONC s a test
3 CONT with on
3 CONC e line break
@
). However, during parsing, this constraint SHALL NOT be enforced in any way.@
”, but they are relatively common in gedcom files. The above policy is intended to resolve common invalid files in an intuitive way.
Given the following non-conformant data
1 EMAIL name@example.com
2 DATE @#DGREG
3 CONC ORIAN@ 2 JAN 2019
a conformant application will concatenate these lines normally during parsing
1 EMAIL name@example.com
2 DATE @#DGREGORIAN@ 2 JAN 2019
creating a valid date escape in the DATE
-tagged extended line. The unmatched @
in the EMAIL
-tagged line is left unchanged during parsing.
Upon re-serialisation, the unmatched @ in the “EMAIL
” will be doubled when converting to an xref structure, but the date escape will not be modified
1 EMAIL name@@example.com
2 DATE @#DGREGORIAN@ 2 JAN 2019
If the serialisation decides to split either extended line with CONC
s, it must not do so in a way that splits up the pairs of “@
”s.
Each line shall be converted to a line string by concatenating together the level, cross-reference identifier, tag and payload as described by the Line
production given in §4.1. The application must serialise all line strings with a single space character (U+0020) for each S
or PayloadSep
production in the Line
production, and must not put additional whitespace before or after payloads which are pointers.
Although ELF parsers are required to be able to read the following line string, ELF writers must not produce this line string.
1 FAMC @F9@
There are two space characters after the FAMC
tag in this example. When parsing, the first space is matched by the PayloadSep
production while the second is matched by the optional S
production that comes before the pointer in the Payload
production. ELF writers must not insert additional whitespace before the pointer, and therefore must not produce this line string.
The header record is the first record in an ELF document. It shall have a HEAD
tag, no payload and no cross-reference identifier. The substructures of the header record are called metadata structures, and contain information about the dataset as a whole.
Certain metadata structures, which are referred to as serialisation metadata structures, are processed by the ELF parser during parsing and then removed from the dataset. Each serialisation metadata structure encodes one piece of serialisation metadata, as determined by the tag of the serialisation metadata structure. The serialisation metadata affects how the ELF parser processes the file.
This standard defines five types of serialisation metadata, as given in the following table.
Tag | Serialisation metadata |
---|---|
CHAR |
specified character encoding, as defined in §3.2 |
ELF |
ELF serialisation version, as defined in §5.1.1 |
GED |
legacy GEDCOM version, as defined in §5.1.2 |
PLANG |
default payload language |
SCHMA |
schema reference |
PLANG
tag. This standard does not reserve any tags for future use as serialisation metadata structures. If a future standard adds new ones, they will only be interpreted conditionally based on the ELF serialisation version.
The following fragment does not contain a Unicode escape in the ELF
serialisation metadata structure, and so does not represent the version 1.0. It is simply interpreted as the string “1@#U2E@0
”. This is not a valid version number, as defined in §5.1, and therefore the ELF
structure is a non-conformant structure. An ELF parser must either terminate processing on encountering it, or issue a warning.
0 HEAD
1 ELF 1@#U2E@0
The following fragment contains a NOTE
metadata structure whose payload, after unescaping, is the string “Ceci est une note longue à propos de ce document
”.
0 HEAD
1 NOTE Ceci est une note longue @#UC0@ pro
2 CONC pos de ce document
2 PLANG fr
0 TRLR
This is allowed because the NOTE
tag does not denote a serialisation metadata structure. The PLANG
substructure does not denote a serialisation metadata structure because it is not a direct substructure of the header record.
The payload of the ELF
serialisation metadata structure, and the payload of the VERS
substructure of the GEDC
serialisation metadata structure both contain a version number, which is a string used to record the version of a standard that matches the following Version
production:
Version ::= Integer "." Integer ( "." Integer )?
Integer ::= [0-9]+
The three components represented by the Integer
production are decimal integers, and may include leading zeros which are ignored. These components are called the major version, minor version and revision number, respectively. If the revision number is omitted, a value of 0 is assumed.
The following three numbers version are exactly equivalent:
1 ELF 1.0
1 ELF 1.0.0
1 ELF 1.000
The ELF serialisation version is a version number located in the payload of the ELF
serialisation metadata structure, and indicates the version of the ELF Serialisation standard with which the document complies.
The version number of this version of the standard is 1.0.0
. An ELF writer producing output according to this standard must include this ELF serialisation version in the output if the generated file contains any Unicode escapes, schema references, payload languages or payload datatypes.
If an ELF parser is reading a document with an ELF serialisation version which differs from the version number of this standard only by the revision number, the ELF parser must parse the input according to this standard.
If an ELF parser encounters an ELF serialisation version which has a different minor version to this standard, but the same major version, it should parse the input according to this standard, but should issue a warning to the user that the document is in an unknown version of ELF.
If an ELF parser encounters an ELF serialisation version with a different major version, the document is a non-conformant source.
The legacy GEDCOM version is a version number located in the payload of the VERS
substructure of the GEDC
serialisation metadata structure, and indicates the version of GEDCOM which the document is compatible with.
This standard, when used together with the [ELF Data Model], is compatible with both GEDCOM 5.5 and GEDCOM 5.5.1. An ELF writer producing output according to this standard must include a legacy GEDCOM version of either 5.5
or 5.5.1
in the output if it omitted the ELF serialisation version or if it included no schema references in the output, and should do so otherwise if the document conforms to the [ELF Data Model].
If an ELF parser encounters a legacy GEDCOM version other than 5.5
or 5.5.1
, the document is a non-conformant source.
The following ELF fragment encodes a legacy GEDCOM version of 5.3
, which was used by an abandoned draft of GEDCOM back in 1993.
0 HEAD
1 GEDC
2 VERS 5.3
An ELF parser may accept this and continue parsing the data in an implementation-defined manner, which might involve handling some constructs contrary to the ELF standards. If an ELF parser does continue parsing this non-conformant source, it must issue a warning to the user.
Once a header record has been assembled as described in §4.2.1, the ELF parser shall iterate over its substructures looking for structures with a tag of CHAR
, ELF
, GED
, PLANG
or SCHMA
. These substructures are identified as serialisation metadata structures and each is processed as specified in this section.
Any serialisation metadata structure, or any structure nested within a serialisation metadata structure regardless of the depth of the nesting, is a non-conformant structure if it has a cross-reference identifier, or if it has a tag of HEAD
, TRLR
, CONC
or CONT
, or if it has a payload which is a pointer.
The SCHMA
structure in the following document is a non-conformant structure:
0 HEAD
1 SCHMA https://example.com/this/is/a/very/long/IRI
2 CONC /which/has/been/continued/on/to/two/lines
0 TRLR
If a header record contains two or more serialisation metadata structures with the same tag, and that tag is not SCHMA
, the second and subsequent serialisation metadata structures are non-conformant structures.
The second PLANG
structure in this ELF fragment is a non-conformant structure as a document must not have multiple default payload languages. An ELF parser must either terminate processing the file or issue a warning.
0 HEAD
1 PLANG nds
1 PLANG de
If the serialisation metadata structure has a tag of CHAR
, it is deleted from the header record with no further processing.
If the serialisation metadata structure has a tag of ELF
, and its payload is not a valid version number, it is a non-conformant structure. Otherwise, the version number in its payload is interpreted as the ELF serialisation version as described in §5.1.1, and the structure is deleted from the header record.
The following fragment of a header record encodes an ELF serialisation version of 1.0
:
0 HEAD
1 ELF 1.0
If the serialisation metadata structure has a tag of GEDC
, it is used to determine the legacy GEDCOM version as follows. The serialisation metadata structure is a non-conformant structure if it has a payload, or if it does not have exactly one substructure with a VERS
tag and exactly one substructure with a FORM
tag, or if the payload of the VERS
substructure is not a valid version number, or if the payload of the FORM
substructure is not the string “LINEAGE-LINKED
”. Otherwise, the version number in the payload of the VERS
substructure is interpreted as the legacy GEDCOM version as described in §5.1.2, and the whole serialisation metadata structure is deleted from the header record.
The following fragment of a header record encodes an legacy GEDCOM version of 5.5
:
0 HEAD
1 GEDC
2 VERS 5.5
2 FORM LINEAGE-LINKED
The GEDC
serialisation metadata structure in the following header record is a non-conformant structure for two reasons: first, its VERS
substructure is not a valid version number because of the trailing “EL
”; and secondly, because there is no FORM
substructure.
0 HEAD
1 GEDC
2 VERS 5.5.1 EL
Once structures have been assembled from the lines forming them, and converted to a typed structures if the application is schema-aware, any string payloads need to be unescaped.
ELF uses the “at” sign (@
; U+0040) in the representation of pointers, as well as in escape sequences which are used to encode a special processing instructions in a string payload. Other uses of the “at” sign in payloads which are strings should be escaped, and must be when not escaping it would result in an ambiguity.
EMAIL
structure which almost invariably has a payload containing one “at” sign, and is often not properly escaped in real-world data. Payloads with a single “at” sign are never legal in GEDCOM. ELF requires such payloads to be interpreted as if the “at” sign had been escaped.
ELF provides two escape mechanisms which can escape an “at” sign in a payload. The recommended mechanism is to use an escaped at sign, defined in §6.1. The alternative is to use a Unicode escape, which is a more general escape mechanism defined in §6.3 that allows arbitrary Unicode characters to be encoded. Unicode escapes are an example of an escape sequence, which is a general facility for embedding special processing instructions in a string payload. Escape sequences are defined in §6.2.
An escaped at sign is a string matching the EscapedAt
production below, and is used to represent a single “at” sign in a string payload.
EscapedAt ::= "@@"
An escaped at sign simply doubles up the “at” sign. Thus, the email address name@example.com
should be encoded as follows:
1 EMAIL name@@example.com
An escape sequence is a string that can be used in a string payload to denote some form of special processing instruction.
An escape sequence shall match the following EscapeSeq
production.
EscapeSeq ::= "@#" EscapeType EscapeValue "@"
EscapeType ::= [A-Z]
EscapeValue ::= [^#x40#xA#xD]*
The following line contains an escape sequence:
2 DATE @#DFRENCH R@ 6 COMP 11
Escape sequences containing internal spaces are explicitly allowed by this standard and this example uses the D
escape type to write a date using the French Republican calendar defined in §4.3 of [ELF Dates].
This production differs in two ways from the equivalent production in [GEDCOM 5.5.1]. First, the character immediately following the initial “@#
” must be an upper-case ASCII letter in ELF. This was formerly a requirement in GEDCOM too, but was dropped after GEDCOM 5.3; nevertheless, all uses of escape sequences in past and present GEDCOM standards have conformed to this syntax requirement, and ELF reintroduces it.
Secondly, the production does not require a character after the final “at” sign, meaning that a space character immediately after an escape sequence is treated as part of a string payload and not as part of the escape sequence. This change has been made so that Unicode escapes can be used internally in a word, without requiring a space afterwards. For example, the Portuguese name João might be encoded as:
1 NAME Jo@#UE3@o
Is the second change likely to cause problems? Are there current applications which will issue an error when they encounter a escape sequence which is not followed by a space, but will accept unknown escape sequences?
The escape type of an escape sequence is the single character matched by EscapeType
production. It defines how the escape sequence is to be interpreted. This standard defines one escape type: the character U
is used to represent Unicode escapes, as defined in §6.3.
D
escape type for specifying calendar names, and this is the sole use of escape sequences in [GEDCOM 5.5.1]. Previous versions of GEDCOM have used the A
escape type for referencing multimedia objects in auxiliary files, C
for switching character encoding, F
for including data from another file, and L
for recording the number of octets of binary data immediately following. ELF does not support these character escapes, but FHISO is unlikely to reuse these escape types in future version of ELF unless for a compatible feature.
This standard reserves all possible escape types for future FHISO use. Third parties must not use their own escape sequences, except as permitted by a FHISO standard.
This extensibility mechanism is likely to be in [ELF Schemas], and could be as simple as a escape type to IRI mapping to define how the escape type is used in that particular document. For example,
0 SCHMA
1 ESC B https://example.com/binary-escape
Possibly this will be included in ELF 1.0, and if so, the paragraph above reserving all escape sequences will need changing. But the TSC do not consider this feature a priority for ELF 1.0.
The escape value of an escape sequence is the string matched by the EscapeValue
production. The meaning of the escape value and any restrictions on its content or format depend on the particular escape type. The only general restriction placed on all escape values is that they must not contain the “at” sign (U+0040), line feed (U+000A), or carriage return (U+000D).
D
for calendar escapes and U
for Unicode escapes – and does not allow third parties to define their own. The Unicode escape syntax defined in §6.3 only allows whitespace and hexadecimal digits to appear in the escape value, while the calendar escape syntax defined in §3.1 of [ELF Dates] only allows whitespace and ASCII letters. This means no punctuation characters can actually occur in an escape value in ELF 1.0, even though they are permitted in the generic syntax and must be accepted in unknown escapes sequences. A future version of ELF might reserve one or more currently unused character for a specific purpose within an escape sequence.
In particular, it is not possible to put arbitrary IRIs in an escape value, something which may need considering more carefully in the future, especially if there is any plan to turn calendar escapes into a more general datatype escape mechanism. The problem is that “at” signs are allowed in IRIs, and does in mailto
IRIs or http
IRIs with embedded userinfo. A future version of ELF might reserve a character for escaping characters within escape sequences. For example, %{
…}
might be used, something like this:
@#T<https://userinfo%{40}example.com/>@
A Unicode escape is a type of escape sequence that allows arbitrary Unicode characters to be encoded ELF files, regardless of the character encoding used for the file. ELF parsers are required to support Unicode escapes.
Unicode escapes use the U
escape type and has an escape value which is a sequence of zero or more uppercase hexadecimal integers, separate by spaces. The hexadecimal integers are the code points of the characters encoded by the Unicode escape. Its escape value shall matches the following UnicodeEsc
production.
UnicodeEsc ::= S? ( HexNumber (S HexNumber)* S? )?
HexNumber ::= [0-9A-F]+
If the Portuguese name “João
” is used in an ELF file encoded with the ASCII character encoding, it must be encoded using a Unicode escape such as this:
1 NAME Jo@#UE3@o
This is not the only possible encoding of the name João. If it written with a combining tilde character (U+0303) instead of a precomposed ‘a’ with tilde character (U+00E3), it could be encoded:
1 NAME Joa@#U303@o
[Basic Concepts] allows any string to be converted into Unicode Normalization Form C, which converts the latter form to the former, so an ELF writer need not preserve the form in which the accented character was originally entered.
The Unicode escape syntax allows multiple characters to be encoded in a single escape sequence. This allow a shorter and easier to read encoding of names in non-Latin scripts. For example, the Arabic name عزيز (Aziz) could be encoded in any of the following ways:
1 NAME عزيز
1 NAME @#U639@@#U632@@#U64A@@#U632@
1 NAME @#U 639 632 64A 632@
@#U11f@
.
@#U@
. These get deleted by an ELF parser during unescaping, as described in §6.5. They are permitted because they provide an alternative means of protecting necessary trailing whitespace in a string payload that is to be read by a legacy application or transmitted in a way that would otherwise remove the whitespace. Putting a @#U@
at the end of the encoded payload might be preferable to encoding the final character of whitespace if the receiving application ignores the unknown Unicode escape.
ELF writers must use a Unicode escape to encode characters that cannot be encoded in the target character encoding, but should not use them otherwise without a specific need, and should prefer an escaped at sign to a Unicode escape when escaping an “at” sign (U+0040).
ELF allows the string payload of a structure to be split across two or more consecutive lines. When this is done, the first line which contains the start of the string payload is called the continued line and the subsequent line or lines which contain the remainder of the string payload are called continuation lines. Any line with a tag of CONT
or CONC
is a continuation line.
CONT
continuation lines are used when the value encoded in a string payload needs to contain line breaks. The part of the string payload following each line break is placed on a continuation line using the CONT
tag, and the line break itself is removed from encoded version of the payload.
CONC
continuation lines are commonly used when preserving the layout of fragment of a text found in a source, such as the following three lines of text found on a sepulchral brass:
4 TEXT Pray for the soule of Edward Cowrtney esquyer secunde son
5 CONT of sr Willm Cowrtney knyght of Povderam, which dyed the
5 CONT firrst day of mch Ano dom mvcix on whos soule ihu have mci
CONC
continuation lines are used when it is desirable to split a string payload which does not contain a convenient line break across several lines. The payload is split at an arbitrary place which should be between two characters that are not whitespace.
CONC
continuation lines can also be useful for breaking string payloads when shorter lines are desirable – such as to prevent the examples in this standard from line-wrapping.
1 NOTE Prof. D. H. Kelley speculates that the mother of King Ecg
2 CONC berht of Wessex was a daughter of Æthelbeorht II of Kent.
In the fragment above, the NOTE
structure has a string payload which contains no line breaks and where the name Ecgberht is single word.
Applications must not assign significance to where CONC
continuation lines are inserted nor to how many are present in the serialisation of a string payload.
The TSC considered adding a third type of continuation line, which would have provisionally used a CONSP
tag. It was designed for splitting on a space character without relying on leading or trailing whitespace being preserved in the payload of lines. It would have worked like CONT
, except that instead of replacing a line break it would replace a space character (U+0020).
1 NOTE This is a long line which has been
2 CONSP split using the new mechanism.
After further consideration and consultation it was felt that the use cases for this were not sufficient to justify adding a new feature to ELF, however the TSC welcome further opinions on this.
In order to unescape a string payload of a structure, an ELF parser shall first identify all escaped at signs and escape sequences in the string payload per §6.5.1, and verify that each identified escape sequence is a permitted escape for the structure in whose payload it was found, as described in §6.5.2.
Next, each identified escaped at sign is replaced with a single “at” sign, and each identified Unicode escape is replaced with the character it encodes. Escape sequences other than Unicode escapes are left unaltered.
Because all escaped at signs and escape sequences are identified before any are unescaped, it is not possible to apply both forms of escaping sequentially to a single character. For example, neither of the following structures are valid ways of encoding a string payload consisting of a single “at” sign.
0 NOTE @@#U40@@
0 NOTE @#U40@@#U40@
The former is the recommended way of encoding a payload which consists of the string “@#U40@
”, while the latter is an alternate encoding (which is not recommended) of the string “@@
”.
Finally, any substructures corresponding to continuation lines are identified and their payloads merged into the payload of their parent structure, as described in §6.5.3.
As continuation lines are merged after escaped at signs and Unicode escapes are unescaped, the payload of following structure is the literal string “@#U21@
” and not a exclamation mark (U+0021):
0 NOTE @
1 CONC #U21@
To identify all the escaped at signs and escape sequences in a string payload, an ELF parser scans the string from beginning to end looking for “at” signs (U+0040), and then inspects the next character, if there is one, to determine how the “at” sign is to be interpreted.
If the following character is another “at” sign, then an ELF parser shall identify the two “at” signs as an escaped at sign, and then resume scan for “at” signs from the character following the second “at” sign.
The @@
in the payload of the following structure is identified as an escaped at sign.
1 EMAIL name@@example.com
Otherwise, if the following character is the number sign (#
; U+0023), then an ELF parser shall identify these two characters as the start of an escape sequence, terminating at the subsequent “at” sign. If there is no subsequent “at” sign, or if the string identified as an escape sequence does not match the EscapeSeq
production, the structure containing this string payload is a non-conformant structure. If a syntactically correct escape sequence was identified, the ELF parser shall resume scanning for “at” signs from the character following the second “at” sign.
In this example, the @#
is treated as the start of an escape sequence, but because there is no subsequent @
, it is a non-conformant structure, and an ELF parser must either terminate parsing or issue a warning to the user.
0 NOTE Lines containing only a @# are non-conformant.
If the character immediately after the @#
is not an upper-case ASCII letter, the escape sequence does not match the EscapeSeq
production and the result is also a non-conformant sturcture. This example is a non-conformant structure for that reason.
0 NOTE Following a @# with a @ isn't necessarily conformant.
Otherwise, the “at” sign is treated as a regular character, and scanning for “at” signs continues from the next character. This facility for treating unescaped “at” signs as regular characters is deprecated.
This applies in the following structure, where the “at” sign has not been properly escaped.
1 EMAIL name@example.com
ELF parsers must accept this, but a future version of ELF is likely to make this a non-conformant structure.
The following table illustrates how some more complicated string payloads are parsed into strings, escaped at signs, escape sequences and bare “at” signs.
String payload | Parsed as |
---|---|
“name@example.com ” |
“name ”, “@ ”, “example.com ” |
“name@@example.com ” |
“name ”, “@@ ”, “example.com ” |
“name@@@example.com ” |
“name ”, “@@ ”, “@ ”, “example.com ” |
“name@@@@example.com ” |
“name ”, “@@ ”, “@@ ”, “example.com ” |
“some@#XYZ@thing ” |
“some ”, “@#XYZ@ ”, “thing ” |
“some@@#XYZ@thing ” |
“some ”, “@@ ”, “#XYZ ”, “@ ”, “thing ” |
“some@@@#XYZ@thing ” |
“some ”, “@@ ”, “@#XYZ@ ”, “thing ” |
“@#XA@@#YB@ ” |
“@#XA@ ”, “@#YB@ ” |
A permitted escape is an escape sequence with an escape type that is permitted to occur in a particular structure. If a string payload contains an escape sequence other than an permitted escape, the structure is a non-conformant structure.
If the application is schema-aware, permitted escapes are identified as described in the [ELF Schemas] standard. Otherwise, permitted escapes are identified as described in this section.
If the escape type is U
, then the escape sequence is a permitted escape.
If the escape type is D
, then the escape sequence is a permitted escape.
The following structure contains two instances of escape sequences with the escape type D
, which is denotes a calendar escape in §3.1.1 of [ELF Dates]. Both uses are permitted escapes, despite the fact that ages, as defined in §6 of [ELF Dates], do not allow the use of calendar escapes.
1 DEAT
2 DATE @#DJULIAN@ 30 JAN 1649
2 AGE @#DJULIAN@ 48y
D
escape type are permitted escapes everywhere so that that serialisation layer is compatible with future versions of ELF which may choose to allow calendar escapes in other contexts. For example, a future version of ELF could allow calendar escapes to be used with ages because the length of a year can depend on the calendar being used. Schema-aware applications are better able to determine whether the calendar escape is really a permitted escape.
An escape sequence with any other escape type is not a permitted escape.
Substructures with a tag of CONC
or CONT
are called a continuation substructures. They correspond to continuation lines.
A continuation substructure is a malformed structure if it has a cross-reference identifier, or has a non-empty list of substructures, or is a substructure of a continuation substructure, or is preceded in the list of substructures by a structure other than a continuation substructure. Likewise, any record whose tag is CONT
or CONC
is a malformed structure.
CONC
and CONT
continuation substructures.
The third line of this example is a malformed structure because the NOTE
structure has another substructure before the continuation substructure – namely, the REFN
structure.
0 NOTE Start of note
1 REFN 5bb43407-9f24-4b42-b00e-c32cc0f09d21
1 CONT End of note
A continuation substructure is a non-conformant structure if it has a payload which is a pointer.
The second line of this example is a non-conformant structure because it is a continuation structure whose payload is a pointer.
0 @N1@ NOTE This can be found in:
1 CONT @F1@
The NOTE
line is the continued line, and has a valid cross-reference identifier. It is only continuation lines and not continued lines that must not have cross-reference identifiers.
If a structure has any continuation substructures, each is merged with the parent structure in the order they appear in the list of substructures, as follows.
If a CONC
continuation substructure is encountered, an ELF parser shall first append a line break to the payload of the parent structure. The form of line break appended is implementation-defined, but all inserted line breaks must have identical lexical forms.
Then, regardless of the type of continuation substructure, the payload of the continuation substructure shall be appended to the payload of the parent structure, and the continuation substructure is removed from the parent’s list substructures.
This NOTE
structure has three continuation substructures followed by one other substructure.
0 NOTE This paragraph is sufficiently long that it has proved con
1 CONC venient to wrap it onto a second line.
1 CONT
1 CONT This is a short paragraph.
1 REFN 8e445bb6-cb27-4c12-8c74-e051395639c2
None of the lines in this example contain trailing whitespace. Once continuation substructures have been merged, this example consists of a NOTE
structure whose string payload is “This paragraph is sufficiently long that it has proved convenient to wrap it onto a second line.\n\nThis is a short paragraph.
” In this explanation, \n
denotes a line break of unspecified form. This is for exposition only and does not form part of the ELF syntax.
After merging continuation substructures, the NOTE
structure has just one substructure – the REFN
structure.
@
If a tagged structure is pointed to by the pointer-valued payload of another tagged structure, the pointe-to tagged structure’s corresponding xref structure shall be given an xref_id, a string matching production XrefID
.
XrefID ::= "@" ID "@"
ID ::= [0-9A-Z_a-z] [#x20-#x3F#x41-#x7E]*
It must not be the case that two different xref structures be given the same xref_id. Conformant implementations must not attach semantic importance to the contents of an xref_id.
It is recommended that an xref_id be no more than 22 characters (20 characters plus the leading and trailing U+0040)
Each record should be given an xref_id; each non-record structure should not; and each serialisation metadata tagged structure must not be given an xref_id.
The xref structure that corresponds to a tagged structure with a pointer-valued payload has, as its payload, an xref: a string identical to the xref_id of the xref structure corresponding to the pointed-to tagged structure.
When parsing, if xref payloads are encountered that do not correspond to exactly one xref structure’s xref_id, that payload shall be converted to to a pointer to a record with tag “UNDEF
”, which shall not have a payload nor substructures. It is recommended that one such “UNDEF
” tagged structure be inserted for each distinct xref.
If the escape type is U
(U+0055), the escape is a unicode escape and its handling is discussed in §6.3; otherwise, it is handled according to this section.
If an escape is in the payload of an tagged structure whose tag is an escape preserving tag, and if the escape’s escape type* is in the tag’s set of preserved escape types, then the escape shall be preserved unmodified in the corresponding xref structure’s payload.
DATE
” tagged structure has payload “ABT @#DJULIAN@ 1540
”, its corresponding xref structure’s payload is also “ABT @#DJULIAN@ 1540
”.
Otherwise, a modification of the escape shall be placed in the xref structure’s payload which is identical to the original escape except that each of the two @
shall each be replaced with a pair of consecutive U+0040 @
.
NOTE
” tagged structure has payload “ABT @#DJULIAN@ 1540
”, its corresponding xref structure’s payload is “ABT @@#DJULIAN@@ 1540
”.
If an escape is in the payload of an xref structure whose tag is an escape preserving tag, and the escape’s escape type* is in the tag’s set of preserved escape types, the escape shall be preserved unmodified in the corresponding tagged structure’s payload.
DATE
” xref structure has payload “ABT @#DJULIAN@ 1540
”, its corresponding tagged structure’s payload is also “ABT @#DJULIAN@ 1540
”.
Otherwise, the escape shall be omitted from the corresponding tagged structure’s payload.
NOTE
” xref structure has payload “ABT @#DJULIAN@ 1540
”, its corresponding tagged structure’s payload is “ABT 1540
”.
@
sIt might be worthwhile to restrict this entire section to non-escape preserving tags; without that we have a (somewhat obscure) problem with the current system:
Consider the escape-preserving tag DATE
. A serialisation/parsing sequence applied to the string “@@#Dx@@ yz
” yields
@@#Dx@@ yz
”@#Dx@ yz
”@#Dx@ yz
” – not with @@
because it matches a date escapeDuring serialisation, each U+0040 (@
) that is not part of an escape shall be encoded as two consecutive U+0040 (@@
).
name@example.com
” is serialised as the xref structure payload “name@@example.com
”
The tagged structures representing the dataset are ordered as follows:
A serialisation metadata tagged structure with tag “HEAD
” and the following substructures:
A serialisation metadata tagged structure with tag “CHAR
” and payload identifying the character encoding used; see §8.1 for details.
A serialisation metadata tagged structure with tag “SCHMA
” and no payload, with substructures encoding the ELF Schema.
Each tagged structure with the superstructure type identifier elf:Metadata
, in an order consistent with the partial order of structures present in the metadata.
Each tagged structure with the superstructure type identifier elf:Document
, in arbitrary order.
A serialisation metadata tagged structure with tag “TRLR
” and no payload or substructures.
The character encoding shall be serialised in the “CHAR
” tagged structure’s payload encoding name in the following table:
Encoding | Description |
---|---|
ASCII |
The US version of ASCII defined in [ASCII]. |
ANSEL |
The extended Latin character set for bibliographic use defined in [ANSEL]. |
UNICODE |
Either the UTF-16LE or the UTF-16BE encodings of Unicode defined in [ISO 10646]. |
UTF-8 |
The UTF-8 encodings of Unicode defined in [ISO 10646]. |
It is required that the encoding used should be able to represent all code points within the string; unicode escapes (see §6.3) allow this to be achieved for any supported encoding. It is recommended that UTF-8
be used for all datasets.
Copyright © 2017–19, Family History Information Standards Organisation, Inc. The text of this standard is available under the Creative Commons Attribution 4.0 International License.