Technical
Site

Related Links

Download

A citation elements vocabulary

This is a first draft of material intended to form the introduction to a FHISO Citation Elements standard.

FHISO's citation elements vocabulary provides a standard, extensible framework for encoding all the data about a genealogical source that might reasonably be included in a formatted citation to that source. It does not seek to provide an exhaustive description of sources.

This citation element vocabulary covers just a small part of the genealogy domain, and it is anticipated that parties adopting this standard will wish to incorporate it in whatever serialisation format they currently use. For this reason, this standard does not define a serialisation format.

General

The key words must, must not, required, shall, shall not, should, should not, recommended, may and optional in this standard are to be interpreted as described in [RFC 2119].

An application is conformant with this standard if and only if it follows all the requirements and prohibitions contained in this document, as indicated by use of the words must, must not, required, shall and shall not, and the relevant parts of its normative references. Standards referencing this standard must not loosen any of the requirements and prohibitions made by this standard, nor place additional requirements or prohibitions on the constructs defined herein.

Adding requirements or prohibitions is disallowed so as to preserve interoperability between applications: data generated by one conformant application must always be acceptable to another conformant application, regardless of what additional standards each may conform to.

Indented text in coloured boxes, such as preceding paragraph, does not form a normative part of this standard, and is labelled as either an example or a note.

Editorial notes, such as this, are used to record outstanding issues, or points where there is not yet consensus; they will be resolved and removed for the final standard. Examples and notes will be retained in the standard.

Characters and strings

The grammar given here uses the same EBNF notation as [XML], except that no significance is attached to the capitalisation of grammar symbols. Conforming applications must not generate data not conforming to the syntax given here, but non-conforming syntax may be accepted and processed by a conforming application in an implementation-defined manner.

Characters are specified by reference to their code point number in [ISO 10646], without regard to any particular character encoding. In this standard characters may be identified in this standard by their hexadecimal code point prefixed with "U+".

The character encoding is a property of the serialisation, and not defined in this standard. Non-Unicode encodings are not precluded, so long as it is defined how characters in that encoding corresponds to Unicode characters.

Characters must match the Char production from [XML].

Char  ::=  [#1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
This includes all code points except the null character, surrogates (which are reserved for encodings such as UTF-16 and not characters in their own right), and the invalid characters U+FFFE and U+FFFF.

A string is a sequence of zero or more characters.

The definition of a string is identical to the definition of the xs:string datatype defined in [XSD Pt2], used in many XML and Semantic Web technologies.

Characters matching the RestrictedChar production from [XML] should not appear in strings, and applications may process such characters in an implementation-defined manner or reject strings containing them.

RestrictedChar  ::=  [#x1-#x8] | [#xB-#xC] | [#xE-#x1F]
                       | [#x7F-#x84] | [#x86-#x9F]
This includes all C0 and C1 control characters except tab (U+0009), line feed (U+000A), carriage return (U+000D) and next line (U+0085).
As applications can process C1 control characters in an implementation-defined manner, they can opt to handle Windows-1252 quotation in data masquerading as Unicode.

Whitespace is defined as a sequence of one or more space characters, carriage returns, line feeds, or tabs. It matches the production S from [XML].

S  ::=  (#x20 | #x9 | #xD | #xA)+

Whitespace normalisation is the process of discarding any leading or trailing whitespace, and replacing other whitespace with a single space (U+0020) character.

The definition of whitespace normalisation is identical to that in [XML].

In the event of a difference between the definitions of the Char, RestrictedChar and S productions given here and those in [XML], the definitions in the latest edition of XML 1.1 specification are applicable.

Sources and citations

A source is any resource from which information is obtained during the genealogical research process. Sources come in many forms, including manuscripts, artefacts, books, films, people, recordings and websites. A full mechanism for describing sources is beyond the scope of this standard.

A citation is an abstract reference to a specific source from which information has been used in some context. It should include sufficient detail that a third-party could readily locate the information themselves, assuming the source remains accessible.

A formatted citation is a citation that has been rendered into human-readable form, typically as a sentence or short paragraph that might be used as a footnote, endnote or bibliography entry. There is no single standard for the correct form of formatted citations; many different style guides exist, each giving their own rules on how to construct a formatted citation.

A formatted citation produced for use in a footnote on the first use of the source, and conforming to [Chicago] might read:

1   Christian Settipani, Les ancêtres de Charlemagne, 2nd ed. (Oxford: Prosopographia et Genealogica, 2015), 129–31.

The 1 at the start of the citation is the hypothetical footnote number.

A citation element is a representation of a logically self-contained piece of information about a source that might reasonably be included in a formatted citation. The information is stored in a sufficiently structured way that applications can parse and reformat it as needed when producing a formatted citation.

A citation element set is an unordered set of citation elements that completely encode the information about a source required to produce a formatted citation. Given a citation element set and any necessary internal state, an application should be able to produce algorithmically a formatted citation in any mainstream citation style; they need not use every citation element in doing so if the style dictates that certain information is omitted in certain contexts.

The earlier example formatted citation to Les ancêtres de Charlemagne is represented by a citation element set containing the following seven citation elements:

The footnote number is not a citation element as it does not pertain to the source. The author and page range are not expressed here in quite the same form as the formatted citation, but an application can readily parse them to convert them to the required format because their format is defined by this standard.

Citation element sets should not include citation elements for information that is not normally included in a formatted citation. They are not intended to provide a general mechanism for storing arbitrary information about sources.

Formatted citations do not normally include details such as the email addresses, phone numbers or academic affiliations of authors, so they should not be included in the citation element set.

A citation element consists of three parts:

A citation element set must not contain more than one citation element with the same citation element name and language tag.

Many languages and serialisation formats provide a map or dictionary type which could be suitable for storing a citation element set. JSON's object notion, given in §4 of [RFC 7159], is an example. The language tag complicates the use of these types — is it part of the map key, or the map value? It is recommended that if it not possible to use a pair comprising both the citation element name and the language tag as the map key, then the language tag should be included in the map value.
This simple model where a citation is represented by a name-value map may not be sufficient for layer citationscitations containing information on the resource consulted, as well as the resources from which it is derived. FHISO have deferred discussion of this issue until other aspects of this standard are more complete.

Citation elements names

The citation element name is an identifier used to identify what information the citation element contains. It shall take the form of an IRI matching the IRI production in §2.2 of [RFC 3987].

This standard defines a citation element for the title of a source. It has the citation element name http://terms.fhiso.org/sources/title.
IRIs have been chosen in preference to URIs because it is recognised that certain culture-specific genealogical concepts may not have English names, and in such cases the human-legibility of IRIs is advantageous.

This standard defines many citation elements, all of which have a citation element name that begins http://terms.fhiso.org/. These aim to cover the information normally found in formatted citations to a wide range of sources, but applications may define their own citation elements or use those defined by a third-party standard; such citation elements are known as extension citation elements. It is recommended that any extension citation elements also use the http IRI scheme defined in §2.7.1 of [RFC 7230], and an authority component consisting of just a domain name (or subdomain) under the control of the party defining the extension citation elements. Conforming applications must not discard unrecognised extension citation elements, other than at the instruction of the user, but may opt not to display them.

It is recommended that an HTTP 1.1 GET request made without an Accept header to the citation element name IRI (once converted to a URI per §3.1 of [RFC 3987]) should result in a 303 "See Other" redirect to a document containing a human-readable definition of the element.

A 303 redirect is considered best practice for [Linked Data], so as to avoid confusing the citation element name IRI with its definition, which is found at the post-redirect URL. The citation elements defined in this standard are not specifically designed for use in Linked Data, but the same considerations apply.
A future draft of this standard is likely to add support for a discovery mechanism, whereby a HTTP 1.1 GET request to the citation element name IRI, made with an appropriate Accept header, yields a machine-readable definition of the citation element. Support for this by the authors of extension citation elements is likely to be recommended but not required, while application support for it would be optional.

Citation element names are compared using the "simple string comparison" algorithm given in §5.3.1 of [RFC 3987]. If a citation element name does not compare equal to an IRI known to the application, it must not make any assumptions on the purpose of the citation element or the meaning of its value based on the IRI.

This is a simple character-by-character comparison, with no normalisation carried out on the IRIs prior to comparison. This is how XML namespace names are compared in [XML Names].

For the purpose of comparing citation element names, the following IRIs are all distinct, even though an HTTP request to them would fetch the same resource.

http://éléments.example.com/nationalité
HTTP://ÉLÉMENTS.EXAMPLE.COM/nationalit%C3%A9
http://xn--lments-9uab.example.com/nationalit%c3%a9

In additional to describing the intended purpose of the citation element, the definition of a citation element (regardless of whether it is one of those defined in this standard, or whether it is an extension citation element) shall state:

A list-valued citation element is one that can logically have multiple values. It should be reserved for situations where the values genuinely contains different information, and not used to accommodate transliterations, translations, or variant forms of text. Citation elements that are not list-valued are single-valued.

The http://terms.fhiso.org/sources/creators-name citation element used to record the authors, editors and compilers of a source is list-valued because sources may have multiple authors. The http://terms.fhiso.org/sources/title citation element is single-valued, as sources normally only have one title; language tags provide a way of giving a transliterated version of the title.

A citation element declaration is a serialisation of the formal definition of a citation element. It must include the citation element name and its cardinality, should include its range, and may include other details.

This draft does not define a format for citation element declarations. In earlier discussion, a format based on JSON-LD's @context was mooted, setting @container to either @list or null for each citation element namec; however the handling of language tags seemingly prevents this from being done in a way that is compatible with JSON-LD.

Citation elements values

The citation element value is the content of the citation element, and is either a string or a ordered list of strings, depending on whether the citation element is single-valued or list-valued. This standard does not state how a list should be represented, and it will depend on the serialisation being used.

Initial discussion suggested that an unordered set of strings should be possible as a citation element value. As there no use case has yet been found for this, it has not been included in this draft; it may be added to subsequent drafts.

In the earlier example of Les ancêtres de Charlemagne, the title could be encoded in a citation element with:

The author could be encoded with a citation element with:

In the former case, the value is a string because the title citation element is defined to be single-valued, while the creators-name citation element is defined as being list-valued.
The precise details of citation elements for authorship (and how to accommodate concepts like editors, compilers and translators) are as yet undecided. The citation element name used in this example may need updating to reflect the final decision.

A citation element defined as being list-valued must not have a citation element value which is a string; a citation element defined as being single-valued must not have a citation element value which is a list. In the conceptual model defined by this standard, a string is different to a list of one string.

Applications may convert any citation element value into Unicode Normalization Form C, as defined in any version of Unicode Standard Annex #15 [UAX 15].

This allows applications to store citation element values internally in either Normalization Form C or Normalization Form D for ease of searching, sorting and comparison, without also retaining the original, unnormalised form.

Applications may whitespace-normalised citation element values, and in such elements, the S whitespace production collapses to a single space (U+0020).

Language Tags

Each citation element may contain a language tag which shall, if present, match the Language-Tag production from [RFC 5646]. Because several citation elements with the same citation element name are permitted if they have different language tags, even for single-valued citation elements, this may be used to provide transliterated versions of the citation element value. The script subtags, described in §2.2.3 of [RFC 5646] should be used when an element has been transliterated.

Some citation elements have non-textual citation element values (for example if they are numbers), and such citation element names should be defined not to allow language tags. When this is the case, a language tag must not be provided. When a citation elements contains an untranslated, untransliterated citation element value, as found in the source, the language tag should be omitted. In all other circumstances, citation elements should contain a language tag.

If there is not an obvious language tag to use, [ISO 639-2] provides the codes mul for when multiple languages are used, und for when the language is undetermined, and zxx for when there is no linguistic content present.
The title element is defined to be single-valued in this standard, but a citation element set may contain a title element with the value "Η Γενεαλογία των Κομνηνών" and no language code (it being in the original, untransliterated Greek), and another with the value "Hē Genealogia tōn Komnēnōn" tagged with the language tag "el-Latn". Were the original language unknown, which might be the case if transliteration were supplied by a computer, it could be tagged "und-Latn".
Translation has not been included in this draft because the citation element containing the original is untagged, meaning there is no way of distinguishing between translated and transliterated versions. Adding a language tag to the original would not help, as there would then be no way of knowing it is the original, and still no way of knowing what what was a translation. In any case, most style guides advocate only transliterating and not translating citation elements. If there is a good use case for allowing translation, a work-around would be to define a private use subtag like x-original to mark the original; a better mechanism would be preferable.

If the citation elements are being serialised in XML, it is recommended that the special xml:lang attribute defined in §2.12 of [XML] is used to encode the language tag.

A JSON serialisation

There has been little discussion, and therefore no consensus has been established, on whether to include a serialisation format in this standard. The principal motivation in providing one is to provide an easy way of providing example data in this standard. For this purpose, a JSON syntax is more compact than an XML one.

This section defines a JSON serialisation of citation element sets. Support for it is optional; even if an application wishes to support serialisation to JSON, it may opt to do it differently.

A citation element name shall be serialised as a JSON string (per §7 of [RFC 7159]) containing the citation element name.

A citation element value which is a string shall be serialised as a JSON string; a citation element value which is a list shall be serialised as a JSON array of strings (per §5 of [RFC 7159]).

A book title would be serialised "Royal Ancestry" as the title citation element is single-valued. Its author would be serialised [ "Richardson, Douglas" ] as the creators-name citation element is list-valued.

The serialised citation element value shall be wrapped in an JSON object (per §4 of [RFC 7159]) comprising two members: one named value or list (depending on whether the citation element is single-valued or list-valued) whose value is the serialised citation element value, the other named lang whose value is the language tag as a JSON string. If the language tag is omitted, either the lang member may be omitted or it may be explicitly set to null.

The wrapped versions of book title and author from the previous example are:

{ "value": "Royal Ancestry", "lang": null }
{ "list": [ "Richardson, Douglas" ] }
In this example, the former explicitly sets the lang to null, while the latter does so implicitly. Both are acceptable.

A citation element set is serialised as a JSON object, with each distinct citation element name serialised as the JSON member name. If the citation element name is declared not to allow language tags, then the JSON member value is the serialised citation element value; if language tags are allowed, then the JSON member value is a JSON array containing the wrapped version of every citation element value from a citation element with the current citation element name.

A simple citation element set containing just the title and author of Royal Ancestry would be:

{ "http://terms.fhiso.org/sources/title":
    [ { "value": "Royal Ancestry" } ],
  "http://terms.fhiso.org/sources/creators-name":
    [ { "list": [ "Richardson, Douglas" ] } ] }

For an example demonstrating transliteration, consider the title and author of the book Η Γενεαλογία των Κομνηνών:

{ "http://terms.fhiso.org/sources/title": 
    [ { "value": "Η Γενεαλογία των Κομνηνών" },
      { "value": "Hē Genealogia tōn Komnēnōn", "lang": "el-Latn" } ],
  "http://terms.fhiso.org/sources/creators-name":
    [ { "list": [ "Βαρζος, Κωνσταντίνος" ] },
      { "list": [ "Varzos, Konstantinos" ], "lang": "el-Latn" } ] }

The syntax has deliberately been chosen to be compatible with [JSON-LD]. To parse it as JSON-LD, the following @context must be available to the parser.:

"@context": { "value": "@value", "lang": "@language", 
              "list": "@list" }
As this syntax is valid JSON-LD, this defines a conversion to RDF using the algorithm set out in §10 of [JSON-LD API].
The combination of language tags and list-valued citation elements means that the serialisation cannot take advantage of JSON-LD's compact list representation by setting @container to @list. This is the reason why separate value and list keywords are need in the serialisation proposed here. If a simplification is possible, it would be beneficial.

List-flattening formats

It is anticipated that some adopters may need to serialise citation element sets in a list-flattening format that does not allow a list of one string to be distinguished from a single string.

An application may wish to serialise citation element sets as XML in a format with one element per string value, such as follows:

<elements>
  <element name="http://terms.fhiso.org/sources/title"
           value="[Eirene?], First Wife of Emperor Isaakios II" />
  <element name="http://terms.fhiso.org/sources/creators-name"
           value="Stone, Don C." />
  <element name="http://terms.fhiso.org/sources/creators-name"
           value="Owens, Charles R." />
</elements>
In such a serialisation, there's no way of telling from the data that the creators-name citation element is list-valued. It can be determined empirically when more than one author is given (as in this example), but when only a single creators-name is given, the serialisation gives no indication of whether a creators-name is list-valued or single-valued. This format is therefore list-flattening.

List-flattening formats are permitted by this standard; however, if application uses such a format, it must ensure that the serialised data includes a citation element declaration for every extension citation element used in the data. For consistency, it is recommended that citation element declarations are also given for citation elements defined by this standard. Citation element declarations may be included in non-list-flattening formats too.

This is essential to ensure that an application can convert data from a list-flattening format to a non-list-flattening format, even if unknown extension citation elements are present. As these must not be discarded, the data cannot be processed without a citation element declaration.

References

Normative references

[ISO 10646]
ISO (International Organization for Standardization). ISO/IEC 10646:2014. Information technology — Universal Coded Character Set (UCS). 2014.
[ISO 639-2]
ISO (International Organization for Standardization). ISO 639-2:1998. Codes for the representation of names of languages — Part 2: Alpha-3 code. 1998. (See http://www.loc.gov/standards/iso639-2/.)
[RFC 2119]
IETF (Internet Engineering Task Force). RFC 2119: Key words for use in RFCs to Indicate Requirement Levels. Scott Bradner, 1997. (See http://tools.ietf.org/html/rfc2119.)
[RFC 3987]
IETF (Internet Engineering Task Force). RFC 3987: Internationalized Resource Identifiers (IRIs). Martin Duerst and Michel Suignard, 2005. (See http://tools.ietf.org/html/rfc3987.)
[RFC 5646]
IETF (Internet Engineering Task Force). RFC 5646: Tags for Identifying Languages. Addison Phillips and Mark Davis, eds., 2009. (See http://tools.ietf.org/html/rfc5646.)
[RFC 7230]
IETF (Internet Engineering Task Force). RFC 7230: Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing. Roy Fieldind and Julian Reschke, eds., 2014. (See http://tools.ietf.org/html/rfc7230.)
[UAX 15]
The Unicode Consortium. "Unicode Standard Annex 15: Unicode Normalization Forms" in The Unicode Standard, Version 8.0.0. Mark Davis and Ken Whistler, eds., 2015. (See http://unicode.org/reports/tr15/.)
[XML]
W3C (World Wide Web Consortium). Extensible Markup Language (XML) 1.1, 2nd edition. Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, François Yergeau, and John Cowan eds., 2006. W3C Recommendation. (See https://www.w3.org/TR/xml11/.)

Other references

[Chicago]
The Chicago Manual of Style, 16th ed. Chicago: University of Chicago Press, 2010.
[JSON-LD]
W3C (World Wide Web Consortium). JSON-LD 1.0 — A JSON-based Serialization for Linked Data. Manu Sporny, Gregg Kellogg and Markus Lanthaler, eds., 2014. W3C Recommendation. (See https://www.w3.org/TR/json-ld/.)
[JSON-LD API]
W3C (World Wide Web Consortium). JSON-LD 1.0 Processing Algorithms and API. Manu Sporny, Gregg Kellogg and Markus Lanthaler, eds., 2014. W3C Recommendation. (See https://www.w3.org/TR/json-ld-api/.)
[Linked Data]
Heath, Tom and Christian Bizer. Linked Data: Evolving the Web into a Global Data Space, 1st edition. Morgan & Claypool, 2011. (See http://linkeddatabook.com/editions/1.0/.)
[RFC 7159]
IETF (Internet Engineering Task Force). The JavaScript Object Notation (JSON) Data Interchange Format Tim Bray, ed., 2014. (See http://tools.ietf.org/html/rfc7159.)
[XML Names]
World Wide Web Consortium. Namespaces in XML 1.1, 2nd edition. Tim Bray, Dave Hollander, Andrew Layman and Richard Tobin, eds., 2006.
W3C Recommendation. See https://www.w3.org/TR/xml-names11/.
[XSD Pt2]
World Wide Web Consortium. W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. W3C Recommendation. See https://www.w3.org/TR/xmlschema11-2/