This is a first public draft of a standard defining a regular expression dialect for use in FHISO standards. This document is not endorsed by the FHISO membership, and may be updated, replaced or obsoleted by other documents at any time.
The public tsc-public@fhiso.org mailing list is the preferred place for comments, discussion and other feedback on this draft.
Latest public version: | https://fhiso.org/TR/patterns |
This version: | https://fhiso.org/TR/patterns-20180316 |
This document defines a “least-common denominator” regular expression dialect. An explicit goal is to have all patterns be trivially modifiable to work with as many mainstream regular expression engines and libraries as possible.
In particular, in interest of compatibility, the pattern in this document
:alphanum:
, \pL
, etc.).(a|an)
and (an|a)
match differently.Where this standard gives a specific technical meaning to a word or phrase, that word or phrase is formatted in bold text in its initial definition, and in italics when used elsewhere. The key words must, must not, required, shall, shall not, should, should not, recommended, not recommended, may and optional in this standard are to be interpreted as described in [RFC 2119].
An application is conformant with this standard if and only if it obeys all the requirements and prohibitions contained in this document, as indicated by use of the words must, must not, required, shall and shall not, and the relevant parts of its normative references. Standards referencing this standard must not loosen any of the requirements and prohibitions made by this standard, nor place additional requirements or prohibitions on the constructs defined herein.
If a conformant application encounters data that does not conform to this standard, it may issue a warning or error message, and may terminate processing of the document or data fragment.
This standard depends on FHISO’s Basic Concepts for Genealogical Standards standard. To be conformant with this standard, an application must also be conformant with [Basic Concepts]. Concepts defined in that standard are used here without further definition.
Indented text in grey or coloured boxes does not form a normative part of this standard, and is labelled as either an example or a note.
The grammar given here uses the form of EBNF notation defined in §6 of [XML], except that no significance is attached to the capitalisation of grammar symbols. Conforming applications must not generate data not conforming to the syntax given here, but non-conforming syntax may be accepted and processed by a conforming application in an implementation-defined manner.
This standard uses prefix notation when discussing specific terms. The following prefix bindings are assumed in this standard:
types |
https://terms.fhiso.org/types/ |
As defined in [Basic Concepts], a pattern is a regular expression intended to provides a constraint on the lexical space of a datatype. This document defines both the types:Pattern
datatype and the semantics of what it means for a string to match a pattern.
A pattern is an element of the types:Pattern
datatype. Every pattern is said to match a (possibly-infinite) set of strings. The set of strings that a particular pattern matches is defined by the contents of the pattern itself, using the rules specified in this document.
This draft is inconsistent in its use of “match” vs “match”. It is not clear which is preferable.
[Basic Concepts] currently refers to “match” as an undefined word, which is convenient because not all uses of “match” in that document refer to matching a pattern (others match EBNF productions, Accept headers, informal requirements, etc.) That flexibility is supported by not formally defining “match”.
Conversely, much of this document presents a formal definition of what it means to match a pattern or its component parts. That formality suggests “match” should be a formally-defined word.
As an alternative, the entire document could be re-written to define the “language of” a pattern as many theoretical computer science text do, which would greatly reduce the occurrences of the word “match”. This draft has taken the position that “language” it too important a word in other genealogical contexts for its formal definition herein to be wise.
Consider the pattern “[ab][cd]?
”. The set of strings that that pattern matches is {“a
”, “ac
”, “ad
”, “b
”, “bc
”, “bd
”}. We say that “[ab][cd]?
” matches the string “a
” and does not match the string “aa
”.
This section presents a set of definitions that fully define both the lexical space of the types:Pattern
datatype and the set of strings matched by a given pattern. It does this by introducing and naming several intermediate concepts. These intermediate concepts are presented for expository purposed only and may be changed or removed from future versions of this standard. In particular, this documents’ presentation of the following key words should not be directly referenced in other documents: regular expression, branch, piece, quantifier, atom, normal character, metacharacter, banned character, escaped character, character class, positive character class, negative character class, character range, wildcard.
The following components define portions of patterns which match strings
A regular expression consists of one or more branches. Between each branch is a single U+007C |
character.
regExp ::= branch ( '|' branch )*
The set of strings matched by a regular expression is the union of the sets of strings matched by its branches.
A branch consists of one or more pieces. The pieces of a branch appear adjacent to one another with no intervening characters.
branch ::= piece piece*
A branch matches a string s if and only if a prefix of s matches the first piece of the branch and the remainder of s matches the remaining pieces.
While the above is definition of branch matching does not appear to be ambiguous, is it formally incomplete as it does not define what it means for a string to “match the remaining pieces”. The following is more rigorous, but also more verbose:
A sequence of n pieces (where n ≥ 2) matches a string s if and only if s can be represented as a concatenation of n strings s1 s2 … sn such that s1 matches the first piece in the sequence, s2 matches the second piece, and so on with sn matching the last piece.
A branch matches a string if either (a) the branch consists of only a single piece and that piece matches the string, or (b) the branch consists of a sequence of pieces and that sequence of pieces matches the string.
Feedback is invited on which version is preferable.
A piece consists of an atom, possibly followed by a quantifier.
piece ::= atom quantifier?
quantifier ::= [?*+] | ( '{' quantity '}' )`
quantity ::= quantRange | quantMin | QuantExact
quantRange ::= QuantExact ',' QuantExact
quantMin ::= QuantExact ','
QuantExact ::= 0 | [1-9] [0-9]*
The set of strings matched by a piece depends on the quantifier used. If S is an atom and L(S) is the set of strings matched by S then
Piece | Set of strings matched |
---|---|
S | L(S) |
S ? |
the empty string, and all strings in L(S) |
S * |
all concatenations of zero or more strings in L(S) |
S + |
all concatenations of one or more strings in L(S) |
S {n,m} |
all concatenations of at least n and no more than m strings in L(S) |
S {n} |
all concatenations of exactly n strings in L(S) |
S {n,} |
all concatenations of at least n strings in L(S) |
When the above table refers to “strings in L(S)”, the strings do not need to be distinct.
a
”, “b
”} then L(S *
) includes an infinite number of strings, including "“,”a
“,”b
“,”aa
“,”ab
“,”ba
“,”bb
“,”aaa
", etc.
a
”, “b
”} then L(S {3}
) contains 8 strings: {“aaa
”, “aab
”, “aba
”, “abb
”, “baa
”, “bab
”, “bba
”, “bbb
”}.
{,n}
, which some regex dialects allow as a shorthand for {0,n}
.
*?
, +?
, etc., which some dialects allow to select between derivations of a particular string.
{02,12}
and other leading zeros in quantities.
An atom is either a normal character, an escaped character, a character class, or a parenthesised regular expression.
atom ::= NormalChar | escapedChar | charClass | '(' regExp ')'
An atom that is a parenthesized regular expression matches the same set of strings as its regular expression (the parentheses do not directly contribute to the match).
An atom that is a normal character or escaped character matches any single-character string containing the character represented by the normal character or escaped character.
An atom that is a character class matches any single-character string containing a character within the character class.
The following components define portions of patterns which represent single characters
A normal character is any character that is not a metacharacter or a banned character.
Each normal character represents itself.
The metacharacters are ‘.
’, ‘\
’, ‘?
’, ‘*
’, ‘+
’, ‘{
’, ‘}
’, ‘(
’, ‘)
’, ‘|
’, ‘[
’, and ‘]
’.
The banned characters are ‘^
’, ‘$
’, ‘&
’, ‘/
’, and the escapable control characters U+0009, U+000A, and U+000D.
}
to appear unescaped, but that must not be done in patterns.
An escaped character is a U+005C \
followed by a single character, which must be a metacharacter, a class metacharacter, a banned character, or one of U+0074 t
, U+006E n
, or U+0072 r
.
An escaped metacharacter, class metacharacter, or banned character represents the metacharacter, a class metacharacter, or banned character itself.
An escaped U+0074 \t
represents the character U+0009 (the horizontal tab).
An escaped U+006E \n
represents the character U+000A (the line fed).
An escaped U+0072 \r
represents the character U+000D (the carriage return).
\f
, \A
, etc). For maximal compatibility, patterns must not escape characters other than those listed above.
Code-point escapes (e.g., \x{2F2E}
for ⼮
) are not provided in this specification because they are not supported in some common regular expression engines such as POSIX and XML. Instead, Unicode should be expressed in the same encoding used by the strings being checked for membership in a regular expression’s language.
If the chosen engine is byte- rather than code-point-oriented, care should be made that (a) quantifiers bind to characters, not bytes; and (b) character class ranges are correctly handled. Binding can be achieved by adding parentheses around each multi-byte character; how to achieve character class ranges is not known in general by the authors of this specification.
A class character is either an escaped character or a single character that is neither a class metacharacter nor a banned character.
The class metacharacters are ‘.
’, ‘\
’, ‘-
’, ‘|
’, ‘[
’, and ‘]
’.
[A-^]
” is not a pattern (as it includes the banned character ‘^
’) even though it is accepted by some regular expression engines.
The following components define sets of individual characters.
A character class is either a positive character class, a negative character class, or a wildcard.
charClass ::= posCharClass | negCharClass | wildcard
A positive character class is a set of one or more character ranges within brackets.
posCharClass ::= '[' charRange+ ']'
A positive character class defines the union of the sets defined by its character ranges.
A character range is either a single class character or two class characters separated by a U+002D -
.
charRange ::= classChar | classChar '-' classChar
If a character range has two class characters the second must not have a numerically lesser code point than the first.
A single-class character character range defines the singleton set containing only the character that its class character represents. A two-character character range defines the set of all characters with code points that are both
A negative character class is a set of one or more character ranges, preceded by U+005E ^
, within brackets.
negCharClass ::= '[^' charRange+ ']'
A negative character class defines the set of all characters that are not within the union of the sets defined by its character ranges.
A wildcard is represented as U+002E .
.
wildcard ::= '.'
The wildcard defines the set of all characters.
.
. When using an engine that does not do so, replace all .
with something else, such as (.|[\r\n])
, (.|\s)
, or [\s\S]
. Which one works depends on the engine in question.
types:Pattern
datatypeFHISO uses the types:Pattern
datatype to represent patterns. It must not be used for pattern-like regular expression variants that do not conform to this standard’s definition of a pattern.
/^.[.]$/
” is a regular expression that matches a two-character string ending with a period. Because this is not a valid pattern, it does not have the types:Pattern
datatype.
The lexical space of this datatype is the space of all strings that match the regExp
production in §2.2.1.1.
([A-Z][a-z]+ )*
” is within the lexical space of this datatype, but “^\x{FFEF}.*$
” is not, despite being a valid regular expression in some engines.
This is a structured non-language-tagged datatype which has the following properties:
Name | https://terms.fhiso.org/types/Pattern |
Type | http://www.w3.org/2000/01/rdf-schema#Datatype |
Pattern | .* |
Supertype | No non-trivial supertypes |
Abstract | false |
Following are some suggestions for making regular expressions in the above dialect work with various regular expression engines.
ECMAScript
variety and regex_match
(not regex_search
). Replace non-escaped .
with (.|\s)
.
ECMAScript
variety. How to ensure full match not known to the author of this document.
^(
…)$
. Replace non-escaped .
with (.|\s)
.
^(?s
…)$
.
RegexOptions.Multiline
option or replace non-escaped .
with (.|\n)
. Surround expression with ^(
…)$
.
m/^(
…)$/s
.
PCRE_UTF8
option. Surround expression with ^(
…)$
.
PCRE2_UTF
and PCRE2_DOTALL
options. Surround expression with ^(
…)$
.
/^(
…)$/us
with the preg_
… functions.
\
s.
re.DOTALL
option. In Python 3.4 and later, use the fullmatch
function; otherwise use match
and surround the expression with (
…)$
.
/\A(
…)\Z$/m
.
.
with [\s\S]
.
Copyright © 2017–18, Family History Information Standards Organisation, Inc. The text of this standard is available under the Creative Commons Attribution 4.0 International License.