April 2002 Draft
JavaScript 2.0
Formal Description
Regular Expression Semantics
previousupnext

Thursday, February 7, 2002

The regular expression semantics describe the actions the regular expression engine takes in order to transform a regular expression pattern into a function for matching against input strings. For convenience, the regular expression grammar is repeated here. See also the description of the semantic notation.

This document is also available as a Word 98 rtf file.

The regular expression semantics below are working (except for case-insensitive matches) and have been tried on sample cases, but they could be formatted better.

Semantics

tag syntaxError;
SemanticException = {syntaxError};

Unicode Character Classes

Syntax

UnicodeCharacter  Any Unicode character
UnicodeAlphanumeric  Any Unicode alphabetic or decimal digit character (includes ASCII 0-9, A-Z, and a-z)
LineTerminator  «LF» | «CR» | «u2028» | «u2029»

Semantics

lineTerminatorsCharacter{} = {‘«LF»’, ‘«CR»’, ‘«u2028»’, ‘«u2029»’};
reWhitespacesCharacter{} = {‘«FF»’, ‘«LF»’, ‘«CR»’, ‘«TAB»’, ‘«VT»’, ‘ ’};
reDigitsCharacter{} = {‘0’ ... ‘9’};
reWordCharactersCharacter{} = {‘0’ ... ‘9’, ‘A’ ... ‘Z’, ‘a’ ... ‘z’, ‘_’};

Regular Expression Definitions

Semantics

tuple REInput
strString,
ignoreCaseBoolean,
multilineBoolean,
spanBoolean
end tuple;

Field str is the input string. ignoreCase, multiline, and span are the corresponding regular expression flags.

tag undefined;
Capture = String  {undefined};
tuple REMatch
endIndexInteger,
capturesCapture[]
end tuple;
tag failure;
REResult = REMatch  {failure};

A REMatch holds an intermediate state during the pattern-matching process. endIndex is the index of the next input character to be matched by the next component in a regular expression pattern. If we are at the end of the pattern, endIndex is one plus the index of the last matched input character. captures is a zero-based array of the strings captured so far by capturing parentheses.

Continuation = REMatch  REResult;

A Continuation is a function that attempts to match the remaining portion of the pattern against the input string, starting at the intermediate state given by its REMatch argument. If a match is possible, it returns a REMatch result that contains the final state; if no match is possible, it returns a failure result.

Matcher = REInput  REMatch  Continuation  REResult;

A Matcher is a function that attempts to match a middle portion of the pattern against the input string, starting at the intermediate state given by its REMatch argument. Since the remainder of the pattern heavily influences whether (and how) a middle portion will match, we must pass in a Continuation function that checks whether the rest of the pattern matched. If the continuation returns failure, the matcher function may call it repeatedly, trying various alternatives at pattern choice points.

The REInput parameter contains the input string and is merely passed down to subroutines.

A Integer Matcher is a function executed at the time the regular expression is compiled that returns a Matcher for a part of the pattern. The Integer parameter contains the number of capturing left parentheses seen so far in the pattern and is used to assign static, consecutive numbers to capturing parentheses.

proc characterSetMatcher(acceptanceSetCharacter{}, invertBoolean): Matcher
proc m(tREInputxREMatchcContinuation): REResult
iInteger  x.endIndex;
sString  t.str;
if i = |sthen return failure
elsif s[i acceptanceSet xor invert then
return c(REMatchendIndexi + 1, capturesx.captures)
else return failure
end if
end proc;
return m
end proc;

characterSetMatcher returns a Matcher that matches a single input string character. If invert is false, the match succeeds if the character is a member of the acceptanceSet set of characters (possibly ignoring case). If invert is true, the match succeeds if the character is not a member of the acceptanceSet set of characters (possibly ignoring case).

proc characterMatcher(chCharacter): Matcher
return characterSetMatcher({ch}, false)
end proc;

characterMatcher returns a Matcher that matches a single input string character. The match succeeds if the character is the same as ch (possibly ignoring case).

Regular Expression Patterns

Syntax

RegularExpressionPattern  Disjunction

Semantics

Execute[RegularExpressionPattern  Disjunction]: REInput  Integer  REResult
begin
m1Matcher  GenMatcher[Disjunction](0);
proc e(tREInputindexInteger): REResult
xREMatch  REMatchendIndexindexcapturesfillCapture(CountParens[Disjunction]);
return m1(txsuccessContinuation)
end proc;
return e
end;
proc successContinuation(xREMatch): REResult
return x
end proc;
proc fillCapture(iInteger): Capture[]
if i = 0 then return [] else return fillCapture(i – 1)  [undefined] end if
end proc;

Disjunctions

Syntax

Disjunction 
   Alternative
|  Alternative | Disjunction

Semantics

proc GenMatcher[Disjunction] (parenIndexInteger): Matcher
[Disjunction  Alternativedo return GenMatcher[Alternative](parenIndex);
[Disjunction0  Alternative | Disjunction1do
m1Matcher  GenMatcher[Alternative](parenIndex);
m2Matcher  GenMatcher[Disjunction1](parenIndex + CountParens[Alternative]);
proc m3(tREInputxREMatchcContinuation): REResult
yREResult  m1(txc);
case y of
REMatch do return y;
{failuredo return m2(txc)
end case
end proc;
return m3
end proc;
CountParens[Disjunction]: Integer;
CountParens[Disjunction  Alternative] = CountParens[Alternative];
CountParens[Disjunction0  Alternative | Disjunction1] = CountParens[Alternative] + CountParens[Disjunction1];

Alternatives

Syntax

Alternative 
   «empty»
|  Alternative Term

Semantics

proc GenMatcher[Alternative] (parenIndexInteger): Matcher
[Alternative  «empty»] do
proc m(tREInputxREMatchcContinuation): REResult
return c(x)
end proc;
return m;
[Alternative0  Alternative1 Termdo
m1Matcher  GenMatcher[Alternative1](parenIndex);
m2Matcher  GenMatcher[Term](parenIndex + CountParens[Alternative1]);
proc m3(tREInputxREMatchcContinuation): REResult
proc d(yREMatch): REResult
return m2(tyc)
end proc;
return m1(txd)
end proc;
return m3
end proc;
CountParens[Alternative]: Integer;
CountParens[Alternative  «empty»] = 0;
CountParens[Alternative0  Alternative1 Term] = CountParens[Alternative1] + CountParens[Term];

Terms

Syntax

Term 
   Assertion
|  Atom
|  Atom Quantifier

Semantics

proc GenMatcher[Term] (parenIndexInteger): Matcher
[Term  Assertiondo
proc m(tREInputxREMatchcContinuation): REResult
if TestAssertion[Assertion](txthen return c(xelse return failure end if
end proc;
return m;
[Term  Atomdo return GenMatcher[Atom](parenIndex);
[Term  Atom Quantifierdo
mMatcher  GenMatcher[Atom](parenIndex);
minInteger  Minimum[Quantifier];
maxLimit  Maximum[Quantifier];
greedyBoolean  Greedy[Quantifier];
if max  + then if max < min then throw syntaxError end if end if;
return repeatMatcher(mminmaxgreedyparenIndexCountParens[Atom])
end proc;
CountParens[Term]: Integer;
CountParens[Term  Assertion] = 0;
CountParens[Term  Atom] = CountParens[Atom];
CountParens[Term  Atom Quantifier] = CountParens[Atom];

Syntax

Quantifier 
   QuantifierPrefix
|  QuantifierPrefix ?
QuantifierPrefix 
   *
|  +
|  ?
|  { DecimalDigits }
|  { DecimalDigits , }
|  { DecimalDigits , DecimalDigits }
DecimalDigits 
   DecimalDigit
|  DecimalDigits DecimalDigit
DecimalDigit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Semantics

Limit = Integer  {+};
proc resetParens(xREMatchpIntegernParensInteger): REMatch
capturesCapture[]  x.captures;
iInteger  p;
while i < p + nParens do captures  captures[i \ undefined]; i  i + 1 end while;
return REMatchendIndexx.endIndexcapturescaptures
end proc;
proc repeatMatcher(bodyMatcher, minInteger, maxLimit, greedyBoolean, parenIndexInteger, nBodyParensInteger): Matcher
proc m(tREInputxREMatchcContinuation): REResult
if max = 0 then return c(xend if;
proc d(yREMatch): REResult
if min = 0 and y.endIndex = x.endIndex then return failure end if;
newMinInteger  min;
if min  0 then newMin  min – 1 end if;
newMaxLimit  max;
if max  + then newMax  max – 1 end if;
m2Matcher  repeatMatcher(bodynewMinnewMaxgreedyparenIndexnBodyParens);
return m2(tyc)
end proc;
xrREMatch  resetParens(xparenIndexnBodyParens);
if min  0 then return body(txrd)
elsif greedy then
zREResult  body(txrd);
case z of
REMatch do return z;
{failuredo return c(x)
end case
else
zREResult  c(x);
case z of
REMatch do return z;
{failuredo return body(txrd)
end case
end if
end proc;
return m
end proc;
Minimum[Quantifier]: Integer;
Minimum[Quantifier  QuantifierPrefix] = Minimum[QuantifierPrefix];
Minimum[Quantifier  QuantifierPrefix ?] = Minimum[QuantifierPrefix];
Maximum[Quantifier]: Limit;
Maximum[Quantifier  QuantifierPrefix] = Maximum[QuantifierPrefix];
Maximum[Quantifier  QuantifierPrefix ?] = Maximum[QuantifierPrefix];
Greedy[Quantifier]: Boolean;
Greedy[Quantifier  QuantifierPrefix] = true;
Greedy[Quantifier  QuantifierPrefix ?] = false;
Minimum[QuantifierPrefix]: Integer;
Minimum[QuantifierPrefix  *] = 0;
Minimum[QuantifierPrefix  +] = 1;
Minimum[QuantifierPrefix  ?] = 0;
Minimum[QuantifierPrefix  { DecimalDigits }] = IntegerValue[DecimalDigits];
Minimum[QuantifierPrefix  { DecimalDigits , }] = IntegerValue[DecimalDigits];
Minimum[QuantifierPrefix  { DecimalDigits1 , DecimalDigits2 }] = IntegerValue[DecimalDigits1];
Maximum[QuantifierPrefix]: Limit;
Maximum[QuantifierPrefix  *] = +;
Maximum[QuantifierPrefix  +] = +;
Maximum[QuantifierPrefix  ?] = 1;
Maximum[QuantifierPrefix  { DecimalDigits }] = IntegerValue[DecimalDigits];
Maximum[QuantifierPrefix  { DecimalDigits , }] = +;
Maximum[QuantifierPrefix  { DecimalDigits1 , DecimalDigits2 }] = IntegerValue[DecimalDigits2];
IntegerValue[DecimalDigits]: Integer;
IntegerValue[DecimalDigits  DecimalDigit] = DecimalValue[DecimalDigit];
IntegerValue[DecimalDigits0  DecimalDigits1 DecimalDigit] = 10IntegerValue[DecimalDigits1] + DecimalValue[DecimalDigit];
DecimalValue[DecimalDigit]: Integer = digitValue(DecimalDigit);

Assertions

Syntax

Assertion 
   ^
|  $
|  \ b
|  \ B

Semantics

proc TestAssertion[Assertion] (tREInputxREMatch): Boolean
[Assertion  ^do
return x.endIndex = 0 or (t.multiline and t.str[x.endIndex – 1]  lineTerminators);
[Assertion  $do
return x.endIndex = |t.stror (t.multiline and t.str[x.endIndex lineTerminators);
[Assertion  \ bdo return atWordBoundary(x.endIndext.str);
[Assertion  \ Bdo return not atWordBoundary(x.endIndext.str)
end proc;
proc atWordBoundary(iIntegersString): Boolean
return inWord(i – 1, sxor inWord(is)
end proc;
proc inWord(iIntegersString): Boolean
if i = –1 or i = |sthen return false else return s[i reWordCharacters end if
end proc;

Atoms

Syntax

Atom 
   PatternCharacter
|  .
|  NullEscape
|  \ AtomEscape
|  CharacterClass
|  ( Disjunction )
|  ( ? : Disjunction )
|  ( ? = Disjunction )
|  ( ? ! Disjunction )
PatternCharacter  UnicodeCharacter except ^ | $ | \ | . | * | + | ? | ( | ) | [ | ] | { | } | |

Semantics

proc GenMatcher[Atom] (parenIndexInteger): Matcher
[Atom  PatternCharacterdo return characterMatcher(PatternCharacter);
[Atom  .do
proc m1(tREInputxREMatchcContinuation): REResult
aCharacter{}  t.span ? {} : lineTerminators;
m2Matcher  characterSetMatcher(atrue);
return m2(txc)
end proc;
return m1;
[Atom  NullEscapedo
proc m(tREInputxREMatchcContinuation): REResult
return c(x)
end proc;
return m;
[Atom  \ AtomEscapedo return GenMatcher[AtomEscape](parenIndex);
[Atom  CharacterClassdo
aCharacter{}  AcceptanceSet[CharacterClass];
return characterSetMatcher(aInvert[CharacterClass]);
[Atom  ( Disjunction )do
m1Matcher  GenMatcher[Disjunction](parenIndex + 1);
proc m2(tREInputxREMatchcContinuation): REResult
proc d(yREMatch): REResult
refCapture  t.str[x.endIndex ... y.endIndex – 1];
updatedCapturesCapture[]  y.captures[parenIndex \ ref];
return c(REMatchendIndexy.endIndexcapturesupdatedCaptures)
end proc;
return m1(txd)
end proc;
return m2;
[Atom  ( ? : Disjunction )do return GenMatcher[Disjunction](parenIndex);
[Atom  ( ? = Disjunction )do
m1Matcher  GenMatcher[Disjunction](parenIndex);
proc m2(tREInputxREMatchcContinuation): REResult
yREResult  m1(txsuccessContinuation);
case y of
REMatch do return c(REMatchendIndexx.endIndexcapturesy.captures);
{failuredo return failure
end case
end proc;
return m2;
[Atom  ( ? ! Disjunction )do
m1Matcher  GenMatcher[Disjunction](parenIndex);
proc m2(tREInputxREMatchcContinuation): REResult
case m1(txsuccessContinuationof
REMatch do return failure;
{failuredo return c(x)
end case
end proc;
return m2
end proc;
CountParens[Atom]: Integer;
CountParens[Atom  PatternCharacter] = 0;
CountParens[Atom  .] = 0;
CountParens[Atom  NullEscape] = 0;
CountParens[Atom  \ AtomEscape] = 0;
CountParens[Atom  CharacterClass] = 0;
CountParens[Atom  ( Disjunction )] = CountParens[Disjunction] + 1;
CountParens[Atom  ( ? : Disjunction )] = CountParens[Disjunction];
CountParens[Atom  ( ? = Disjunction )] = CountParens[Disjunction];
CountParens[Atom  ( ? ! Disjunction )] = CountParens[Disjunction];

Escapes

Syntax

NullEscape  \ _
AtomEscape 
   DecimalEscape
|  CharacterEscape
|  CharacterClassEscape

Semantics

proc GenMatcher[AtomEscape] (parenIndexInteger): Matcher
[AtomEscape  DecimalEscapedo
nInteger  EscapeValue[DecimalEscape];
if n = 0 then return characterMatcher(‘«NUL»’)
elsif n > parenIndex then throw syntaxError
else return backreferenceMatcher(n)
end if;
[AtomEscape  CharacterEscapedo
return characterMatcher(CharacterValue[CharacterEscape]);
[AtomEscape  CharacterClassEscapedo
return characterSetMatcher(AcceptanceSet[CharacterClassEscape], false)
end proc;
proc backreferenceMatcher(nInteger): Matcher
proc m(tREInputxREMatchcContinuation): REResult
refCapture  nthBackreference(xn);
case ref of
String do
iInteger  x.endIndex;
sString  t.str;
jInteger  i + |ref|;
if j  |sand s[i ... j – 1] = ref then
return c(REMatchendIndexjcapturesx.captures)
else return failure
end if;
{undefineddo return c(x)
end case
end proc;
return m
end proc;
proc nthBackreference(xREMatchnInteger): Capture
return x.captures[n – 1]
end proc;

Syntax

CharacterEscape 
   ControlEscape
|  c ControlLetter
|  HexEscape
|  IdentityEscape
ControlLetter 
   A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z
|  a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z
IdentityEscape  UnicodeCharacter except _ | UnicodeAlphanumeric
ControlEscape 
   f
|  n
|  r
|  t
|  v

Semantics

CharacterValue[CharacterEscape]: Character;
CharacterValue[CharacterEscape  ControlEscape] = CharacterValue[ControlEscape];
CharacterValue[CharacterEscape  c ControlLetter] = codeToCharacter(bitwiseAnd(characterToCode(ControlLetter), 31));
CharacterValue[CharacterEscape  HexEscape] = CharacterValue[HexEscape];
CharacterValue[CharacterEscape  IdentityEscape] = IdentityEscape;
CharacterValue[ControlEscape]: Character;
CharacterValue[ControlEscape  f] = ‘«FF»’;
CharacterValue[ControlEscape  n] = ‘«LF»’;
CharacterValue[ControlEscape  r] = ‘«CR»’;
CharacterValue[ControlEscape  t] = ‘«TAB»’;
CharacterValue[ControlEscape  v] = ‘«VT»’;

Decimal Escapes

Syntax

DecimalEscape  DecimalIntegerLiteral [lookahead{DecimalDigit}]
DecimalIntegerLiteral 
   0
|  NonZeroDecimalDigits
NonZeroDecimalDigits 
   NonZeroDigit
|  NonZeroDecimalDigits DecimalDigit
NonZeroDigit  1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Semantics

EscapeValue[DecimalEscape  DecimalIntegerLiteral [lookahead{DecimalDigit}]]: IntegerIntegerValue[DecimalIntegerLiteral];
IntegerValue[DecimalIntegerLiteral]: Integer;
IntegerValue[DecimalIntegerLiteral  0] = 0;
IntegerValue[DecimalIntegerLiteral  NonZeroDecimalDigits] = IntegerValue[NonZeroDecimalDigits];
IntegerValue[NonZeroDecimalDigits]: Integer;
IntegerValue[NonZeroDecimalDigits  NonZeroDigit] = DecimalValue[NonZeroDigit];
IntegerValue[NonZeroDecimalDigits0  NonZeroDecimalDigits1 DecimalDigit] = 10IntegerValue[NonZeroDecimalDigits1] + DecimalValue[DecimalDigit];
DecimalValue[NonZeroDigit]: Integer = digitValue(NonZeroDigit);

Hexadecimal Escapes

Syntax

HexEscape 
   x HexDigit HexDigit
|  u HexDigit HexDigit HexDigit HexDigit
HexDigit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | a | b | c | d | e | f

Semantics

CharacterValue[HexEscape]: Character;
CharacterValue[HexEscape  x HexDigit1 HexDigit2] = codeToCharacter(16HexValue[HexDigit1] + HexValue[HexDigit2]);
CharacterValue[HexEscape  u HexDigit1 HexDigit2 HexDigit3 HexDigit4] = codeToCharacter(4096HexValue[HexDigit1] + 256HexValue[HexDigit2] + 16HexValue[HexDigit3] + HexValue[HexDigit4]);
HexValue[HexDigit]: Integer = digitValue(HexDigit);

Character Class Escapes

Syntax

CharacterClassEscape 
   s
|  S
|  d
|  D
|  w
|  W

Semantics

AcceptanceSet[CharacterClassEscape]: Character{};
AcceptanceSet[CharacterClassEscape  s] = reWhitespaces;
AcceptanceSet[CharacterClassEscape  S] = {‘«NUL»’ ... ‘«uFFFF»’} – reWhitespaces;
AcceptanceSet[CharacterClassEscape  d] = reDigits;
AcceptanceSet[CharacterClassEscape  D] = {‘«NUL»’ ... ‘«uFFFF»’} – reDigits;
AcceptanceSet[CharacterClassEscape  w] = reWordCharacters;
AcceptanceSet[CharacterClassEscape  W] = {‘«NUL»’ ... ‘«uFFFF»’} – reWordCharacters;

User-Specified Character Classes

Syntax

CharacterClass 
   [ [lookahead{^}] ClassRanges ]
|  [ ^ ClassRanges ]
ClassRanges 
   «empty»
|  NonemptyClassRangesdash
  {dashnoDash}
NonemptyClassRanges 
   ClassAtomdash
|  ClassAtom NonemptyClassRangesnoDash
|  ClassAtom - ClassAtomdash ClassRanges
|  NullEscape ClassRanges

Semantics

AcceptanceSet[CharacterClass]: Character{};
AcceptanceSet[CharacterClass  [ [lookahead{^}] ClassRanges ]] = AcceptanceSet[ClassRanges];
AcceptanceSet[CharacterClass  [ ^ ClassRanges ]] = AcceptanceSet[ClassRanges];
Invert[CharacterClass]: Boolean;
Invert[CharacterClass  [ [lookahead{^}] ClassRanges ]] = false;
Invert[CharacterClass  [ ^ ClassRanges ]] = true;
AcceptanceSet[ClassRanges]: Character{};
AcceptanceSet[ClassRanges  «empty»] = {};
AcceptanceSet[ClassRanges  NonemptyClassRangesdash] = AcceptanceSet[NonemptyClassRangesdash];
AcceptanceSet[NonemptyClassRanges]: Character{};
AcceptanceSet[NonemptyClassRanges  ClassAtomdash] = AcceptanceSet[ClassAtomdash];
AcceptanceSet[NonemptyClassRanges0  ClassAtom NonemptyClassRangesnoDash1] = AcceptanceSet[ClassAtom AcceptanceSet[NonemptyClassRangesnoDash1];
AcceptanceSet[NonemptyClassRanges  ClassAtom1 - ClassAtomdash2 ClassRanges] = characterRange(AcceptanceSet[ClassAtom1], AcceptanceSet[ClassAtomdash2])  AcceptanceSet[ClassRanges];
AcceptanceSet[NonemptyClassRanges  NullEscape ClassRanges] = AcceptanceSet[ClassRanges];
proc characterRange(lowCharacter{}, highCharacter{}): Character{}
if |low 1 or |high 1 then throw syntaxError end if;
lCharacter  min low;
hCharacter  min high;
if l  h then return {l ... helse throw syntaxError end if
end proc;

Character Class Range Atoms

Syntax

ClassAtom 
   ClassCharacter
|  \ ClassEscape
ClassCharacterdash  UnicodeCharacter except \ | ]
ClassCharacternoDash  ClassCharacterdash except -
ClassEscape 
   DecimalEscape
|  b
|  CharacterEscape
|  CharacterClassEscape

Semantics

AcceptanceSet[ClassAtom]: Character{};
AcceptanceSet[ClassAtom  ClassCharacter] = {ClassCharacter};
AcceptanceSet[ClassAtom  \ ClassEscape] = AcceptanceSet[ClassEscape];
AcceptanceSet[ClassEscape]: Character{};
AcceptanceSet[ClassEscape  DecimalEscape]
begin
if EscapeValue[DecimalEscape] = 0 then return {‘«NUL»’}
else throw syntaxError
end if
end;
AcceptanceSet[ClassEscape  b] = {‘«BS»’};
AcceptanceSet[ClassEscape  CharacterEscape] = {CharacterValue[CharacterEscape]};
AcceptanceSet[ClassEscape  CharacterClassEscape] = AcceptanceSet[CharacterClassEscape];

Waldemar Horwat
Last modified Thursday, February 7, 2002
previousupnext