JavaScript 2.0 Regular Expression Semantics

April 2002 Draft

JavaScript 2.0

Formal Description

Regular Expression Semantics

Thursday, February 7, 2002

The regular expression semantics describe the actions the regular expression engine takes in order to transform a regular expression pattern into a function for matching against input strings. For convenience, the regular expression grammar is repeated here. See also the description of the semantic notation.

This document is also available as a Word 98 rtf file.

The regular expression semantics below are working (except for case-insensitive matches) and have been tried on sample cases, but they could be formatted better.

Semantics

tag syntaxError;

SemanticException = {syntaxError};

Unicode Character Classes

Syntax

UnicodeCharacter Any Unicode character

UnicodeAlphanumeric Any Unicode alphabetic or decimal digit character (includes ASCII 0-9, A-Z, and a-z)

LineTerminator «LF» | «CR» | «u2028» | «u2029»

Semantics

lineTerminators: Character{} = {‘«LF»’, ‘«CR»’, ‘«u2028»’, ‘«u2029»’};

reWhitespaces: Character{} = {‘«FF»’, ‘«LF»’, ‘«CR»’, ‘«TAB»’, ‘«VT»’, ‘ ’};

reDigits: Character{} = {‘0’ ... ‘9’};

reWordCharacters: Character{} = {‘0’ ... ‘9’, ‘A’ ... ‘Z’, ‘a’ ... ‘z’, ‘_’};

Regular Expression Definitions

Semantics

tuple REInput

str: String,

ignoreCase: Boolean,

multiline: Boolean,

span: Boolean

end tuple;

Field str is the input string. ignoreCase, multiline, and span are the corresponding regular expression flags.

tag undefined;

Capture = String {undefined};

tuple REMatch

endIndex: Integer,

captures: Capture[]

end tuple;

tag failure;

REResult = REMatch {failure};

A REMatch holds an intermediate state during the pattern-matching process. endIndex is the index of the next input character to be matched by the next component in a regular expression pattern. If we are at the end of the pattern, endIndex is one plus the index of the last matched input character. captures is a zero-based array of the strings captured so far by capturing parentheses.

Continuation = REMatch REResult;

A Continuation is a function that attempts to match the remaining portion of the pattern against the input string, starting at the intermediate state given by its REMatch argument. If a match is possible, it returns a REMatch result that contains the final state; if no match is possible, it returns a failure result.

Matcher = REInput REMatch Continuation REResult;

A Matcher is a function that attempts to match a middle portion of the pattern against the input string, starting at the intermediate state given by its REMatch argument. Since the remainder of the pattern heavily influences whether (and how) a middle portion will match, we must pass in a Continuation function that checks whether the rest of the pattern matched. If the continuation returns failure, the matcher function may call it repeatedly, trying various alternatives at pattern choice points.

The REInput parameter contains the input string and is merely passed down to subroutines.

A Integer Matcher is a function executed at the time the regular expression is compiled that returns a Matcher for a part of the pattern. The Integer parameter contains the number of capturing left parentheses seen so far in the pattern and is used to assign static, consecutive numbers to capturing parentheses.

proc characterSetMatcher(acceptanceSet: Character{}, invert: Boolean): Matcher

proc m(t: REInput, x: REMatch, c: Continuation): REResult

i: Integer x.endIndex;

s: String t.str;

if i = |s| then return failure

elsif s[i] acceptanceSet xor invert then

return c(REMatchendIndex: i + 1, captures: x.captures)

else return failure

end if

end proc;

return m

end proc;

characterSetMatcher returns a Matcher that matches a single input string character. If invert is false, the match succeeds if the character is a member of the acceptanceSet set of characters (possibly ignoring case). If invert is true, the match succeeds if the character is not a member of the acceptanceSet set of characters (possibly ignoring case).

proc characterMatcher(ch: Character): Matcher

return characterSetMatcher({ch}, false)

end proc;

characterMatcher returns a Matcher that matches a single input string character. The match succeeds if the character is the same as ch (possibly ignoring case).

Regular Expression Patterns

Syntax

RegularExpressionPattern Disjunction

Semantics

Execute[RegularExpressionPattern Disjunction]: REInput Integer REResult

begin

m1: Matcher GenMatcher[Disjunction](0);

proc e(t: REInput, index: Integer): REResult

x: REMatch REMatchendIndex: index, captures: fillCapture(CountParens[Disjunction]);

return m1(t, x, successContinuation)

end proc;

return e

end;

proc successContinuation(x: REMatch): REResult

return x

end proc;

proc fillCapture(i: Integer): Capture[]

if i = 0 then return [] else return fillCapture(i – 1) [undefined] end if

end proc;

Disjunctions

Syntax

Disjunction

Alternative

| Alternative | Disjunction

Semantics

proc GenMatcher[Disjunction] (parenIndex: Integer): Matcher

[Disjunction Alternative] do return GenMatcher[Alternative](parenIndex);

[Disjunction₀ Alternative | Disjunction₁] do

m1: Matcher GenMatcher[Alternative](parenIndex);

m2: Matcher GenMatcher[Disjunction₁](parenIndex + CountParens[Alternative]);

proc m3(t: REInput, x: REMatch, c: Continuation): REResult

y: REResult m1(t, x, c);

case y of

REMatch do return y;

{failure} do return m2(t, x, c)

end case

end proc;

return m3

end proc;

CountParens[Disjunction]: Integer;

CountParens[Disjunction Alternative] = CountParens[Alternative];

CountParens[Disjunction₀ Alternative | Disjunction₁] = CountParens[Alternative] + CountParens[Disjunction₁];

Alternatives

Syntax

Alternative

«empty»

| Alternative Term

Semantics

proc GenMatcher[Alternative] (parenIndex: Integer): Matcher

[Alternative «empty»] do

proc m(t: REInput, x: REMatch, c: Continuation): REResult

return c(x)

end proc;

return m;

[Alternative₀ Alternative₁ Term] do

m1: Matcher GenMatcher[Alternative₁](parenIndex);

m2: Matcher GenMatcher[Term](parenIndex + CountParens[Alternative₁]);

proc m3(t: REInput, x: REMatch, c: Continuation): REResult

proc d(y: REMatch): REResult

return m2(t, y, c)

end proc;

return m1(t, x, d)

end proc;

return m3

end proc;

CountParens[Alternative]: Integer;

CountParens[Alternative «empty»] = 0;

CountParens[Alternative₀ Alternative₁ Term] = CountParens[Alternative₁] + CountParens[Term];

Terms

Syntax

Term

Assertion

| Atom

| Atom Quantifier

Semantics

proc GenMatcher[Term] (parenIndex: Integer): Matcher

[Term Assertion] do

proc m(t: REInput, x: REMatch, c: Continuation): REResult

if TestAssertion[Assertion](t, x) then return c(x) else return failure end if

end proc;

return m;

[Term Atom] do return GenMatcher[Atom](parenIndex);

[Term Atom Quantifier] do

m: Matcher GenMatcher[Atom](parenIndex);

min: Integer Minimum[Quantifier];

max: Limit Maximum[Quantifier];

greedy: Boolean Greedy[Quantifier];

if max + then if max < min then throw syntaxError end if end if;

return repeatMatcher(m, min, max, greedy, parenIndex, CountParens[Atom])

end proc;

CountParens[Term]: Integer;

CountParens[Term Assertion] = 0;

CountParens[Term Atom] = CountParens[Atom];

CountParens[Term Atom Quantifier] = CountParens[Atom];

Syntax

Quantifier

QuantifierPrefix

| QuantifierPrefix ?

QuantifierPrefix

*

| +

| ?

| { DecimalDigits }

| { DecimalDigits , }

| { DecimalDigits , DecimalDigits }

DecimalDigits

DecimalDigit

| DecimalDigits DecimalDigit

DecimalDigit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Semantics

Limit = Integer {+};

proc resetParens(x: REMatch, p: Integer, nParens: Integer): REMatch

captures: Capture[] x.captures;

i: Integer p;

while i < p + nParens do captures captures[i \ undefined]; i i + 1 end while;

return REMatchendIndex: x.endIndex, captures: captures

end proc;

proc repeatMatcher(body: Matcher, min: Integer, max: Limit, greedy: Boolean, parenIndex: Integer, nBodyParens: Integer): Matcher

proc m(t: REInput, x: REMatch, c: Continuation): REResult

if max = 0 then return c(x) end if;

proc d(y: REMatch): REResult

if min = 0 and y.endIndex = x.endIndex then return failure end if;

newMin: Integer min;

if min 0 then newMin min – 1 end if;

newMax: Limit max;

if max + then newMax max – 1 end if;

m2: Matcher repeatMatcher(body, newMin, newMax, greedy, parenIndex, nBodyParens);

return m2(t, y, c)

end proc;

xr: REMatch resetParens(x, parenIndex, nBodyParens);

if min 0 then return body(t, xr, d)

elsif greedy then

z: REResult body(t, xr, d);

case z of

REMatch do return z;

{failure} do return c(x)

end case

else

z: REResult c(x);

case z of

REMatch do return z;

{failure} do return body(t, xr, d)

end case

end if

end proc;

return m

end proc;

Minimum[Quantifier]: Integer;

Minimum[Quantifier QuantifierPrefix] = Minimum[QuantifierPrefix];

Minimum[Quantifier QuantifierPrefix ?] = Minimum[QuantifierPrefix];

Maximum[Quantifier]: Limit;

Maximum[Quantifier QuantifierPrefix] = Maximum[QuantifierPrefix];

Maximum[Quantifier QuantifierPrefix ?] = Maximum[QuantifierPrefix];

Greedy[Quantifier]: Boolean;

Greedy[Quantifier QuantifierPrefix] = true;

Greedy[Quantifier QuantifierPrefix ?] = false;

Minimum[QuantifierPrefix]: Integer;

Minimum[QuantifierPrefix *] = 0;

Minimum[QuantifierPrefix +] = 1;

Minimum[QuantifierPrefix ?] = 0;

Minimum[QuantifierPrefix { DecimalDigits }] = IntegerValue[DecimalDigits];

Minimum[QuantifierPrefix { DecimalDigits , }] = IntegerValue[DecimalDigits];

Minimum[QuantifierPrefix { DecimalDigits₁ , DecimalDigits₂ }] = IntegerValue[DecimalDigits₁];

Maximum[QuantifierPrefix]: Limit;

Maximum[QuantifierPrefix *] = +;

Maximum[QuantifierPrefix +] = +;

Maximum[QuantifierPrefix ?] = 1;

Maximum[QuantifierPrefix { DecimalDigits }] = IntegerValue[DecimalDigits];

Maximum[QuantifierPrefix { DecimalDigits , }] = +;

Maximum[QuantifierPrefix { DecimalDigits₁ , DecimalDigits₂ }] = IntegerValue[DecimalDigits₂];

IntegerValue[DecimalDigits]: Integer;

IntegerValue[DecimalDigits DecimalDigit] = DecimalValue[DecimalDigit];

IntegerValue[DecimalDigits₀ DecimalDigits₁ DecimalDigit] = 10IntegerValue[DecimalDigits₁] + DecimalValue[DecimalDigit];

DecimalValue[DecimalDigit]: Integer = digitValue(DecimalDigit);

Assertions

Syntax

Assertion

^

| $

| \ b

| \ B

Semantics

proc TestAssertion[Assertion] (t: REInput, x: REMatch): Boolean

[Assertion ^] do

return x.endIndex = 0 or (t.multiline and t.str[x.endIndex – 1] lineTerminators);

[Assertion $] do

return x.endIndex = |t.str| or (t.multiline and t.str[x.endIndex] lineTerminators);

[Assertion \ b] do return atWordBoundary(x.endIndex, t.str);

[Assertion \ B] do return not atWordBoundary(x.endIndex, t.str)

end proc;

proc atWordBoundary(i: Integer, s: String): Boolean

return inWord(i – 1, s) xor inWord(i, s)

end proc;

proc inWord(i: Integer, s: String): Boolean

if i = –1 or i = |s| then return false else return s[i] reWordCharacters end if

end proc;

Atoms

Syntax

Atom

PatternCharacter

| .

| NullEscape

| \ AtomEscape

| CharacterClass

| ( Disjunction )

| ( ? : Disjunction )

| ( ? = Disjunction )

| ( ? ! Disjunction )

PatternCharacter UnicodeCharacter except ^ | $ | \ | . | * | + | ? | ( | ) | [ | ] | { | } | |

Semantics

proc GenMatcher[Atom] (parenIndex: Integer): Matcher

[Atom PatternCharacter] do return characterMatcher(PatternCharacter);

[Atom .] do

proc m1(t: REInput, x: REMatch, c: Continuation): REResult

a: Character{} t.span ? {} : lineTerminators;

m2: Matcher characterSetMatcher(a, true);

return m2(t, x, c)

end proc;

return m1;

[Atom NullEscape] do

proc m(t: REInput, x: REMatch, c: Continuation): REResult

return c(x)

end proc;

return m;

[Atom \ AtomEscape] do return GenMatcher[AtomEscape](parenIndex);

[Atom CharacterClass] do

a: Character{} AcceptanceSet[CharacterClass];

return characterSetMatcher(a, Invert[CharacterClass]);

[Atom ( Disjunction )] do

m1: Matcher GenMatcher[Disjunction](parenIndex + 1);

proc m2(t: REInput, x: REMatch, c: Continuation): REResult

proc d(y: REMatch): REResult

ref: Capture t.str[x.endIndex ... y.endIndex – 1];

updatedCaptures: Capture[] y.captures[parenIndex \ ref];

return c(REMatchendIndex: y.endIndex, captures: updatedCaptures)

end proc;

return m1(t, x, d)

end proc;

return m2;

[Atom ( ? : Disjunction )] do return GenMatcher[Disjunction](parenIndex);

[Atom ( ? = Disjunction )] do

m1: Matcher GenMatcher[Disjunction](parenIndex);

proc m2(t: REInput, x: REMatch, c: Continuation): REResult

y: REResult m1(t, x, successContinuation);

case y of

REMatch do return c(REMatchendIndex: x.endIndex, captures: y.captures);

{failure} do return failure

end case

end proc;

return m2;

[Atom ( ? ! Disjunction )] do

m1: Matcher GenMatcher[Disjunction](parenIndex);

proc m2(t: REInput, x: REMatch, c: Continuation): REResult

case m1(t, x, successContinuation) of

REMatch do return failure;

{failure} do return c(x)

end case

end proc;

return m2

end proc;

CountParens[Atom]: Integer;

CountParens[Atom PatternCharacter] = 0;

CountParens[Atom .] = 0;

CountParens[Atom NullEscape] = 0;

CountParens[Atom \ AtomEscape] = 0;

CountParens[Atom CharacterClass] = 0;

CountParens[Atom ( Disjunction )] = CountParens[Disjunction] + 1;

CountParens[Atom ( ? : Disjunction )] = CountParens[Disjunction];

CountParens[Atom ( ? = Disjunction )] = CountParens[Disjunction];

CountParens[Atom ( ? ! Disjunction )] = CountParens[Disjunction];

Escapes

Syntax

NullEscape \ _

AtomEscape

DecimalEscape

| CharacterEscape

| CharacterClassEscape

Semantics

proc GenMatcher[AtomEscape] (parenIndex: Integer): Matcher

[AtomEscape DecimalEscape] do

n: Integer EscapeValue[DecimalEscape];

if n = 0 then return characterMatcher(‘«NUL»’)

elsif n > parenIndex then throw syntaxError

else return backreferenceMatcher(n)

end if;

[AtomEscape CharacterEscape] do

return characterMatcher(CharacterValue[CharacterEscape]);

[AtomEscape CharacterClassEscape] do

return characterSetMatcher(AcceptanceSet[CharacterClassEscape], false)

end proc;

proc backreferenceMatcher(n: Integer): Matcher

proc m(t: REInput, x: REMatch, c: Continuation): REResult

ref: Capture nthBackreference(x, n);

case ref of

String do

i: Integer x.endIndex;

s: String t.str;

j: Integer i + |ref|;

if j |s| and s[i ... j – 1] = ref then

return c(REMatchendIndex: j, captures: x.captures)

else return failure

end if;

{undefined} do return c(x)

end case

end proc;

return m

end proc;

proc nthBackreference(x: REMatch, n: Integer): Capture

return x.captures[n – 1]

end proc;

Syntax

CharacterEscape

ControlEscape

| c ControlLetter

| HexEscape

| IdentityEscape

ControlLetter

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z

| a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z

IdentityEscape UnicodeCharacter except _ | UnicodeAlphanumeric

ControlEscape

f

| n

| r

| t

| v

Semantics

CharacterValue[CharacterEscape]: Character;

CharacterValue[CharacterEscape ControlEscape] = CharacterValue[ControlEscape];

CharacterValue[CharacterEscape c ControlLetter] = codeToCharacter(bitwiseAnd(characterToCode(ControlLetter), 31));

CharacterValue[CharacterEscape HexEscape] = CharacterValue[HexEscape];

CharacterValue[CharacterEscape IdentityEscape] = IdentityEscape;

CharacterValue[ControlEscape]: Character;

CharacterValue[ControlEscape f] = ‘«FF»’;

CharacterValue[ControlEscape n] = ‘«LF»’;

CharacterValue[ControlEscape r] = ‘«CR»’;

CharacterValue[ControlEscape t] = ‘«TAB»’;

CharacterValue[ControlEscape v] = ‘«VT»’;

Decimal Escapes

Syntax

DecimalEscape DecimalIntegerLiteral [lookahead{DecimalDigit}]

DecimalIntegerLiteral

0

| NonZeroDecimalDigits

NonZeroDecimalDigits

NonZeroDigit

| NonZeroDecimalDigits DecimalDigit

NonZeroDigit 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Semantics

EscapeValue[DecimalEscape DecimalIntegerLiteral [lookahead{DecimalDigit}]]: Integer = IntegerValue[DecimalIntegerLiteral];

IntegerValue[DecimalIntegerLiteral]: Integer;

IntegerValue[DecimalIntegerLiteral 0] = 0;

IntegerValue[DecimalIntegerLiteral NonZeroDecimalDigits] = IntegerValue[NonZeroDecimalDigits];

IntegerValue[NonZeroDecimalDigits]: Integer;

IntegerValue[NonZeroDecimalDigits NonZeroDigit] = DecimalValue[NonZeroDigit];

IntegerValue[NonZeroDecimalDigits₀ NonZeroDecimalDigits₁ DecimalDigit] = 10IntegerValue[NonZeroDecimalDigits₁] + DecimalValue[DecimalDigit];

DecimalValue[NonZeroDigit]: Integer = digitValue(NonZeroDigit);

Hexadecimal Escapes

Syntax

HexEscape

x HexDigit HexDigit

| u HexDigit HexDigit HexDigit HexDigit

HexDigit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | a | b | c | d | e | f

Semantics

CharacterValue[HexEscape]: Character;

CharacterValue[HexEscape x HexDigit₁ HexDigit₂] = codeToCharacter(16HexValue[HexDigit₁] + HexValue[HexDigit₂]);

CharacterValue[HexEscape u HexDigit₁ HexDigit₂ HexDigit₃ HexDigit₄] = codeToCharacter(4096HexValue[HexDigit₁] + 256HexValue[HexDigit₂] + 16HexValue[HexDigit₃] + HexValue[HexDigit₄]);

HexValue[HexDigit]: Integer = digitValue(HexDigit);