July 2000 Draft
JavaScript 2.0
Formal Description
Regular Expression Semantics
previousupnext

Thursday, November 11, 1999

The regular expression semantics describe the actions the regular expression engine takes in order to transform a regular expression pattern into a function for matching against input strings. For convenience, the regular expression grammar is repeated here. See also the description of the semantic notation.

This document is also available as a Word 98 rtf file.

The regular expression semantics below are working (except for case-insensitive matches) and have been tried on sample cases, but they could be formatted better.

Semantics

type SemanticException = oneof {syntaxError}

Unicode Character Classes

Syntax

UnicodeCharacter  Any Unicode character
UnicodeAlphanumeric  Any Unicode alphabetic or decimal digit character (includes ASCII 0-9, A-Z, and a-z)
LineTerminator  «LF» | «CR» | «u2028» | «u2029»

Semantics

lineTerminators : {Character} = {‘«LF»’, ‘«CR»’, ‘«u2028»’, ‘«u2029»’}

reWhitespaces : {Character} = {‘«FF»’, ‘«LF»’, ‘«CR»’, ‘«TAB»’, ‘«VT»’, ‘ ’}

reDigits : {Character} = {‘0’ ... ‘9’}

reWordCharacters : {Character} = {‘0’ ... ‘9’, ‘A’ ... ‘Z’, ‘a’ ... ‘z’, ‘_’}

Regular Expression Definitions

Semantics

type REInput = tuple {strStringignoreCaseBooleanmultilineBoolean}

Field str is the input string. ignoreCase and multiline are the corresponding regular expression flags.

type REResult = oneof {successREMatchfailure}

type REMatch = tuple {endIndexIntegercapturesCapture[]}

A REMatch holds an intermediate state during the pattern-matching process. endIndex is the index of the next input character to be matched by the next component in a regular expression pattern. If we are at the end of the pattern, endIndex is one plus the index of the last matched input character. captures is a zero-based array of the strings captured so far by capturing parentheses.

type Capture = oneof {presentStringabsent}

type Continuation = REMatch  REResult

A Continuation is a function that attempts to match the remaining portion of the pattern against the input string, starting at the intermediate state given by its REMatch argument. If a match is possible, it returns a success result that contains the final REMatch state; if no match is possible, it returns a failure result.

type Matcher = REInput  REMatch  Continuation  REResult

A Matcher is a function that attempts to match a middle portion of the pattern against the input string, starting at the intermediate state given by its REMatch argument. Since the remainder of the pattern heavily influences whether (and how) a middle portion will match, we must pass in a Continuation function that checks whether the rest of the pattern matched. If the continuation returns failure, the matcher function may call it repeatedly, trying various alternatives at pattern choice points.

The REInput parameter contains the input string and is merely passed down to subroutines.

type MatcherGenerator = Integer  Matcher

A MatcherGenerator is a function executed at the time the regular expression is compiled that returns a Matcher for a part of the pattern. The Integer parameter contains the number of capturing left parentheses seen so far in the pattern and is used to assign static, consecutive numbers to capturing parentheses.

characterSetMatcher(acceptanceSet{Character}invertBoolean) : Matcher
  = function(tREInputxREMatchcContinuation)
         let iInteger = x.endIndex;
             sString = t.str
         in if i = |s|
             then failure
             else if s[i acceptanceSet xor invert
             then c(endIndex (i + 1), captures x.captures)
             else failure

characterSetMatcher returns a Matcher that matches a single input string character. If invert is false, the match succeeds if the character is a member of the acceptanceSet set of characters (possibly ignoring case). If invert is true, the match succeeds if the character is not a member of the acceptanceSet set of characters (possibly ignoring case).

characterMatcher(chCharacter) : Matcher = characterSetMatcher({ch}, false)

characterMatcher returns a Matcher that matches a single input string character. The match succeeds if the character is the same as ch (possibly ignoring case).

Regular Expression Patterns

Syntax

RegularExpressionPattern  Disjunction

Semantics

action Exec[RegularExpressionPattern] : REInput  Integer  REResult

Exec[RegularExpressionPattern  Disjunction]
  = let matchMatcher = GenMatcher[Disjunction](0)
     in function(tREInputindexInteger)
             match(
                 t,
                 endIndex indexcaptures fillCapture(CountParens[Disjunction]),
                 successContinuation)

successContinuation(xREMatch) : REResult = success x

fillCapture(iInteger) : Capture[]
  = if i = 0
     then []Capture
     else fillCapture(i - 1)  [absent]

Disjunctions

Syntax

Disjunction 
   Alternative
|  Alternative | Disjunction

Semantics

action GenMatcher[Disjunction] : MatcherGenerator

GenMatcher[Disjunction  Alternative] = GenMatcher[Alternative]

GenMatcher[Disjunction  Alternative | Disjunction1](parenIndexInteger)
  = let match1Matcher = GenMatcher[Alternative](parenIndex);
         match2Matcher = GenMatcher[Disjunction1](parenIndex + CountParens[Alternative])
     in function(tREInputxREMatchcContinuation)
             case match1(txcof
                success(yREMatch): success y;
                failurematch2(txc)
                end

action CountParens[Disjunction] : Integer

CountParens[Disjunction  Alternative] = CountParens[Alternative]

CountParens[Disjunction  Alternative | Disjunction1]
  = CountParens[Alternative] + CountParens[Disjunction1]

Alternatives

Syntax

Alternative 
   «empty»
|  Alternative Term

Semantics

action GenMatcher[Alternative] : MatcherGenerator

GenMatcher[Alternative  «empty»](parenIndexInteger)
  = function(tREInputxREMatchcContinuation)
         c(x)

GenMatcher[Alternative  Alternative1 Term](parenIndexInteger)
  = let match1Matcher = GenMatcher[Alternative1](parenIndex);
         match2Matcher = GenMatcher[Term](parenIndex + CountParens[Alternative1])
     in function(tREInputxREMatchcContinuation)
             let dContinuation
                     = function(yREMatch)
                            match2(tyc)
             in match1(txd)

action CountParens[Alternative] : Integer

CountParens[Alternative  «empty»] = 0

CountParens[Alternative  Alternative1 Term]
  = CountParens[Alternative1] + CountParens[Term]

Terms

Syntax

Term 
   Assertion
|  Atom
|  Atom Quantifier

Semantics

action GenMatcher[Term] : MatcherGenerator

GenMatcher[Term  Assertion](parenIndexInteger)
  = function(tREInputxREMatchcContinuation)
         if TestAssertion[Assertion](tx)
         then c(x)
         else failure

GenMatcher[Term  Atom] = GenMatcher[Atom]

GenMatcher[Term  Atom Quantifier](parenIndexInteger)
  = let matchMatcher = GenMatcher[Atom](parenIndex);
         minInteger = Minimum[Quantifier];
         maxLimit = Maximum[Quantifier];
         greedyBoolean = Greedy[Quantifier]
     in if 
             (case max of
                finite(mInteger): m < min;
                infinitefalse
                end)
         then throw syntaxError
         else repeatMatcher(matchminmaxgreedyparenIndexCountParens[Atom])

action CountParens[Term] : Integer

CountParens[Term  Assertion] = 0

CountParens[Term  Atom] = CountParens[Atom]

CountParens[Term  Atom Quantifier] = CountParens[Atom]

Syntax

Quantifier 
   QuantifierPrefix
|  QuantifierPrefix ?
QuantifierPrefix 
   *
|  +
|  ?
|  { DecimalDigits }
|  { DecimalDigits , }
|  { DecimalDigits , DecimalDigits }
DecimalDigits 
   DecimalDigit
|  DecimalDigits DecimalDigit
DecimalDigit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Semantics

type Limit = oneof {finiteIntegerinfinite}

resetParens(xREMatchpIntegernParensInteger) : REMatch
  = if nParens = 0
     then x
     else let yREMatch = endIndex x.endIndexcaptures x.captures[p  absent]
           in resetParens(yp + 1, nParens - 1)

repeatMatcher(bodyMatcherminIntegermaxLimitgreedyBooleanparenIndexIntegernBodyParensInteger)
  : Matcher
  = function(tREInputxREMatchcContinuation)
         if 
             (case max of
                finite(mInteger): m = 0;
                infinitefalse
                end)
         then c(x)
         else let dContinuation
                       = function(yREMatch)
                              if min = 0 and y.endIndex = x.endIndex
                              then failure
                              else let newMinInteger
                                            = if min = 0
                                               then 0
                                               else min - 1;
                                        newMaxLimit
                                            = case max of
                                                  finite(mInteger): finite (m - 1);
                                                  infiniteinfinite
                                                  end
                                    in repeatMatcher(
                                            body,
                                            newMin,
                                            newMax,
                                            greedy,
                                            parenIndex,
                                            nBodyParens)(tyc);
                   xrREMatch = resetParens(xparenIndexnBodyParens)
               in if min  0
                   then body(txrd)
                   else if greedy
                   then case body(txrdof
                             success(zREMatch): success z;
                             failurec(x)
                             end
                   else case c(xof
                            success(zREMatch): success z;
                            failurebody(txrd)
                            end

action Minimum[Quantifier] : Integer

Minimum[Quantifier  QuantifierPrefix] = Minimum[QuantifierPrefix]

Minimum[Quantifier  QuantifierPrefix ?] = Minimum[QuantifierPrefix]

action Maximum[Quantifier] : Limit

Maximum[Quantifier  QuantifierPrefix] = Maximum[QuantifierPrefix]

Maximum[Quantifier  QuantifierPrefix ?] = Maximum[QuantifierPrefix]

action Greedy[Quantifier] : Boolean

Greedy[Quantifier  QuantifierPrefix] = true

Greedy[Quantifier  QuantifierPrefix ?] = false

action Minimum[QuantifierPrefix] : Integer

Minimum[QuantifierPrefix  *] = 0

Minimum[QuantifierPrefix  +] = 1

Minimum[QuantifierPrefix  ?] = 0

Minimum[QuantifierPrefix  { DecimalDigits }] = IntegerValue[DecimalDigits]

Minimum[QuantifierPrefix  { DecimalDigits , }] = IntegerValue[DecimalDigits]

Minimum[QuantifierPrefix  { DecimalDigits1 , DecimalDigits2 }]
  = IntegerValue[DecimalDigits1]

action Maximum[QuantifierPrefix] : Limit

Maximum[QuantifierPrefix  *] = infinite

Maximum[QuantifierPrefix  +] = infinite

Maximum[QuantifierPrefix  ?] = finite 1

Maximum[QuantifierPrefix  { DecimalDigits }] = finite IntegerValue[DecimalDigits]

Maximum[QuantifierPrefix  { DecimalDigits , }] = infinite

Maximum[QuantifierPrefix  { DecimalDigits1 , DecimalDigits2 }]
  = finite IntegerValue[DecimalDigits2]

action IntegerValue[DecimalDigits] : Integer

IntegerValue[DecimalDigits  DecimalDigit] = DecimalValue[DecimalDigit]

IntegerValue[DecimalDigits  DecimalDigits1 DecimalDigit]
  = 10*IntegerValue[DecimalDigits1] + DecimalValue[DecimalDigit]

action DecimalValue[DecimalDigit] : Integer = digitValue(DecimalDigit)

Assertions

Syntax

Assertion 
   ^
|  $
|  \ b
|  \ B

Semantics

action TestAssertion[Assertion] : REInput  REMatch  Boolean

TestAssertion[Assertion  ^](tREInputxREMatch)
  = if x.endIndex = 0
     then true
     else t.multiline and t.str[x.endIndex - 1]  lineTerminators

TestAssertion[Assertion  $](tREInputxREMatch)
  = if x.endIndex = |t.str|
     then true
     else t.multiline and t.str[x.endIndex lineTerminators

TestAssertion[Assertion  \ b](tREInputxREMatch)
  = atWordBoundary(x.endIndext.str)

TestAssertion[Assertion  \ B](tREInputxREMatch)
  = not atWordBoundary(x.endIndext.str)

atWordBoundary(iIntegersString) : Boolean = inWord(i - 1, sxor inWord(is)

inWord(iIntegersString) : Boolean
  = if i = -1 or i = |s|
     then false
     else s[i reWordCharacters

Atoms

Syntax

Atom 
   PatternCharacter
|  .
|  \ AtomEscape
|  CharacterClass
|  ( Disjunction )
|  ( ? : Disjunction )
|  ( ? = Disjunction )
|  ( ? ! Disjunction )
PatternCharacter  UnicodeCharacter except ^ | $ | \ | . | * | + | ? | ( | ) | [ | ] | { | } | |

Semantics

action GenMatcher[Atom] : MatcherGenerator

GenMatcher[Atom  PatternCharacter](parenIndexInteger)
  = characterMatcher(PatternCharacter)

GenMatcher[Atom  .](parenIndexInteger) = characterSetMatcher(lineTerminatorstrue)

GenMatcher[Atom  \ AtomEscape] = GenMatcher[AtomEscape]

GenMatcher[Atom  CharacterClass](parenIndexInteger)
  = let a{Character} = AcceptanceSet[CharacterClass]
     in characterSetMatcher(aInvert[CharacterClass])

GenMatcher[Atom  ( Disjunction )](parenIndexInteger)
  = let matchMatcher = GenMatcher[Disjunction](parenIndex + 1)
     in function(tREInputxREMatchcContinuation)
             let dContinuation
                     = function(yREMatch)
                            let updatedCapturesCapture[]
                                    = y.captures[parenIndex 
                                           present t.str[x.endIndex ... y.endIndex - 1]]
                            in c(endIndex y.endIndexcaptures updatedCaptures)
             in match(txd)

GenMatcher[Atom  ( ? : Disjunction )] = GenMatcher[Disjunction]

GenMatcher[Atom  ( ? = Disjunction )](parenIndexInteger)
  = let matchMatcher = GenMatcher[Disjunction](parenIndex)
     in function(tREInputxREMatchcContinuation)
             case match(txsuccessContinuationof
                success(yREMatch): c(endIndex x.endIndexcaptures y.captures);
                failurefailure
                end

GenMatcher[Atom  ( ? ! Disjunction )](parenIndexInteger)
  = let matchMatcher = GenMatcher[Disjunction](parenIndex)
     in function(tREInputxREMatchcContinuation)
             case match(txsuccessContinuationof
                success(yREMatch): failure;
                failurec(x)
                end

action CountParens[Atom] : Integer

CountParens[Atom  PatternCharacter] = 0

CountParens[Atom  .] = 0

CountParens[Atom  \ AtomEscape] = 0

CountParens[Atom  CharacterClass] = 0

CountParens[Atom  ( Disjunction )] = CountParens[Disjunction] + 1

CountParens[Atom  ( ? : Disjunction )] = CountParens[Disjunction]

CountParens[Atom  ( ? = Disjunction )] = CountParens[Disjunction]

CountParens[Atom  ( ? ! Disjunction )] = CountParens[Disjunction]

Escapes

Syntax

AtomEscape 
   DecimalEscape
|  CharacterEscape
|  CharacterClassEscape

Semantics

action GenMatcher[AtomEscape] : MatcherGenerator

GenMatcher[AtomEscape  DecimalEscape](parenIndexInteger)
  = let nInteger = EscapeValue[DecimalEscape]
     in if n = 0
         then characterMatcher(‘«NUL»’)
         else if n > parenIndex
         then throw syntaxError
         else backreferenceMatcher(n)

GenMatcher[AtomEscape  CharacterEscape](parenIndexInteger)
  = characterMatcher(CharacterValue[CharacterEscape])

GenMatcher[AtomEscape  CharacterClassEscape](parenIndexInteger)
  = characterSetMatcher(AcceptanceSet[CharacterClassEscape], false)

backreferenceMatcher(nInteger) : Matcher
  = function(tREInputxREMatchcContinuation)
         case nthBackreference(xnof
            present(refString):
                  let iInteger = x.endIndex;
                      sString = t.str
                  in let jInteger = i + |ref|
                  in if j > |s|
                      then failure
                      else if s[i ... j - 1] = ref
                      then c(endIndex jcaptures x.captures)
                      else failure;
            absentc(x)
            end

nthBackreference(xREMatchnInteger) : Capture = x.captures[n - 1]

Syntax

CharacterEscape 
   ControlEscape
|  c ControlLetter
|  HexEscape
|  IdentityEscape
ControlLetter 
   A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z
|  a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z
IdentityEscape  UnicodeCharacter except UnicodeAlphanumeric
ControlEscape 
   f
|  n
|  r
|  t
|  v

Semantics

action CharacterValue[CharacterEscape] : Character

CharacterValue[CharacterEscape  ControlEscape] = CharacterValue[ControlEscape]

CharacterValue[CharacterEscape  c ControlLetter]
  = codeToCharacter(bitwiseAnd(characterToCode(ControlLetter), 31))

CharacterValue[CharacterEscape  HexEscape] = CharacterValue[HexEscape]

CharacterValue[CharacterEscape  IdentityEscape] = IdentityEscape

action CharacterValue[ControlEscape] : Character

CharacterValue[ControlEscape  f] = ‘«FF»

CharacterValue[ControlEscape  n] = ‘«LF»

CharacterValue[ControlEscape  r] = ‘«CR»

CharacterValue[ControlEscape  t] = ‘«TAB»

CharacterValue[ControlEscape  v] = ‘«VT»

Decimal Escapes

Syntax

DecimalEscape  DecimalIntegerLiteral [lookahead{DecimalDigit}]
DecimalIntegerLiteral 
   0
|  NonZeroDecimalDigits
NonZeroDecimalDigits 
   NonZeroDigit
|  NonZeroDecimalDigits DecimalDigit
NonZeroDigit  1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Semantics

action EscapeValue[DecimalEscape] : Integer

EscapeValue[DecimalEscape  DecimalIntegerLiteral [lookahead{DecimalDigit}]]
  = IntegerValue[DecimalIntegerLiteral]

action IntegerValue[DecimalIntegerLiteral] : Integer

IntegerValue[DecimalIntegerLiteral  0] = 0

IntegerValue[DecimalIntegerLiteral  NonZeroDecimalDigits]
  = IntegerValue[NonZeroDecimalDigits]

action IntegerValue[NonZeroDecimalDigits] : Integer

IntegerValue[NonZeroDecimalDigits  NonZeroDigit] = DecimalValue[NonZeroDigit]

IntegerValue[NonZeroDecimalDigits  NonZeroDecimalDigits1 DecimalDigit]
  = 10*IntegerValue[NonZeroDecimalDigits1] + DecimalValue[DecimalDigit]

action DecimalValue[NonZeroDigit] : Integer = digitValue(NonZeroDigit)

Hexadecimal Escapes

Syntax

HexEscape 
   x HexDigit HexDigit
|  u HexDigit HexDigit HexDigit HexDigit
HexDigit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | a | b | c | d | e | f

Semantics

action CharacterValue[HexEscape] : Character

CharacterValue[HexEscape  x HexDigit1 HexDigit2]
  = codeToCharacter(16*HexValue[HexDigit1] + HexValue[HexDigit2])

CharacterValue[HexEscape  u HexDigit1 HexDigit2 HexDigit3 HexDigit4]
  = codeToCharacter(
         4096*HexValue[HexDigit1] + 256*HexValue[HexDigit2] + 16*HexValue[HexDigit3] +
         HexValue[HexDigit4])

action HexValue[HexDigit] : Integer = digitValue(HexDigit)

Character Class Escapes

Syntax

CharacterClassEscape 
   s
|  S
|  d
|  D
|  w
|  W

Semantics

action AcceptanceSet[CharacterClassEscape] : {Character}

AcceptanceSet[CharacterClassEscape  s] = reWhitespaces

AcceptanceSet[CharacterClassEscape  S] = {‘«NUL»’ ... ‘«uFFFF»’} - reWhitespaces

AcceptanceSet[CharacterClassEscape  d] = reDigits

AcceptanceSet[CharacterClassEscape  D] = {‘«NUL»’ ... ‘«uFFFF»’} - reDigits

AcceptanceSet[CharacterClassEscape  w] = reWordCharacters

AcceptanceSet[CharacterClassEscape  W] = {‘«NUL»’ ... ‘«uFFFF»’} - reWordCharacters

User-Specified Character Classes

Syntax

CharacterClass 
   [ [lookahead{^}] ClassRanges ]
|  [ ^ ClassRanges ]
ClassRanges 
   «empty»
|  NonemptyClassRangesdash
  {dashnoDash}
NonemptyClassRanges 
   ClassAtomdash
|  ClassAtom NonemptyClassRangesnoDash
|  ClassAtom - ClassAtomdash ClassRanges

Semantics

action AcceptanceSet[CharacterClass] : {Character}

AcceptanceSet[CharacterClass  [ [lookahead{^}] ClassRanges ]]
  = AcceptanceSet[ClassRanges]

AcceptanceSet[CharacterClass  [ ^ ClassRanges ]] = AcceptanceSet[ClassRanges]

action Invert[CharacterClass] : Boolean

Invert[CharacterClass  [ [lookahead{^}] ClassRanges ]] = false

Invert[CharacterClass  [ ^ ClassRanges ]] = true

action AcceptanceSet[ClassRanges] : {Character}

AcceptanceSet[ClassRanges  «empty»] = {}Character

AcceptanceSet[ClassRanges  NonemptyClassRangesdash]
  = AcceptanceSet[NonemptyClassRangesdash]

action AcceptanceSet[NonemptyClassRanges] : {Character}

AcceptanceSet[NonemptyClassRanges  ClassAtomdash] = AcceptanceSet[ClassAtomdash]

AcceptanceSet[NonemptyClassRanges  ClassAtom NonemptyClassRangesnoDash1]
  = AcceptanceSet[ClassAtom AcceptanceSet[NonemptyClassRangesnoDash1]

AcceptanceSet[NonemptyClassRanges  ClassAtom1 - ClassAtomdash2 ClassRanges]
  = let range{Character}
             = characterRange(AcceptanceSet[ClassAtom1], AcceptanceSet[ClassAtomdash2])
     in range  AcceptanceSet[ClassRanges]

characterRange(low{Character}high{Character}) : {Character}
  = if |low 1 or |high 1
     then throw syntaxError
     else let lCharacter = min low;
               hCharacter = min high
           in if l  h
               then {l ... h}
               else throw syntaxError

Character Class Range Atoms

Syntax

ClassAtom 
   ClassCharacter
|  \ ClassEscape
ClassCharacterdash  UnicodeCharacter except \ | ]
ClassCharacternoDash  ClassCharacterdash except -
ClassEscape 
   DecimalEscape
|  b
|  CharacterEscape
|  CharacterClassEscape

Semantics

action AcceptanceSet[ClassAtom] : {Character}

AcceptanceSet[ClassAtom  ClassCharacter] = {ClassCharacter}

AcceptanceSet[ClassAtom  \ ClassEscape] = AcceptanceSet[ClassEscape]

action AcceptanceSet[ClassEscape] : {Character}

AcceptanceSet[ClassEscape  DecimalEscape]
  = if EscapeValue[DecimalEscape] = 0
     then {‘«NUL»’}
     else throw syntaxError

AcceptanceSet[ClassEscape  b] = {‘«BS»’}

AcceptanceSet[ClassEscape  CharacterEscape] = {CharacterValue[CharacterEscape]}

AcceptanceSet[ClassEscape  CharacterClassEscape] = AcceptanceSet[CharacterClassEscape]


Waldemar Horwat
Last modified Thursday, November 11, 1999
previousupnext