April 2002 Draft
JavaScript 2.0
Formal Description
Regular Expression Semantics
|
Thursday, February 7, 2002
The regular expression semantics describe the actions the regular expression engine takes in order to transform a regular expression pattern into a function for matching against input strings. For convenience, the regular expression grammar is repeated here. See also the description of the semantic notation.
This document is also available as a Word 98 rtf file.
The regular expression semantics below are working (except for case-insensitive matches) and have been tried on sample cases, but they could be formatted better.
Field str is the input string. ignoreCase, multiline, and span are the corresponding regular expression flags.
A REMatch holds an intermediate state during the pattern-matching process. endIndex is the index of the next input character to be matched by the next component in a regular expression pattern. If we are at the end of the pattern, endIndex is one plus the index of the last matched input character. captures is a zero-based array of the strings captured so far by capturing parentheses.
A Continuation is a function that attempts to match the remaining portion of the pattern against the input string, starting at the intermediate state given by its REMatch argument. If a match is possible, it returns a REMatch result that contains the final state; if no match is possible, it returns a failure result.
A Matcher is a function that attempts to match a middle portion of the pattern against the input string, starting at the intermediate state given by its REMatch argument. Since the remainder of the pattern heavily influences whether (and how) a middle portion will match, we must pass in a Continuation function that checks whether the rest of the pattern matched. If the continuation returns failure, the matcher function may call it repeatedly, trying various alternatives at pattern choice points.
The REInput parameter contains the input string and is merely passed down to subroutines.
A Integer Matcher is a function executed at the time the regular expression is compiled that returns a Matcher for a part of the pattern. The Integer parameter contains the number of capturing left parentheses seen so far in the pattern and is used to assign static, consecutive numbers to capturing parentheses.
characterSetMatcher returns a Matcher that matches a single input string character. If invert is false, the match succeeds if the character is a member of the acceptanceSet set of characters (possibly ignoring case). If invert is true, the match succeeds if the character is not a member of the acceptanceSet set of characters (possibly ignoring case).
characterMatcher returns a Matcher that matches a single input string character. The match succeeds if the character is the same as ch (possibly ignoring case).
|
Disjunction1]
= CountParens[Alternative] + CountParens[Disjunction1];*
+
?
.
\
AtomEscapeA
| B
| C
| D
| E
| F
| G
| H
| I
| J
| K
| L
| M
| N
| O
| P
| Q
| R
| S
| T
| U
| V
| W
| X
| Y
| Z
a
| b
| c
| d
| e
| f
| g
| h
| i
| j
| k
| l
| m
| n
| o
| p
| q
| r
| s
| t
| u
| v
| w
| x
| y
| z
c
ControlLetter]
= codeToCharacter(bitwiseAnd(characterToCode(ControlLetter), 31));-
ClassAtomdash2 ClassRanges]
= characterRange(AcceptanceSet[ClassAtom1], AcceptanceSet[ClassAtomdash2])
AcceptanceSet[ClassRanges];
Waldemar Horwat Last modified Thursday, February 7, 2002 |