ECMAScript 4 Lexer

ECMAScript 4 Netscape Proposal

Core Language

Lexer

Wednesday, June 4, 2003

This section presents an informal overview of the ECMAScript 4 lexer. See the stages and lexical semantics sections in the formal description chapter for the details.

Changes since ECMAScript 3

The ECMAScript 4 lexer behaves in the same way as the ECMAScript 3 lexer except for the following:

There are additional punctuators and reserved words.
The lexer recognizes several nonreserved words that have special meanings in some contexts but can be used as identifiers.
Only semicolon insertion on line breaks is handled by the lexer; the ECMAScript 4 parser allows semicolons to be omitted before a closing }. In addition, the ECMAScript 4 parser allows semicolons to be omitted before the else of an if-else statement and before the while of a do-while statement.
Semicolon insertion on line breaks are both disabled in strict mode.
[no line break] restrictions in grammar productions are ignored in strict mode.

Source Code

ECMAScript 4 source text consists of a sequence of UTF-16 Unicode version 2.1 or later characters normalized to Unicode Normalized Form C (canonical composition), as described in the Unicode Technical Report #15.

Comments and White Space

Comments and white space behave just like in ECMAScript 3.

Punctuators

The following ECMAScript 3 punctuation tokens are recognized in ECMAScript 4:

! != !== % %= & && &= ( ) * *= + ++ += , - -- -= . / /= : :: ; < << <<= <= = == === > >= >> >>= >>> >>>= ? [ ] ^ ^= { | |= || } ~

The following punctuation tokens are new in ECMAScript 4:

&&= ... ^^ ^^= ||=

Keywords

The following reserved words are used in ECMAScript 4:

as break case catch class const continue default delete do else export extends false finally for function if import in instanceof is namespace new null package private public return super switch this throw true try typeof use var void while with

The following reserved words are reserved for future expansion:

abstract debugger enum goto implements interface native protected synchronized throws transient volatile

The following words have special meaning in some contexts in ECMAScript 4 but are not reserved and may be used as identifiers:

get set

Any of the above keywords may be used as an identifier by including a \_ escape anywhere within the identifier, which strips it of any keyword meanings. The two, four, and eight-digit hexadecimal escapes \xdd, \udddd, and \Udddddddd may also be used in identifiers; these strip the identifier of any keyword meanings as well.

Changes from ECMAScript 3

The following words were reserved in ECMAScript 3 but are not reserved in ECMAScript 4:

boolean byte char double final float int long short static

The following words were not reserved in ECMAScript 3 but are reserved in ECMAScript 4:

as is namespace use

Semicolon Insertion

The ECMAScript 4 syntactic grammar explicitly makes semicolons optional in the following situations:

Before any }
Before the else of an if-else statement
Before the while of a do-while statement (but not before the while of a while statement)
Before the end of the program

Semicolons are optional in these situations even if they would construct empty statements. Strict mode has no effect on semicolon insertion in the above cases.

In addition, sometimes line breaks in the input stream are turned into VirtualSemicolon tokens. Specifically, if the first through the n^th tokens of an ECMAScript program form are grammatically valid but the first through the n+1^st tokens are not and there is a line break (or a comment including a line break) between the n^th tokens and the n+1^st tokens, then the parser tries to parse the program again after inserting a VirtualSemicolon token between the n^th and the n+1^st tokens. This kind of VirtualSemicolon insertion does not occur in strict mode.

Numeric Literals

The syntax for numeric literals is the same as in ECMAScript 3, with the addition of long, ulong, and float numeric literals. The rules for numeric literals are as follows:

A numeric literal without a suffix is converted to an IEEE double-precision floating-point number.
A numeric literal with the suffix l or L is interpreted as a long value and must be a decimal or hexadecimal constant without an exponent or decimal point and be in the range of 0 through 2⁶³; furthermore, if the value is exactly 2⁶³ then the literal can only be used as the operand of the - unary negation operator.
A numeric literal with the suffix ul, uL, Ul, or UL is interpreted as a ulong value and must be a decimal or hexadecimal constant without an exponent or decimal point and be in the range of 0 through 2⁶⁴–1.
A numeric literal with the suffix f or F is interpreted as a float value and must be a decimal constant. Hexadecimal float constants are not permitted because the suffix would be interpreted as a hexadecimal digit.

The suffix must be adjacent to the number with no intervening white space. A number may not be followed by an identifier without intervening white space.

Regular Expression Literals

Regular expression literals begin with a slash (/) character not immediately followed by another slash (two slashes start a line comment). Like in ECMAScript 3, regular expression literals are ambiguous with the division (/) or division-assignment (/=) tokens. The lexer treats a / or /= as a division or division-assignment token if either of these tokens would be allowed by the syntactic grammar as the next token; otherwise, the lexer treats a / or /= as starting a regular expression.

This unfortunate dependence of lexical parsing on grammatical parsing is inherited from ECMAScript 3. See the regular expression syntax rationale for a discussion of the issues.

Waldemar Horwat
Last modified Wednesday, June 4, 2003