JavaScript 2.0 Lexer

July 2000 Draft

JavaScript 2.0

Core Language

Lexer

Saturday, April 29, 2000

This section presents an informal overview of the JavaScript 2.0 lexer. See the stages and lexer semantics sections in the formal description chapter for the details.

Changes since JavaScript 1.5

The JavaScript 2.0 lexer behaves in the same way as the JavaScript 1.5 lexer except for the following:

There are additional punctuators and reserved words.
The lexer recognizes several nonreserved words that have special meanings in some contexts but can be used as identifiers.
Numerals may be followed by units.
Only semicolon insertion on line breaks is handled by the lexer; the JavaScript 2.0 parser allows semicolons to be omitted before a closing }. In addition, the JavaScript 2.0 parser allows semicolons to be omitted before the else of an if-else statement and before the while of a do-while statement.
Semicolon insertion on line breaks are both disabled in strict mode.
[no line break] restrictions in grammar productions are ignored in strict mode.

Source Code

JavaScript 2.0 source text consists of a sequence of UTF-16 Unicode version 2.1 or later characters normalized to Unicode Normalized Form C (canonical composition), as described in the Unicode Technical Report #15.

Comments and White Space

Comments and white space behave just like in JavaScript 1.5.

Punctuators

The following JavaScript 1.5 punctuation tokens are recognized in JavaScript 2.0:

! != !== % %= & && &= ( ) * *= + ++ += , - -- -= . / /= : :: ; < << <<= <= = == === > >= >> >>= >>> >>>= ? [ ] ^ ^= { | |= || } ~

The following punctuation tokens are new in JavaScript 2.0:

# &&= -> .. ... @ ^^ ^^= ||=

Keywords

The following reserved words are used in JavaScript 2.0:

break case catch class const continue default delete do else eval export extends false final finally for function if implements import in instanceof interface new null package private public return static super switch this throw true try typeof var volatile while with

Out of these, the only word that was not reserved in JavaScript 1.5 is eval.

The following reserved words are reserved for future expansion:

abstract debugger enum goto native protected synchronized throws transient

The following words have special meaning in some contexts in JavaScript 2.0 but are not reserved and may be used as identifiers:

attribute constructor get language namespace set use

Semicolon Insertion

The JavaScript 2.0 grammar explicitly makes semicolons optional in the following situations:

Before any }
Before the else of an if-else statement
Before the while of a do-while statement (but not before the while of a while statement)
Before the end of the program

Semicolons are optional in these situations even if they would construct empty statements. Strict mode has no effect on semicolon insertion in the above cases.

In addition, sometimes line breaks in the input stream are turned into VirtualSemicolon tokens. Specifically, if the first through the n^th tokens of a JavaScript program form are grammatically valid but the first through the n+1^st tokens are not and there is a line break (or a comment including a line break) between the n^th tokens and the n+1^st tokens, then the parser tries to parse the program again after inserting a VirtualSemicolon token between the n^th and the n+1^st tokens. This kind of VirtualSemicolon insertion does not occur in strict mode.

Regular Expression Literals

Regular expression literals begin with a slash (/) character not immediately followed by another slash (two slashes start a line comment). Like in JavaScript 1.5, regular expression literals are ambiguous with the division (/) or division-assignment (/=) tokens. The lexer treats a / or /= as a division or division-assignment token if either of these tokens would be allowed by the syntactic grammar as the next token; otherwise, the lexer treats a / or /= as starting a regular expression.

This unfortunate dependence of lexical parsing on grammatical parsing is inherited from JavaScript 1.5. See the regular expression syntax rationale for a discussion of the issues.

Units

When a numeric literal is be immediately followed by an optional underscore and an identifier, the lexer drops the underscore if it is present and converts the identifier to a string literal. The parser then treats the number and string as a unit expression. There are no reserved word restrictions on the identifier in this case; any identifier that begins with a letter will work, even if it is a reserved word.

For example, 3in and 3_in are both converted to 3 "in". 5xena is converted to 5 "xena". On the other hand, 0xena is converted to 0xe "na". It is unwise to define unit names that begin with the letters e or E either alone or followed by a decimal digit, or x or X followed by a hexadecimal digit because of potential ambiguities with exponential or hexadecimal notation.

Waldemar Horwat
Last modified Saturday, April 29, 2000