| ECMAScript 4 Netscape Proposal Core Language Lexer |    | 
Wednesday, June 4, 2003
This section presents an informal overview of the ECMAScript 4 lexer. See the stages and lexical semantics sections in the formal description chapter for the details.
The ECMAScript 4 lexer behaves in the same way as the ECMAScript 3 lexer except for the following:
}. In addition, the ECMAScript 4 parser allows semicolons to be
    omitted before the else of an if-else statement and before
    the while of a do-while statement.ECMAScript 4 source text consists of a sequence of UTF-16 Unicode version 2.1 or later characters normalized to Unicode Normalized Form C (canonical composition), as described in the Unicode Technical Report #15.
Comments and white space behave just like in ECMAScript 3.
The following ECMAScript 3 punctuation tokens are recognized in ECMAScript 4:
!   !=   !==
  %   %=   &
  &&   &=   (
  )   *   *=
  +   ++   +=
  ,   -   --
  -=   .   /
  /=   :   ::
  ;   <   <<
  <<=   <=   =
  ==   ===   >
  >=   >>   >>=
  >>>   >>>=   ?
  [   ]   ^
  ^=   {   |
  |=   ||   }
  ~
The following punctuation tokens are new in ECMAScript 4:
&&=   ...   ^^
  ^^=   ||=
The following reserved words are used in ECMAScript 4:
as   break   case
  catch   class   const
  continue   default   delete
  do   else   export
  extends   false   finally
  for   function   if
  import   in   instanceof
  is   namespace   new
  null   package   private
  public   return   super
  switch   this   throw
  true   try   typeof
  use   var   void
  while   with
The following reserved words are reserved for future expansion:
abstract   debugger   enum
  goto   implements   interface
  native   protected   synchronized
  throws   transient   volatile
The following words have special meaning in some contexts in ECMAScript 4 but are not reserved and may be used as identifiers:
get   set
Any of the above keywords may be used as an identifier by including a \_ escape anywhere within the identifier,
which strips it of any keyword meanings. The two, four, and eight-digit hexadecimal escapes \xdd, \udddd,
and \Udddddddd may also be used in identifiers; these strip the identifier of any keyword meanings as
well.
The following words were reserved in ECMAScript 3 but are not reserved in ECMAScript 4:
boolean   byte   char
  double   final   float
  int   long   short
  static
The following words were not reserved in ECMAScript 3 but are reserved in ECMAScript 4:
as   is   namespace
  use
The ECMAScript 4 syntactic grammar explicitly makes semicolons optional in the following situations:
}else of an if-else statementwhile of a do-while statement (but not before the while
    of a while statement)Semicolons are optional in these situations even if they would construct empty statements. Strict mode has no effect on semicolon insertion in the above cases.
In addition, sometimes line breaks in the input stream are turned into VirtualSemicolon tokens. Specifically, if the first through the nth tokens of an ECMAScript program form are grammatically valid but the first through the n+1st tokens are not and there is a line break (or a comment including a line break) between the nth tokens and the n+1st tokens, then the parser tries to parse the program again after inserting a VirtualSemicolon token between the nth and the n+1st tokens. This kind of VirtualSemicolon insertion does not occur in strict mode.
See also the semicolon insertion syntax rationale.
The syntax for numeric literals is the same as in ECMAScript 3, with the addition of long, ulong,
and float numeric literals. The rules for numeric literals are as follows:
l or L is interpreted as a long value and must
    be a decimal or hexadecimal constant without an exponent or decimal point and be in the range of 0 through 263;
    furthermore, if the value is exactly 263 then the literal can only be used as the operand of the -
    unary negation operator.ul, uL, Ul, or UL is interpreted
    as a ulong value and must be a decimal or hexadecimal constant without an exponent or decimal point and be
    in the range of 0 through 264–1.f or F is interpreted as a float value and
    must be a decimal constant. Hexadecimal float constants are not permitted because the suffix would be interpreted
    as a hexadecimal digit.The suffix must be adjacent to the number with no intervening white space. A number may not be followed by an identifier without intervening white space.
Regular expression literals begin with a slash (/) character not immediately followed by another slash (two
slashes start a line comment). Like in ECMAScript 3, regular expression literals are ambiguous with the division (/)
or division-assignment (/=) tokens. The lexer treats a / or /= as a division or division-assignment
token if either of these tokens would be allowed by the syntactic grammar as the next token; otherwise, the lexer treats a
/ or /= as starting a regular expression.
This unfortunate dependence of lexical parsing on grammatical parsing is inherited from ECMAScript 3. See the regular expression syntax rationale for a discussion of the issues.
| Waldemar Horwat Last modified Wednesday, June 4, 2003 |    |