July 2000 Draft
JavaScript 2.0
Core Language
Lexer
|
Saturday, April 29, 2000
This section presents an informal overview of the JavaScript 2.0 lexer. See the stages and lexer semantics sections in the formal description chapter for the details.
The JavaScript 2.0 lexer behaves in the same way as the JavaScript 1.5 lexer except for the following:
}
. In addition, the JavaScript 2.0 parser allows semicolons to be
omitted before the else
of an if
-else
statement and before
the while
of a do
-while
statement.JavaScript 2.0 source text consists of a sequence of UTF-16 Unicode version 2.1 or later characters normalized to Unicode Normalized Form C (canonical composition), as described in the Unicode Technical Report #15.
Comments and white space behave just like in JavaScript 1.5.
The following JavaScript 1.5 punctuation tokens are recognized in JavaScript 2.0:
!
!=
!==
%
%=
&
&&
&=
(
)
*
*=
+
++
+=
,
-
--
-=
.
/
/=
:
::
;
<
<<
<<=
<=
=
==
===
>
>=
>>
>>=
>>>
>>>=
?
[
]
^
^=
{
|
|=
||
}
~
The following punctuation tokens are new in JavaScript 2.0:
#
&&=
->
..
...
@
^^
^^=
||=
The following reserved words are used in JavaScript 2.0:
break
case
catch
class
const
continue
default
delete
do
else
eval
export
extends
false
final
finally
for
function
if
implements
import
in
instanceof
interface
new
null
package
private
public
return
static
super
switch
this
throw
true
try
typeof
var
volatile
while
with
Out of these, the only word that was not reserved in JavaScript 1.5 is eval
.
The following reserved words are reserved for future expansion:
abstract
debugger
enum
goto
native
protected
synchronized
throws
transient
The following words have special meaning in some contexts in JavaScript 2.0 but are not reserved and may be used as identifiers:
attribute
constructor
get
language
namespace
set
use
The JavaScript 2.0 grammar explicitly makes semicolons optional in the following situations:
}
else
of an if
-else
statementwhile
of a do
-while
statement (but not before the while
of a while
statement)Semicolons are optional in these situations even if they would construct empty statements. Strict mode has no effect on semicolon insertion in the above cases.
In addition, sometimes line breaks in the input stream are turned into VirtualSemicolon tokens. Specifically, if the first through the nth tokens of a JavaScript program form are grammatically valid but the first through the n+1st tokens are not and there is a line break (or a comment including a line break) between the nth tokens and the n+1st tokens, then the parser tries to parse the program again after inserting a VirtualSemicolon token between the nth and the n+1st tokens. This kind of VirtualSemicolon insertion does not occur in strict mode.
See also the semicolon insertion syntax rationale.
Regular expression literals begin with a slash (/
) character not immediately followed by another slash (two
slashes start a line comment). Like in JavaScript 1.5, regular expression literals are ambiguous with the division (/
)
or division-assignment (/=
) tokens. The lexer treats a /
or /=
as a division or division-assignment
token if either of these tokens would be allowed by the syntactic grammar as the next token; otherwise, the lexer treats a
/
or /=
as starting a regular expression.
This unfortunate dependence of lexical parsing on grammatical parsing is inherited from JavaScript 1.5. See the regular expression syntax rationale for a discussion of the issues.
When a numeric literal is be immediately followed by an optional underscore and an identifier, the lexer drops the underscore if it is present and converts the identifier to a string literal. The parser then treats the number and string as a unit expression. There are no reserved word restrictions on the identifier in this case; any identifier that begins with a letter will work, even if it is a reserved word.
For example, 3in
and 3_in
are both converted to 3 "in"
. 5xena
is converted to 5 "xena"
. On the other hand, 0xena
is converted to 0xe "na"
.
It is unwise to define unit names that begin with the letters e
or E
either alone or followed by
a decimal digit, or x
or X
followed by a hexadecimal digit because of potential ambiguities with
exponential or hexadecimal notation.
Waldemar Horwat Last modified Saturday, April 29, 2000 |