ECMAScript 4 Netscape Proposal
Rationale
Syntax
|
Tuesday, November 19, 2002
This section presents a number of syntactic alternatives that were considered while developing this proposal.
The term semicolon insertion informally refers to the ability to write programs while omitting semicolons between statements. In both ECMAScript 3 and ECMAScript 4 there are two kinds of semicolon insertion:
}
and the end of the program are optional in both ECMAScript
3 and 2.0. In addition, the ECMAScript 4 parser allows semicolons to be omitted before the else
of an if
-else
statement and before the while
of a do
-while
statement.Grammatical semicolon insertion is implemented directly by the syntactic grammar’s productions, which simply do not require a semicolon in the aforementioned cases. Line breaks in the source code are not relevant to grammatical semicolon insertion.
Line-break semicolon insertion cannot be easily implemented in the syntactic grammar. This kind of semicolon insertion turns a syntactically incorrect program into a correct program and relies on line breaks in the source code.
Grammatical semicolon insertion is harmless. On the other hand, line-break semicolon insertion suffers from the following problems:
The first problem presents difficulty for some preprocessors such as the one for XML attributes which turn line breaks into spaces. The second and third ones are more serious. Programmers are confused when they discover that the program
a = b + c (d + e).print()
doesn’t do what they expect:
a = b + c; (d + e).print();
Instead, that program is parsed as:
a = b + c(d + e).print();
The third problem is the most serious. New features are added to the language turn illegal syntax into legal syntax. If
an existing program relies on the illegal syntax to trigger line-break semicolon insertion, then the program will silently
change behavior once the feature is added. For example, the juxtaposition of a numeric literal followed by a string literal
(such as 4 "in"
) is illegal in ECMAScript 3. ECMAScript 4 makes this legal syntax for expressions with
units. This syntax extension has the unfortunate consequence of silently changing the meaning of the following ECMAScript
3 program:
a = b + 4 "in".print()
from:
a = b + 4; "in".print();
to:
a = b + 4"in".print();
ECMAScript 4 gets around this incompatibility by adding a [no line break] restriction in the grammar that requires the numeric and string literals to be on the same line. Unfortunately, this compatibility is a double-edged sword. Due to ECMAScript 3 compatibility, ECMAScript 4 has to have a large number of these [no line break] restrictions. It is hard to remember all of them, and forgetting one of them often silently causes an ECMAScript 4 program to be reinterpreted. Some programmers will be dismayed to find that:
internal function f(x) {return x*x}
turns into:
internal; function f(x) {return x*x}
(where internal;
is an expression statement) instead of:
internal function f(x) {return x*x}
An earlier version of ECMAScript 4 disallowed line-break semicolon insertion. The current version allows it but only in non-strict mode. Strict mode removes all [no line break] restrictions, simplifying the language again. As a side effect, it is possible to write a program that does different things in strict and non-strict modes (the last example above is one such program), but this is the price to pay to achieve simplicity.
ECMAScript 4 retains compatibility with ECMAScript 3 by adopting the same rules for detecting regular expression literals. This complicates the design of programs such as syntax-directed text editors and machine scanners because it makes it impossible to find all of the tokens in an ECMAScript program without parsing the program.
Making ECMAScript 4’s lexical grammar independent of its syntactic grammar significantly would have allowed tools to
easily process an ECMAScript program and escape all instances of, say, </
to properly embed an ECMAScript 4
or later program in an HTML page. The full parser changes for each version of ECMAScript. To illustrate the difficulties,
compare such ECMAScript 3 gems as:
for (var x = a in foo && "</x>" || mot ? z:/x:3;x<5;y</g/i) {xyz(x++);} for (var x = a in foo && "</x>" || mot ? z/x:3;x<5;y</g/i) {xyz(x++);}
One idea explored early in the design of ECMAScript 4 was providing an alternate, unambiguous syntax for regular expressions
and encouraging the use of the new syntax. A RegularExpression could have been specified unambiguously
using «
and »
as its opening and closing delimiters instead of /
and /
.
For example, «3*»
would be a regular expression that matches zero or more 3
’s. Such
a regular expression could be empty: «»
is a regular expression that matches only the empty string,
while //
starts a comment. To write such a regular expression using the slash syntax one needs to write /(?:)/
.
Syntactic resynchronization occurs when the lexer needs to find the end of a block (the matching }
) in order
to skip a portion of a program written in a future version of ECMAScript. Ordinarily this would not be a problem, but regular
expressions complicate matters because they make lexing dependent on parsing. The rules for recognizing regular expression
literals must be changed for those portions of the program. The rule below might work, or a simplified parse might be performed
on the input to determine the locations of regular expressions. This is an area that needs further work.
During syntax resynchronization ECMAScript 4 determines whether a /
starts a regular expression or is a
division (or /=
) operator solely based on the previous token:
/ interpretation |
Previous token |
---|---|
/ or /= |
Identifier Number RegularExpression
String) ++ --
] } class false null
private protected public
super this true get set Any other punctuation |
RegularExpression | ! != !==
# % %=
& && &&=
&= ( *
*= + +=
, - -=
-> . ..
... / /=
: :: ;
< << <<=
<= = ==
=== > >=
>> >>= >>>
>>>= ? @
[ ^ ^=
^^ ^^= {
| |= ||
||= ~ abstract break case
catch const continue
debugger default delete
do else enum
export extends final
finally for function
goto if implements
import in instanceof
interface is namespace
native new package
return static switch
synchronized throw throws
transient try typeof
use var volatile
while with |
Regardless of the previous token, //
is interpreted as the beginning of a comment.
The only controversial choices are )
and }
. A /
after either a )
or
}
token can be either a division symbol (if the )
or }
closes a subexpression or an
object literal) or a regular expression token (if the )
or }
closes a preceding statement or an
if
, while
, or for
expression). Having /
be interpreted as a RegularExpression
in expressions such as (x+y)/2
would be problematic, so it is interpreted as a division operator after )
or }
. If one wants to place a regular expression literal at the very beginning of an expression statement, it’s
best to put the regular expression in parentheses. Fortunately, this is not common since one usually assigns the result of
the regular expression operation to a variable.
The current ECMAScript 4 proposal uses Pascal-style colons to introduce types in declarations. For example:
var x:Integer = 7; function square(a:Number):Number {return a*a}
This is due to a consensus decision of the ECMA working group, with Waldemar the only dissenter. There are a couple of alternative syntaxes:
We could allow modified C-style type declarations as long as a function’s return type is listed after its parameters:
var Integer x = 7; var Integer y = 8, Integer z = 9; // Declares two Integer variables function square(Number a) Number {return a*a}
A function’s return type cannot be listed before the parameters because this would make the grammar ambiguous.
In fact, an implementation could unambiguously admit both the Pascal-style and the modified C-style declarations by replacing the TypedIdentifier and Result grammar rules with the ones listed below. The resulting grammar is still LALR(1).
Advantages of using the modified C-style syntax include:
{a:17, b:33}
).
The latter would present a conundrum if we ever wanted to declare field types in an object literal. Some programmers have
been using these as a convenient facility for passing named arguments to functions.Since attributes are simple expressions, we could allow attributes that evaluate to types. For var
and const
declarations, these attributes would specify the type of the declared variables. For function
declarations, these
attributes would specify the function’s return type. For stylistic consistency, types of arguments would also be listed before
their identifiers.
Integer var x = 7; Integer var y = 8, z = 9; // Declares two Integer variables Number function square(Number a) {return a*a}
This style is simple and reads fairly naturally.
Again, an implementation could unambiguously admit both the Pascal-style and the attribute-style declarations, with the resulting grammar still being LALR(1). However, it’s better if the language made a choice rather than propagating the confusion of having two or three styles; this flexibility could be used for compatibility with existing programs.
ECMAScript 4 uses the same syntax for type expressions as for value expressions for the following reasons:
(expr1)(expr2)
,
is expr1
a type or a value expression? If the two have the same syntax, it doesn’t matter.By consensus in the ECMA TC39 modularity subcommittee, we decided to use the syntax function get
id (
...)
rather than getter function
id (
...)
for defining a getter
and function set
id (
...)
rather than setter function
id (
...)
for defining a setter. The latter would have simplified the FunctionName
rule to:
while creating two additional attributes, getter
and setter
. The decision was based on aesthetics;
neither syntax is more difficult to implement than the other.
An alternative to pragmas that was considered early was to report syntax errors at the
time the relevant statement was executed rather than at the time it was parsed. This way a single program could include parts
written in a future version of ECMAScript without getting an error unless it tries to execute those portions on a system that
does not understand that version of ECMAScript. If a program part that contains an error is never executed, the error never
breaks the script. For example, the following function finishes successfully if whizBangFeature
is false
:
function move(x:Integer, y:Integer, d:Integer) {
x += 10;
y += 3;
if (whizBangFeature
) {
simulate{@x and #y} along path
} else {
x += d; y += d;
}
return [x,y];
}
The code simulate{@x and #y} along path
is a syntax error, but this error does not break the script unless
the script attempts to execute that piece of code.
One problem with this approach is that it frustrates debugging; a script author benefits from knowing about syntax errors at compile time rather than at run time.
Waldemar Horwat Last modified Tuesday, November 19, 2002 |