ECMAScript 4 Syntax Rationale

ECMAScript 4 Netscape Proposal

Rationale

Syntax

Tuesday, November 19, 2002

This section presents a number of syntactic alternatives that were considered while developing this proposal.

Semicolon Insertion

Definitions

The term semicolon insertion informally refers to the ability to write programs while omitting semicolons between statements. In both ECMAScript 3 and ECMAScript 4 there are two kinds of semicolon insertion:

Grammatical Semicolon Insertion: Semicolons before a closing } and the end of the program are optional in both ECMAScript 3 and 2.0. In addition, the ECMAScript 4 parser allows semicolons to be omitted before the else of an if-else statement and before the while of a do-while statement.
Line-Break Semicolon Insertion: If the first through the n^th tokens of an ECMAScript program form are grammatically valid but the first through the n+1^st tokens are not and there is a line break between the n^th tokens and the n+1^st tokens, then the parser tries to parse the program again after inserting a VirtualSemicolon token between the n^th and the n+1^st tokens.

Grammatical semicolon insertion is implemented directly by the syntactic grammar’s productions, which simply do not require a semicolon in the aforementioned cases. Line breaks in the source code are not relevant to grammatical semicolon insertion.

Line-break semicolon insertion cannot be easily implemented in the syntactic grammar. This kind of semicolon insertion turns a syntactically incorrect program into a correct program and relies on line breaks in the source code.

Discussion

Grammatical semicolon insertion is harmless. On the other hand, line-break semicolon insertion suffers from the following problems:

Line breaks are relevant in the program’s source code
The consequences of this kind of semicolon insertion appear inconsistent to programmers
Existing program behavior can change unexpectedly when new syntax is introduced

The first problem presents difficulty for some preprocessors such as the one for XML attributes which turn line breaks into spaces. The second and third ones are more serious. Programmers are confused when they discover that the program

a = b + c
(d + e).print()

doesn’t do what they expect:

a = b + c;
(d + e).print();

Instead, that program is parsed as:

a = b + c(d + e).print();

The third problem is the most serious. New features are added to the language turn illegal syntax into legal syntax. If an existing program relies on the illegal syntax to trigger line-break semicolon insertion, then the program will silently change behavior once the feature is added. For example, the juxtaposition of a numeric literal followed by a string literal (such as 4 "in") is illegal in ECMAScript 3. ECMAScript 4 makes this legal syntax for expressions with units. This syntax extension has the unfortunate consequence of silently changing the meaning of the following ECMAScript 3 program:

a = b + 4
"in".print()

from:

a = b + 4;
"in".print();

to:

a = b + 4"in".print();

ECMAScript 4 gets around this incompatibility by adding a [no line break] restriction in the grammar that requires the numeric and string literals to be on the same line. Unfortunately, this compatibility is a double-edged sword. Due to ECMAScript 3 compatibility, ECMAScript 4 has to have a large number of these [no line break] restrictions. It is hard to remember all of them, and forgetting one of them often silently causes an ECMAScript 4 program to be reinterpreted. Some programmers will be dismayed to find that:

internal
  function f(x) {return x*x}

turns into:

internal;
function f(x) {return x*x}

(where internal; is an expression statement) instead of:

internal function f(x) {return x*x}

An earlier version of ECMAScript 4 disallowed line-break semicolon insertion. The current version allows it but only in non-strict mode. Strict mode removes all [no line break] restrictions, simplifying the language again. As a side effect, it is possible to write a program that does different things in strict and non-strict modes (the last example above is one such program), but this is the price to pay to achieve simplicity.

Regular Expression Literals

ECMAScript 4 retains compatibility with ECMAScript 3 by adopting the same rules for detecting regular expression literals. This complicates the design of programs such as syntax-directed text editors and machine scanners because it makes it impossible to find all of the tokens in an ECMAScript program without parsing the program.

Making ECMAScript 4’s lexical grammar independent of its syntactic grammar significantly would have allowed tools to easily process an ECMAScript program and escape all instances of, say, </ to properly embed an ECMAScript 4 or later program in an HTML page. The full parser changes for each version of ECMAScript. To illustrate the difficulties, compare such ECMAScript 3 gems as:

for (var x = a in foo && "</x>" || mot ? z:/x:3;x<5;y</g/i) {xyz(x++);}
for (var x = a in foo && "</x>" || mot ? z/x:3;x<5;y</g/i) {xyz(x++);}

Alternate Regular Expression Syntax

One idea explored early in the design of ECMAScript 4 was providing an alternate, unambiguous syntax for regular expressions and encouraging the use of the new syntax. A RegularExpression could have been specified unambiguously using « and » as its opening and closing delimiters instead of / and /. For example, «3*» would be a regular expression that matches zero or more 3’s. Such a regular expression could be empty: «» is a regular expression that matches only the empty string, while // starts a comment. To write such a regular expression using the slash syntax one needs to write /(?:)/.

Syntactic Resynchronization

Syntactic resynchronization occurs when the lexer needs to find the end of a block (the matching }) in order to skip a portion of a program written in a future version of ECMAScript. Ordinarily this would not be a problem, but regular expressions complicate matters because they make lexing dependent on parsing. The rules for recognizing regular expression literals must be changed for those portions of the program. The rule below might work, or a simplified parse might be performed on the input to determine the locations of regular expressions. This is an area that needs further work.

During syntax resynchronization ECMAScript 4 determines whether a / starts a regular expression or is a division (or /=) operator solely based on the previous token:

/ interpretation Previous token

/ or /= Identifier Number RegularExpression String
) ++ -- ] }
class false null private protected public super this true
get set
Any other punctuation

RegularExpression ! != !== # % %= & && &&= &= ( * *= + += , - -= -> . .. ... / /= : :: ; < << <<= <= = == === > >= >> >>= >>> >>>= ? @ [ ^ ^= ^^ ^^= { | |= || ||= ~
abstract break case catch const continue debugger default delete do else enum export extends final finally for function goto if implements import in instanceof interface is namespace native new package return static switch synchronized throw throws transient try typeof use var volatile while with

`/` interpretation	Previous token
`/` or `/=`	Identifier Number RegularExpression String `)` `++` `--` `]` `}` `class` `false` `null` `private` `protected` `public` `super` `this` `true` `get` `set` Any other punctuation
RegularExpression	`!` `!=` `!==` `#` `%` `%=` `&` `&&` `&&=` `&=` `(` `` `=` `+` `+=` `,` `-` `-=` `->` `.` `..` `...` `/` `/=` `:` `::` `;` `<` `<<` `<<=` `<=` `=` `==` `===` `>` `>=` `>>` `>>=` `>>>` `>>>=` `?` `@` `[` `^` `^=` `^^` `^^=` `{` `\|` `\|=` `\|\|` `\|\|=` `~` `abstract` `break` `case` `catch` `const` `continue` `debugger` `default` `delete` `do` `else` `enum` `export` `extends` `final` `finally` `for` `function` `goto` `if` `implements` `import` `in` `instanceof` `interface` `is` `namespace` `native` `new` `package` `return` `static` `switch` `synchronized` `throw` `throws` `transient` `try` `typeof` `use` `var` `volatile` `while` `with`

Regardless of the previous token, // is interpreted as the beginning of a comment.

The only controversial choices are ) and }. A / after either a ) or } token can be either a division symbol (if the ) or } closes a subexpression or an object literal) or a regular expression token (if the ) or } closes a preceding statement or an if, while, or for expression). Having / be interpreted as a RegularExpression in expressions such as (x+y)/2 would be problematic, so it is interpreted as a division operator after ) or }. If one wants to place a regular expression literal at the very beginning of an expression statement, it’s best to put the regular expression in parentheses. Fortunately, this is not common since one usually assigns the result of the regular expression operation to a variable.

Type Declarations

The current ECMAScript 4 proposal uses Pascal-style colons to introduce types in declarations. For example:

var x:Integer = 7;
function square(a:Number):Number {return a*a}

This is due to a consensus decision of the ECMA working group, with Waldemar the only dissenter. There are a couple of alternative syntaxes:

C-Style

We could allow modified C-style type declarations as long as a function’s return type is listed after its parameters:

var Integer x = 7;
var Integer y = 8, Integer z = 9;  // Declares two Integer variables
function square(Number a) Number {return a*a}

A function’s return type cannot be listed before the parameters because this would make the grammar ambiguous.

In fact, an implementation could unambiguously admit both the Pascal-style and the modified C-style declarations by replacing the TypedIdentifier and Result grammar rules with the ones listed below. The resulting grammar is still LALR(1).

TypedIdentifier

Identifier

| Identifier : TypeExpression

| TypeExpression Identifier

Result

«empty»

| : TypeExpression^allowIn

| [lookahead{{}] TypeExpression^allowIn

Advantages of using the modified C-style syntax include:

On the Pascal/Modula/Ada vs. C/C++/Java syntax debate, ECMAScript tends to use syntax more similar to Java.
We already use the colon syntax for statement labels and object literal elements (for example {a:17, b:33}). The latter would present a conundrum if we ever wanted to declare field types in an object literal. Some programmers have been using these as a convenient facility for passing named arguments to functions.

Attribute-Style

Since attributes are simple expressions, we could allow attributes that evaluate to types. For var and const declarations, these attributes would specify the type of the declared variables. For function declarations, these attributes would specify the function’s return type. For stylistic consistency, types of arguments would also be listed before their identifiers.

Integer var x = 7;
Integer var y = 8, z = 9;        // Declares two Integer variables
Number function square(Number a) {return a*a}

This style is simple and reads fairly naturally.

Again, an implementation could unambiguously admit both the Pascal-style and the attribute-style declarations, with the resulting grammar still being LALR(1). However, it’s better if the language made a choice rather than propagating the confusion of having two or three styles; this flexibility could be used for compatibility with existing programs.

Type Expressions

ECMAScript 4 uses the same syntax for type expressions as for value expressions for the following reasons:

Creating two different syntaxes for two kinds of expressions would add to the complexity of the language.
ECMAScript is a dynamic language and it is useful to manipulate types as though they were first-class values.
It’s difficult to unambiguously distinguish type expressions from value expressions. In the expression (expr1)(expr2), is expr1 a type or a value expression? If the two have the same syntax, it doesn’t matter.

Function Declarations

Getters and Setters

By consensus in the ECMA TC39 modularity subcommittee, we decided to use the syntax function get id (...) rather than getter function id (...) for defining a getter and function set id (...) rather than setter function id (...) for defining a setter. The latter would have simplified the FunctionName rule to:

FunctionName Identifier

while creating two additional attributes, getter and setter. The decision was based on aesthetics; neither syntax is more difficult to implement than the other.

Language Directives

An alternative to pragmas that was considered early was to report syntax errors at the time the relevant statement was executed rather than at the time it was parsed. This way a single program could include parts written in a future version of ECMAScript without getting an error unless it tries to execute those portions on a system that does not understand that version of ECMAScript. If a program part that contains an error is never executed, the error never breaks the script. For example, the following function finishes successfully if whizBangFeature is false:

function move(x:Integer, y:Integer, d:Integer) {
  x += 10;
  y += 3;
  if (whizBangFeature) {
    simulate{@x and #y} along path
  } else {
    x += d; y += d;
  }
  return [x,y];
}

The code simulate{@x and #y} along path is a syntax error, but this error does not break the script unless the script attempts to execute that piece of code.

One problem with this approach is that it frustrates debugging; a script author benefits from knowing about syntax errors at compile time rather than at run time.

Waldemar Horwat
Last modified Tuesday, November 19, 2002