July 2000 Draft
JavaScript 2.0
Rationale
Syntax
previousupnext

Tuesday, February 15, 2000

Semicolon Insertion

Definitions

The term semicolon insertion informally refers to the ability to write programs while omitting semicolons between statements. In both JavaScript 1.5 and JavaScript 2.0 there are two kinds of semicolon insertion:

Grammatical Semicolon Insertion
Semicolons before a closing } and the end of the program are optional in both JavaScript 1.5 and 2.0. In addition, the JavaScript 2.0 parser allows semicolons to be omitted before the else of an if-else statement and before the while of a do-while statement.
Line-Break Semicolon Insertion
If the first through the nth tokens of a JavaScript program form are grammatically valid but the first through the n+1st tokens are not and there is a line break between the nth tokens and the n+1st tokens, then the parser tries to parse the program again after inserting a VirtualSemicolon token between the nth and the n+1st tokens.

Grammatical semicolon insertion is implemented directly by the parser grammar's productions, which simply do not require a semicolon in the aforementioned cases. Line breaks in the source code are not relevant to grammatical semicolon insertion.

Line-break semicolon insertion cannot be easily implemented in the parser's grammar. This kind of semicolon insertion turns a syntactically incorrect program into a correct program and relies on line breaks in the source code.

Discussion

Grammatical semicolon insertion is harmless. On the other hand, line-break semicolon insertion suffers from the following problems:

  1. Line breaks are relevant in the program's source code
  2. The consequences of this kind of semicolon insertion appear inconsistent to users
  3. Existing program behavior can change unexpectedly when new syntax is introduced

The first problem presents difficulty for some preprocessors such as the one for XML attributes which turn line breaks into spaces. The second and third ones are more serious. Users are confused when they discover that the program

a = b + c
(d + e).print()

doesn't do what they expect:

a = b + c;
(d + e).print();

Instead, that program is parsed as:

a = b + c(d + e).print();

The third problem is the most serious. New features are added to the language turn illegal syntax into legal syntax. If an existing program relies on the illegal syntax to trigger line-break semicolon insertion, then the program will silently change behavior once the feature is added. For example, the juxtaposition of a numeric literal followed by a string literal (such as 4 "in") is illegal in JavaScript 1.5. JavaScript 2.0 makes this legal syntax for expressions with units. This syntax extension has the unfortunate consequence of silently changing the meaning of the following JavaScript 1.5 program:

a = b + 4
"in".print()

from:

a = b + 4;
"in".print();

to:

a = b + 4"in".print();

JavaScript 2.0 gets around this incompatibility by adding a [no line break] restriction in the grammar that requires the numeric and string literals to be on the same line. Unfortunately, this compatibility is a double-edged sword. Due to JavaScript 1.5 compatibility, JavaScript 2.0 has to have a large number of these [no line break] restrictions. It is hard to remember all of them, and forgetting one of them often silently causes a JavaScript 2.0 program to be reinterpreted. Users will be dismayed to find that:

local
  function f(x) {return x*x}

turns into:

local;
function f(x) {return x*x}

(where local; is an expression statement) instead of:

local function f(x) {return x*x}

An earlier version of JavaScript 2.0 disallowed line-break semicolon insertion. The current version allows it but only in non-strict mode. Strict mode removes all [no line break] restrictions, simplifying the language again. As a side effect, it is possible to write a program that does different things in strict and non-strict modes (the last example above is one such program), but this is the price to pay to achieve simplicity.

Regular Expression Literals

JavaScript 2.0 retains compatibility with JavaScript 1.5 by adopting the same rules for detecting regular expression literals. This complicates the design of programs such as syntax-directed text editors and machine scanners because it makes it impossible to find all of the tokens in a JavaScript program without parsing the program.

Making JavaScript 2.0's lexical grammar independent of its syntactic grammar significantly would have allowed tools to easily process a JavaScript program and escape all instances of, say, </ to properly embed a JavaScript 2.0 or later program in an HTML page. The full parser changes for each version of JavaScript. To illustrate the difficulties, compare such JavaScript 1.5 gems as:

for (var x = a in foo && "</x>" || mot ? z:/x:3;x<5;y</g/i) {xyz(x++);}
for (var x = a in foo && "</x>" || mot ? z/x:3;x<5;y</g/i) {xyz(x++);}

Alternate Regular Expression Syntax

One idea explored early in the design of JavaScript 2.0 was providing an alternate, unambiguous syntax for regular expressions and encouraging the use of the new syntax. A RegularExpression could have been specified unambiguously using « and » as its opening and closing delimiters instead of / and /. For example, «3*» would be a regular expression that matches zero or more 3's. Such a regular expression could be empty: «» is a regular expression that matches only the empty string, while // starts a comment. To write such a regular expression using the slash syntax one needs to write /(?:)/.

Syntactic Resynchronization

Syntactic resynchronization occurs when the lexer needs to find the end of a block (the matching }) in order to skip a portion of a program written in a future version of JavaScript. Ordinarily this would not be a problem, but regular expressions complicate matters because they make lexing dependent on parsing. The rules for recognizing regular expression literals must be changed for those portions of the program. The rule below might work, or a simplified parse might be performed on the input to determine the locations of regular expressions. This is an area that needs further work.

During syntax resynchronization JavaScript 2.0 determines whether a / starts a regular expression or is a division (or /=) operator solely based on the previous token:

/ interpretation Previous token
/ or /=   Identifier   Number   RegularExpression   String
)   ++   --   ]   }
false   null   super   this   true
constructor   getter   method   override   setter   traditional   version
Any other punctuation
RegularExpression   !   !=   !==   #   %   %=   &   &&   &&=   &=   (   *   *=   +   +=   ,   -   -=   ->   .   ..   ...   /   /=   :   ::   ;   <   <<   <<=   <=   =   ==   ===   >   >=   >>   >>=   >>>   >>>=   ?   @   [   ^   ^=   ^^   ^^=   {   |   |=   ||   ||=   ~
abstract   break   case   catch   class   const   continue   debugger   default   delete   do   else   enum   eval   export   extends   field   final   finally   for   function   goto   if   implements   import   in   instanceof   native   new   package   private   protected   public   return   static   switch   synchronized   throw   throws   transient   try   typeof   var   volatile   while   with

Regardless of the previous token, // is interpreted as the beginning of a comment.

The only controversial choices are ) and }. A / after either a ) or } token can be either a division symbol (if the ) or } closes a subexpression or an object literal) or a regular expression token (if the ) or } closes a preceding statement or an if, while, or for expression). Having / be interpreted as a RegularExpression in expressions such as (x+y)/2 would be problematic, so it is interpreted as a division operator after ) or }. If one wants to place a regular expression literal at the very beginning of an expression statement, it's best to put the regular expression in parentheses. Fortunately, this is not common since one usually assigns the result of the regular expression operation to a variable.

Type Declarations

The current JavaScript 2.0 proposal uses Pascal-style colons to introduce types in declarations. For example:

var x:integer = 7;
function square(a:number):number {return a*a}

This is due to a consensus decision of the ECMA working group, with Waldemar the only dissenter.

We could allow modified C-style type declarations as long as a function's return type is listed after its parameters:

var integer x = 7;
function square(number a) number {return a*a}

A function's return type cannot be listed before the parameters because this would make the grammar ambiguous.

In fact, an implementation could unambiguously admit both the Pascal-style and the modified C-style declarations by replacing the TypedIdentifier and ResultSignature grammar rules with the ones listed below. The resulting grammar is still LALR(1).

TypedIdentifier 
   Identifier
|  Identifier : TypeExpression
|  TypeExpression Identifier
ResultSignature 
   «empty»
|  : TypeExpressionallowIn
|  [lookahead{{}] TypeExpressionallowIn

A few advantages of using the modified C-style syntax:

Type Expressions

We could define other useful type operators such as union, intersection, and difference as listed in the table below. s and t are type expressions.

Type   Values Coercion of value v
s + t All values belonging to either type s or type t or both If vs+t, then use v; otherwise, if v@s is defined then use v@s; otherwise, if v@t is defined then use v@t.
s * t All values simultaneously belonging to both type s and type t If v@s@t is defined and is a member of s*t, then use v@s@t.
s / t All values belonging to type s but not type t If v@s is defined and is a member of s/t, then use v@s.

The following subtype and type equivalence relations hold. r, s, and t represent arbitrary types.

s s + t s * t s
t + t = t t * t = t
(r + s) + t = r + (s + t) (r * s) * t = r * (s * t)
none t t any

JavaScript 2.0 uses the same syntax for type expressions as for value expressions for the following reasons:

Language Declarations

An alternative to language declarations that was considered early was to report syntax errors at the time the relevant statement was executed rather than at the time it was parsed. This way a single program could include parts written in a future version of JavaScript without getting an error unless it tries to execute those portions on a system that does not understand that version of JavaScript. If a program part that contains an error is never executed, the error never breaks the script. For example, the following function finishes successfully if whizBangFeature is false:

function move(integer x, integer y, integer d) {
  x += 10;
  y += 3;
  if (whizBangFeature) {
    simulate{@x and #y} along path
  } else {
    x += d; y += d;
  }
  return [x,y];
}

The code simulate{@x and #y} along path is a syntax error, but this error does not break the script unless the script attempts to execute that piece of code.

One problem with this approach is that it frustrates debugging; a script author benefits from knowing about syntax errors at compile time rather than at run time.


Waldemar Horwat
Last modified Tuesday, February 15, 2000
previousupnext