Written by Michael Ang <mang@subcarrier.org>
Comments to Norris Boyd <nboyd@atg.com>
I. Background
ECMAScript identifiers are currently specified as being Unicode. However, only the first 128 Unicode characters are allowed, effectively restricting identifiers to ASCII.
Implementations of ECMAScript are currently in use around the world. Developers whose native language is not English should be able to have identifiers that make sense to them. Although arbitrary strings can be used for named property lookup, allowing ideographs and other Unicode characters in identifiers will make it easier for global developers to write scripts.
Since implementations must currently accept Unicode characters, extending the range of characters allowed to that of the Unicode identifier class should not be an undue burden.
Java guarantees that escaped Unicode characters occurring in source code (in the form \uNNNN) will be unescaped before compilation. This can lead to problems in dynamic languages, for example when a newline character is escaped:
Program 1 (note that \u000A is the newline character):
int foo = 5;\u000Aint bar =6;
Program 2 (equivalent in Java, but not ECMAScript):
int foo = 5;
int bar = 6;
Because allowing Unicode escapes in identifiers would complicate interpreter implementations, this is forbidden. Note that Unicode escapes are still allowed in comments and literal strings, but are not decoded.
Section 5.14 of the Unicode Standard v2.0 gives implementation guidelines for identifiers. Most identifiers legal under these guidelines are legal in ECMAScript. ECMAScript differs in that no provisions are given for ignoring formatting characters (which are forbidden).
II. Recommendations
- since identifiers are compared based on the sequence of their code points, identifiers that appear identical may not be
- no provision is made for ignoring layout and format control characters
These recommendations are made against the April 22 ECMAScript draft. Specific changes to the document appear in bold type.
§6 Source Text
Amend the first section as follows:
"However, non-ASCII Unicode characters may appear
only within identifiers, comments, and string literals.
In identifiers, the exact set of Unicode characters allowed is
specified in Section 7.5 and corresponds to those Unicode
characters with the property of alphabetic, decimal digit,
combining mark, or ideographic. In string literals, any
Unicode character may also be expressed as a Unicode escape sequence
consisting of six ASCII characters, namely \u plus four hexadecimal
digits. Within a comment, such an escape sequence is effectively ignored
as part of the comment. Within a string literal, the Unicode
escape sequence contributes one character to the string value of the literal."
§7.5 Identifiers
Amend the first section as follows:
"An identifier is a character sequence of unlimited length, where each character
in the sequence must be a Unicode character with the property of
alphabetic (category "L"), decimal digit (category "Nd"), ideographic, or combining.
For historical reasons, the underscore (_) character and dollar sign ($) are also supported.
The first character may not be a Unicode decimal digit.
Two ECMAScript identifiers are the same only if they have the same sequence of Unicode characters (as defined by their Unicode code points). This means that two identifiers with the same external appearance may not be identical. Composite Unicode characters are treated as distinct from their decomposed equivalents. For example, LATIN CAPITAL LETTER A (\u0061) followed by COMBINING RING ABOVE (\u030A) is distinct from LATIN CAPITAL LETTER A WITH RING ABOVE (\u00C5)."
The Unicode Standard v2.0 specifies implementation guidlines for identifiers
(§5.14 Identifiers). These significant differences between ECMAScript
and these guidelines should be noted:
Amend the BNF as follows:
"IdentifierName ::
-
IdentifierLetter
IdeographicCharacter
IdentifierName CombiningCharacter IdentifierName Extender
IdentifierName IdeographicCharacter
IdentifierName IdentifierLetter
IdentifierName DecimalDigit
CombiningCharacter
-
A CombiningCharacter is a Unicode character with the normative
combining property.
Extender
-
An Extender is a a Unicode character in a set defined
in §5.14 of the Unicode Standard 2.0. (XXX should
expand this reference.)
IdentifierLetter :: one of
[ASCII table with _ and $]
Additionally, an IdentifierLetter may be a member of the Unicode letter class (those Unicode characters in category "L"), or the Unicode character FULLWIDTH LOW LINE (U+FF3F).
IdeographicCharacter ::
-
An IdentifierIdeographic may be a Unicode character with the ideographic property.
The ideographic property is an informative property of the Compatibility Han
characters, the Unified Han Set, and Hangzhou-style numerals, and the IDEOGRAPHIC
NUMBER ZERO.
DecimalDigit :: one of
-
0 1 2 3 4 5 6 7 8 9
-
Additionally, a DecimalDigit may be a member of the Unicode decimal
number class (those Unicode
characters in category "Nd"."
§15.9.1 Regular Expression Pattern Matching
The textual descriptions of the \w and \W character classes do not match with the character ranges given. The ranges given are what is intended (for historical reasons).
Amend the descriptions of \w and \W character classes:
\w | ASCII letters, digits, and underscore; equivalent to "[a-zA-Z0-9_]". |
\w | Any character not an ASCII letter, digit, or underscore; equivalent to "[^a-zA-Z0-9_]". |
Written by Michael Ang <mang@subcarrier.org>
Comments to Norris Boyd <nboyd@atg.com>
Last modified: Fri Dec 18 18:58:34 PST 1998