Valid identifiers in FORTRAN 66, C, Java , C++ current and future
What’s in a name ? Identifiers are used in modern programming languages to refer to types, classes, variables and object instances. While the first programming languages were resource-constrained and ASCII-centered, modern languages are more flexible with regards to the possible forms identifiers can take.
This post is a comparison on the lexical conventions for identifiers (length and character sets) in FORTRAN 66, C, Java, current and future C++.
FORTRAN 66
The original FORTRAN 66 identifiers were defined based on digits and letters as follows:
A symbolic name consists of from one to six alphanumeric characters, the first of which must be alphabetic. A digit is one of the ten characters: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 A letter is one of the twenty-six characters; A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z.
So we only have the 26 ASCII letters to choose from (i.e. case insensitive) to build our 6-character identifiers. No underscores, no $ signs.
C
ANSI C (or ISO C or C90) as defined by ISO/IEC 9899:1990 says:
An identifier is a sequence of nondigit characters (including the underscore _ and the lower-case and upper-case letters) and digits. The first character shall be a nondigit character.
C is limited to ASCII letters, but it is case sensitive. Underscore OK, $ not OK.
ISO C lifted the length limitations set 15 years before in the C Reference Manual that came with 6th Edition Unix, where “no more than the first eight characters are significant, and only the first seven for external identifiers“. The practical length of identifiers in ISO C is constrained by the requirements on the compiler implementation translation limits: 31 significant characters for an internal identifier.
C++ current standard (2003)
The current C++ standard as implemented in currently available compilers has the same character set limitations as C:
identifier: nondigitidentifier nondigitidentifier digit
nondigit: one of _ a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L MN O P Q R S T U V W X Y Z digit: one of 0 1 2 3 4 5 6 7 8 9
The limit for the maximum number of characters in an internal identifier, macro name or in an external identifier is increased to a grandiose 1024.
Java
In the Java Language Specification, Third Edition an identifier is defined as an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter.
The “Java digits” are just 0-9.
A “Java letter” is defined with reference to the 30 Unicode General Categories which also match the “Java Constant Field” values, according to this table:
| Abbr | Long | Description | Java Constant Field Value |
|---|---|---|---|
| Cc | Control | a C0 or C1 control code | CONTROL |
| Cf | Format | a format control character | FORMAT |
| Cn | Unassigned | a reserved unassigned code point or a noncharacter | UNASSIGNED |
| Co | Private_Use | a private-use character | PRIVATE_USE |
| Cs | Surrogate | a surrogate code point | SURROGATE |
| Ll | Lowercase_Letter | a lowercase letter | LOWERCASE_LETTER |
| Lm | Modifier_Letter | a modifier letter | MODIFIER_LETTER |
| Lo | Other_Letter | other letters, including syllables and ideographs | OTHER_LETTER |
| Lt | Titlecase_Letter | a digraphic character, with first part uppercase | TITLECASE_LETTER |
| Lu | Uppercase_Letter | an uppercase letter | UPPERCASE_LETTER |
| Mc | Spacing_Mark | a spacing combining mark (positive advance width) | COMBINING_SPACING_MARK |
| Me | Enclosing_Mark | an enclosing combining mark | ENCLOSING_MARK |
| Mn | Nonspacing_Mark | a nonspacing combining mark (zero advance width) | NON_SPACING_MARK |
| Nd | Decimal_Number | a decimal digit | DECIMAL_DIGIT_NUMBER |
| Nl | Letter_Number | a letterlike numeric character | LETTER_NUMBER |
| No | Other_Number | a numeric character of other type | OTHER_NUMBER |
| Pc | Connector_Punctuation | a connecting punctuation mark, like a tie | CONNECTOR_PUNCTUATION |
| Pd | Dash_Punctuation | a dash or hyphen punctuation mark | DASH_PUNCTUATION |
| Pe | Close_Punctuation | a closing punctuation mark (of a pair) | END_PUNCTUATION |
| Pf | Final_Punctuation | a final quotation mark | FINAL_QUOTE_PUNCTUATION |
| Pi | Initial_Punctuation | an initial quotation mark | INITIAL_QUOTE_PUNCTUATION |
| Po | Other_Punctuation | a punctuation mark of other type | OTHER_PUNCTUATION |
| Ps | Open_Punctuation | an opening punctuation mark (of a pair) | START_PUNCTUATION |
| Sc | Currency_Symbol | a currency sign | CURRENCY_SYMBOL |
| Sk | Modifier_Symbol | a non-letterlike modifier symbol | MODIFIER_SYMBOL |
| Sm | Math_Symbol | a symbol of primarily mathematical use | MATH_SYMBOL |
| So | Other_Symbol | a symbol of other type | OTHER_SYMBOL |
| Zl | Line_Separator | U+2028 LINE SEPARATOR only | LINE_SEPARATOR |
| Zp | Paragraph_Separator | U+2029 PARAGRAPH SEPARATOR only | PARAGRAPH_SEPARATOR |
| Zs | Space_Separator | a space character (of various non-zero widths) | SPACE_SEPARATOR |
With the help of this table, we understand that a “Java Letter” can be a currency symbol (such as “$”), a connecting punctuation character (such as “_”), or belong to one of the Unicode General Categores Lu, Ll, Lt, Lm or Lo.
It is not clear to me whether by saying currency symbol and connecting punctuation character the entire CURRENCY_SYMBOL (Sc), CONNECTOR_PUNCTUATION (Pc), DASH_PUNCTUATION (Pd), END_PUNCTUATION (Pe), FINAL_QUOTE_PUNCTUATION (Pf), INITIAL_QUOTE_PUNCTUATION (Pi), OTHER_PUNCTUATION (Po) and START_PUNCTUATION (Ps) Unicode General Categores are included, maybe somebody with Java skills can fill this void.
The Java programming language allows programmers to name identifiers with great liberty, including most Unicode code points (basically in their native languages), with underscore and dollar sign ($) both OK. . An undesirable side effect is that two identifiers differ if they differ in their Unicode code point, even if the glyphs (what you see on the screen) are the same. For example A and Α are different identifiers in Java because they are respectively LATIN CAPITAL LETTER A and GREEK CAPITAL LETTER ALPHA, and a is different from а because they are respectively LATIN SMALL LETTER A and CYRILLIC SMALL LETTER A.
C++ upcoming standard (C++0x)
C++0x, the planned new standard for the C++ programming language due to come out in 2011 or 2012 is more elastic than current C++ in its definition of an identifier:
An identifier is an arbitrarily long sequence of letters and digits, starting with a letter. Upper-and lower-case letters are different. All characters are significant. A “letter” is the usual a-z, A-Z and _ or a “universal-character-name” or “other implementation-defined characters”.
A “universal-character-name” is defined with reference to Annex A (Recommended extended repertoire for user-defined identifiers) of TR 10176:2003, TECHNICAL REPORT ISO/IEC TR 10176, Fourth edition (2003): Guidelines for the preparation of programming language standards.
A “universal-character-name” according to TR 10176, Annex A can be any character which “collectively can be used to generate word-like identifiers for most natural languages of the world“, including “letters (combining or not), syllables, and ideographs together with the modifier letters and marks conventionally used as parts of words“. The acceptable Unicode code points are:
Latin: 0041-005A, 0061-007A, 00AA, 00BA, 00C0-00D6, 00D8-00F6, 00F8-01F5, 01FA-0217, 0250-02A8, 1E00-1E9B, 1EA0-1EF9, 207F Greek: 0386, 0388-038A, 038C, 038E-03A1, 03A3-03CE, 03D0-03D6, 03DA, 03DC, 03DE, 03E0, 03E2-03F3, 1F00-1F15, 1F18-1F1D, 1F20-1F45, 1F48-1F4D, 1F50-1F57, 1F59, 1F5B, 1F5D, 1F5F-1F7D, 1F80-1FB4, 1FB6-1FBC, 1FC2-1FC4, 1FC6-1FCC, 1FD0-1FD3, 1FD6-1FDB, 1FE0-1FEC, 1FF2-1FF4, 1FF6-1FFC Cyrillic: 0401-040C, 040E-044F, 0451-045C, 045E-0481, 0490-04C4, 04C7-04C8, 04CB-04CC, 04D0-04EB, 04EE-04F5, 04F8-04F9 Armenian: 0531-0556, 0561-0587 Hebrew: 05D0-05EA, 05F0-05F2 Hebrew (C): 05B0-05B9, 05BB-05BD, 05BF, 05C1-05C2 Arabic: 0621-063A, 0640-064A, 0671-06B7, 06BA-06BE, 06C0-06CE, 06D0-06D3, 06D5, 06E5-06E6 Arabic (C): 064B-0652, 0670, 06D6-06DC, 06E7-06E8, 06EA-06ED Devanagari: 0905-0939, 0950, 0958-0961 Devanagari (C): 0901-0903, 093E-094D, 0951-0952, 0962-0963 Bengali: 0985-098C, 098F-0990, 0993-09A8, 09AA-09B0, 09B2, 09B6-09B9, 09DC-09DD, 09DF-09E1, 09F0-09F1 Bengali (C): 0981-0983, 09BE-09C4, 09C7-09C8, 09CB-09CD, 09E2-09E3 Gurmukhi: 0A05-0A0A, 0A0F-0A10, 0A13-0A28, 0A2A-0A30, 0A32-0A33, 0A35-0A36, 0A38-0A39, 0A59-0A5C, 0A5E, 0A74 Gurmukhi (C): 0A02, 0A3E-0A42, 0A47-0A48, 0A4B-0A4D Gujarati: 0A85-0A8B, 0A8D, 0A8F-0A91, 0A93-0AA8, 0AAA-0AB0, 0AB2-0AB3, 0AB5-0AB9, 0ABD, 0AD0, 0AE0 Gujarati (C): 0A81-0A83, 0ABE-0AC5, 0AC7-0AC9, 0ACB-0ACD Oriya: 0B05-0B0C, 0B0F-0B10, 0B13-0B28, 0B2A-0B30, 0B32-0B33, 0B36-0B39, 0B5C-0B5D, 0B5F-0B61 Oriya (C): 0B01-0B03, 0B3E-0B43, 0B47-0B48, 0B4B-0B4D Tamil: 0B85-0B8A, 0B8E-0B90, 0B92-0B95, 0B99-0B9A, 0B9C, 0B9E-0B9F, 0BA3-0BA4, 0BA8-0BAA, 0BAE-0BB5, 0BB7-0BB9 Tamil (C): 0B82-0B83, 0BBE-0BC2, 0BC6-0BC8, 0BCA-0BCD Telugu: 0C05-0C0C, 0C0E-0C10, 0C12-0C28, 0C2A-0C33, 0C35-0C39, 0C60-0C61 Telugu (C): 0C01-0C03, 0C3E-0C44, 0C46-0C48, 0C4A-0C4D Kannada: 0C85-0C8C, 0C8E-0C90, 0C92-0CA8, 0CAA-0CB3, 0CB5-0CB9, 0CDE, 0CE0-0CE1 Kannada (C): 0C82-0C83, 0CBE-0CC4, 0CC6-0CC8, 0CCA-0CCD Malayalam: 0D05-0D0C, 0D0E-0D10, 0D12-0D28, 0D2A-0D39, 0D60-0D61 Malayalam (C): 0D02-0D03, 0D3E-0D43, 0D46-0D48, 0D4A-0D4D Thai: 0E01-0E30, 0E32-0E33, 0E40-0E46, 0E50-0E59 Thai (C): 0E31, 0E34-0E3A, 0E47-0E4E Lao: 0E81-0E82, 0E84, 0E87-0E88, 0E8A, 0E8D, 0E94-0E97, 0E99-0E9F, 0EA1-0EA3, 0EA5, 0EA7, 0EAA-0EAB, 0EAD-0EAE, 0EB0, 0EB2-0EB3, 0EBD, 0EC0-0EC4, 0EC6, 0EDC-0EDD Lao (C): 0EB1, 0EB4-0EB9, 0EBB-0EBC, 0EC8-0ECD Tibetan: 0F00, 0F40-0F47, 0F49-0F69, 0F88-0F8B Tibetan (C): 0F18-0F19, 0F35, 0F37, 0F39, 0F71-0F84, 0F86-0F87, 0F90-0F95, 0F97, 0F99-0FAD, 0FB1-0FB7, 0FB9 Georgian: 10A0-10C5, 10D0-10F6 Hiragana: 3041-3093 Katakana: 30A1-30F6, 30FB-30FC Bopomofo: 3105-312C Hangul: AC00-D7A3 CJK Unified Ideographs: 4E00-9FA5 Digits: 0030-0039, 0660-0669, 06F0-06F9, 0966-096F, 09E6-09EF, 0A66-0A6F, 0AE6-0AEF, 0B66-0B6F, 0BE7-0BEF, 0C66-0C6F, 0CE6-0CEF, 0D66-0D6F, 0E50-0E59, 0ED0-0ED9, 0F20-0F29 Special characters: 00B5, 02B0-02B8, 02BB, 02BD-02C1, 02D0-02D1, 02E0-02E4, 037A, 0559, 093D, 0B3D, 1FBE, 203F-2040, 2102, 2107, 210A-2113, 2115, 2118-211D, 2124, 2126, 2128, 212A-2131, 2133-2138, 2160-2182, 3005-3007, 3021-3029
The upcoming version of the C++ will be subject to the same confusing same-glyph, different Unicode code-point syndrome as Java A != Α and a != а.
The good news is that since the “good” code points are listed, it is easier for implementations to check if a character is acceptable or not, whereas for Java it is required to have access to the Unicode tables to know if a character belongs to a certain General Category.