Specification

23 July 2022

Introduction

This is the specification and reference for Yao, the language.

Yao aspires to be a general-purpose, extensible, systems, scripting, and shell programming language.

“General-purpose” means that it can be used for anything in programming.

“Extensible” means that users can extend it with their own ideas.

“Systems programming language” means that Yao can be used for building systems, which means that it can communicate with code outside itself (see “Some Were Meant for C” by Stephen Kell).

“Scripting programming language” means that Yao is suitable for fast prototyping and throw-away programs.

“Shell programming language” means that Yao is designed to be used in a shell, making it easy to run outside commands from the language itself, while still providing all of the capabilities and convenience of a normal programming language.

Note

This document uses some material copied from the The Go Programming Language Specification, which is licensed under the Creative Commons Attribution 3.0 License

Warning

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Terms

character

Any Unicode code point whether multiple bytes or not.

This definition is for simplicity in referring to Unicode code points through this document.

compiler

Any program that creates executable code from Yao source code or executes Yao code directly.

This means that interpreters fall under this definition.

implementation

A combination of compilers and standard library that implements this specification.

standard library

A library that contains the API’s, tokens, keywords, and everything else required to allow a compiler to translate Yao source code and make it runnable.

Notation

The syntax is defined using Extended Backus-Naur Form (EBNF):

Production  = production_name "=" [ Expression ] "." .
Expression  = Alternative { "|" Alternative } .
Alternative = Term { Term } .
Expression  = production_name | " token [ "…" token ] | Group | Option | Repetition .
Group       = "(" Expression ")" .
Option      = "[" Expression "]" .
Repetition  = "{" Expression "}" .

Productions are expressions constructed from terms and the following operators, in increasing precedence:

|  alternation
() grouping
[] option (0 or 1 times)
{} repetition (0 to n times)

Lower-case production names are used to identify lexical tokens. Non-terminals are in CamelCase. Lexical tokens are enclosed in double quotes “” or back quotes ``.

The form a b represents the set of characters from a through b as alternatives. The horizontal ellipsis is also used elsewhere in the spec to informally denote various enumerations or code snippets that are not further specified. The character (as opposed to the three characters ...) is not a token of the Yao language.

Source Code Representation

Source code is Unicode text encoded in UTF-8.

Each code point is distinct; for instance, upper and lower case letters are different characters.

Warning

Implementation Restriction: The compiler MUST canonicalize the text, so a single accented code point is not distinct from the same character constructed from combining an accent and a letter; those are treated as two code points.

Warning

Implementation Restriction: A compiler MUST disallow the NUL character (U+0000) in the source text.

Warning

Implementation Restriction: A compiler MUST allow and ignore a UTF-8-encoded byte order mark (U+FEFF) if it is the first Unicode code point in the source text.

Warning

Implementation Restriction: A byte order mark MUST be disallowed anywhere else in the source.

Characters

The following terms are used to denote specific Unicode character classes:

newline        = /* the Unicode code point U+000A */ .
unicode_char   = /* an arbitrary Unicode code point except newline */ .
unicode_letter = /* a Unicode code point classified as "Letter" */ .
unicode_digit  = /* a Unicode code point classified as "Number, decimal digit" */ .

In The Unicode Standard 8.0, Section 4.5 “General Category” defines a set of character categories. Yao treats all characters in any of the Letter categories Lu, Ll, Lt, Lm, or Lo as Unicode letters, and those in the Number category Nd as Unicode digits.

Architecture

Compilers are required to have, or appear to have, the following architecture:

                             |------------------------------------------------|
                             |                       Yvm                      |
|-------|     |--------|     | |----------|     |-----------|     |---------| |
| Lexer | --> | Parser | --> | | Analyzer | --> | Optimizer | --> | Backend | |
|-------|     |--------|     | |----------|     |-----------|     |---------| |
                             |------------------------------------------------|

Each box represents a stage of the compiler. Each stage must be independent of the others.

Warning

Implementation Restriction: All stages above must be implemented as though they only make one pass.

Lexer

The lexer stage is required to break the text of a Yao file into tokens.

Parser

The parser stage is required to parse the source code and generate the intermediate representation (IR) called YIR, which is part of Yvm.

Syntax

This section will describe the syntax of Yao source code.

Letters and Digits

The underscore character _ (U+005F) is considered a letter.

letter        = unicode_letter | "_" .
decimal_digit = "0" … "9" .
binary_digit  = "0" | "1" .
octal_digit   = "0" … "7" .
hex_digit     = "0" … "9" | "A" … "F" | "a" … "f" .

Lexical Elements

Each of the elements below are indivisible. This means that implementations must not split them in any way.

Comments

Comments serve as program documentation. There are two forms:

  1. Line comments start with the character sequence // and stop at the end of the line.

  2. General comments start with the character sequence /* and stop with the Nth */ character sequence, where N is the number of times the character sequence /* appears in the comment, including the start of the comment. In other words, general comments nest.

A comment cannot start inside a rune literal or string literal, or inside a comment. Comments act like a space.

Tokens

Tokens form the vocabulary of the Yao language. There are three classes: identifiers, operators and punctuation, and literals.

Whitespace, formed by horizontal tabs (U+0009), newlines (U+000A), carriage returns (U+000D), spaces (U+000D), and comments, is ignored, except as it separates tokens that would otherwise combine into a single token.

While breaking the input into tokens, the next token is the longest sequence of characters that form a valid token.

Semicolons

The formal grammar uses semicolons ; as terminators in a number of productions. All of the semicolons used in the grammar are required.

Identifiers

Identifiers name program entities such as variables and types. An identifier is a sequence of one or more letters and digits. The first character in an identifier must be a letter and must not be an underscore.

identifier = unicode_letter { letter | unicode_digit } .

Note

Implementation Allowance: Identifiers that begin with an underscore are reserved for implementations and are thus allowed for implementations to use.

Warning

Implementation Restriction: Implementations are not allowed to use two consecutive underscores at the beginning of identifiers. Such identifiers are reserved for C implementations and using them would invoke Undefined Behavior in C.

Some identifiers are predeclared.

Integer Literals

An integer literal is a sequence of digits representing an integer constant. An optional prefix sets a non-decimal base: 0b for binary, 0o for octal, and 0x for hexadecimal.

In hexadecimal literals, letters a through f and A through F represent values 10 through 15.

For readability, an underscore character _ may appear after a base prefix or between successive digits; such underscores do not change the literal’s value.

int_lit = decimal_lit | binary_lit | octal_lit | hex_lit .

decimal_lit = "0" | ( "1" … "9" ) [ [ "_" ] decimal_digits ] .
binary_lit  = "0" ( "b" ) [ "_" ] binary_digits .
octal_lit   = "0" ( "o" ) [ "_" ] octal_digits .
hex_lit     = "0" ( "x" ) [ "_" ] hex_digits .

decimal_digits = decimal_digit { [ "_" ] decimal_digit } .
binary_digits  = binary_digit { [ "_" ] binary_digit } .
octal_digits   = octal_digit { [ "_" ] octal_digit } .
hex_digits     = hex_digit { [ "_" ] hex_digit } .

Examples:

42
4_2
0o600
0xBadFace
0x_67_7a_2f_cc_40_c6
170141183460469231731687303715884105727
170_141183_460469_231731_687303_715884_105727

_42         // an identifier, not an integer literal
42_         // invalid: _ must separate successive digits
4__2        // invalid: only one _ at a time
0_xBadFace  // invalid: _ must separate successive digits

Warning

It is a compile error if there is a digit in the number that does not make sense in the base. For example, it is a compile error in any digit besides 0 and 1 appears in a binary number, or any digit greater than 7 for an octal number.

Floating-Point Literals

A floating-point literal is a decimal or hexadecimal representation of a floating-point constant.

A decimal floating-point literal consists of an integer part (decimal digits), a decimal point, a fractional part (decimal digits), and an exponent part (e followed by an optional sign and decimal digits). One of the integer part or the fractional part may be elided; one of the decimal point or the exponent part may be elided. An exponent value exp scales the mantissa (integer and fractional part) by 10^exp.

A hexadecimal floating-point literal consists of a 0x prefix, an integer part (hexadecimal digits), a radix point, a fractional part (hexadecimal digits), and an exponent part (p followed by an optional sign and decimal digits). One of the integer part or the fractional part may be elided; the radix point may be elided as well, but the exponent part is required. (This syntax matches the one given in IEEE 754-2008 §5.12.3.) An exponent value exp scales the mantissa (integer and fractional part) by 2^exp.

For readability, an underscore character _ may appear after a base prefix or between successive digits; such underscores do not change the literal value.

float_lit = decimal_float_lit | hex_float_lit .

decimal_float_lit = decimal_digits "." [ decimal_digits ] [ decimal_exponent ]
                  | decimal_digits decimal_exponent
                  | "." decimal_digits [ decimal_exponent ] .

decimal_exponent = ( "e" ) [ "+" | "-" ] decimal_digits .

hex_float_lit = "0" ( "x" ) hex_mantissa hex_exponent .
hex_mantissa  = [ "_" ] hex_digits "." [ hex_digits ]
              | [ "_" ] hex_digits
              | "." hex_digits .

hex_exponent = ( "p" ) [ "+" | "-" ] decimal_digits .

Examples:

0.
72.40
072.40       // == 72.40
2.71828
1.e+0
6.67428e-11
.25
1_5.         // == 15.0
0.15e+0_2    // == 15.0

0x1p-2       // == 0.25
0x2.p10      // == 2048.0
0x1.Fp+0     // == 1.9375
0X.8p-0      // == 0.5
0x15e-2      // == 0x15e - 2 (integer subtraction)

0x.p1        // invalid: mantissa has no digits
1p-2         // invalid: p exponent requires hexadecimal mantissa
0x1.5e-2     // invalid: hexadecimal mantissa requires p exponent
1_.5         // invalid: _ must separate successive digits
1._5         // invalid: _ must separate successive digits
1.5_e1       // invalid: _ must separate successive digits
1.5e_1       // invalid: _ must separate successive digits
1.5e1_       // invalid: _ must separate successive digits

Imaginary Literals

An imaginary literal represents the imaginary part of a complex constant. It consists of an integer or floating-point literal followed by the lower-case letter i. The value of an imaginary literal is the value of the respective integer or floating-point literal multiplied by the imaginary unit i.

imaginary_lit = ( decimal_digits | int_lit | float_lit ) "i" .

Examples:

0i
0o123i        // == 0o123 * 1i == 83i
0xabci        // == 0xabc * 1i == 2748i
0.i
2.71828i
1.e+0i
6.67428e-11i
1e6i
.25i
.12345e+5i
0x1p-2i       // == 0x1p-2 * 1i == 0.25i

Rune Literals

A rune literal represents a rune constant, an integer value identifying a Unicode code point. A rune literal is expressed as one or more characters enclosed in single quotes, as in 'x' or 'n'. Within the quotes, any character may appear except newline and unescaped single quote. A single quoted character represents the Unicode value of the character itself, while multi-character sequences beginning with a backslash encode values in various formats.

The simplest form represents the single character within the quotes; since Go source text is Unicode characters encoded in UTF-8, multiple UTF-8-encoded bytes may represent a single integer value. For instance, the literal 'a' holds a single byte representing a literal a, Unicode U+0061, value 0x61, while 'ä' holds two bytes (0xc3 0xa4) representing a literal a-dieresis, U+00E4, value 0xe4.

Several backslash escapes allow arbitrary values to be encoded as ASCII text. There are four ways to represent the integer value as a numeric constant: x followed by exactly two hexadecimal digits, u followed by exactly four hexadecimal digits, and a U followed by exactly eight hexadecimal digits. In each case the value of the literal is the value represented by the digits in the corresponding base.

Although these representations all result in an integer, they have different valid ranges. Hexadecimal escapes satisfy this condition by construction. The escapes u and U represent Unicode code points so within them some values are illegal, in particular those above 0x10FFFF and surrogate halves.

After a backslash, certain single-character escapes represent special values:

\a   /* U+0007 alert or bell */
\b   /* U+0008 backspace */
\f   /* U+000C form feed */
\n   /* U+000A line feed or newline */
\r   /* U+000D carriage return */
\t   /* U+0009 horizontal tab */
\v   /* U+000b vertical tab */
\\   /* U+005c backslash */
\'   /* U+0027 single quote (valid escape only within rune literals) */
\"   /* U+0022 double quote (valid escape only within string literals) */

All other sequences starting with a backslash are illegal inside rune literals.

rune_lit = "'" ( unicode_value | hex_byte_value ) "'" .

unicode_value  = unicode_char | little_u_value | big_u_value | escaped_char .
hex_byte_value = `\` "x" hex_digit hex_digit .
little_u_value = `\` "u" hex_digit hex_digit hex_digit hex_digit .
big_u_value    = `\` "U" hex_digit hex_digit hex_digit hex_digit hex_digit hex_digit hex_digit hex_digit .
escaped_char   = `\` ( "a" | "b" | "f" | "n" | "r" | "t" | "v" | `\` | "'" | `"` ) .

Examples:

'a'
'ä'
'本'
'\t'
'\000'
'\007'
'\377'
'\x07'
'\xff'
'\u12e4'
'\U00101234'
'\''         // rune literal containing single quote character
'aa'         // illegal: too many characters
'\xa'        // illegal: too few hexadecimal digits
'\0'         // illegal: too few octal digits
'\uDFFF'     // illegal: surrogate half
'\U00110000' // illegal: invalid Unicode code point

String Literals

A string literal represents a string constant obtained from concatenating a sequence of characters. There are two forms: raw string literals and interpreted string literals.

Raw string literals are character sequences between two sets of three consecutive double quotes, as in """foo""". Within the quotes, any character may appear except three consecutive double quotes. The value of a raw string literal is the string composed of the uninterpreted (implicitly UTF-8-encoded) characters between the quotes; in particular, backslashes have no special meaning and the string may contain newlines. Carriage return characters (\r) inside raw string literals are discarded from the raw string value.

Interpreted string literals are character sequences between double quotes, as in "bar". Within the quotes, any character may appear except newline and unescaped double quote. The text between the quotes forms the value of the literal, with backslash escapes interpreted as they are in rune literals (except that ' is illegal and " is legal), with the same restrictions. The two-digit hexadecimal (\xnn) escape represents individual bytes of the resulting string; all other escapes represent the (possibly multi-byte) UTF-8 encoding of individual characters. Thus inside a string literal \xFF represent a single byte of value 0xFF (255), while ÿ, \u00FF, \U000000FF, and \xc3\xbf represent the two bytes 0xc3 0xbf of the UTF-8 encoding of character U+00FF.

string_lit             = raw_string_lit | interpreted_string_lit .
raw_string_lit         = `"""` { unicode_char | newline } `"""` .
interpreted_string_lit = `"` { unicode_value | hex_byte_value } `"` .

Examples:

"""abc"""            // same as "abc"
"""\n
\n"""                // same as "\n\n\n"
"\n"
"\""                 // same as `"`
"Hello, world!\n"
"日本語"
"\u65e5本\U00008a9e"
"\xff\u00FF"
"\uD800"             // illegal: surrogate half
"\U00110000"         // illegal: invalid Unicode code point

These examples all represent the same string:

"日本語"                                 // UTF-8 input text
"""日本語"""                             // UTF-8 input text as a raw literal
"\u65e5\u672c\u8a9e"                    // the explicit Unicode code points
"\U000065e5\U0000672c\U00008a9e"        // the explicit Unicode code points
"\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e"  // the explicit UTF-8 bytes

Warning

If the source code represents a character as two code points, such as a combining form involving an accent and a letter, the result will be an error if placed in a rune literal (it is not a single code point), and will appear as two code points if placed in a string literal.

Constants

There are boolean constants, rune constants, integer constants, floating-point constants, complex constants, and string constants. Integer, floating-point, and complex constants are collectively called numeric constants.

A constant value is represented by a rune literal, integer literal, floating-point literal, imaginary literal, or string literal, an identifier denoting a constant, a constant expression, a conversion with a result that is a constant, or the result value of pure and total functions with all constant arguments. The boolean truth values are represented by the predeclared constants true and false.

In general, complex constants are a form of constant expression and are discussed in that section.

Numeric constants represent exact values of arbitrary precision and do not overflow.

All constants are typed. The boolean constants (true and false) are bool types. Rune constants are either char or rune, depending on their size. Numeric constants are num types.

A constant may be given a type explicitly by a constant declaration or conversion, or implicitly when used in a variable declaration or an assignment or as an operand in an expression. It is an error if the constant value cannot be represented as a value of the respective type.

Warning

Implementation Restriction: When an integer constant is assigned to a variable or constant of a type with limited width, the compiler is required to automatically convert from _ to the specified type.

Warning

It is a compile error if, when assigning an integer constant to a variable or constant of limited width, the specified type cannot represent the integer constant exactly.

Warning

Implementation Restriction: When an floating-point constant is assigned to a variable or constant of a type with limited width, the compiler is required to automatically convert from _ to the specified type.

Warning

It is a compile error if, when assigning an floating-point constant to a variable or constant of limited width, the specified type cannot represent the floating-point constant without overflow.

These requirements apply both to literal constants and to the result of evaluating constant expressions.

Variables

A variable is a storage location for holding a value. The set of permissible values is determined by the variable’s type.

A variable declaration or, for function parameters and results, the signature of a function declaration or function literal reserves storage for a named variable.

Structured variables of array and struct types have elements and fields that may be addressed individually. Each such element acts like a variable.

The static type (or just type) of a variable is the type given in its declaration, the type provided in the new call or composite literal, or the type of an element of a structured variable. Variables of interface type also have a distinct dynamic type, which is the concrete type of the value assigned to the variable at run time (unless the value is the predeclared identifier nil, which has no type). The dynamic type may vary during execution but values stored in interface variables are always assignable to the static type of the variable.

Examples:

x: Any  // x is nil and has static type Any
v: @T   // v has value nil, static type @T
x = 42  // x has value 42 and dynamic type int
x = v   // x has value (*T)(nil) and dynamic type @T

A variable’s value is retrieved by referring to the variable in an expression; it is the most recent value assigned to the variable. If a variable has not yet been assigned a value, its value is the zero value for its type.

Standard Library

Predeclared Identifiers

In defense of https://www.trojansource.codes/:

TODO: Require an explicit Unicode control character to allow switching scripts in the middle of an identifier?

TODO: Implement the mitigations suggested by the link above:

Compilers, interpreters, and build pipelines supporting Unicode should throw errors or warnings for unterminated bidirectional control characters in comments or string literals, and for identifiers with mixed-script confusable characters.

Language specifications should formally disallow unterminated bidirectional control characters in comments and string literals.

Code editors and repository frontends should make bidirectional control characters and mixed-script confusable characters perceptible with visual symbols or warnings.

Also, implement https://news.ycombinator.com/item?id=29172311. To make things easier on users, have an easy way in the language to say “use the contents of such a file as a string”.

Use https://www.unicode.org/reports/tr31/ for identifiers, but also exclude Hangul filler and half-width Hangul filler letters from identifiers.