Specification¶
24 January 2023
Introduction¶
This is the specification for Yvm, the compiler library, and its Intermediate Representation.
Note
This document uses some material copied from the The Go Programming Language Specification, which is licensed under the Creative Commons Attribution 3.0 License.
Warning
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.
Terms¶
- backend
The portion of an implementation that translates YIR to machine code.
- bit width
The number of binary bits occupied by an object.
- compiler
Any program that either translates YIR to machine code, or interprets YIR.
- element
A single object, possibly one of many within an array type.
- field
An item in a struct type or a union type.
- machine code
Code that can be executed directly by a machine, whether the machine is physical or virtual.
- YIR
Yvm Intermediate Representation, the language on which YVM is built.
Notation¶
The syntax is defined using Extended Backus-Naur Form (EBNF):
Production = production_name "=" [ Expression ] "." .
Expression = Alternative { "|" Alternative } .
Alternative = Term { Term } .
Term = production_name | token [ "…" token ] | Group | Option | Repetition .
Group = "(" Expression ")" .
Option = "[" Expression "]" .
Repetition = "{" Expression "}" .
Productions are expressions constructed from terms and the following operators, in increasing precedence:
| alternation
() grouping
[] option (0 or 1 times)
{} repetition (0 to n times)
Lower-case production names are used to identify lexical tokens. Non-terminals are in CamelCase. Lexical tokens are enclosed in double quotes “” or back quotes ``.
The form a … b
represents the set of characters from a
through b
as
alternatives. The horizontal ellipsis …
is also used elsewhere in the spec to
informally denote various enumerations or code snippets that are not further
specified. The character …
(as opposed to the three characters ...
) is not a
token of the YIR language.
Source Code Representation¶
Source code is Unicode text encoded in UTF-8.
Each code point is distinct; for instance, upper and lower case letters are different characters.
Warning
Implementation Restriction: The compiler MUST canonicalize the text, so a single accented code point is not distinct from the same character constructed from combining an accent and a letter; those are treated as two code points.
Warning
Implementation Restriction: A compiler MUST disallow the NUL
character (U+0000) in the source text.
Warning
Implementation Restriction: A compiler MUST disallow the byte order mark (U+FEFF) in the source text.
Characters¶
The following terms are used to denote specific Unicode character classes:
newline = /* the Unicode code point U+000A */ .
unicode_char = /* an arbitrary Unicode code point except newline */ .
unicode_letter = /* a Unicode code point classified as "Letter" */ .
unicode_digit = /* a Unicode code point classified as "Number, decimal digit" */ .
In The Unicode Standard 8.0, Section 4.5 “General Category” defines a set of character categories. YIR treats all characters in any of the Letter categories Lu, Ll, Lt, Lm, or Lo as Unicode letters, and those in the Number category Nd as Unicode digits.
Architecture¶
Yvm implementations are required to have the following architecture:
|----------| |-----------| |---------|
| Analyzer | --> | Optimizer | --> | Backend |
|----------| |-----------| |---------|
Analyzer¶
The analyzer will be built with a series of stages, which should be controllable by the user.
Those stages will be run on the code that Yvm is given and produce annotations.
In order to run passes, users will also pass YIR code to Yvm to run. Yvm implementations should execute that code, and it is expected that users will encode their analysis logic in that code, using it to call analysis passes as desired.
Optimizer¶
The analyzer will be built with a series of stages, which should be controllable by the user.
Those stages will be run on the code that Yvm is given. These stages should either produce annotations or transform the code they are given.
In order to run passes, users will also pass YIR code to Yvm to run. Yvm implementations should execute that code, and it is expected that users will encode their optimization logic in that code, using it to call optimization passes as desired.
Backend¶
The backend will be built with a series of stages, which should be controllable by the user.
Those stages will be run on the code that Yvm is given. These stages should continually refine the Yvm code into machine code.
In order to run passes, users will also pass YIR code to Yvm to run. Yvm implementations should execute that code, and it is expected that users will encode their code generation logic in that code, using it to call code generation passes as desired.
Yvm Intermediate Representation¶
Yvm Intermediate Representation (YIR) is the language underlying Yvm, and it has three forms:
Text
Bitcode
In-memory
The YIR Syntax section defines the text version of YIR, the YIR Bitcode section defines the bitcode version of YIR, and the YIR API defines the in-memory form of YIR, as well as the Yao API.
The semantics of YIR are described in the YIR Semantics section.
YIR Syntax¶
The section defines the text format of the Yvm Intermediate Representation (YIR). It is the basis for the other forms.
Lexical Elements¶
Each of the elements below are indivisible. This means that implementations must not split them in any way.
Letters and Digits¶
The underscore character _
(U+005F) is considered a letter.
letter = unicode_letter | "_" .
decimal_digit = "0" ... "9" .
binary_digit = "0" | "1" .
octal_digit = "0" ... "7" .
hex_digit = "0" ... "9" | "A" ... "F" | "a" ... "f" .
Go To
Tokens¶
Tokens form the vocabulary of the YIR language. There are three classes: identifiers, operators and punctuation, and literals.
Whitespace, formed by horizontal tabs (U+0009), newlines (U+000A), carriage returns (U+000D), spaces (U+000D), and comments, is ignored, except as it separates tokens that would otherwise combine into a single token.
While breaking the input into tokens, the next token is the longest sequence of characters that form a valid token.
Semicolons¶
The formal grammar uses semicolons ;
as terminators in a number of
productions. All of the semicolons used in the grammar are required.
Identifiers¶
Identifiers name program entities such as variables and types. An identifier is a sequence of one or more letters and digits. The first character in an identifier must be a letter and must not be an underscore.
identifier = unicode_letter { letter | unicode_digit } .
Note
Implementation Allowance: Identifiers that begin with an underscore are reserved for implementations and are thus allowed for implementations to use.
Warning
Implementation Restriction: Implementations are not allowed to use two consecutive underscores at the beginning of identifiers. Such identifiers are reserved for C implementations and using them would invoke Undefined Behavior in C.
Some identifiers are predeclared.
Integer Literals¶
An integer literal is a sequence of digits representing an integer
constant. An optional prefix sets a non-decimal base: 0b
for binary,
0o
for octal, and 0x
for hexadecimal.
In hexadecimal literals, letters a
through f
and A
through F
represent values 10 through 15.
For readability, an underscore character _
may appear after a base
prefix or between successive digits; such underscores do not change the
literal’s value.
int_lit = decimal_lit | binary_lit | octal_lit | hex_lit .
decimal_lit = "0" | ( "1" ... "9" ) [ [ "_" ] decimal_digits ] .
binary_lit = "0" ( "b" ) [ "_" ] binary_digits .
octal_lit = "0" ( "o" ) [ "_" ] octal_digits .
hex_lit = "0" ( "x" ) [ "_" ] hex_digits .
decimal_digits = decimal_digit { [ "_" ] decimal_digit } .
binary_digits = binary_digit { [ "_" ] binary_digit } .
octal_digits = octal_digit { [ "_" ] octal_digit } .
hex_digits = hex_digit { [ "_" ] hex_digit } .
Examples:
42
4_2
0o600
0xBadFace
0x_67_7a_2f_cc_40_c6
170141183460469231731687303715884105727
170_141183_460469_231731_687303_715884_105727
_42 // an identifier, not an integer literal
42_ // invalid: _ must separate successive digits
4__2 // invalid: only one _ at a time
0_xBadFace // invalid: _ must separate successive digits
Warning
It is a compile error if there is a digit in the number that does not make
sense in the base. For example, it is a compile error in any digit besides 0
and 1
appears in a binary number, or any digit greater than 7
for an octal
number.
Floating-Point Literals¶
A floating-point literal is a decimal or hexadecimal representation of a floating-point constant.
A decimal floating-point literal consists of an integer part (decimal digits), a
decimal point, a fractional part (decimal digits), and an exponent part (e
followed by an optional sign and decimal digits). One of the integer part or the
fractional part may be elided; one of the decimal point or the exponent part may
be elided. An exponent value exp scales the mantissa (integer and fractional
part) by 10^exp
.
A hexadecimal floating-point literal consists of a 0x
prefix, an integer part
(hexadecimal digits), a radix point, a fractional part (hexadecimal digits), and
an exponent part (p
followed by an optional sign and decimal digits). One of
the integer part or the fractional part may be elided; the radix point may be
elided as well, but the exponent part is required. (This syntax matches the one
given in IEEE 754-2008 §5.12.3.) An exponent value exp scales the mantissa
(integer and fractional part) by 2^exp
.
For readability, an underscore character _
may appear after a base prefix or
between successive digits; such underscores do not change the literal value.
float_lit = decimal_float_lit | hex_float_lit .
decimal_float_lit = decimal_digits "." [ decimal_digits ] [ decimal_exponent ]
| decimal_digits decimal_exponent
| "." decimal_digits [ decimal_exponent ] .
decimal_exponent = ( "e" ) [ "+" | "-" ] decimal_digits .
hex_float_lit = "0" ( "x" ) hex_mantissa hex_exponent .
hex_mantissa = [ "_" ] hex_digits "." [ hex_digits ]
| [ "_" ] hex_digits
| "." hex_digits .
hex_exponent = ( "p" ) [ "+" | "-" ] decimal_digits .
Go To
Examples:
0.
72.40
072.40 // == 72.40
2.71828
1.e+0
6.67428e-11
.25
1_5. // == 15.0
0.15e+0_2 // == 15.0
0x1p-2 // == 0.25
0x2.p10 // == 2048.0
0x1.Fp+0 // == 1.9375
0X.8p-0 // == 0.5
0x15e-2 // == 0x15e - 2 (integer subtraction)
0x.p1 // invalid: mantissa has no digits
1p-2 // invalid: p exponent requires hexadecimal mantissa
0x1.5e-2 // invalid: hexadecimal mantissa requires p exponent
1_.5 // invalid: _ must separate successive digits
1._5 // invalid: _ must separate successive digits
1.5_e1 // invalid: _ must separate successive digits
1.5e_1 // invalid: _ must separate successive digits
1.5e1_ // invalid: _ must separate successive digits
Rune Literals¶
A rune literal represents a rune constant, an integer value identifying a
Unicode code point. A rune literal is expressed as one or more characters
enclosed in single quotes, as in 'x'
or 'n'
. Within the quotes, any
character may appear except newline and unescaped single quote. A single quoted
character represents the Unicode value of the character itself, while
multi-character sequences beginning with a backslash encode values in various
formats.
The simplest form represents the single character within the quotes; since Go
source text is Unicode characters encoded in UTF-8, multiple UTF-8-encoded bytes
may represent a single integer value. For instance, the literal 'a'
holds a
single byte representing a literal a
, Unicode U+0061, value 0x61
, while
'ä'
holds two bytes (0xc3
0xa4
) representing a literal a-dieresis, U+00E4,
value 0xe4
.
Several backslash escapes allow arbitrary values to be encoded as ASCII text.
There are four ways to represent the integer value as a numeric constant: x
followed by exactly two hexadecimal digits, u
followed by exactly four
hexadecimal digits, and a U
followed by exactly eight hexadecimal digits. In
each case the value of the literal is the value represented by the digits in the
corresponding base.
Although these representations all result in an integer, they have different
valid ranges. Hexadecimal escapes satisfy this condition by construction. The
escapes u
and U
represent Unicode code points so within them some values are
illegal, in particular those above 0x10FFFF
and surrogate halves.
After a backslash, certain single-character escapes represent special values:
\a /* U+0007 alert or bell */
\b /* U+0008 backspace */
\f /* U+000C form feed */
\n /* U+000A line feed or newline */
\r /* U+000D carriage return */
\t /* U+0009 horizontal tab */
\v /* U+000b vertical tab */
\\ /* U+005c backslash */
\' /* U+0027 single quote (valid escape only within rune literals) */
\" /* U+0022 double quote (valid escape only within string literals) */
All other sequences starting with a backslash are illegal inside rune literals.
rune_lit = "'" ( unicode_value | hex_byte_value ) "'" .
unicode_value = unicode_char | little_u_value | big_u_value | escaped_char .
hex_byte_value = `\` "x" hex_digit hex_digit .
little_u_value = `\` "u" hex_digit hex_digit hex_digit hex_digit .
big_u_value = `\` "U" hex_digit hex_digit hex_digit hex_digit hex_digit hex_digit hex_digit hex_digit .
escaped_char = `\` ( "a" | "b" | "f" | "n" | "r" | "t" | "v" | `\` | "'" | `"` ) .
Go To
Examples:
'a'
'ä'
'本'
'\t'
'\000'
'\007'
'\377'
'\x07'
'\xff'
'\u12e4'
'\U00101234'
'\'' // rune literal containing single quote character
'aa' // illegal: too many characters
'\xa' // illegal: too few hexadecimal digits
'\0' // illegal: too few octal digits
'\uDFFF' // illegal: surrogate half
'\U00110000' // illegal: invalid Unicode code point
String Literals¶
A string literal represents a string constant obtained from concatenating a sequence of characters. There are two forms: raw string literals and interpreted string literals.
Raw string literals are character sequences between two sets of three
consecutive double quotes, as in """foo"""
. Within the quotes, any character
may appear except three consecutive double quotes. The value of a raw string
literal is the string composed of the uninterpreted (implicitly UTF-8-encoded)
characters between the quotes; in particular, backslashes have no special
meaning and the string may contain newlines. Carriage return characters (\r
)
inside raw string literals are discarded from the raw string value.
Interpreted string literals are character sequences between double quotes, as in
"bar"
. Within the quotes, any character may appear except newline and
unescaped double quote. The text between the quotes forms the value of the
literal, with backslash escapes interpreted as they are in rune literals (except
that '
is illegal and "
is legal), with the same restrictions. The two-digit
hexadecimal (\xnn
) escape represents individual bytes of the resulting
string; all other escapes represent the (possibly multi-byte) UTF-8 encoding of
individual characters. Thus inside a string literal \xFF
represent a single
byte of value 0xFF
(255), while ÿ
, \u00FF
, \U000000FF
, and \xc3\xbf
represent the two bytes 0xc3
0xbf
of the UTF-8 encoding of character
U+00FF
.
string_lit = raw_string_lit | interpreted_string_lit .
raw_string_lit = `"""` { unicode_char | newline } `"""` .
interpreted_string_lit = `"` { unicode_value | hex_byte_value } `"` .
Examples:
"""abc""" // same as "abc"
"""\n
\n""" // same as "\n\n\n"
"\n"
"\"" // same as `"`
"Hello, world!\n"
"日本語"
"\u65e5本\U00008a9e"
"\xff\u00FF"
"\uD800" // illegal: surrogate half
"\U00110000" // illegal: invalid Unicode code point
These examples all represent the same string:
"日本語" // UTF-8 input text
"""日本語""" // UTF-8 input text as a raw literal
"\u65e5\u672c\u8a9e" // the explicit Unicode code points
"\U000065e5\U0000672c\U00008a9e" // the explicit Unicode code points
"\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e" // the explicit UTF-8 bytes
Warning
If the source code represents a character as two code points, such as a combining form involving an accent and a letter, the result will be an error if placed in a rune literal (it is not a single code point), and will appear as two code points if placed in a string literal.
Syntax Elements¶
Every Yvm file is a series of declarations and definitions of constants, types, and functions.
Definitions¶
When new constants, types, and functions are created in Yvm, it is through definitions. When such a definition occurs, the item is “defined” for use after that point.
All of the required data for the item is required to be given at the definition point.
Declarations¶
It is possible to declare constants, types, and functions that are not defined; if so, they are expected to be defined elsewhere, including in code external to all Yvm modules, such as C code.
When declaring an item, only the name is necessary.
Types¶
A type in Yvm is either a primitive type, a compound type, or a pointer type.
Type names always begin with $
:
type_name = "$" identifier /* with no spaces inbetween */ .
Go To
Examples:
$s64
$int
$ f64 // invalid: must not have a space between $ and the name
Primitive Types¶
A primitive type in Yvm is a type that has no parts. In other words, it is a type that cannot be split into parts.
There are two kinds of primitive types: integer types and floating-point types.
Integer Types¶
Integer types are types that use integer math instructions. They are defined with a certain bit width:
int_def = "int" type_name bit_width ";" .
bit_width = int_lit .
The bit_width
must be a positive integer greater than 0
. It specifies how
many binary digits are in the type.
Warning
Implementation Allowance: Implementations are allowed to only support integer types whose width is a multiple of eight. This allowance is temporary, as it has not yet been decided whether to require support for arbitrary bit-widths, so implementations must be prepared for this to change.
Integer types cannot be declared. Integer types also cannot be generic.
Examples:
int $usize 64;
int $bool 8;
Floating-Point Types¶
Floating-point types are types that use floating-point math instructions. They are defined with a certain bit width:
float_def = "float" type_name bit_width ";" .
The bit_width
must be a positive integer greater than 0
. It specifies how
many binary digits are in the type.
Floating-point types cannot be declared. Floating-point types also cannot be generic.
Examples:
float $f64 64;
declare float $f32;
Compound Types¶
A compound type in Yvm is a type that has different parts.
There are three kinds of compound types: struct types, union types, and array types.
Struct Types¶
Struct types are compound types that can hold disparate data of different types. Each field of the struct has a type and a name:
struct_def = "struct" type_name "{" field { "," field } "}" .
struct_decl = "declare" "struct" type_name ";"
field = var_name ":" type_name .
var_name = "%" identifier /* with no spaces inbetween */ .
Go To
struct_def
is a struct definition, and struct_decl
is a struct
declaration.
Examples:
struct $str
{
%len: $usize,
%idx: $usize,
%a: $char_ptr
}
// illegal: field name must not have space after %
struct $str2
{
% len: $usize,
%idx: $usize,
%a: $char_ptr
}
declare struct $FILE;
Warning
It is an error if a struct type contains a field of the same type as the struct, or contains a field whose type contains (directly or indirectly) a field with the same type as the struct, including in an array type.
Struct types can also be generic:
struct_gen_def = "struct" type_name "<" gen_list ">" "{" gen_field_list "}" .
gen_list = identifier { "," identifier } .
gen_field_list = gen_field { "," gen_field }
gen_field = var_name ":" (type_name | identifier) .
Go To
Generic struct types cannot be declared.
Warning
It is an error if the type of any field in a generic struct type is an
identifier that does not exist in gen_list
.
Examples:
struct $array<t>
{
%len: $usize,
%idx: $usize,
%a: t
}
// illegal: r does not exist as a generic parameter
struct $array<t>
{
%len: $usize,
%idx: $usize,
%a: r
}
Union Types¶
Union types are compound types that data of different types that are combined. In other words, all fields in a union are stored in the same place.
union_def = "union" type_name "{" field { "," field } "}" .
union_decl = "declare" "union" type_name ";"
union_def
is a union definition, and union_decl
is a union
declaration.
Examples:
union $data
{
%len: $usize,
%idx: $usize,
%a: $char_ptr
}
// illegal: field name must not have space after %
union $data2
{
% len: $usize,
%idx: $usize,
%a: $char_ptr
}
declare union $DATA;
Warning
It is an error if a union type contains a field of the same type as the union, or contains a field whose type contains (directly or indirectly) a field with the same type as the union, including in an array type.
Union types can also be generic:
union_gen_def = "union" type_name "<" gen_list ">" "{" gen_field_list "}" .
Generic union types cannot be declared.
Warning
It is an error if the type of any field in a generic union type is an
identifier that does not exist in gen_list
.
Examples:
union $data<t>
{
%len: $usize,
%idx: $usize,
%a: t
}
// illegal: r does not exist as a generic parameter
union $data<t>
{
%len: $usize,
%idx: $usize,
%a: r
}
Array Types¶
Array types are compound types that have items that are all of the same type. Arrays have a length, which is the number of elements the array has.
array_def = type_name num_elems ";" .
num_elems = int_lit .
num_elems
is the amount of elements in the array and must be a constant
integer.
Array types cannot be declared. Array types also cannot be generic.
Pointer Types¶
Pointer types are types that are pointers:
pointer_def = "ptr" type_name type_of_pointer [ addr_space ] .
type_of_pointer = type_name .
addr_space = string_lit .
Go To
type_of_pointer
is the type that the pointer will point to. addr_space
is
the name of the address space the pointer type is constrained to.
Pointer types cannot be declared. Pointer types also cannot be generic.
Pointer Provenance¶
Pointers have provenance. This means that there is extra information contained in a pointer. This information includes:
The address that the pointer points to.
The address space for the pointer.
How many elements are allocated at the address (the allocation must have enough bytes to fit all of the elements).
The type that the pointer points to.
Note
The reason that the type the pointer points to is tracked as part of the provenance is because the backend is the only part of the entire process that knows what size elements actually are since it is translating code into code for a specific machine.
Each of these items has their own separate type:
The address has type
$*<type_name>
(equivalent to C’s<type_name>*
)The address space has type
$@<type_name>
The number of elements allocated has type
$usize
The type that the pointer points to has type
$type
where <type_name>
is the name of the pointer type (not type_of_pointer
) as
defined by pointer_def
.
Since there can only be one address space per pointer type, there is no need to allow multiple address space types per pointer type.
When pointers are passed around, their provenance is passed with them. While most backends will be able to eliminate the address space and type, backends MUST keep the address and MUST keep the number of allocated elements.
However, it is possible to explicitly discard the provenance information by grabbing the address alone, and then just pass that around. In that case, the pointer provenance does not need to be passed around. This is for compatibility with C code.
Warning
If code that discards the provenance information ever accesses memory outside of the allocation for that pointer, the behavior is undefined.
The provenance of a pointer can be accessed as follows:
The address can be accessed with
%s.addr
The address space can be accessed with
%s.addr_space
The number of elements allocated can be accessed with
%s.len
The type that the pointer points to can be accessed with
%s.type
where %s
is a register that is a pointer type.
Generic Types¶
Struct types and union types can be generic. However, when a generic type is used with concrete type arguments, it cannot be used directly; each use of a generic type with a unique set of concrete type arguments must be given its own type name.
That is done like so:
set_gen_def = "set" type_name concrete_type_instantiation ";" .
concrete_type_instantiation = type_name "<" type_name { "," type_name } ">" .
Go To
The type_name
in set_gen_def
is the name of the new type.
Warning
It is an error if any of the concrete type arguments do not exist.
Examples:
struct $array<t>
{
%len: $usize,
%idx: $usize,
%a: $*t
}
set $str $array<$uchar>;
set $str_ptr $array<$str>;
Type Renaming¶
In addition, it is possible to create new types with a set
statement.
set_def = "set" type_name type_name ";" .
The first type_name
is the name of the new type, and the second is the name of
the type it is created from.
Examples:
set $string $str;
set $uint8 $u8;
Constants¶
Constants, which are immutable data with a name, are defined as follows:
constant_def = "constant" constant_name ":" type_name "=" literal_inst ";" .
constant_decl = "declare" "constant" constant_name ":" type_name .
constant_name = "&" identifier /* with no space inbetween */ .
Go To
constant_def
is a constant definition, and constant_decl
is a constant
declaration.
Warning
It is an error if the arguments to the literal instruction are not literals or previously defined constants.
Warning
It is an error if a constant declaration does not have a definition elsewhere, or the type mismatches.
Examples:
constant &newline_ptr : $*schar = string_literal<$schar> "\n";
constant &newline_len : $usize = int_literal<$usize> 2;
constant &newline : $str = struct_literal<$str> &newline_len, 0, &newline_ptr;
declare constant &newline2 : $str;
// illegal; no space allowed between "&" and the name.
constant & newline3 : $str = struct_literal<$str> &newline_len, 0, &newline_ptr;
Functions¶
TODO
In Yvm, functions are the items that contain actual code.
Note
While code may appear in other places, such code is only for the backend to execute or to define constants.
Functions are defined and declared as follows:
func_def = "define" func_name "(" [ func_param_list ] ")" "->" func_ret func_body .
func_decl = "declare" func_name "(" [ func_param_list ] ")" "->" func_ret ";" .
func_param_list = func_param { "," func_param } .
func_param = identifier ":" type_name .
func_ret = ("void" | type_name) .
func_body = "{" basic_block { basic_block } "}" .
Parameters¶
TODO
Basic Blocks¶
TODO
Instructions¶
TODO
Instruction Groups¶
TODO
Literal Instructions¶
TODO
literal_inst = /* TODO */ .
Generic Functions¶
TODO
YIR Semantics¶
TODO
Predeclared Types¶
TODO
$usize
(for object sizes).$type
(struct that has size and alignment).
Registers¶
TODO
Allocation¶
TODO
Memory Model¶
TODO
YIR Bitcode¶
TODO
YIR API¶
TODO
In defense of https://www.trojansource.codes/:
TODO: Require an explicit Unicode control character to allow switching scripts in the middle of an identifier?
TODO: Implement the mitigations suggested by the link above:
Compilers, interpreters, and build pipelines supporting Unicode should throw errors or warnings for unterminated bidirectional control characters in comments or string literals, and for identifiers with mixed-script confusable characters.
Language specifications should formally disallow unterminated bidirectional control characters in comments and string literals.
Code editors and repository frontends should make bidirectional control characters and mixed-script confusable characters perceptible with visual symbols or warnings.
Also, implement https://news.ycombinator.com/item?id=29172311. To make things easier on users, have an easy way in the language to say “use the contents of such a file as a string”.
Use https://www.unicode.org/reports/tr31/ for identifiers, but also exclude Hangul filler and half-width Hangul filler letters from identifiers.
Comments¶
Comments serve as program documentation. There are two forms:
Line comments start with the character sequence
//
and stop at the end of the line.General comments start with the character sequence
/*
and stop with theN
th*/
character sequence, whereN
is the number of times the character sequence/*
appears in the comment, including the start of the comment. In other words, general comments nest.A comment cannot start inside a rune literal or string literal, or inside a comment. Comments act like a space.