Specification

23 July 2022

Introduction

This is the specification for Yvm, the compiler library, and its Intermediate Representation.

Note

This document uses some material copied from the The Go Programming Language Specification, which is licensed under the Creative Commons Attribution 3.0 License.

Warning

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Terms

backend

The portion of an implementation that translates YIR to machine code.

bit width

The number of binary bits occupied by an object.

compiler

Any program that either translates YIR to machine code, or interprets YIR.

element

A single object, possibly one of many within an array type.

field

An item in a struct type or a union type.

machine code

Code that can be executed directly by a machine, whether the machine is physical or virtual.

YIR

Yvm Intermediate Representation, the language on which YVM is built.

Notation

The syntax is defined using Extended Backus-Naur Form (EBNF):

Production  = production_name "=" [ Expression ] "." .
Expression  = Alternative { "|" Alternative } .
Alternative = Term { Term } .
Term        = production_name | token [ "…" token ] | Group | Option | Repetition .
Group       = "(" Expression ")" .
Option      = "[" Expression "]" .
Repetition  = "{" Expression "}" .

Productions are expressions constructed from terms and the following operators, in increasing precedence:

|  alternation
() grouping
[] option (0 or 1 times)
{} repetition (0 to n times)

Lower-case production names are used to identify lexical tokens. Non-terminals are in CamelCase. Lexical tokens are enclosed in double quotes “” or back quotes ``.

The form a b represents the set of characters from a through b as alternatives. The horizontal ellipsis is also used elsewhere in the spec to informally denote various enumerations or code snippets that are not further specified. The character (as opposed to the three characters ...) is not a token of the YIR language.

Source Code Representation

Source code is Unicode text encoded in UTF-8.

Each code point is distinct; for instance, upper and lower case letters are different characters.

Warning

Implementation Restriction: The compiler MUST canonicalize the text, so a single accented code point is not distinct from the same character constructed from combining an accent and a letter; those are treated as two code points.

Warning

Implementation Restriction: A compiler MUST disallow the NUL character (U+0000) in the source text.

Warning

Implementation Restriction: A compiler MUST disallow the byte order mark (U+FEFF) in the source text.

Characters

The following terms are used to denote specific Unicode character classes:

newline        = /* the Unicode code point U+000A */ .
unicode_char   = /* an arbitrary Unicode code point except newline */ .
unicode_letter = /* a Unicode code point classified as "Letter" */ .
unicode_digit  = /* a Unicode code point classified as "Number, decimal digit" */ .

In The Unicode Standard 8.0, Section 4.5 “General Category” defines a set of character categories. YIR treats all characters in any of the Letter categories Lu, Ll, Lt, Lm, or Lo as Unicode letters, and those in the Number category Nd as Unicode digits.

Architecture

Yvm implementations are required to have the following architecture:

|----------|     |-----------|     |---------|
| Analyzer | --> | Optimizer | --> | Backend |
|----------|     |-----------|     |---------|

Analyzer

The analyzer will be built with a series of stages, which should be controllable by the user.

Those stages will be run on the code that Yvm is given and produce annotations.

In order to run passes, users will also pass YIR code to Yvm to run. Yvm implementations should execute that code, and it is expected that users will encode their analysis logic in that code, using it to call analysis passes as desired.

Optimizer

The analyzer will be built with a series of stages, which should be controllable by the user.

Those stages will be run on the code that Yvm is given. These stages should either produce annotations or transform the code they are given.

In order to run passes, users will also pass YIR code to Yvm to run. Yvm implementations should execute that code, and it is expected that users will encode their optimization logic in that code, using it to call optimization passes as desired.

Backend

The backend will be built with a series of stages, which should be controllable by the user.

Those stages will be run on the code that Yvm is given. These stages should continually refine the Yvm code into machine code.

In order to run passes, users will also pass YIR code to Yvm to run. Yvm implementations should execute that code, and it is expected that users will encode their code generation logic in that code, using it to call code generation passes as desired.

Yvm Intermediate Representation

Yvm Intermediate Representation (YIR) is the language underlying Yvm, and it has three forms:

  • Text

  • Bitcode

  • In-memory

The YIR Syntax section defines the text version of YIR, the YIR Bitcode section defines the bitcode version of YIR, and the YIR API defines the in-memory form of YIR, as well as the Yao API.

The semantics of YIR are described in the YIR Semantics section.

YIR Syntax

The section defines the text format of the Yvm Intermediate Representation (YIR). It is the basis for the other forms.

Lexical Elements

Each of the elements below are indivisible. This means that implementations must not split them in any way.

Letters and Digits

The underscore character _ (U+005F) is considered a letter.

letter        = unicode_letter | "_" .
decimal_digit = "0" ... "9" .
binary_digit  = "0" | "1" .
octal_digit   = "0" ... "7" .
hex_digit     = "0" ... "9" | "A" ... "F" | "a" ... "f" .

Comments

Comments serve as program documentation. There are two forms:

  1. Line comments start with the character sequence // and stop at the end of the line.

  2. General comments start with the character sequence /* and stop with the Nth */ character sequence, where N is the number of times the character sequence /* appears in the comment, including the start of the comment. In other words, general comments nest.

A comment cannot start inside a rune literal or string literal, or inside a comment. Comments act like a space.

Tokens

Tokens form the vocabulary of the YIR language. There are three classes: identifiers, operators and punctuation, and literals.

Whitespace, formed by horizontal tabs (U+0009), newlines (U+000A), carriage returns (U+000D), spaces (U+000D), and comments, is ignored, except as it separates tokens that would otherwise combine into a single token.

While breaking the input into tokens, the next token is the longest sequence of characters that form a valid token.

Semicolons

The formal grammar uses semicolons ; as terminators in a number of productions. All of the semicolons used in the grammar are required.

Identifiers

Identifiers name program entities such as variables and types. An identifier is a sequence of one or more letters and digits. The first character in an identifier must be a letter and must not be an underscore.

identifier = unicode_letter { letter | unicode_digit } .

Note

Implementation Allowance: Identifiers that begin with an underscore are reserved for implementations and are thus allowed for implementations to use.

Warning

Implementation Restriction: Implementations are not allowed to use two consecutive underscores at the beginning of identifiers. Such identifiers are reserved for C implementations and using them would invoke Undefined Behavior in C.

Some identifiers are predeclared.

Integer Literals

An integer literal is a sequence of digits representing an integer constant. An optional prefix sets a non-decimal base: 0b for binary, 0o for octal, and 0x for hexadecimal.

In hexadecimal literals, letters a through f and A through F represent values 10 through 15.

For readability, an underscore character _ may appear after a base prefix or between successive digits; such underscores do not change the literal’s value.

int_lit = decimal_lit | binary_lit | octal_lit | hex_lit .
decimal_lit = "0" | ( "1" ... "9" ) [ [ "_" ] decimal_digits ] .
binary_lit = "0" ( "b" ) [ "_" ] binary_digits .
octal_lit = "0" ( "o" ) [ "_" ] octal_digits .
hex_lit = "0" ( "x" ) [ "_" ] hex_digits .

decimal_digits = decimal_digit { [ "_" ] decimal_digit } .
binary_digits = binary_digit { [ "_" ] binary_digit } .
octal_digits = octal_digit { [ "_" ] octal_digit } .
hex_digits = hex_digit { [ "_" ] hex_digit } .

Examples:

42
4_2
0o600
0xBadFace
0x_67_7a_2f_cc_40_c6
170141183460469231731687303715884105727
170_141183_460469_231731_687303_715884_105727

_42         // an identifier, not an integer literal
42_         // invalid: _ must separate successive digits
4__2        // invalid: only one _ at a time
0_xBadFace  // invalid: _ must separate successive digits

Warning

It is a compile error if there is a digit in the number that does not make sense in the base. For example, it is a compile error in any digit besides 0 and 1 appears in a binary number, or any digit greater than 7 for an octal number.

Floating-Point Literals

A floating-point literal is a decimal or hexadecimal representation of a floating-point constant.

A decimal floating-point literal consists of an integer part (decimal digits), a decimal point, a fractional part (decimal digits), and an exponent part (e followed by an optional sign and decimal digits). One of the integer part or the fractional part may be elided; one of the decimal point or the exponent part may be elided. An exponent value exp scales the mantissa (integer and fractional part) by 10^exp.

A hexadecimal floating-point literal consists of a 0x prefix, an integer part (hexadecimal digits), a radix point, a fractional part (hexadecimal digits), and an exponent part (p followed by an optional sign and decimal digits). One of the integer part or the fractional part may be elided; the radix point may be elided as well, but the exponent part is required. (This syntax matches the one given in IEEE 754-2008 §5.12.3.) An exponent value exp scales the mantissa (integer and fractional part) by 2^exp.

For readability, an underscore character _ may appear after a base prefix or between successive digits; such underscores do not change the literal value.

float_lit = decimal_float_lit | hex_float_lit .

decimal_float_lit = decimal_digits "." [ decimal_digits ] [ decimal_exponent ]
                  | decimal_digits decimal_exponent
                  | "." decimal_digits [ decimal_exponent ] .

decimal_exponent = ( "e" ) [ "+" | "-" ] decimal_digits .

hex_float_lit = "0" ( "x" ) hex_mantissa hex_exponent .
hex_mantissa  = [ "_" ] hex_digits "." [ hex_digits ]
              | [ "_" ] hex_digits
              | "." hex_digits .

hex_exponent = ( "p" ) [ "+" | "-" ] decimal_digits .

Examples:

0.
72.40
072.40       // == 72.40
2.71828
1.e+0
6.67428e-11
.25
1_5.         // == 15.0
0.15e+0_2    // == 15.0

0x1p-2       // == 0.25
0x2.p10      // == 2048.0
0x1.Fp+0     // == 1.9375
0X.8p-0      // == 0.5
0x15e-2      // == 0x15e - 2 (integer subtraction)

0x.p1        // invalid: mantissa has no digits
1p-2         // invalid: p exponent requires hexadecimal mantissa
0x1.5e-2     // invalid: hexadecimal mantissa requires p exponent
1_.5         // invalid: _ must separate successive digits
1._5         // invalid: _ must separate successive digits
1.5_e1       // invalid: _ must separate successive digits
1.5e_1       // invalid: _ must separate successive digits
1.5e1_       // invalid: _ must separate successive digits

Rune Literals

A rune literal represents a rune constant, an integer value identifying a Unicode code point. A rune literal is expressed as one or more characters enclosed in single quotes, as in 'x' or 'n'. Within the quotes, any character may appear except newline and unescaped single quote. A single quoted character represents the Unicode value of the character itself, while multi-character sequences beginning with a backslash encode values in various formats.

The simplest form represents the single character within the quotes; since Go source text is Unicode characters encoded in UTF-8, multiple UTF-8-encoded bytes may represent a single integer value. For instance, the literal 'a' holds a single byte representing a literal a, Unicode U+0061, value 0x61, while 'ä' holds two bytes (0xc3 0xa4) representing a literal a-dieresis, U+00E4, value 0xe4.

Several backslash escapes allow arbitrary values to be encoded as ASCII text. There are four ways to represent the integer value as a numeric constant: x followed by exactly two hexadecimal digits, u followed by exactly four hexadecimal digits, and a U followed by exactly eight hexadecimal digits. In each case the value of the literal is the value represented by the digits in the corresponding base.

Although these representations all result in an integer, they have different valid ranges. Hexadecimal escapes satisfy this condition by construction. The escapes u and U represent Unicode code points so within them some values are illegal, in particular those above 0x10FFFF and surrogate halves.

After a backslash, certain single-character escapes represent special values:

\a   /* U+0007 alert or bell */
\b   /* U+0008 backspace */
\f   /* U+000C form feed */
\n   /* U+000A line feed or newline */
\r   /* U+000D carriage return */
\t   /* U+0009 horizontal tab */
\v   /* U+000b vertical tab */
\\   /* U+005c backslash */
\'   /* U+0027 single quote (valid escape only within rune literals) */
\"   /* U+0022 double quote (valid escape only within string literals) */

All other sequences starting with a backslash are illegal inside rune literals.

rune_lit = "'" ( unicode_value | hex_byte_value ) "'" .

unicode_value  = unicode_char | little_u_value | big_u_value | escaped_char .
hex_byte_value = `\` "x" hex_digit hex_digit .
little_u_value = `\` "u" hex_digit hex_digit hex_digit hex_digit .
big_u_value    = `\` "U" hex_digit hex_digit hex_digit hex_digit hex_digit hex_digit hex_digit hex_digit .
escaped_char   = `\` ( "a" | "b" | "f" | "n" | "r" | "t" | "v" | `\` | "'" | `"` ) .

Examples:

'a'
'ä'
'本'
'\t'
'\000'
'\007'
'\377'
'\x07'
'\xff'
'\u12e4'
'\U00101234'
'\''         // rune literal containing single quote character
'aa'         // illegal: too many characters
'\xa'        // illegal: too few hexadecimal digits
'\0'         // illegal: too few octal digits
'\uDFFF'     // illegal: surrogate half
'\U00110000' // illegal: invalid Unicode code point

String Literals

A string literal represents a string constant obtained from concatenating a sequence of characters. There are two forms: raw string literals and interpreted string literals.

Raw string literals are character sequences between two sets of three consecutive double quotes, as in """foo""". Within the quotes, any character may appear except three consecutive double quotes. The value of a raw string literal is the string composed of the uninterpreted (implicitly UTF-8-encoded) characters between the quotes; in particular, backslashes have no special meaning and the string may contain newlines. Carriage return characters (\r) inside raw string literals are discarded from the raw string value.

Interpreted string literals are character sequences between double quotes, as in "bar". Within the quotes, any character may appear except newline and unescaped double quote. The text between the quotes forms the value of the literal, with backslash escapes interpreted as they are in rune literals (except that ' is illegal and " is legal), with the same restrictions. The two-digit hexadecimal (\xnn) escape represents individual bytes of the resulting string; all other escapes represent the (possibly multi-byte) UTF-8 encoding of individual characters. Thus inside a string literal \xFF represent a single byte of value 0xFF (255), while ÿ, \u00FF, \U000000FF, and \xc3\xbf represent the two bytes 0xc3 0xbf of the UTF-8 encoding of character U+00FF.

string_lit             = raw_string_lit | interpreted_string_lit .
raw_string_lit         = `"""` { unicode_char | newline } `"""` .
interpreted_string_lit = `"` { unicode_value | hex_byte_value } `"` .

Examples:

"""abc"""            // same as "abc"
"""\n
\n"""                // same as "\n\n\n"
"\n"
"\""                 // same as `"`
"Hello, world!\n"
"日本語"
"\u65e5本\U00008a9e"
"\xff\u00FF"
"\uD800"             // illegal: surrogate half
"\U00110000"         // illegal: invalid Unicode code point

These examples all represent the same string:

"日本語"                                // UTF-8 input text
"""日本語"""                            // UTF-8 input text as a raw literal
"\u65e5\u672c\u8a9e"                    // the explicit Unicode code points
"\U000065e5\U0000672c\U00008a9e"        // the explicit Unicode code points
"\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e"  // the explicit UTF-8 bytes

Warning

If the source code represents a character as two code points, such as a combining form involving an accent and a letter, the result will be an error if placed in a rune literal (it is not a single code point), and will appear as two code points if placed in a string literal.

Syntax Elements

Every Yvm file is a series of declarations and definitions of constants, types, and functions.

Definitions

When new constants, types, and functions are created in Yvm, it is through definitions. When such a definition occurs, the item is “defined” for use after that point.

All of the required data for the item is required to be given at the definition point.

Declarations

It is possible to declare constants, types, and functions that are not defined; if so, they are expected to be defined elsewhere, including in code external to all Yvm modules, such as C code.

When declaring an item, only the name is necessary.

Types

A type in Yvm is either a primitive type, a compound type, or a pointer type.

Type names always begin with $:

type_name = "$" identifier /* with no spaces inbetween */ .

Go To

Examples:

$s64
$int
$ f64    // invalid: must not have a space between $ and the name
Primitive Types

A primitive type in Yvm is a type that has no parts. In other words, it is a type that cannot be split into parts.

There are two kinds of primitive types: integer types and floating-point types.

Integer Types

Integer types are types that use integer math instructions. They are defined with a certain bit width:

int_def  = "int" type_name bit_width ";" .

bit_width = int_lit .

The bit_width must be a positive integer greater than 0. It specifies how many binary digits are in the type.

Warning

Implementation Allowance: Implementations are allowed to only support integer types whose width is a multiple of eight. This allowance is temporary, as it has not yet been decided whether to require support for arbitrary bit-widths, so implementations must be prepared for this to change.

Integer types cannot be declared. Integer types also cannot be generic.

Examples:

int $usize 64;
int $bool 8;
Floating-Point Types

Floating-point types are types that use floating-point math instructions. They are defined with a certain bit width:

float_def  = "float" type_name bit_width ";" .

The bit_width must be a positive integer greater than 0. It specifies how many binary digits are in the type.

Floating-point types cannot be declared. Floating-point types also cannot be generic.

Examples:

float $f64 64;
declare float $f32;
Compound Types

A compound type in Yvm is a type that has different parts.

There are three kinds of compound types: struct types, union types, and array types.

Struct Types

Struct types are compound types that can hold disparate data of different types. Each field of the struct has a type and a name:

struct_def  = "struct" type_name "{" field { "," field } "}" .
struct_decl = "declare" "struct" type_name ";"

field = var_name ":" type_name .

var_name = "%" identifier /* with no spaces inbetween */ .

Go To

struct_def is a struct definition, and struct_decl is a struct declaration.

Examples:

struct $str
{
	%len: $usize,
	%idx: $usize,
	%a: $char_ptr
}

// illegal: field name must not have space after %
struct $str2
{
	% len: $usize,
	%idx: $usize,
	%a: $char_ptr
}

declare struct $FILE;

Warning

It is an error if a struct type contains a field of the same type as the struct, or contains a field whose type contains (directly or indirectly) a field with the same type as the struct, including in an array type.

Struct types can also be generic:

struct_gen_def = "struct" type_name "<" gen_list ">" "{" gen_field_list "}" .

gen_list       = identifier { "," identifier } .
gen_field_list = gen_field { "," gen_field }
gen_field      = var_name ":" (type_name | identifier) .

Generic struct types cannot be declared.

Warning

It is an error if the type of any field in a generic struct type is an identifier that does not exist in gen_list.

Examples:

struct $array<t>
{
	%len: $usize,
	%idx: $usize,
	%a: t
}

// illegal: r does not exist as a generic parameter
struct $array<t>
{
	%len: $usize,
	%idx: $usize,
	%a: r
}
Union Types

Union types are compound types that data of different types that are combined. In other words, all fields in a union are stored in the same place.

union_def  = "union" type_name "{" field { "," field } "}" .
union_decl = "declare" "union" type_name ";"

union_def is a union definition, and union_decl is a union declaration.

Examples:

union $data
{
	%len: $usize,
	%idx: $usize,
	%a: $char_ptr
}

// illegal: field name must not have space after %
union $data2
{
	% len: $usize,
	%idx: $usize,
	%a: $char_ptr
}

declare union $DATA;

Warning

It is an error if a union type contains a field of the same type as the union, or contains a field whose type contains (directly or indirectly) a field with the same type as the union, including in an array type.

Union types can also be generic:

union_gen_def = "union" type_name "<" gen_list ">" "{" gen_field_list "}" .

Generic union types cannot be declared.

Warning

It is an error if the type of any field in a generic union type is an identifier that does not exist in gen_list.

Examples:

union $data<t>
{
	%len: $usize,
	%idx: $usize,
	%a: t
}

// illegal: r does not exist as a generic parameter
union $data<t>
{
	%len: $usize,
	%idx: $usize,
	%a: r
}
Array Types

Array types are compound types that have items that are all of the same type. Arrays have a length, which is the number of elements the array has.

array_def  = type_name num_elems ";" .

num_elems = int_lit .

num_elems is the amount of elements in the array and must be a constant integer.

Array types cannot be declared. Array types also cannot be generic.

Pointer Types

Pointer types are types that are pointers:

pointer_def = "ptr" type_name type_of_pointer [ addr_space ] .

type_of_pointer = type_name .
addr_space      = string_lit .

type_of_pointer is the type that the pointer will point to. addr_space is the name of the address space the pointer type is constrained to.

Pointer types cannot be declared. Pointer types also cannot be generic.

Pointer Provenance

Pointers have provenance. This means that there is extra information contained in a pointer. This information includes:

  • The address that the pointer points to.

  • The address space for the pointer.

  • How many elements are allocated at the address (the allocation must have enough bytes to fit all of the elements).

  • The type that the pointer points to.

Note

The reason that the type the pointer points to is tracked as part of the provenance is because the backend is the only part of the entire process that knows what size elements actually are since it is translating code into code for a specific machine.

Each of these items has their own separate type:

  • The address has type $*<type_name> (equivalent to C’s <type_name>*)

  • The address space has type $@<type_name>

  • The number of elements allocated has type $usize

  • The type that the pointer points to has type $type

where <type_name> is the name of the pointer type (not type_of_pointer) as defined by pointer_def.

Since there can only be one address space per pointer type, there is no need to allow multiple address space types per pointer type.

When pointers are passed around, their provenance is passed with them. While most backends will be able to eliminate the address space and type, backends MUST keep the address and MUST keep the number of allocated elements.

However, it is possible to explicitly discard the provenance information by grabbing the address alone, and then just pass that around. In that case, the pointer provenance does not need to be passed around. This is for compatibility with C code.

Warning

If code that discards the provenance information ever accesses memory outside of the allocation for that pointer, the behavior is undefined.

The provenance of a pointer can be accessed as follows:

  • The address can be accessed with %s.addr

  • The address space can be accessed with %s.addr_space

  • The number of elements allocated can be accessed with %s.len

  • The type that the pointer points to can be accessed with %s.type

where %s is a register that is a pointer type.

Generic Types

Struct types and union types can be generic. However, when a generic type is used with concrete type arguments, it cannot be used directly; each use of a generic type with a unique set of concrete type arguments must be given its own type name.

That is done like so:

set_gen_def = "set" type_name concrete_type_instantiation ";" .

concrete_type_instantiation = type_name "<" type_name { "," type_name } ">" .

Go To

The type_name in set_gen_def is the name of the new type.

Warning

It is an error if any of the concrete type arguments do not exist.

Examples:

struct $array<t>
{
	%len: $usize,
	%idx: $usize,
	%a: $*t
}
set $str $array<$uchar>;
set $str_ptr $array<$str>;
Type Renaming

In addition, it is possible to create new types with a set statement.

set_def = "set" type_name type_name ";" .

The first type_name is the name of the new type, and the second is the name of the type it is created from.

Examples:

set $string $str;
set $uint8 $u8;

Constants

Constants, which are immutable data with a name, are defined as follows:

constant_def  = "constant" constant_name ":" type_name "=" literal_inst ";" .
constant_decl = "declare" "constant" constant_name ":" type_name .

constant_name = "&" identifier /* with no space inbetween */ .

constant_def is a constant definition, and constant_decl is a constant declaration.

Warning

It is an error if the arguments to the literal instruction are not literals or previously defined constants.

Warning

It is an error if a constant declaration does not have a definition elsewhere, or the type mismatches.

Examples:

constant &newline_ptr : $*schar = string_literal<$schar> "\n";
constant &newline_len : $usize = int_literal<$usize> 2;
constant &newline : $str = struct_literal<$str> &newline_len, 0, &newline_ptr;

declare constant &newline2 : $str;

// illegal; no space allowed between "&" and the name.
constant & newline3 : $str = struct_literal<$str> &newline_len, 0, &newline_ptr;

Functions

TODO

In Yvm, functions are the items that contain actual code.

Note

While code may appear in other places, such code is only for the backend to execute or to define constants.

Functions are defined and declared as follows:

func_def  = "define" func_name "(" [ func_param_list ] ")" "->" func_ret func_body .
func_decl = "declare" func_name "(" [ func_param_list ] ")" "->" func_ret ";" .

func_param_list = func_param { "," func_param } .

func_param = identifier ":" type_name .
func_ret   = ("void" | type_name) .
func_body  = "{" basic_block { basic_block } "}" .
Parameters

TODO

Basic Blocks

TODO

Instructions

TODO

Instruction Groups

TODO

Literal Instructions

TODO

literal_inst = /* TODO */ .
Generic Functions

TODO

YIR Semantics

TODO

Predeclared Types

TODO

  • $usize (for object sizes).

  • $type (struct that has size and alignment).

Registers

TODO

Allocation

TODO

Memory Model

TODO

YIR Bitcode

TODO

YIR API

TODO

In defense of https://www.trojansource.codes/:

TODO: Require an explicit Unicode control character to allow switching scripts in the middle of an identifier?

TODO: Implement the mitigations suggested by the link above:

Compilers, interpreters, and build pipelines supporting Unicode should throw errors or warnings for unterminated bidirectional control characters in comments or string literals, and for identifiers with mixed-script confusable characters.

Language specifications should formally disallow unterminated bidirectional control characters in comments and string literals.

Code editors and repository frontends should make bidirectional control characters and mixed-script confusable characters perceptible with visual symbols or warnings.

Also, implement https://news.ycombinator.com/item?id=29172311. To make things easier on users, have an easy way in the language to say “use the contents of such a file as a string”.

Use https://www.unicode.org/reports/tr31/ for identifiers, but also exclude Hangul filler and half-width Hangul filler letters from identifiers.