Design

Yvm is a compiler library. It is similar to LLVM, but written in C (soon to be written in Yao) and focuses on functional programming and ease of optimization and analysis.

High-Level Design

When Yvm was being designed, there were some basic goals:

  1. Ease of optimization.

  2. Ease of analysis.

  3. Ease of lowering.

  4. Ease of targeting.

  5. Reproducible builds.

  6. Provability.

There are five major things that must be done right (easy to compute) in order to fulfill those requirements:

  1. Control Flow

  2. Data Flow

  3. Optimization

  4. API

  5. Proof System

Control Flow

Control flow is largely decided by the form of the IR. In general, there are three major options:

  1. Continuation-Passing Style (CPS)

  2. A-normal Form (ANF)

  3. Static Single Assignment (SSA)

CPS and ANF are equivalent in power (ease of optimization and analysis). SSA is almost equivalent to the others, but it isn’t quite because control flow is different. However, SSA is easier to translate into machine code; LLVM uses it for this reason.

The tradeoff becomes difficult then, because YIR needs to be easy to optimize and easy to lower to machine code.

Thankfully, there is one change that can be added to SSA to make it exactly equivalent to CPS: basic block arguments. This makes it equivalent to CPS because each branch instruction to another basic block then becomes equivalent to calling a continuation.

When the creator of LLVM (Chris Lattner) designed Swift for Apple, he introduced another intermediate language into the Swift compiler that he called Swift Intermediate Language (SIL), and he made SIL use basic block arguments instead of using a phi node like LLVM. This made Swift easier to optimize.

So it was decided that YIR would be SSA with basic block arguments.

Data Flow

Data flow is a little harder, at least in SSA. The reason for this is because in most SSA forms, there are instructions that do not have referential transparency (RT), which if they did, would make data flow fairly trivial. In fact, this is the reason that most SSA instructions do have RT (i.e., add takes two values and produces an entirely new value).

As it turns out, both LLVM IR and SIL are suboptimal because they have instructions that do not have RT (besides control flow instructions, which cannot be made referentially transparent, even in CPS). This is the biggest reason for Yvm’s existence.

Therefore, YIR must be designed such that every instruction (besides control flow) has RT.

But that presents a nasty problem: instructions that access memory inherently cannot have RT, and such instructions are important for being able to compile down to efficient machine code.

To solve this problem, YIR’s designer borrowed an idea from Clojure: the separation of identity from state. See also Are We There Yet? (YouTube) by Rich Hickey, the creator of Clojure.

The gist is this: programming languages are historically bad at managing “time,” or more concretely, values over time. (For more information on values, see Rich Hickey’s talk The Value of Values.) YIR should make that easy.

That’s hardest for load and store instructions because they don’t have RT by default. But there is a little we can do:

  1. A normal load instruction that takes a pointer and returns the value. This has RT, but it only works in the absence of data races. These can be reordered.

  2. An atomic load instruction that takes a pointer and returns the value. These will have “relaxed” semantics. These cannot be reordered, i.e., they will also act as volatile loads.

  3. A store instruction that returns nothing. This only works in the absence of data races. These can be reordered.

  4. An atomic store instruction that returns nothing. These will have “relaxed” semantics. These cannot be reordered, i.e., they will also act as volatile loads.

  5. Memory fence instructions for all 16 versions of store and load barriers. These will have to be control flow instructions because they don’t have RT.

The store instructions are the only thing that do not need RT because there is no relevant data flow, except out to memory, which is only relevant when it comes to memory fences/barriers.

Speaking of which, the memory barriers will be control flow instructions, which means they will split basic blocks. They will prevent reordering of load and store instructions, which actually makes it easy: load and store instructions are just not allowed to cross basic block boundaries, depending on the fence.

This design gives YIR several capabilities:

  • Loads of memory locations that store constant objects do not force a strict ordering and can be optimized heavily (#1). This also goes for objects that are not shared across threads or do not interfere with things like signals.

  • Only loads and stores that can’t be reordered will not be considered for optimization.

  • The translation to machine code is devilishly simple since most arches have actual memory fence instructions.

In addition to the above, all load and store instructions in Yvm must be defined as loading and storing entire objects, not just part of an object. This is done for several reasons, but among the reasons are:

  1. Eliminating load/store tearing.

  2. Giving precise semantics for signals.

Using all of this, YIR will be easier to optimize and to analyze.

However, I then have a problem where changing pointers can cause really bad undefined behavior. And I can’t just make a special instruction for updating pointers because C (and Yao) have the capability to treat anything in memory as a sequence of bytes, which includes pointers.

So what I think I will do is create a specific instruction for checking a pointer for validity. This instruction will be defined to check whether the data that a pointer points to is actually allocated. It will return a boolean value.

With the above instruction, it is possible to make dereferencing a pointer undefined behavior. Conscientious languages should always check pointers, but others would not have to.

SSI Form

This means that we need to not use SSA form, and instead we will use static single information (SSI) form.

Using SSI requires us to make a few more changes. First, any basic block that is a target of another basic block must take the same number and type of arguments as all other targets of that basic block. This will be equivalent to the SSI σ function. The ɸ is taken care of by basic block arguments.

Optimization

There are two things that need to be done right for optimizations to work:

  • Optimization passes must have a way to attach data to any item or element in Yvm.

  • Yvm must have excellent, well-defined semantics that can be implemented on any platform.

  • Correctness must be easy to keep and hard to lose.

  • If a pass preserves correctness, then it must be able to be applied in any order with all other correctness-preserving passes. In other words, such passes should be commutative by design of the system, not the passes themselves.

Data

It should be possible to attach data to any element in Yvm, from declarations, conditional blocks, functions, function parameters, basic blocks, basic block arguments, statements, instructions, instruction arguments, constants, etc.

This will allow Yvm to be data driven and make easy integration for superoptimizers. It will also make it easy to add a declarative optimization language for Yvm later.

Semantics

The semantics of Yvm must be such that generating actual machine code that has those exact semantics should be possible for any supported platform.

To do this, Yvm will define:

  1. Instruction types

  2. Instruction groups

  3. Semantics for time

  4. Semantics for interrupts

  5. Semantics for syscalls

Instruction Types

To allow Yvm to be flexible, it will not define a fixed set of instructions that are basically set in stone without horrendous effort.

To do this, Yvm will define instruction types that specify semantics, and then instructions can be added by frontends, backends, or anyone. When adding an instruction, its type must be specified, and that is how its semantics will be defined.

Each instruction type will also serve as a “generic” form of itself, which means they are valid instructions themselves. When a backend encounters one, it should be able to emit a set of machine instructions for the target machine that will match the semantics of the instruction type, even if the set of instructions is not efficient.

This means that instructions that have certain types can be used to tell backends how to generate more efficient code and will lead to better optimizations. It also means that if a backend does not recognize an instruction, it can just emit the generic set of instructions that it would emit for that instruction’s type.

Adding Instruction Types

In addition, users (frontends, backends, optimizations, etc.) should be able to add instruction types as well give formal models for their semantics.

This is possible because there are only a few things that computers can do:

  1. Move stuff around in memory.

  2. Math.

  3. Managing space resources.

  4. Managing time resources.

  5. Input/output.

Specifying formal semantics of number 1 should be easy, and for number 2, formal semantics are already well-defined by mathematics itself.

Numbers 3, 4, and 5 are a little harder.

First of all, we use “time” and “space” as defined in this blog post. With those definitions, we can better specify formal semantics for both including allocating memory, I/O ports, and CPU time, among other things.

For number 5, we can define all input and output as being in terms of byte streams, and then we can define formal semantics.

Instruction Groups

In addition to instruction types, there will be something called “instruction groups.”

What these are is a set of instructions that can only be used together.

These are necessary because if you have one instruction, say malloc, that allocates memory, and another instruction, say free, that frees memory, if they are in a group, they can be used together. However, another instruction that allocates memory, say arena_malloc, may not be able to be used with free and needs to be paired with arena_free.

Math Instructions

Add instructions return the value and an overflow bit. They can also take a carry bit. Subtraction does the same.

Multiplies return two values. The first is the value, the second is the same size, but it is the carry bits.

Divisions always return a quotient and either a remainder or a modulus. That means there are two division instructions for signed types, but since only mod works for unsigned, there will only be one instruction for unsigned types.

Thread Instructions

There should be instructions for thread creation and destruction. These instructions should only create the thread or destroy it. They should not create or destroy the thread stack, but the thread creation instruction should take a pointer to the thread stack.

Manipulating the thread stack should happen with memory allocation/deallocation instructions.

Memory Allocation Instructions

There should be instructions to allocate, reallocate, and deallocate memory.

These instructions should be defined as just defining, moving, or undefining arrays of bytes from the infinite array of bytes. See “Some Were Meant for C” by Stephen Kell.

These instructions should have the ability to “suballocate” memory. If a section is already existing, but unused, there should be an instruction to “allocate” a portion of it.

I/O Instructions

There should be instructions to do I/O. These instructions should be based on reading/writing bytes, like Unix. I think there should also be two different types, for char devices (non-seekable) and block devices (seekable).

Time

Yvm must know the concept of time, probably based on “machine units” (cycles) like HAL/S.

Interrupts

Yvm must have some concept of interrupts.

Interrupts should only be able to happen between Yvm instructions, but the completion of an instruction should always be a safe point for interrupts to happen.

Syscalls

Syscalls and their semantics should be defined by a series of instructions for each. This will have the great side effect of pinning provable semantics onto low-level syscalls.

Preserving Correctness

It is critical that optimization should never make correct code incorrect. However, unlike most compiler systems, Yvm should also never exploit incorrect code for optimization.

This means that Yvm optimization passes should be expected to not exploit the existence of undefined behavior for optimization. Instead, passes should optimize only in ways that do not have to assume that undefined behavior does not exist.

The reason for this is that exploiting the existence of UB leads to a compiler that will cause the behavior of the code to change. An example is the following:

int something(y_vec* v)
{
	y_usize len = v->len;

	if (v == NULL)
	{
		return -1;
	}

	// Rest of code...
}

Code like the above was added to the linux kernel. In this case, there is a dereference of a pointer that could be NULL. Because NULL behavior could be UB, and because the compiler decided to assume that UB could not happen, it removed the if statement that checked if the pointer was NULL. In Linux, this led to a security vulnerability.

Yvm passes should never do that. They should never assume that UB does not happen; instead, they should optimize as though it can happen unless it can be proven that it does not.

This will make Yvm unique among compiler systems: while most compilers make wrong code fast, Yvm will make wrong code slow and correct code fast.

Pass Commutativity

Yvm should be designed in such a way that passes that preserve correctness, including not using UB as an excuse to optimize (see Preserving Correctness above), such passes will always be commutative.

This means that running passes in any order will not change the correctness of the code, insofar as is possible. (Sometimes it is not, unfortunately. UB is a jerk.)

API

The API is how Yvm will implement the requirement to be easily targetable.

The API design is a more low-level detail than the other two and will have to be done as Yvm is being built.

However, one important thing that the API should expose is navigating between basic blocks in a function. Navigating forward will be easy, since the basic blocks that a particular basic block can go to are going to be encoded into the terminator (control flow instruction) at the end of the basic block. The API should also make it easy to go to any previous basic blocks from a basic block.

Compiler Plumbing

Yvm should be where the lexer, parser, and code gen plumbing is. Compilers should just have functions written to interfaces that Yvm provides. This will allow compilers to share as much code as possible.

Stages

  1. Lexing

  2. Parsing

  3. Yvm Code Gen

  4. Yvm Linking

  5. Abstraction Optimization

  6. Machine Code Gen

  7. Object Linking

  8. Machine Optimization

Lexing

Lexing in Yvm should be a “pull” model; the lexer only lexes the next token when the parser is ready for it.

Parsing

Parsing in Yvm should also be a “pull” model; the parser only parses the next token(s) when the Yvm code gen is ready for it.

The parser could be combined with the Yvm code gen stage.

Yvm Code Gen

Yvm code gen is the stage where the Yvm API will be used to generate the Yvm code by populating Yvm data structures.

Yvm Linking

Linking is the process of combining two Yvm files into one, resolving references between them, ensuring that no duplicates remain, etc.

Abstraction Optimization

Abstraction optimization is platform-agnostic optimization.

Machine Code Gen

Machine code gen is the stage where Yvm is turned into machine-specific code.

Object Linking

If there is still linking that needs to happen after machine code gen, then object linking is the stage where that happens. This linking does not happen on Yvm files; it happens on “object” (machine code) files and is machine-dependent.

This stage should be avoided as much as possible because it will require Yvm to either use external tools or support the specific machine itself.

Machine Optimization

Machine optimization is the stage where machine-specific optimizations can take place. This is done machine code.

As far as possible, optimizations should be done during the Abstraction Optimization stage. This is because Yvm will either need to call external tools, or it will need to support the specific machine itself.

However, some optimizations cannot be done effectively except on actual machine code, so this stage must still exist.

Platform-Specfic Code

Yvm also needs the ability to have platform-specific code, specifically ISA-specific assembly because when it comes down to it, all platforms have certain requirements that must be met, and sometimes, that requires ISA-specific assembly.

Thus, Yvm will support ISA-specific assembly, along with one instruction type, cp, to move a Yvm value into and out of an ISA-specific register.

For x86-64, such ISA-specific code could look like this:

define @_syscall6 : (%syscall_num : $ptr, %arg1 : $ptr, %arg2 : $ptr, %arg3 : $ptr,
              %arg4 : $ptr, %arg5 : $ptr, %arg6 : $ptr) -> $ptr
{
entry(%syscall_num : $ptr, %arg1 : $ptr, %arg2 : $ptr, %arg3 : $ptr,
    %arg4 : $ptr, %arg5 : $ptr, %arg6 : $ptr):
  !x86_64.rax = cp %syscall_num;
  !x86_64.rdi = cp %arg1;
  !x86_64.rsi = cp %arg2;
  !x86_64.rdx = cp %arg3;
  !x86_64.r10 = cp %arg4;
  !x86_64.r8 = cp %arg5;
  !x86_64.r9 = cp %arg6;
  !x86_64.rax = x86_64.syscall !x86_64.rcx, !x86_64.r11;
  %ret = cp !x86_64.rax;
  return %ret;
}

(The above code implements the Linux x86_64 syscall ABI.)

In addition, code with ISA-specific assembly should cause errors if it is in a file that is compiled with an incompatible target machine.

Certificates

Yvm should provide certificates about how Yvm is translated to machine code, as well as how optimizations modify code, including adding optimization passes that just add notes.

Yvm optimizations should also give their justifications for firing.

All of the above, plus Yao certificates, should make it possible for users to have assurance that the code they see (the source code) was translated correctly, without any funny business from the software engineers or their employers.

Versioning

Yvm must be able to check for breaking changes between versions of a piece of software, like Clojure’s Spec.

This includes the library name, which must be unique and the compiler needs to check that there are not breaking changes, or require that the library is renamed. (Or maybe require the major number to be incremented and be automatically appended to the library name?)

Hashbang

Yvm files should be runnable. This will be by hash bangs and an executable that can JIT the files. This means that the executable (yvm?) should be able to link files. This, in turn, means that Yvm files should be able to have references to other files.

Proof System

Yvm will be suitable for a proof system. The entire system must be provable.

The exception is platform-specific (ISA-specific) assembly, which must be tied to fake Yvm code to specify semantics (remember that Yvm code can specify anything regarding the large array of bytes). All proofs will have to make the assumption that the ISA-specific assembly is correct. The fake code can also take the place of missing ISA-specific assembly for ISA’s that don’t need it.

Design by Contract

To enable proving properties, Yvm should use “Design by Contract” (DbC) as a model and be able to:

  • Attach preconditions and postconditions to functions.

  • Attach preconditions, postconditions, invariants, and variants to basic blocks.

The preconditions and postconditions for functions are self-explanatory, but they should reference an already existing predicate function that takes the same amount of values of the same types as the functions.

For basic blocks, the same sort of thing applies, but while preconditions and postconditions are self-explanatory for a basic block, variants are a little more involved.

In essence, just as a function would need to keep track of the old values of the arguments, a basic block would need to keep track of all of the previous values of the basic block arguments. Such arguments should be given to the predicate function as a list, as well as the current values of the arguments. That way, all previous values can be taken into account.

To convert Yao function preconditions and postconditions to Yvm DbC, just convert them into Yvm function preconditions and postconditions.

To convert Yao asserts to Yvm Dbc, just split the basic block and add the assert as a postcondition of the first block and a precondition of the second.

To convert Yao type invariants to Yvm DbC, add the invariants as preconditions and postconditions of every function that is in scope for that type.

To convert loop preconditions to Yvm DbC, just make them preconditions to the basic block that starts the loop, the one with the loop condition.

To convert loop postconditions/invariants to Yvm DbC, just make them postconditions to all of the basic blocks that can jump out of the loop, whether to the loop condition or out of the loop.

To convert loop variants to Yvm DbC, just make them variants of the to the basic block that starts the loop, the one with the loop condition.

Requirements

YIR Requirements

  • Each source file (compilation unit) should be compiled into one Yvm file, like LLVM modules.

  • All modules should be able to be linked together. This mostly involves resolving declarations vs. defines and making sure nothing is defined twice.

  • There should be a predeclared type for types that contains:

    • The size of the type.

    • The alignment of the type.

  • Allow a way to create primitive types with certain numbers of bits for:

    • Integers.

    • Floats-point numbers.

  • Types and functions should be able to be public and private.

    • When linking, check that no private ones are used outside.

    • Then rename all private ones to random names in both units.

    • Then combine.

  • Allow ways to create:

    • Struct types.

    • Union types.

    • Enum types.

  • There will be built-in type modifiers:

    • Owned pointers (@).

    • Borrowed pointers/references (&).

    • Arrays ([]).

    • Mutable (!). This must be followed by a pointer/reference/array modifier because only data stored at an address can be mutable. (Register values are not mutable.)

  • Pointer types:

    • Must have explicit provenance:

      • Address. This must be like a void*, usable without provenance. Useful for implementing memcpy(). They’ll still be labelled as $@<type> or $&<type>.

      • Number of elements.

      • Type.

      • Address space.

      • Address and number of elements should always be kept, but type and address space can be discarded when generating code.

      • Lifetime. (This is for building in a borrow checker and other static analysis.)

  • In order to allow using pointers as void pointers, there needs to be two different instructions to calculate address from a pointer. One should be the one that takes into account type, and the other should treat it as a char*.

    • The type-based one should be like getelementptr, which uses as many index arguments as possible.

      • This one should also not be able to follow more than one pointer. In other words, if there is something like a->b->c, it should only be able to follow the first in one instruction, and getting the next would require a load followed by another one of these instructions.

    • The byte-based one should only take one index.

    • There should also be a type-based one that takes a non-pointer value. This one’s first index, unlike the pointer version and getelementptr, is not the index of the pointer itself. The indices should skip that one. This will allow me to use registers in getelementptr-like stuff.

      • This one should not be able to follow pointers, and must always return the actual value. Call it getelement?

  • Functions, structs, unions, and instructions should be able to be generic.

    • This should only extend one layer deep.

  • Have a statement to create a new type using a generic type with concrete type arguments.

    • This will take care of multiple layers of generics for functions.

  • Generic functions should exist in fully-linked Yvm files. Backends will be responsible for generating code for all concrete instantiations of generic functions.

  • Have a way to create types that are pointers to specific types. Could use the same statement for creating types from generic types.

  • Master Instructions:

    • There should be instructions that are considered to be the master type of instruction. These instructions define the semantics of all other instructions that are considered to be the same type.

    • Such instructions can include alloc, for example, which allocates memory. However, there may be more than one way to allocate memory, such as malloc(), mmap(), and even opening shared memory with another process.

  • Minimal undefined behavior, if any. Define everything possible without limiting programmers.

    • One instance of needed undefined behavior: accessing a raw pointer without a bounds check (still needed in some cases) could be UB if the access goes beyond the bounds of the allocation. I need to allow that, but it should never be UB to attempt to access out-of-bounds with a bounds check.

    • However, UB should be redefined away from what it is in C.

      • Instead, it should be defined such that compilers cannot assume the absence of it when doing optimization. Allowing it would incentivize evil compilers.

  • Functions are made up of basic blocks with arguments.

    • There is no code outside of basic blocks.

    • Basic blocks are made up of instructions.

    • All instructions return values except for the last instruction in the block, which is always a control flow instruction, which does not return a value, but either jumps to another basic block, returns from the function, or jumps away for some other reason.

    • The return values of instructions are captured in “registers”.

    • Only one instance of a register with a specific name per basic block is allowed.

      • This includes basic block arguments.

    • The entry block (the first block) must have the same names for arguments as the function parameters.

    • The entry block must also have the same number and types of arguments as the function parameters.

  • Must have debug info capability.

    • Does not need to be implemented during bootstrap, but must be designed in order to make things work together later.

    • DWARF can be attached to modules, subroutines, variables, parameters, constants, types, and labels, so YIR needs to be able to attach info to all of those entities.

      • Must have a way to tell what package/module a YIR module (compilation unit) came from.

      • Must be able to attach source info to compilation units.

      • Must be able to attach source info to instructions, basic blocks (labels), and functions.

      • Must be able to attach debug info (source variables) to registers and function parameters.

      • Must be able to attach debug info to constants and types.

      • Debug info should be able to be preserved after linking.

Interpreter/Backend Requirements

  • Must implement runtime capabilities.

    • Will be based on “provenance” of Yvm files, which must be a string provided by the user to the backend.

    • Backend users must be able to say which instructions that files of a certain provenance can and cannot use at all.

      • This is to make runtime capabilities easier and faster.

      • This should be a compile-time thing.

      • For example, a user might say that files from a certain provenance cannot use any instructions that might have UB.

        • This means that there has to be a getelementptr instruction that does a bounds check and panics on failure.

      • This also applies to certain instruction/arguments combos.

        • For example, the user might say to not let files from a certain provenance call ReadFile on Windows. (ReadFile is an argument to the call instruction.)

    • Backend users must be able to provide a runtime functions for checking permissions.

      • For example, a function to check that code from a certain provenance can only open files under a certain directory.

      • The functions will take the arguments to the instruction and return a boolean: true if allowed, false otherwise.

    • With this design, security, especially preventing confused deputy, is mostly a matter of deciding what code can do.

      • This means it’s up to users.

      • But a good list should be provided and should include:

        • Not allowing thread creation without using a function of the same provenance.

        • Not allowing use of certain functions, including passing them to other functions.

        • Not allowing int to pointer casts. (To prevent creation of function pointers to get around above restriction.)

        • Not allowing calls to OS functions.

  • Must allow built-in functions.

    • This will be things like Rig’s target registration, which will be its own instruction, but I want it to just call a C function for performance.

  • Must allow users to define their own instructions.

    • This will be implemented by a specific instruction that then calls a built-in function provided by the user.

API Requirements

TODO: Will be done in a future subproject.

Functional Requirements

  • Must be able to parse YIR.

  • Must be able to parse YIR bitcode or bytecode.

    • To be decided and designed later, in a separate subproject.

  • Must have a backend to generate C99.

    • The C99 must have no undefined behavior in C.

    • The C99 must use the safe arithmetic in yc/arith.h.

    • The C99 will use scopes with jumps as basic blocks.

      • All registers in a basic block that only appear in that basic block will be const variables local to the corresponding scope.

      • Basic block arguments will be implemented as function-level non-const variables with a prefix matching the basic block. Before jumping to the next scope, all of the variables will be assigned to.

    • The C99 must include full source information (debug info).

      • However, any debug info that cannot be represented with portable C99 should be excluded. I think only the #line directive works.

  • The bootstrapped compiler must have a Language Server Protocol mode.

Algorithmic Requirements

  • Algorithms:

    • Parsing YIR.

    • Parsing bitcode/bytecode. (To be implemented in a future subproject.)

    • Optimization.

    • Linking.

  • Parsing Algorithmic Requirements:

    • Time: O(n) average case, O(n^2) worst case (only if absolutely necessary), where n is the size of the input file.

    • Space: O(n), where n is the size of the input file.

  • Optimization Algorithmic Requirements:

    • Time: Varied, depending on the optimization.

      • Inlining: O(1) average case, O(n) worst case, where n is the size of the linked program.

      • Constant folding: O(1) average case, O(n) worst case, where n is the size of the linked program.

      • TODO: Add more as I think of them. Will do optimization in a future subproject.

    • Space: Varied, depending on the optimization.

      • Inlining: O(1) average case, O(n) worst case, where n is the size of the linked program.

      • Constant folding: O(1) average case, O(n) worst case, where n is the size of the linked program.

      • TODO: Add more as I think of them. Will do optimization in a future subproject.

  • Linking Algorithmic Requirements:

    • Time: O(n), where n is the size of the linked program.

    • Space: O(n) average case, O(n^2) worst case, where n is the size of the linked program.

Performance Requirements

  • Bootstrap compiler:

    • Scale: bootstrapped compiler.

    • Fast enough for quick turnaround (less than 5s) during development of bootstrapped compiler.

  • Bootstrapped compiler:

    • Scale:

      • 10’s of millions of files.

      • Each billions of lines long.

      • With lines megabytes long.

    • Time: 1M LoC/s, non-optimized, but with debug info.

    • Space: 3x size of file.

    • Response to each LSP request under 5ms. (Users notice anything above 10ms, and the editor needs enough time to do its thing too.)

Security Requirements

  • Bootstrap compiler:

    • No threat model, no requirements.

    • It’s going to be internal use only, so I don’t care.

  • Interpreter:

    • Ability to handle the dynamic capabilities mentioned in the Yao design document.

Safety Requirements

  • Bootstrap compiler:

    • Must not crash on compiling bootstrapped compiler.

    • Must not miscompile on compiling bootstrapped compiler.

    • Must not generate C99 with undefined behavior.

Glossary

The definitions of these terms apply only to this document.

Lowering

Changing IR into machine code.

Object

An object in the C sense.

Platform

The combination of ISA and Operating System.

Space

Any limited resource where instances of the resource can be used concurrently with other instances of the same resource. (See this blog post.)

Targeting

Writing a backend for a specific ISA.

Time

Any resource that cannot be used concurrently with other instances of the resource. (See this blog post.)

Undefined Behavior (UB)

A certain action that, if it occurs, causes all subsequent behavior of the software in question to be undefined, which means that anything can happen. This is necessary to have because there are some things, like out of bounds reads and writes, that may affect how the program behaves in ways that cannot be anticipated.