The entire reason for a Version Control System is to:
Store data, including ensuring its integrity.
Track changes to that data over time.
Easily recall that data on request from users.
These principles must guide the design.
In other words, Version Control Systems have stagnated since the introduction of git and its conquest of the VCS space for software.
But behind the scenes, other VCS’s continue to be used for cases where git does not work, such as the heavy use of Perforce in the game industry. People in the game industry want a VCS that will do code and assets.
In addition, there has been demand, but no fulfilling of that demand, for a “semantic-aware” source code VCS. People have even used XML-aware merge plugins for git!
So obviously, there are cases where git is insufficient. And it is time for a new generation, the fourth generation of Version Control Systems.
Version Control vs Software Configuration Management¶
Should Yar be a Software Configuration Management System?
SCM is about tracking, not just the versions, but also everything else around software, such a bug reports, build management, documentation, etc.
I believe that Yar should do all of those things. However, I do not want to label it as Software Configuration Management for two reasos:
I intend for it to be used outside of software contexts.
Yar will track all of the other items around projects in the same way it will track files.
Yes, bug reports and other things like that will be version controlled themselves.
Fourth Generation Version Control¶
So with all of that said, what will fourth generation version control look like?
I think it will look like this:
Ease-of-use, enabling non-technical software users to use it.
This includes integration with other software to provide version control as a replacement for undo/redo.
Non-text-line-based diffing and merging. Regardless of how this is done, it must enable:
Semantics-aware diffing and merging on source code.
Binary data-aware diffing and merging.
Version controlling all other aspects of project management.
Yar is going to be a manifestation of that vision.
The biggest item, however, is number 1: fourth generation version control must be accessible for non-technical software users. It’s time that we, as programmers, gave them the tools we have given ourselves.
The high-level requirements are, in order of priority:
Secure with the following properties (in order of priority):
Auditability (always, except in very specific cases that must be logged).
Trust and trustworthiness (always, but decided by humans and obeyed).
Authentication (when required).
Confidentiality (when required).
Privacy (when required).
Availability (as much as possible).
Easy to use.
Decentralized but centralizable.
Yar’s first and foremost purpose is to secure the information that it stores. However, that security takes many forms.
Full auditability is required.
There should never be a question of where and who data came from.
History must never be rewritten.
Have an option for tracking changes to the database.
Require future customers to have it enabled on central servers.
There should only be a linear history (no branching) of the database because only one change should happen at a time.
In fact, tracking the order of changes in the repo, like I would do to recover a repo, is going to be a killer feature. It enables:
rigfor some assurance of reproducible builds from the exact configuration used for a build.
It will do this by adding artifacts to a commit and tracking the order they were added in, including locking the repo for a build.
It allows easy central file locking.
It allows easy repo recovery when corruption happens.
It allows auditable rebase.
Use SQLite’s transactions for this.
Use a Merkle tree for integrity of commits.
It should be possible to use any cryptographically-secure hash function for the Merkle tree.
This should be per-commit, with a byte or two in the commit to say which function was used.
Defaults should be:
BLAKE2 (speed, also contemporary with SHA-3).
SHA-256 (used for compliance in government circles).
SHA-3 256-bit (should eventually replace SHA-256).
The Merkle tree should be done on the content of files/objects, not on the content of patches.
The public key and password envelope (from OPAQUE) of a user is their identity.
The public key and password of a user must be able to change.
The history of the user’s keys must also be tracked.
Do this in a special “config” repo. (See below.)
I need to have some way to get around the fact that Mac OSX and Windows filesystems default to being case-insensitive. See Security Issues in Git.
Servers and users should all have a private “config” repo.
Should store global configs for users and servers.
It will also be a keystore.
This repo should track changes to identities over time.
Keys and Argon2i extra data (not passwords!).
This should not store passwords.
Users should use a password manager.
This is to make sure that if this repo is breached, only the keys are compromised.
It can, however, store the secrets needed for use with passwords:
Secret key given to Argon2i.
Salt given to Argon2i.
These are safe-ish to store because the password is still not known.
While they make it easier to crack the password, they are still not all that is needed.
All data should be encrypted by a master password.
Research the KeePass file format.
Should also store what users/keys are labelled as spam and other data like that.
Commits must be signed.
From a fellow programmer: combine Rig and Yar to make it so that users can have assurance that the binary they have is tied to the correct source code.
To do this, Yar must be able to attach artifacts to a commit after the commit.
And maybe lock the repo on that commit while adding them?
I think that the above can be done, if I can somehow tie the source code of each file to the resulting object files.
Rig would probably have to hash the source file and put the hash in the object file after the compiler is done? And fail if any of the attributes mismatch before and after? And then Yar fails if that hash doesn’t match the hash of the file in the commit?
I don’t think the above will work. More thinking required.
Trust and Trustworthiness¶
Trust should always be decided by humans.
Trust should never be decided by Yar.
Users should explicitly label keys and “trusted” or “not trusted”.
This includes when a user’s key is changed; other users should have a chance to un-trust the old key and trust the new key.
Changes to trust should be tracked in the central config repo for a user.
This will allow old commits under an old key to stay trusted.
It should be possible for “admin” (as in, sysadmin, manager) users to revoke access of subordinate users.
This should be done by having an option of querying a certificate server.
When a user logs in, go through the entire login, but once established, query the server, assumed to be directly controlled by the sysadmin.
If the server returns a valid signature of the user’s public key, continue; otherwise, bail.
This should be optional.
Authentication must be by the user’s key and by password (OPAQUE).
When either one needs to be changed, both must be used to do the change.
This is so that if a hacker compromises one, he still cannot take over the account.
When either one needs to be changed, both must be changed.
This is so that a compromise of one cannot extend to the other, since a historical record of both are kept.
Specifically, if a private key is compromised, it might be because the config repo was compromised (meaning keys are compromised), which has data used as part of OPAQUE, which might make it easier to crack the password.
Also support WebAuthn?
With non-repudiation, it is up to users to create and rotate keys on breach.
This includes distributing their new keys.
This means that Yar must be able to list all remote servers that have their key for a repo.
Do this in the central config repo mentioned under “Authenticity.”
Also, making it possible to easily push the new key to remotes is necessary.
Also, while Yar should provide tools for keeping accountability, Yar should never take responsibility for keeping it.
Keeping accountability is still the responsibility of humans.
Users should have the option of remaining anonymous.
Other users should still be able to decide how much they trust such users.
Have bug reports/anything be able to be private between any set of users.
Implement the above using groups, like Unix groups?
Provide rate-limiting to prevent DoS attacks on servers.
Specifically, make it possible for a server to reject requests from certain places with an HTTP error.
And to do it based on an interval. Or even a script. (Embed Yao interpreter.)
This should also be fine-grained.
For example, it should be possible to allow partial clones up to a certain size from certain origins.
Reduce susceptibility to spam and reject early.
One possible way to reduce spam:
Make an ASCII- or Unicode-style thing like Fossil uses.
But instead of putting it in to the website, maybe do it through Yar itself, using the user’s key?
In the web interface, allow bots to do automatic management of bad users. See this comment. This would mean having an API.
No, don’t have an API; that would just encourage spam.
Instead, use plugins.
I must have a detailed threat model:
What are the threats?
What are the assets?
What are the risks from each threat to each asset?
How bad are the consequences of the compromise of any asset by any threat?
What are the protections I will put into place?
Security should be “fail-secure,” meaning that if something fails, someone is locked out rather continued to be allowed in (“fail-open”).
Does this mean that trust should expire? Maybe not.
I need to understand SSH certificates before I do anything here.
For the web app, use the Same-Site cookie flags.
This is to prevent instances of the BREACH attack.
WebAuthn on keystore?
What’s contained in the keystore.
Certificates when checking users.
Look at MEGA attack paper for key handling ideas.
Use proper key wrapping on keystore.
Algorithms and Parameters¶
These are subject to change.
Use what BearSSL has.
SHA-256, or BLAKE2b, used in Argon2i.
Same kind of keys as key gen.
OPAQUE (see the RFC, page 35):
Memory: 16 (65536 KiB).
I want to go higher, but I probably can’t because of constrained environments.
This has to be high because of attacks on Argon2i specifically that are weakened by more iterations.
This is set so that my machine just about takes half a second with no parallelism using the reference impl, with the password set to “I am the best there ever was! In programming, of course.” and the salt set to “That’s why I am doing this.”.
A generated, random key, separate for each password.
This is so that adversaries have a harder time cracking the password.
A generated, random salt, separate for each password.
Arbitrary data: username.
NIST curve P-521.
3DH (3 Diffie-Hellman).
Easy to Use¶
MOST IMPORTANT: Yar must be easy, approachable, and give full confidence to the user. See A Case of Computational Thinking: The Subtle Effect of Hidden Dependencies on the User Experience of Version Control.
Also see “What’s Wrong with Git?”.
Beyond that, there are two aspects to this:
Ease of Use.
Ease of Use¶
The most important data stored by a VCS should be easily created and easily queried.
That is what this section is about.
Must be easy to use for individuals in personal repos.
Must also be easy to use for groups anywhere on the centralized/decentralized spectrum.
Study Mercurial’s CLI for ideas.
Must use terms and phrases whose meaning will make sense without tutorials, as much as possible.
There must be an early GUI for many operations.
This can be implemented with a server program to browse with a browser.
Operations that must be performable with this include:
Submodules must be easy.
Adding a submodule should be equivalent to adding a file.
Same with deleting a submodule and deleting a file.
Submodules should never require extra operations beyond add, delete, and update.
This means initializing submodules should happen automatically.
A server should be easy to setup, like Fossil.
There should be a way to auto-sync, but it should never prevent saving or committing if it fails.
There must be a way to report a bug with software through Yar.
Should be possible to set up a template.
The report should go to the server branch.
zv on IRC made a good point why a staging area is still needed:
“here’s a good reason to keep staging area: I want to work on anscillary files that help with debugging, but I only want to commit the real code files”
“having a staging area allows me to not commit every file I’ve modified in the current directory”
“stashing those is incorrect because stash should be for some arbitrary directory state, not for files that I specifically do not want to commit”
Some data in a VCS is not typical, but it should still be possible to query. In addition, it must be possible to get reports on that data. It should also be possible to create complex changes to data.
That is what this section is about.
All data in the database should be accessible to users.
With the exception of privacy and permissions, of course.
Instead of letting the user query stuff directly from the database, implement a way for them to run arbitrary code using some basic queries, like all commits, all users, etc.
Basically, the queries should be to get every instance of a single type of item in the database.
This should automatically be filtered by permissions for the user.
The plugin should then get access to the data it asked for, and it can then run whatever it wishes, whether to filter the data or run reports or something else.
These should still be heavily sandboxed, though, using the Yao interpreter.
There should be no access to filesystem or network.
Plugins should be able to do more than define file formats to track.
bzr, they should be able to do GUI’s, import/export, statistics, and more.
Maybe this won’t be necessary with it also being a library?
Allow the appearance of “rewriting” history on a single branch.
See these comments.
Also, an acquaintance made a good case for leaving in tools to make it appear as if history was rewritten.
I may call this “rebase,” but it won’t be like git’s.
Instead, it will be a mechanism whereby a certain number of commits at the end of the branch’s history are recommitted with different patches.
The recommitted commits will have different numbers, starting after the previous last commit.
Their parents will be both the previous commit and the commit on which they are based.
Multiple checkouts from one clone.
Should also be distributable as a library.
The library should provide:
Undo/redo tree support.
All other Yar operations, except ones that don’t make sense.
Real-time collaboration using centralization and auto-sync with a defined server.
This is a future goal.
There are three parts to this:
Version the entire repo.
Be able to version any kind of file.
Version all artifacts in the repo, and do so recursively so that versions themselves can be versioned.
Version Entire Repo¶
The unit of tracking should be the entire repo.
However, the unit of working should be the branch.
This is to resolve the difference between
gittracks the entire repo, so branching is really easy, but
bzrhas branches that are more graph-like and stronger than the branches of
In particular, branches are their whole history, not just a HEAD pointer.
Must have submodules, and they must be able to handle git repos as well.
See above for how to make submodules easy.
Version Any Kind of File¶
Use plugins to handle different file types.
Each plugin should be able to:
Identify dependencies on other data in other files, including other file types (with help from other plugins).
Read the file into memory.
Write the in-memory form of the file out to disk.
Define the data in the file type (so the general algorithm can diff and merge).
Each file plugin should track semantic elements.
For example, the C plugin should recognize C functions, structs, etc.
The Blender plugin should recognize verts, edges, faces, nodes, etc.
Plugins must be able to identify extraneous data that does not matter for merging. For example, whitespace in C, or empty space in SQLite databases, padding in C structs, etc.
This extraneous data can take any form, and it’s up to the plugin to identify it.
Version All Artifacts¶
Integrated bug and PR tracker.
Bugs and PR’s should be attached to a specific branch.
This is a sort of centralization that makes it easier to figure out what to do with bug reports and the like.
Integrated wiki or some way to present documentation.
Wiki should be attached to a specific branch.
It would be better if the wiki could be automatically generated from documentation in the repo.
Integrated web interface.
Be able to mark commits with certain labels.
This will make it so developers can mark certain commits with “Mistake” for example, which they should then be able to tell the software not to show by default. This will take care of people’s desire to undo their mistakes, for the most part.
This will also make it so that, if they accidentally mix implementing subfeatures while on a feature branch, they can mark the commits with the subfeatures’ names, making each subfeature easier to cherry pick.
I am probably only scratching the surface with the possibilities of marks; they seem extraordinarily powerful.
Maybe they can replace patch commutation/rebasing along with user branches below?
Be able to attach versioned artifacts to other artifacts.
For example, a commit message is just an artifact attached to a commit.
Commits don’t have to have them.
Likewise, an amendment on a commit message simply adds a new version to the commit message with the amendment, sort of like Fossil.
Do not include commit message in hash.
This can also be how marks are done; they are artifacts attached to commits.
This will also allow per-item (recursive) commit message, like BitKeeper allows per-file commit messages.
Decentralized but Centralizable¶
Have a difference between “user branches” and “main” or “master” branches.
User branches are owned by one user, and they can pull from other users’ branches and the “main” or “master” branches.
Then, those with the correct permissions can then merge into the “master” branches and push.
A user’s branch named the same as a “master” branch target is the only branch that the user can use to merge into the “master” branch.
Users can also never commit directly to “master” branches; they can only merge into them.
The reason user branches are good is because it will force users to commit to their personal branches.
This doesn’t sound like much, but it means that there won’t be history rewriting when users pull branches because they can really only:
Merge into their personal branches,
Merge into the “master” branches,
Push the results back.
This means that local rebasing will not happen, even when pulling a branch with changes.
I will have to put a lot of effort into making user branches not clutter the output on logs.
Implement a way to require file locking on specific files on a specific, server-based main branch.
Performance is a feature!
Performance regressions are regressions.
Performance must be tested on every release.
Cloning does not need the full data.
It does not need full history; only send SQLite rows that matter.
Only rows for requested branches and their ancestors.
Nor does it need all of the files; see
This needs to be designed to work from the beginning.
This should be integrated with
rigneeds a file that is not available, it should be able to pull it down.
This implies a command to pull down specific files, with or without their version history.
This, in turns, implies that the checkout should be per-file, not per-directory.
statusand other common commands should be fast.
Use an “untracked cache” like git’s.
Use everything like
rigwould, not just
git update-index --test-untracked-cache.
Use a prefix-compressed path format in the cache?
Search for “prefix-compressed” to see what that means.
Instead, when pulling, you should be able to pull (either separately or in combination):
A branch’s history (just the stuff in the database).
A file’s current content.
A branch’s data (patches), even if partially.
A branch’s recursive history (the stuff in the database for the branch and all of its ancestors).
A branch’s recursive data (patches for the branch and all of its ancestors).
Yes, this means that sometimes, not even the entire set of files for the repo will be pulled.
This is mostly for studios that will have large repos with large assets that many artists don’t need on their machines.
Yar needs to be able to figure out dependencies of what to pull based on what the user wants.
For example, if a blend file has a reference to a PNG texture, that should be a dependency labeled by the plugin, and in that case, Yar should pull down the contents of the PNG if the contents of the blend file is pulled.
To be honest, this is a catch-all category for all other requirements I think of.
Be able to ignore files, like .gitignore.
Supported environment variables:
All of Yar’s dependencies should be vendored.
This means that I must stay on top of the latest, always.
Current list of planned dependencies:
For access to the Internet, to clone, pull, etc.
For SSL in Curl.
For cryptographic primitives to accomplish:
Signing of commits.
Authentication with public key.
Authentication with password (OPAQUE).
Currently does not provide:
Argon2i (for KSF in OPAQUE).
Could include the reference C impl of Argon2i, but it doesn’t look maintained.
If I still write my own, I might use the reference for testing, though.
Blake2 (for Argon2i).
I might also use this as the cryptographic hash since it’s fast and is more immune to length extension attacks than the SHA-2 family.
Look at Perforce change sets/lists since some centralization will be possible?
This could be to replace Git’s index.
Also look at Mercurial Queues for the same reason.
Based on the requirements, this is how Yar must be classified as a VCS (according to the table here):
Conflict Resolution: Modify-Commit-Merge
History Model: Patches
Unit of Change: Tree
I will go into detail about these below.
Yar must use the “modify-commit-merge” model.
This is because Yar will only work on actual committed changes. Changes that have not been committed cannot be counted on.
This includes using it as an undo/redo library because it should be told to make anonymous commits.
The reason for this is because while Yar will probably be able to tell if a file is incomplete or corrupted (from partially saving changes, for example), it will not be able to tell which changes are “atomic” with respect to operations of the software changing the file. For example, an operation on a file might change two disparate parts of it, and Yar should consider both of those as part of one change.
Yar must use the “patches” or “patch theory” history model.
This is for two reasons:
It is easier to pick apart unrelated changes in patch theory because the changes are expressed as separate patches anyway.
When tracking large binary assets, there will be many cases where small changes are made, and storing snapshots would not scale well in that case.
Unit of Change¶
Yar must make the unit of change be full trees.
This is for two reasons:
The entire repository can be made a tree, allowing tracking of repository-wide changes without any special code.
Sometimes, data will depend on other data in other files. In that case, treating groups of files as a single tree can simplify the merging algorithm when either piece of data changes.
However, it also has the concept of semantic elements, like Monticello. This is done with the recursive version control design.
TODO: Design all algorithms and calculate their Big O complexity.
Instead of needing to flatten a file from graggles, maybe I should store the graggles for each commit? I can reuse them where they are the same, both nodes and edges.
Add some way for a user to specify a command to run on a repo that will check for problems with a merge?
Software to Learn From¶
Zero One Zero Editor https://www.sweetscape.com/010editor/.
See the Binary Templates section.
See the Analysis Tools section.
The graph diff should be a model for diffing file types that don’t have custom diffing code (which should still be possible).
Also look at their graph diff, like Bindiff’s.
Look for ideas from their language: https://doc.kaitai.io/user_guide.html#_kaitai_struct_language
The algorithm must:
Be semantics/binary aware.
Have a concept of an item moving.
Have the concept of separating small changes, such as an item moving from changing that item.
After applying the merge algorithm in my head, it does both badmerges correctly (for now). I need to solidify the algorithm.
This also means that plugins must be able to identify when a patch is required by semantics, such as the patch using a function requiring the patch that adds the function.
Using semantics-aware plugins has an additional benefit: I could give them access to all files in the repo to search for necessary items. This could mean that Blender files could declare semantic needs on image texture files that are used for materials in the Blender file.
Types of patches:
Move items from one parent to another.
Move items to somewhere else in the parent.
Switch a set of items with another another set. This is for when the two sets are next to each other.
This is meant to reduce spurious conflicts and to not have ambiguity of which items moved and which items did not move.
If items moved in the parent, then any item between where they were and where they are moved to are also affected by the change in that their positions move.
Must track patches through cherry picks and use that info during merges to skip merging those patches.
Also allow defining operations in plugins.
Must always be invertible.
This should allow compressing some changes by storing the operation and its parameters.
Things to Figure Out¶
Deleting a file and creating a new file with the same name.
Deleting a file and moving an existing file to have the same name.
Figuring out semantic ordering.
Merging multiple patches.
Merging multiple patches when some patches have already been merged from the branch being merged into.
Merging multiple patches when some patches have already been merged from the branch being merged from.
Ordering operations on data.
Binary data (recursive data in structs and unions, as well as arrays).
Keep the order of items.
Also be able to associate a key with an item.
For example, the key of a C function would be its name.
However, this should not be required.
There should be no staging area.
Instead, there should be two different kinds of commits: saves and commits.
Saves should be automatically in a separate branch, and they should be “anonymous” in that they are not labelled.
Commits are full commits.
Saves should happen fast and create separate branches as necessary.
In fact, this is how to implement undo/redo trees.
Commits should be a matter of smashing a series of saves together onto the branch in use.
Patches that reverse each other (such as debugging code added and then deleted) should be removed from the smash.
This design will create this workflow:
Do a bunch of speculative work, saving as necessary.
When integrated with other software, saves should happen on every operation to implement undo/redo trees.
This also means that going back from a save should be possible and easy.
When done, clean up the assets.
This removes the need for:
Git’s staging area.
This also means that files should be automatically added in saves.
However, if they are only in a save and then ignored, they should also be automatically removed.
There should be a way to “bookmark” specific anonymous commits.
This is so they can easily be gone back to after doing something with others.
There should be a way to move a tree of anonymous commits to use a new commit as a parent instead.
This is so that, after committing one set of changes, another can easily be moved to use the new commit as a parent and then commit.
Basically, this system should use a bunch of very low-level operations, but be (usually) manipulated through very high-level commands.
This will allow me to implement something close to what all VCS’s do.
It will also allow others to add their own workflow on top of it.
It will also make it easy to implement an undo/redo system.
Simulate Git’s staging area with this system.
The full history of branches must be kept, like Mercurial.
However, there should also be a marker for where the branch is.
The parent commit where branches split off must also be tracked.
Patches from other branches should also be tracked.
Branches should have hidden numbers so that branch names can be reused.
Only one branch with that name, per user, is considered alive at one time.
Use a bookmark for that?
Since branches can be associated per user, they should also be associated with servers, allowing the centralization for certain branches.
Require users to have both a key and a password.
Allow authentication with both, with the following exceptions:
Commit signing should be key-only.
Password rotation should be key-only.
Key rotation should be password-only.
Use PAKE to keep users’ passwords safe. See “Let’s Talk About PAKE”.
All of these command names are subject to change.
y missing: List either missing commits or patches in one branch relative to another. This is like
y send: Send a Merge/Pull request. This is like
y parents: Look up parent commits of a commit. This is like
Should this have different behavior if the working directory is dirty?
y log: Output the log.
Be able to output nested logs, like Bazaar.
This is especially important with anonymous commits.
y bind: Start auto-sync of the current branch. This is like
bzr bindexcept that it should never error if sync fails.
y unbind: End auto-sync of the current branch.
Ideas to prevent spam:
Must send all keys to server when creating an identity.
Each key should be signed by the previous.
First key signed by itself.
If someone has a lot of keys, they are probably spammers.
If someone doesn’t send their whole key list, they are spammers.
Trying to get around the system will require rebuilding Yar, which will drive up the cost of spamming.
Have a spam “folder” for new issues, PR’s, etc.
Admins should be able to specify how many keys a user can have before being automatically labelled spam.
Admins should be able to label certain keys or email addresses as spam.
If such keys or email addresses pop up in new users, those users should immediately be labelled spam.
Make sure to not have the problems of Git CVE-2022-24765 and CVE-2022-24767.
Do the same thing: if the top-level directory (the one with the file pointing to the bare repo folder) has a different owner than the owner of the repo, error.
To help prevent collisions as much as possible with commit hashes, do the following:
First, have a limit on the number of commits in a repo. This should probably be
nis the number of bits in the hash algorithm that is used. Then check for the limit and refuse if there are that many commits. Otherwise…
Append random data to the Merkle tree. This data should be stored in the commit data in the database.
If the resulting hash equals another hash, go back to 1 and try again.