Also available in Japanese

Comments on Open Source Software / Free Software (OSS/FS) Software Configuration Management (SCM) / Revision-Control Systems

by David A. Wheeler
April 10, 2004; lightly revised May 18, 2005

This paper is getting increasingly obsolete, but I'm leaving it here because there are some broader principles noted here. Enjoy. More recent articles include Elijah's 2008-03-01 "Happenings in the VCS World" , DVCS adoption is soaring among open source projects, and Making Sense of Revision-control Systems. As of 2011, distributed SCM systems have become much more common. When I wrote this paper, git was powerful but had two big problems that have since been addressed: git had a poor user interface (git has since greatly improved the user-level "porcelain" commands), and git didn't work well on Windows (since that time much of git has been rewritten in C, and "git for Windows" using msys is now available, so Windows use is workable though Unix-like systems are naturally better for development). Other major distributed SCM systems are mercurial (Hg), bazaar (bzr), and Monotone (among others), which have their supporters and major-project users. The subversion (svn) program is widely used by those who need a simple centralized SCM.

With the release of Subversion 1.0, lots of people are discussing the pros and cons of various software configuration management (SCM) / version control systems available as open source software / Free Software (OSS/FS). Indeed, the problem is now an embarassment of reasonable choices: there are several OSS/FS SCM systems available today. Here's some information about SCM systems that I've learned that you may find helpful; I'll discuss four options (CVS, Subversion, GNU arch, and Monotone), the differences between centralized and decentralized SCM, a discussion about using GNU arch to support centralized development, and a few links to other reviews. I think future SCM systems will need to counter more threats than today's SCM systems are designed to do; feel free to also look at my paper on SCM security.

CVS, Subversion, GNU Arch, and Monotone

In my opinion three OSS/FS SCM systems got the most discussion in April 2004: CVS, Subversion, and GNU Arch. Two other SCM systems that are getting more than a little attention are Monotone and Bazaar-NG, so I have a few comments about them. As of April 2005, git/Cogito have entered the arena with a bang, since this pair of tools is being developed specifically for Linux kernel development (this is a large number of smart, motivated developers who have the most experience of anyone with distributed SCMs). There are certainly other SCM tools (such as Aegis and CodeVille), and I don't mean to intentionally exclude them, but I just haven't had the time to examine the others in as much depth. Besides, knowing about these four will help you understand the rest. So, here's a brief discussion about each.

CVS

CVS is extremely popular, and it does the job. In fact, when CVS was released, CVS was a major new innovation in software configuration management. However, CVS is now showing its age through a number of awkward limitations: changes are tracked per-file instead of per-change, commits aren't atomic, renaming files and directories is awkward, and its branching limitations mean that you'd better faithfully tag things or there'll be trouble later. Some of the maintainers of the original CVS have declared that the CVS code has become too crusty to effectively maintain. These problems led the main CVS developers to start over and create Subversion.

Subversion

Subversion (SVN) is a new system, intending to be a simple replacement of CVS. I looked at Subversion 1.0, released February 24, 2004. Subversion is basically a re-implementation of CVS with its warts fixed, and it still works the same basic way (supporting a centralized repository). Like CVS, subversion by itself is intended to support a centralized repository for developers and doesn't handle decentralized development well; the svk project extends subversion to support decentralized development.

From a technology point-of-view you can definitely argue with some of subversion's decisions. For example, they don't handle changesets as directly as you'd expect given their centrality to the problem. But technical advancement is not the same as utility; for many people who currently use CVS and just want an incremental improvement, subversion is probably more or less what they were expecting and looking for. But there are weaknesses, for example, Subversion doesn't keep track of "which patches have already been applied" on a given branch, and trying to reapply a patch more than once causes problems. Thus, subversion has trouble with history-sensitive merging of branches where the branches share parts (GNU arch doesn't have this problem, because it does track what merges have been applied).

In 2004 there were concerns by some about Subversion's use of db to store data (rather than the safer flat files), since in a few cases this can let things get "stuck". In practice this doesn't seem to be so bad (in part because the data can be extracted), but certainly some are concerned. In newer versions, there is a database backend called fsfs which uses flat files. The fsf backend was created because subversion had had some problems with the DB backend in debian-installer (a fairly large repository); fsfs works without any problems in that case.

Subversion uses a BSD-old-like license that, while OSS/FS, is GPL-incompatible, and that's unfortunate (GPL incompatibility can be a problem). Subversion can be used to maintain GPL software or any other kind, without restrictions.

Subversion depends on a large number of libraries and programs (and can be perceived as rather "heavyweight"), so it can take some effort to install currently; distributions will probably be quick to include it, so that problem should go away relatively soon. This book on Subversion gives more information about it.

By the way, there's a general problem with Subversion that is shared by many other SCM tools: Subversion tracks file contents, but it doesn't track the modification date/timestamp of individual files (i.e., it fails to record important metainformation). Generated files can store the date/timestamp of the retrieval, or maybe of the changeset, but the latter is not the default. This can produce extra build work, or inaccurate builds. See the email "Should I really have to install Python before Ican build it? on December 13, 2005, for a more detailed explanation. SCM tools that record modification times, as well as the file names and contents, don't have this problem, though they can have a different problem: if a users' clock is severely off, they can cause serious build problems. This can be partly but not completely alleviated by performing extra checks when the files are transferred, but some designs make this hard. Of course, this presumes that all times are for a common standard (e.g., UTC); if clock times are recorded in LOCAL time you have even more trouble.

If you're using CVS and want a simple upgrade path to something better, Subversion appears to be the simplest approach. It works in a very similar way to CVS (in particular through a centralized repository), allowing any of the authorized developers to immediately modify a shared repository (with a record that it was done so and rollback capability). Subversion is what it intends to be: an improved CVS.

GNU Arch

GNU arch is a very interesting competitor, and works in a completely different way from CVS and Subversion. GNU Arch is released under the GNU GPL. I looked at GNU Arch version 1.2, released February 26, 2004. GNU arch is fully decentralized, which makes it very work well for decentralized development (like the Linux kernel's development process). It has a very clever and remarkably simple approach to handling data, so it works very easily with many other tools. The "smarts" are in the client tools, not the server, so a simple secure ftp site or shared directory can serve as the repository, an intriguing capability for such a powerful SCM system. It has simple dependencies, so it's easy to set up too.

Decentralized development has its strengths, particularly in allowing different people to try different approaches (e.g., independent branches and forks) independently and then bringing them together later. This ability to scale and support "survival of the fittest" is what makes decentralized development so important for Linux kernel maintenance. Arch can also be used for centralized development, but see my discussion below about that.

There are also a number of people who have built support tools and such that support arch. For example, tla-graph can create a graph of the patchlogs in archives.

Indeed, I really like arch, yet I'm also frustrated by it. It has so many positive strengths, so it might be confusing why I think it has some problems. So, here's a discussion of its problems, which basically show GNU arch is a tool that's already very usable but needs some maturing.

A serious weakness of arch is that it doesn't work well on Windows-based systems, and it's not clear if that will ever change. There are ports of arch, both non-native (Cygwin and Services for Unix) and a native port too. However, the current win32 port is only in its early stages, and the Win32 page on the Arch wiki says "Arch was never intended to run on a non-POSIX system. Don't expect to have a full blown arch on your Microsoft computer." At least part of the problem is the long filenames used internally by arch; arch could certainly be modified to help, though there doesn't seem to be much movement in that direction. Other problematic areas include symbolic links, proper file permissions, and newline problems, as well as the general immaturity of the port as of March 2004. Some people don't think that poor Windows support is a problem; to me (and others!), that's a serious problem. Even if you don't use any Microsoft Windows systems, people don't want to use many different SCM systems, so if one can handle many environments and the other can't, people will use the one that can handle more environments. I think GNU Arch's use will be hampered by this lack of support as long as this is true, even for people who never use Windows; good native Windows support is very important for an SCM tool.

Arch has some awkward weaknesses involving filenames. Arch uses extremely odd filenaming conventions that cause trouble for scripts, command-line use, and many common tools. Its "+" prefixes cause problems with extremely common tools like vi, vim, and the pager more (this is especially a problem when trying to enter change log information - why choose a convention that's inconvenient for one of the world's most popular text editors?). Its "=" prefixes expose a bug in bash filename completion (this bug will eventually be fixed in bash, but buggy implementations will be around for a long time to come because this is such a rare need and bash is the default shell for many systems). And although this is less of a problem, it stores data in an "{arch}" directory, but the "{}" characters cause problems for many shells (particularly C shells) because they have a special meaning (they're filename globbing characters like "*"). For example, in C shells you can't "cd {arch}" or "vi {arch}/whatever"; you must quote the directory name. The problem isn't that filename conventions are a bad idea; most CM systems have them! The problem is that some of the conventions chosen by arch seem to be designed to interfere with commonly-used tools, and thus require using many work-arounds when using common tools (such as prefixing the filename with "./" or using the "--" option). That's unfortunate since GNU Arch's underlying concepts work well with other tools; if the developers had chosen better conventions these problems would never have occurred. I suspect these poorly-chosen conventions are too ingrained to be easily changed now, but there's always hope. There are ways to override the defaults in some cases, but not in many, and tools should choose good defaults. It's too bad, because nothing in arch's fundamental design requires these particular filename conventions. In February 2004 arch couldn't handle spaces in filenames, but this significant defect has been fixed; version 1.2.1 and later support spaces in filenames.

GNU arch gives you a lot of control using lower-level commands, but it doesn't (yet) automate a number of tasks that it really should be automating. Many common operations require multiple commands, when instead a single command and reasonable options should be enough for most people. If you use a single archive for a long time in GNU arch, it eventually accumulates a very large amount of data and becomes inconvenient to work with. arch's developer suggests dividing archives by time and including a date in the archive name. I think handling this accumulation is a nuisance; this kind of manual work is exactly what an SCM should handle automatically (e.g., perhaps arch could hide branches that have been unused in more than a year, by default). Arch has nice caching facilities (both in archives and on individual workstations) which can speed access to specific versions. However, these caches often have to be created by hand (by default the tool should automatically create caches, and remove old automatically-created caches, as well). Arch works slowly if the {arch} directory is on NFS; the tool should be able to detect slow execution and automatically try to find an efficient alternative, instead of requiring user workarounds. Many arch developers seem to create a similar set of higher-level specialized scripts to automate common tasks, but that's missing the point: you shouldn't have to write scripts to make a tool automate common tasks. An SCM tool should include commands that, through automation and good defaults, "do the right thing" for common tasks. The good news is that the arch developers are realizing that this is a problem and correcting it. The "rm" (delete) command deletes both the id and the corresponding file automatically (instead of requiring two steps); that capability was only added on February 23, 2004, though, so clearly automating steps has only begun. The documentation notes that automatic cache management is desirable; it just hasn't been done. The mirroring capability is clever, but if you download a mirror and make a change, you can't commit the change and the tool isn't smart enough to automatically help (even though the tool does have information on the mirror's source). The website described a complicated workaround using undo and redo, and Jan Huldec described a simpler approach (using tag, sync-tree, and set-tree-version), but the tool should be able to help commit changes even if you downloaded from a mirror.

Arch will sometimes allow dangerous or problematic operations that just shouldn't be allowed. For example, branches should be either commit-based branches (all revisions after base-0 are created by commit) or tag-based branches (all revisions are created by tag); merging commands will not work otherwise, yet the tool doesn't enforce this limitation. The tla tool doesn't check if there are still pending merge rejections (.rej reject files), so operations such as commit, update, replay, or star-merge produce a scrambled workarea; users make mistakes, and an SCM system should work to protect data.

The user interface also has some problems. Under the user nightmare clause, the "mv" and "move" commands do different things: "mv" moves moves both the id and the file, while "move" only moves the id. This user interface seems designed for confusion; why not make "move" and "mv" the same, and make "mv-id" the only command that only manipulates id's? Many commands are aliases, which simply makes documentation unnecessarily complicated.

The arch documentation is weak and needs more work; that's especially unfortunate, because the documentation issues can hamper early adopters who want to start using it today. A careful reading of what's available on-line should be enough for at least basic use of arch, though. Much of the documentation emphasizes lower-level implementation details (e.g., exactly how a command is implemented in the local filesystem) instead of emphasizing the higher-level constructs. Some of the documentation emphasize aliases, which is extremely distracting; if "add" and "add-id" mean the same thing, just document "add" (and later on, in an ignorable note, list the aliases). In some cases the documentation needs to be updated for what the software actually does. The on-line tutorial at the FSF GNU arch website is a good place to start, and the Arch Wiki is an especially good place to find some more detailed reference material.

In general, GNU arch isn't currently as mature as subversion. Its implementation needs more shaking down, its weird filename limitations should be fixed, and it sometimes requires users to do optimizations "by hand" when the tool should be handling it automatically. As noted above, its commands are sometimes on the low-level side; it can take several simple commands to set up values that should be defaults or built-in recipes/commands. And the documentation needs work.

But don't count out GNU arch for the long term based on these problems, most of which are short-term. Many of these problems simply reflect the fact that GNU arch hasn't had as much time to mature as other tools like subversion. I'm documenting these problems because, in fact, GNU arch has a lot going for it. In my opinion, the GNU arch developers have emphasized simplicity, openness of design, and power (ability to handle complex situations), and have paid less attention so far to ease of use (especially for simple situations). Thus, although it has problems as noted above, GNU arch is extremely powerful and its basic concepts are very flexible. More time and tools that build on top of GNU arch can resolve these issues. Arch is also endorsed by the Free Software Foundation (FSF) and directly supported by their Savannah system; that's certainly no guarantee of success, but endorsements like that often bring users and developers to a project, increasing its likelihood of success. GNU arch is a frankly more interesting approach to the problem, and it has a lot of promise.

This open letter from Tom Lord (GNU Arch's developer) to Linus Torvalds explains the basic concepts behind GNU Arch, in more detail.

Unfortunately, events in 2004 and 2005 make it a little less clear how things well GNU Arch will move forward. Many developers seem to like many of the ideas in GNU Arch, but not the implementation. As a result, several other projects have been started which take some of the ideas of GNU Arch, but are separate projects which aim to be much more user-friendly, portable to Microsoft Windows as well as Unix-like systems, and so on. SCM projects that are conceptual descendents of GNU arch include Arx (which has poor Windows support), Bazaar (also named baz) which is essentially a friendly fork of GNU Arch to improve it (primarily its UI), and especially Bazaar-NG (also named bzr). The Bazaar folks are working to ensure a smooth transition to Bazaar-NG once that becomes ready.

Bazaar-NG

Thus Bazaar-NG (also named bzr) is a new distributed SCM system that builds on the ideas of Bazaar (which extended GNU Arch), but it's essentially a new project. Here's how the Bazaar-NG developers compare their work with GNU arch. Bazaar-NG is trying to exploit some of the major innovations in arch, but by providing an interface that's easier to use (e.g., "doing the right thing" and easily supporting common operations), trying to make it easier to transition to, and it borrows many ideas from elsewhere.

I like much of what I see in Bazaar-NG. The main developer is developing the user documentation and code simultaneously (an approach I heartily recommend), and emphasizing common use cases. As a result, it appears that the most common use cases will be especially easy to do -- something very important in SCM systems. I like it when people write user documentation simultaneously, because if a common operation is hard to explain, that's a good signal that the tool isn't user-friendly enough. GNU Arch is an unfortunate example -- it needs good documentation because some of its operations are more complicated or awkward than necessary (some would say Arch has "unnecessary user-hostile complexity"). The Bazaar-NG developers plan to cryptographically sign changes to counter the dangers of repository subversion (see my companion paper on software configuration management (SCM) security for more information).

It's developed in Python, which means it should easily port to any system. Some may be concerned that the resulting system will be too slow; I suspect that concern isn't well-founded, and portions could be rewritten for speed if that becomes a problem, but that remains to be seen. Other SCM systems, such as CodeVille, are written in Python, so this isn't a strange choice.

Bazaar-NG is far less mature than many other projects. So keep that in mind; as of April 2005 I wouldn't commit a large, pre-existing project to Bazaar-NG! But since Bazaar-NG has financial backing from the company Canonical, who commercially support Ubuntu, it may catch up very rapidly. Its emphasis on ease-of-use is quite heartening.

Monotone

Monotone is another decentralized SCM. It's released under the GPL; it uses the programming language Lua (e.g., for hooks), whose implementation has been released under the MIT license (historically it was released under a zlib-like license). I looked at version 0.10, released March 1, 2004. Monotone is interesting because it's different approach to a distributed SCM. As Shlomi Fish describes it,
"changesets are posted to a depot (that can be a CGI script, an NNTP newsgroup or a mailing list), which collects changesets from various sources. Afterwards, each developer commits the desirable changesets into his own private repository.... Monotone identifies the versions of files and directories using their SHA1 checksum. Thus, it can identify when a file was copied or moved, if the signature is identical and merge the two copies. It also has a command set that tries to emulate CVS as much as possible."

Monotone basically has a three-layer structure (working copy, local database, and net server). This is different from GNU Arch, which basically has only two layers (working copy and archive), though GNU Arch has a few tools that make archives work together in special cases (e.g., for mirroring). In few cases this is more convenient than GNU Arch; GNU Arch sometimes makes you enter hand-wringingly long commands to copy data between archives (say from "my local archive" to a "master shared archive"). If in contrast you're simply posting data from a local database to a net server in Monotone, it works well. Monotone is based by using SHA-1 hashes for everything; specific file versions are identified with hashes, and sets of files are identified through the hash of its manifest. That means that SHA-1 hashes are even used as a global namespace for version id's. This has some nice technical properties, but it also means that the normal version numbers used in Monotone aren't meaningful to humans. Thankfully, you don't have to type in long SHA-1 hashes everywhere, only enough to be unique.

In Monotone, each person manages their own local database, and never automatically trusts anything sent by the net server. That can be a little disconcerting, and doesn't appear to be as strong a support if you want to implement centralized development. Internally Monotone uses an underlying simple SQL database (SQLite). It's hard to say if that's good or bad.

One very nice property of Monotone is that it has good support for recording status about approvals and disapprovals, as well as for test results (this is something GNU Arch doesn't do well). Monotone can generate ancestry graphs in xvcg graph visualization format (a separate tool for GNU Arch can create graphs too).

Monotone supports handling file metadata like file permissions (which ones can be executed) and symbolic links by creating and editing a special file (.mt-attrs). This works, but it's nowhere near as convenient as other tools like GNU Arch (which handle this automatically). Monotone requires you to "add" and "drop" each file to state which files in a working copy must be managed. GNU Arch has this mode, but can also be used in a mode where the simple filenames are enough to determine this. I prefer explicit add and drop commands, so I think this is fine, but some may not like this choice. Monotone can only commit entire sets of files; GNU Arch can also commit specific named files as well. This is an advantage for GNU Arch; if you found a minor unrelated problem while working on something else, in GNU Arch (and BitKeeper) you can make that small fix and commit just that one file.

There's current work to port Monotone to Windows (using MinGW and Cygwin), but this work in 2004 was very preliminary. This lack of a Windows port is a problem, as I noted earlier with GNU Arch. As of 2005 this appears to have gotten better, but I haven't checked in detail.

Monotone has recently fixed some of its problems in handling unusual filenames (this seems to be a common problem in SCM systems). Monotone's emphasis on security, and its clear concepts, make it another SCM worth considering. Monotone's approach to merging is based on three-way merging and SHA-1 hashes. The Monotone folks argue that the Arch approach is somewhat weaker than Monotone's approach, but note that Monotone isn't nearly as good as Arch in supporting some kinds of "cherry-picking" (see the Monotone FAQ for more information), so it's hard for me to declare either one a "winner" in terms of merge capabilities.

The Monotone command sets are intentionally similar to CVS, and that can help old CVS users somewhat. But only to a point! The underlying concepts of Monotone are so different that the "same" commands aren't really the same. Monotone's documentation needs work too, but I can say that it was easy to get the current "depot" of Monotone -- while GNU Arch didn't have clear instructions for the equivalent action.

One unfortunate thing: if you forget to commit before merging, and there's a conflict, you could be in for a lot of problems. Here's what their documentation says:

Monotone makes very little distinction between a "pre-commit" merge (an update) and a "post-commit" merge. Both sorts of merge use the exact same algorithm. The major difference concerns the recoverability of the pre-merge state: if you commit your work first, and merge after committing, the merge can fail (due to difficulty in a manual merge step) and your committed state is still safe. It is therefore recommended that you commit your work first, before merging.
Shame, shame! SCM systems should work very hard to prevent data loss or scrambling. Please, SCM authors, build in protection mechanisms or do an automatic commit-before-merge or something else to keep developers out of trouble. They're only human, and commands that can cause data loss or scrambling should require explicit requests, not through the use of normal (and commonly-used) commands.

In 2004 Monotone was experimenting with a "netsync" protocol for synchronizing two databases, which was clever but needed shaking out. As of April 2005, Monotone has switched to using netsync exclusively. However, Monotone can't use a simple repository (like sftp) for centralized repository, which is a minor negative compared to GNU Arch. In 2004 Monotone had nice email support, which I thought was a nice plus (GNU Arch, for example, doesn't do a very good job supporting email automatically). Monotone still supports some email work (e.g., using its Packet I/O capabilities) but it's not clear that it's as good as it was. Not everyone can run a server, and it's nice to allow for the use of email as a transport (because everyone can get email).

Monotone does appear to be less popular than GNU Arch (as determined by Google link counts), for what that's worth. Since Monotone seems to be less popular than GNU Arch, and has a version number less than one (suggesting that it's "not as ready"), I'm going concentrate more on GNU Arch as an example of a decentralized SCM for the rest of the paper. But Monotone can't be counted out for the future.

Centralized vs. Decentralized SCM

As you can tell, there seems to be two different schools of thought on how SCM systems should work. Some people believe SCM systems should primarily aid in controlling a centralized repository, and so they design their tool to support a centralized repository (such as CVS and Subversion). Others believe SCM systems should primarily aid in allowing independent developers to work asynchronously, and then synchronize and pull in changes from each others, so they develop tools to support a decentralized approach (like GNU arch, monotone, darcs, Bazaar-NG, and Bitkeeper). Tools built to support one approach can be used to support the other approach, but it's still important to understand the difference.

Tools built to support one camp can sometimes support the other approach, to at least some extent. However, it's not as clear to me that these supports for the "other approach" are always as good as a tool made to do the same thing natively. That's particularly true when centralized systems try to support decentralized development (in theory a distributed system should be able to easily support centralization easily, though a particular tool may not do a good job). Subversion has svk, which builds a distributed SCM system on top of subversion. However, implementing svk on top of subversion is a very heavyweight way to create a distributed SCM system, far exceeding what it takes to implement a natively distributed SCM system. GNU arch can easily support a centralized repository by having developers share read/write privileges to a directory that implements the repository, but see the discussion below about security concerns I have (due to the direct control over the repository by users). There's also the extra tool arch-pqm which can help mitigate some of my security concerns, though it's not currently integrated into GNU arch. The various projects' supporters all seem to feel that "their side" does adequately support the other approach, though. I do expect that the different projects will continue working to get better at supporting the "other" approach, so in a few years this distinction may get really fuzzy.

A collection of messages in Kernel Traffic illuminate some of the advantages of distributed SCM, and some of the challenges in implementing such systems. In particular, Larry McVoy identifies some of the challenges he faced implementing BitKeeper: rename handling in a distributed system, security semantics (since each user controls their own area), and time semantics (time moves all around). He also claims that merging branches when things are truly distributed, in a way that eliminates unnecessary manual repairs and re-repairs, is not easy.

A posting by Bastiaan Veelo at Linux Weekly News has a nice summary:

"The most important thing to be aware of though is that Arch and Subversion differ in fundamental ways. Arch works in a decentralized way, while Subversion is designed on a client/server model. Indeed with Arch you can start coding and using version control without first applying for access to the server. However, [merging] your code with the main branch has to be done by the one project maintainer....

Development with Subversion (and CVS for that matter) is centralized in the sense that there is just one repository, but it is actually more decentralized in a social sense since there are as many code integrators as there are developers with write access to the repository.

In short, one could say that Arch is centralized around a code integrator, and that Subversion (like CVS) is centralized around a repository. You decide what fits best. If you are a heavy user of CVS... chances are that Subversion actually fits your needs best.

Linus Torvalds has an interesting post about the advantages of distributed development.

The subversion developers have a very enlightened post about this titled Please Stop Bugging Linus Torvalds About Subversion. In it, they say: "We, the Subversion development team, would like to explain why we agree that Subversion would not be the right choice for the Linux kernel. Subversion was primarily designed as a replacement for CVS. It is a centralized version control system. It does not support distributed repositories, nor foreign branching, nor tracking of dependencies between changesets. Given the way Linus and the kernel team work, using patch swapping and decentralized development, Subversion would simply not be much help. While Subversion has been well-received by many open source projects, that doesn't mean it's right for every project." In short, tools are typically developed to support certain approaches, and if you want to work in a certain way you need to choose tools that help (not hurt) the process, create those tools, or change your process to better fit the tools available.

Using Arch to Support Centralized Development

As I noted above, conceptually a distributed approach should be able to fully implement the centralized approach. I do have some concerns about the recommended method for using GNU arch to support a centralized repository of multiple developers. It appears that some support tools will deal with my concerns, though using them takes much more effort.

The GNU Arch wiki site provides basic information on how to use arch in a centralized way. It's easy to use GNU arch to implement a centralized repository: a particularly simple way is to grant all developers read/write access to a shared filesystem (say secure ftp) used to create the centralized repository. The "repository" is in some sense a pseudo-user that everyone can write to. Systems hosting many project repositories that need to be protected from each other will need to define users or groups (say one per project) to provide that separation. This can viewed as a minor problem (now the system administrator or a special group management tool needs to get involved whenever a new project or new developer joins a project) or a big plus (operating system controls are heavily tested and far more reliable than application-level access controls). Once set up, there are certainly many advantages to this scheme. For example, it's often easier to set up a shared directory than a more complex server.

However, I think there are problems when using arch this way. This approach presumes that all the clients "work perfectly;" if there are many developers, the odds increase that some developer is using an older client with a bug or subtle semantic difference that could screw up the whole repository. More importantly, it presumes that developers, and attackers who temporarily gain developer privileges, are never malicious. Since a developer has complete unfettered read/write access to a shared repository, a malicious developer (or attacker taking the developer's credentials) could stomp over a shared arch repository, changing supposedly unchanging data to make the repository quite different than expected. Unless there's something to counteract it, a malicious developer or attacker with their privileges could insert malicious code without making it clear that they inserted it, make it appear that some other developer inserted malicious code, or erase data in a way that makes it unrecoverable. Obviously, malicious developers are bad thing, but an SCM system should always be able identify exactly who inserted any malicious code (in a nonrepudiable way), and protect the integrity of the SCM history so that changes can be easily undone (and re-checked, once you've found a culprit). In today's unfriendly world, where you're often working with people you don't really know, protection against malicious attack is important.

The recommended GNU arch setup for a central repository has all users sharing a single account, so the operating system and arch have no way to even distinguish between the users when they log in! It's possible to set up a shared directory repository so that users authenticate individually, and then set up a shared directory (using groups), but users can then accidentally (or intentionally) set their access control bits so that later developers won't be able to read or modify the files. So, the recommended approach has a lot of drawbacks if a client misbehaves, or you don't fully trust your developers, or an attacker might gain developer privileges.

You can make backups and compare them with the original, which would at least detect malicious changes to the repository history if they happen after the backup. Backups would also allow people to replace the malicious change with the correct version. Note, however, that arch doesn't currently include tools to do this checking automatically (I don't think you can use arch's mirroring capability, since the arch data itself is suspect). So, you'll have to know a lot about arch's internals to do this currently, until arch adds such tools. This approach would not identify exactly who made the malicious change, even when the culprit could have been required to log in as a specific developer. But possibly more importantly, a malicious developer could trivially create a malicious change and forge it as though someone else made the change. A backup could only tell you that an addition had been made, but it can't say if the data in the addition is correct. So backups definitely help, but attackers can get around them.

Another partial (but significant) counter to these problems are the new signing archives capabilities added to arch 1.2. You can optionally make an archive a "signed" archive, in which the changes are cryptographically signed. I've looked into this (my thanks to Colin Walters who helped me understand details of the signature process). When enabled arch can sign MD5 hashes, which are cryptographically much weaker than SHA-1 hashes, but that's certainly a step forward from having no cryptographic signatures. Some effort is definitely required to set up signed archives (e.g., now you need public keys of all developers), though it's a good idea for security-minded systems. The signatures sign the revision number as well as the change itself (they're both encoded in the signed tarball), so an attacker can't just change the patch order and can't silently remove a patch and renumber the later patches without detection. However, it appears to me that such signatures (at least as currently implemented) cannot detect the malicious substitution of whole signed patches (such as the silent replacement of a previous security fix with a non-fix), or removal of the "latest" fix before anyone else uses it. Unlike backups, signatures can detect many problems without comparing an external source (so it'll likely be faster to detect problems), and it's built-in to the tool already, which increases the likelihood it'll be used. For many developers, backups and signing archives may be enough. However, this mechanism still doesn't expose who made certain kinds of malicious changes (such as silent removal and replacement), in the case where the developer could have been identified.

Arch-pqm (patch queue manager) is an arch extension that creates a central repository out of a decentralized tool. It allows developers to send their requests (such as changes) to a central location, then arch-pqm queues up those requests and has them automatically performed. Arch-pqm first checks the GNUPG signatures of the requests to determine if the requester is an authorized developer for that repository, and rejects changes by anyone else. This is closer in approach to how centralized tools like CVS and subversion work. I've had several email conversations with arch-pqm's developer, Colin Walters, and found that arch-pqm only permits operations that protect the history of the repository. In particular, arch-pqm supports the star-merge operation to merge in new changes, caching, uncaching, making new categories / branches / versions, and tagging -- none of which erase the history in the repository.

Thus, it currently appears to me that combining signed archives, backups, and arch-pqm will probably address my concerns. Arch-pqm prevents arbitrary developers, who have rights to the repository, from arbitrarily changing the frozen repository values. Signed archives and comparisons with backups allow the detection and repair of malicious changes to the repository if the attackers work around or subvert arch-pqm. If a malicious developer's changes can always be recorded correctly as theirs and undone later (by forcing them to sign their changes), and at least detected when the infrastructure can't do otherwise, then my concerns disappear. One caveat: I haven't done a detailed security analysis, and arch-pqm wasn't originally designed specifically to provide this security. For example, perhaps creating odd filenames or trying to change settings might subvert this protection. There may be ways to create to exploit a buffer overflow or other technique to subvert these checks. Still, the basic concepts seem sound, and some security analysis at least has a chance with this setup. Unfortunately, using arch-pqm isn't yet built into arch, and the backup checking isn't built into arch either, so there's more than a little "rolling your own" effort to implement and use this approach. Also, the documentation doesn't lay out a simple step-by-step method for setting it up.

I should note that currently I don't think Arch supports signing of signatures. In other words, if B accepts A's work, and C accepts B's work (which included A's work), then I should see signatures by A of A's work, and signatures of B indicating that they accepted A's work. To be fair, few SCM systems support that. But centralized systems have an easier time providing equivalent functionality; distributed systems should record more of this kind of information, because there's no central place to get it or trust it.

Note that Colin Walters is also creating a "smart server" for arch named "archd" and a protocol to support the server. In some ways this appears to be similar in concept to arch-pqm; it would be a program that would automatically execute SCM commands from authorized users. However, archd would use a specialized protocol designed for the purpose to transfer the data, rather than using email. It appears that it will have similar protections (it will limit the commands that can be executed), and if that's true, the same comments would probably apply. But this would be for the future; it's not ready for use at this time.

In all SCMs, if you're worried about malicious developers, you have to be careful about who can define "hooks" and the permissions they have when they run. Whenever GNU arch runs a command, GNU arch runs the program ~/.arch-params/hook (if it exists) to run additional actions ("hooks"). In other words, the hooks are defined on a per-user basis, not per-project basis. That design has some advantages from a security point-of-view; since the hook is not inside the maintained development area (normally), editing files shouldn't trick the CM system into running new commands. However, that has disadvantages if there's a shared repository, because that means that the shared repository can't run commands to enforce some requirements (e.g., to require that there be no compiler warnings, run regression tests, announce a change via email, or require two-person authorization before checking in). This can also be solved by arch-pqm or a smart server, since the server can run the hooks on its own in its own environment.

Other OSS/FS SCM systems

Besides CVS, Subversion (SVN), svk, GNU arch, and Monotone, there are many other OSS/FS SCM systems, such as Aegis, CVSNT, Darcs, FastCST, OpenCM, Vesta, Superversion, Codeville, and git/Cogito, Mercurial I've already mentioned Bazaar, Arx, and Bazaar-NG.

That's not even a complete list! I'm not trying to completely exclude these others from consideration; I just don't have enough time to analyze them too, though for several of them I gathered enough information to decide that I wasn't as interested in learning more. You should certainly investigate the various alternatives before picking an SCM system, since your desires might be different than mine. For use right now, Aegis is reported to be quite mature and would be worth a look; Codeville looks like it will be ready soon and has some interesting merging capabilities; Bazaar-NG (as I mentioned earlier) emphasizes both ease-of-use and good technology, and its corporate backing may speed its development; Darcs is really interesting for its technology.

Here's some information I gathered on some of them:

  1. Aegis. The better SCM initiative's initial information about Aegis made me decide to skip it, but perhaps that was too hurried. The better SCM initiative claimed that Aegis requires running as root, which in my mind is an unfortunate security weakness that immediately turned me off. It also reported that it was very hard it is to install, which again made me not very interested in examining it further. On the other hand, some Aegis users have since told me that Aegis is better than that review claims, so this may have been too harsh. Aegis has been around a long time (first released in 1991), and it's been widely reported as being mature (with lots of functionality) and very reliable; obviously those are important attributes in an SCM system! Aegis can validate received transactions before accepting them, which is an excellent capability; on bigger systems you often don't want to accept changes unless they pass a battery of tests in many environments. Aegis is released under the GNU GPL, the most common OSS/FS license (an advantage over some OSS/FS SCM systems such as CVS, which use odd one-off licenses that make merging functionality from elsewhere more complicated). Aegis supports "both push and pull" models; it's not clear to me that it supports fully distributed development, but it appears to be more flexible than the strictly centralized models supported by, say, CVS. Aegis' direct support of Windows is very poor, unfortunately; they say that "Most sites using Aegis and Windows together do so by running Aegis on the Unix systems, but building and testing on the NT systems. The work areas and repository are accessed via Samba or NFS" (that works, but it's awkward). Aegis suports many security capabilities (see their documentation for more). I hope to take a further look at Aegis in the future; I've received some emails from happy Aegis users, and its strengths are certainly worth considering.
  2. CVSNT. CVSNT is an active fork of CVS. It began life as a port of CVS to Windows NT; it now works on both Windows and Unix-like systems. And it has since added several features beyond the original CVS, such as better handling of merges without tagging requirements, per-branch access control, support for Unicode, more efficient binary diff storage, additional server triggers, and additional protocols. But it appears that CVSNT currently has some of the same limitations as the original CVS, such as not handling renaming well. If you look at this, be sure to check out other alternatives such as Subversion.
  3. FastCST. As of June 2004, FastCST is an interesting project in its early stages; only time will tell if it becomes a major project or not. The author's goal is to create a "completely distributed, fast, and secure revision control tool" but as of release 0.4 only its non-distributed parts are functional. It uses a novel delta algorithm (to minimize the size of a change), it focuses on security at every point, and tries to balance security, collaboration, and control. License: GPL.
  4. OpenCM. OpenCM looks very interesting; it's paid special attention to security, which I appreciate. But there is very little evidence that OpenCM is being maintained or will be maintained for the future. As of April 2004, it was only at version "0.1.2alpha7pl1" (a version number that doesn't inspire confidence!). Worse, that version was released 10 months earlier (on June 20, 2003). The mailing list archives show very little activity. I made a phone call to Jonathan S. Shapiro and learned that there was a small effort to "finish" a few things in OpenCM and call it a "version 1.0" release. But frankly, that doesn't bode well for future maintenance. This is too bad, because there's actually a lot of technical promise in OpenCM. OpenCM may get more support if they produce a "1.0" release. Indeed, it may just take one person to try it out and decide to run with it; there's a lot of technical merit in it. But OpenCM is hard to recommend right now unless you're willing to take the project on.
  5. RCS and SCCS. RCS is a much older SCM system, as is SCCS which came before it. There is a GNU implementation of SCCS, named cssc, but GNU only recommends it when interoperating with old SCCS data. The lock-based approach used by RCS and SCCS just doesn't work well with today's fast development cycles and large development groups. Some SCM systems (like Bitkeeper) use one of these as an infrastructure component to build their SCM system, but at that point they're just lower-level libraries.
  6. Vesta. The better SCM initiative review reported that "Vesta is reported to be mature", and Vesta has been used in many large projects. Vesta is a centralized SCM system with a built-in build system as well, and uses the older "locking style" for editing files. Vesta only supports Unix-like systems; there's no evidence at all that it could run on Windows.

    A major difference between Vesta and other tools is that Vesta is both an SCM and a build tool (like make plus related dependency-computing tools). There are many advantages to this approach; "make" has many known weaknesses, and Vesta automates more of the build process than make does. In particular, Vesta does automatic dependency detection, so you don't have to use a combination of other tools (like makedepend along with make) to build results. However, "make" is extremely popular and common, and that is a turnoff to some potential users. In 2004 I noted that because only Vesta can be used to build Vesta, I expect that it'll be hard for it to attract new users and developers. As of April 2005 I've been told that "bowing to popular demand" they've developed a "Make-based source distribution of Vesta", which eliminates one concern that I had.

    Vesta uses the older, traditional method of handling SCM. It controls a central repository (so it's a centralized system like CVS, Subversion, and Aegis), and you must locks files while they're being edited. Even more oddly, locking is at the granularity of "packages" (not individual files), which in some ways appears even more constricting. Unlike some older systems, that doesn't mean you can't edit files simultaneously. Instead, when two developers need to change files in the same package concurrently, at least one must create a branch in the version number sequence. Locking files for editing is an old, traditional (pre-CVS) way of handling multiple edits to the same file, and if people are essentially assigned to given files this can often work out okay. Old, traditional approaches aren't necessarily bad; many large systems have been created that way, and they work find if you're used to them. However, having to handle locks can slow down development, especially if there are a large number of people who might need to edit a particular file. CVS' approach that eliminated the need for locks was CVS' major achievement. Vesta's alternative solution -- creating new branches -- appears to me to be a little more cumbersome than CVS's if you have to do it a lot, especially since Vesta doesn't seem to have built-in support for merging branches later. Vesta does includes several features to support groups in geographically distributed sites to share development, in particular, there's a tool for replicating sources between repositories.

    Vesta is probably a reasonable choice for those who wish to use the locking style of SCM, and its build systems appears to be much easier to use than make. If groups of files tend to be "owned" by particular individuals who are typically the only ones who make chances to the files, Vesta may work quite well. In fact, if that's how you work, Vesta may support your approach well. However, I suspect many developers (who are used to the freedom of making arbitrary changes and merging later with help from their SCM tool) may find Vesta a little constricting. For some projects, Vesta may be a great choice; for others, it won't be.

  7. Codeville. Codeville is a decentralized system. It has some very interesting technical ideas for merging changes much more effectively. In particular, it has a clever way to eliminate unnecessary merge conflicts. Codeville creates an identifier for each change, and remembers the list of all changes which have been applied to each file and the last change which modified each line in each file. When there's a conflict, it checks to see if one of the two sides has already been applied to the other one, and if so makes the other side win automatically. If that doesn't work, it backs off to a CVS-like patch strategy. It also versions "spaces between the lines", for reasons they describe. Codeville is implemented in Python, which should speed development, and it's a relatively well-known language so it shouldn't have some of the challenges of Darcs (as I'll explain below). Currently it's immature, but it's growing.
  8. Supervision (GPL). Superversion 1.2 is a single-machine, single-developer SCM system. That can be useful, for example, to allow a developer to easily back out of an approach, or to see what changed when. One nifty thing is that it has built-in support for nifty graphs showing the relationship between versions. However, I'm primarily interested in SCM systems that handle many developers, so I didn't find this one so interesting. As of April 2005, they have an upcoming version 2 that will support multiple users, and thus is more interesting from my point of view. Version 2 is designed to work as a centralized server with clients, so it appears to be designed to support centralized development; peer-to-peer development might be added later. It runs on at least Unix-like systems and Windows. It depends on Java; that may mean that it requires the use of the proprietary Sun JVM, which is an issue for many (for this perspective, see Free But Shackled - The Java Trap). As OSS/FS Java implementations become more capable this concern may go away.
  9. git and Cogito. Linus Torvalds and other Linux kernel developers abandoned BitKeeper, and decided to write their own distributed SCM system. Linus created a low-level system called "git", with the intention of having higher-level SCM services be built on top of it. The most popular higher-level service built specifically to run on top of git is Petr Baudis' "Cogito" (formerly known as git-pasky). The development of Cogito and git has moved very rapidly; as of the time of this writing it's still fast-changing and not very mature. git is specifically designed to support Linux kernel development (see this email by Linus Torvalds about git's design), but it's clear it could be used by at least some others as well.

    The primary focus of git is performing distributed development with extremely fast merging (about 1 "patch" per second) for large programs (e.g., the Linux kernel). The lower-level "git" is designed to simply store a large number of different static views of each version of a tree. It does this through the concepts of a "blob" (a versioned file), "tree" (a set of all files for a given version), and "commit" (a description of what changed between two trees). Each of these is referenced using its SHA-1 hash. It's presumed that disk space is not critical; each versioned file is stored as a separate compressed file, and not as a delta. This approach simplifies many tasks at the cost of some storage space, but this is viewed as a reasonable trade-off (there's ongoing work to add "deltification" as a localized option). It is presumed that some operations (such as identifying exactly who last modified every given line in a file) are not important; these are not implemented in the current implementation, and implementing them given the current approach may be quite resource-intensive.

    Cogito does not work on Windows natively (there are reports it work on top of Cygwin), primarily because much of it is implemented using bash shell scripts. I strongly suspect git won't work on Windows natively. However, the underlying file structure should work just fine on Windows. Making it work on Windows might simply require moving the shell code to something more portable (say Python or Perl), and since there's relatively little code that might not take too long. It's also conceivable that a port of bash and many other Unix tools might work too (short of Cygwin), though I know of no one who's tried that approach.

    Currently git-based tools handle renamed files and directories very poorly. Changes do not get applied correctly when a file is renamed but is edited by another branch (this is in comparison to GNU Arch, Darcs, and many other systems). Torvalds has been very adamant that the git format not directly store information about file/directory renames, because he believes it should be possible to determine such information without it. This is technically true, and is especially true if in practice people carefully commit before and after any rename without changing the contents (and never move files with identical contents between commits). But the current tools don't try to handle this case, and so the results are very poor after renames.

    The git data format stores whether or not a file is executable, and of course the filenames and their data (there's actually an entire "mode", so you could store more information if it was important to you). It does not store the date/time stamp of individual files, only the date/timestamp of a commit (of an entire tree of files). Thus, very quickly date/time stamps of individual files are lost; this may not matter to you.

    Merges are currently implemented using the traditional 3-way merge algorithm. For Linux kernel development (and many others) this is actually quite sufficient. But this is known to have problems handling certain kinds of "criss-crossing" branches, so for some it will produce a lot of unnecessary rejects (requiring hand correction) as compared to some other merging implementations. git actually stores complete copies of all past versions and how they relate, so it should be possible to implement alternative merge algorithms in the future.

    Lots of functionality is missing from git and Cogito, though it's enough now to be used. One area of particular concern to me is that while tags can be signed, ordinary commits (even if exchanged between people) are not cryptographically signed. You want cryptographic signatures of commits, and have them stored in the database, so that they can be checked later on. In particular, this sort of precaution helps prevent counter many kinds of attacks if (when) attackers take over a repository.

    Other SCM prototypes have been built on git, and various interfaces have been developed to other SCMs (in particular, there's a prototype git-to-Darcs interface, and GNU Arch's Tom Lord announced he was planning to switch to the git format though it's not clear that will really occur). Since git is low-level, it's probably best to start by using Cognito rather than the low-level git at first.

    A web interface to git repositories has been created; so you can see examples of git results by examining the kernel.org git repository. The mailing list is helpful, but there's a vast amount of traffic on it; Zack Brown's "git traffic" has lots of info on git and Cogito.

  10. Mercurial Mercurial (whose commands begin with "hg") is a small SCM that's an offshoot from git and Cogito. git's low-level functions store whole files (compressed). Mercurial, instead, is designed to store files as changes. This makes tasks like identifying who did what, and when a given file was changed, simpler to do. It's a small Python program, and lacking some functions compared to others at this time, but it's an interesting development.

Darcs, in particular, is very interesting for its technology. From what I've seen, darcs is currently more of a prototype of some very innovative ideas for SCM, and maybe a tool for smaller projects, rather than a useful tool for large projects, though it can be used. Darcs is written in Haskell, which is both a strength and a weakness. Haskell is a high-level functional programming language, which probably helped the developer concentrate on abstract concepts. However, while Haskell is intriguing, in my experience programs written in it are generally slow, and possibly worse, its performance is unpredictable (jemfinch expresses somewhat similar concerns). Some have argued to me that Haskell isn't necessarily slow today, and maybe that's true, but darcs' developer admits that darcs has poor performance (which would cause trouble as a project gets large). In March 2004 the darcs developer said performance has gotten much better, so perhaps that's no longer a serious problem. However, since few developers truly grok functional programming, darcs is less likely to get other developers to help extend it. It does get contributions -- a few minor contributions by others have been reported to me -- but they're nothing compared to the scale of work by others in Subversion or GNU Arch. In March 2004 Darcs' website stated that it does not have an "abundance of features" and its "core may be still be buggy" -- not exactly the words you want to hear when you let a program control your source code! The main developer does say that the website is out of date, that the program is no longer buggy, and that it supports more than basics (though it is still missing some features).

Darcs does have some innovative approaches, though, and perhaps darcs will leap past everyone else, or at least perhaps some of its ideas may slip into other SCM systems. For example, darcs can keep track of inter-patch dependencies so that bringing in just one patch can bring in "just the others needed", a clever capability not supported by other tools like GNU Arch. It is completely patch-oriented, and requires user input to help characterize exactly what changed. For example, it understands a "token replace patch", which makes it possible to create a patch which changes every instance of the variable ``stupidly_named_var'' with ``better_var_name'', while leaving ``other_stupidly_named_var'' untouched. As the author says, "When this patch is merged with any other patch involving the ``stupidly_named_var'', that instance will also be modified to ``better_var_name''. This is in contrast to a more conventional merging method which would not only fail to change new instances of the variable, but would also involve conflicts when merging with any patch that modifies lines containing the variable. By more using additional information about the programmer's intent, darcs is thus able to make the process of changing a variable name the trivial task that it really is..." The advantage is that merge conflicts can suddenly disappear, or at least be far less likely, because the system has more information to work with. The disadvantage is that this requires more interaction with the developer, who already has a complicated problem. Whether or not this approach will catch on is to be seen; I doubt it, myself, since systems which don't have it seem to be acceptable to most developers. But I can definitely see how that additional information could make an SCM system more powerful.

Other Reviews of SCM or OSS/FS SCM Systems

There are many other SCM comparisons available. The better SCM initiative was established to encourage improved OSS/FS SCM systems, by discussing and comparing them. Among other things, see their comparison file. The website revctrl.org is a nice starting point for comparing alternatives. Zooko has written a short review of OSS/FS SCM tools. Shlomi Fish's OnLamp.com article compares various CM systems as does his Evolution of a Revision Control User. The arch folks have developed a comparison of arch with Subversion and CVS (obviously, they like arch). Another pro-arch discussion is Why the Future is Distributed. A pro-subversion discussion is available at Dispelling Subversion FUD. Slashdot had a discussion when Subversion 1.0 was announced. Kernel traffic posted a summary of a technical discussion about BitKeeper. Brad Appleton has collected lots of interesting SCM links. jemfinch has some interesting essays about SCMs (he uses the term VCS), including why he thinks the approach to branches used by Darcs, Arch, and Bazaar-ng is a poor one. A brief overview of SCM systems that can run on Linux is available. Whose Distributed VCS Is The Most Distributed? discusses distributed VCSs. Version Control System Shootout Redux (Mozilla - Mortal Kombat describes Mozilla's decision process, which is really amusing because he uses Mortal Kombat images to describe the "shootout".

I've not discussed highly related issues like bug tracking (such as Bugzilla); that's outside the scope of this paper.

BitMover's BitKeeper

There are many proprietary SCM systems, such as BitKeeper, Perforce, and Rational ClearCase, but since they aren't OSS/FS they're really outside the scope of this paper. However, I can't completely omit discussing BitKeeper entirely, because the Linux kernel developers' use of BitKeeper demonstrated how distributed SCM can work, and BitKeeper's association with this well-known OSS/FS project makes it hard to ignore. Besides, the case of BitMover's BitKeeper is especially interesting, in part because it's very controversial.

BitKeeper is a proprietary SCM system that supports distributed SCM. Even though BitKeeper is proprietary, Linus Torvalds decided to use it to maintain the OSS/FS Linux kernel. The bargain was that the OSS/FS kernel developers got to use (for free) a good SCM tool; the proprietary vendor got a great deal of free publicity and many helpful insights from highly intelligent users. The no-cost BitKeeper required that source code being maintained be copied to the vendor; since few commercial developers wanted to do that, they were generally willing to buy the commercial license without that condition. The no-cost BitKeeper also forbid users to work on competing projects; indeed, there are reports that even purchasers of the for-pay product were forbidden to work on competing projects.

Some, such as Torvalds, found these conditions acceptable. Others did not believe using a proprietary SCM system was acceptable for working on an OSS/FS system (e.g., Richard Stallman's believed this was fundamentally unacceptable). Others were concerned about the risks of depending on a single vendor with a proprietary format (what if the vendor changed their policies later?), or did not find the "cannot develop competing products" condition acceptable (this condition is very unusual and is clearly an attempt to prevent competition). BitMover released a no-cost source-available client for Bitkeeper that allows people to extract current versions of data (programs) from BitKeeper repositories; it's not clear that this client is OSS/FS, and it has limited functionality, but it may be sufficient for some purposes.

In April 2005 things came to a head. Torvalds' employer (OSDL) also paid money to someone else, who on their own free time (not paid for by OSDL) was working on a competing product. BitMover's Larry McVoy complained that even this was unacceptable. After examining the difficulty of trying to keep competing interests compatible, Torvalds decided he would have to switch to a different SCM program. The article No More Free BitKeeper gives the vendor's (BitMover's) side of the story. There's reason to hope that this decision will greatly increase the speed of development of an OSS/FS distributed SCM tool; the licensing constraints of BitKeeper made it very difficult for some excellent developers to work with competing OSS/FS SCM systems, and with that constraint gone it's likely that development of some of them will accelerate.

Conclusions

The world of OSS/FS SCM systems is a better place than it was a few years ago; there are now several viable options. CVS, while it has its weaknesses, is still a workhorse able to do the basic job. Subversion is ready today for those who just want a better CVS for a centralized SCM system, and it's probably the most common SCM choice today for those who want a centralized OSS/FS SCM system that's a little better than the aging CVS. There are other reasonable choices, too; Aegis seems to have a lot going for it too, and I've had several reports that it's mature, so for large projects that would be a system worth examining.

But there are lots of other options, and it's going to be interesting to watch what happens in the future. A lot of people want a distributed SCM system; the Linux kernel developers have shown that distributed SCM can be extremely effective through their use of BitKeeper. In distributed SCM systems, the field is currently crowded, with many people having developed early stages with significantly different approaches to the problem. GNU Arch is extremely capable if you're willing to work with the issues listed above (and I think it will get better), though it hasn't made as much progress in 2004 and 2005 as it should have, and thus it may lose its early momentum to other OSS/FS competitors. Monotone, CodeVille, and Bazaar-NG in particular look like potentially strong contenders at the moment to me. I really like a lot of things about Bazaar-NG, though it's less mature and it remains to be seen if its promising start will result in a winning product.

In the end, the best approach is to look at your options, winnow down to a short list, and then try each of those top contenders. I hope you've found this brief tour helpful.

Feel free to also look at my paper on SCM security, or see my home page at http://www.dwheeler.com.