Counting Source Lines of Code (SLOC)

Picture of David A. Wheeler

Click here to get the paper, ``More than a Gigabuck: Estimating GNU/Linux's Size,'' which presents my latest GNU/Linux size estimates, approach, and analysis.

My latest size-estimation paper is More than a Gigabuck: Estimating GNU/Linux's Size (June 2001). Here are a few interesting facts quoting from the paper (which measures Red Hat Linux 7.1):

  1. It would cost over $1 billion (a Gigabuck) to develop this Linux distribution by conventional proprietary means in the U.S. (in year 2000 U.S. dollars).
  2. It includes over 30 million physical source lines of code (SLOC).
  3. It would have required about 8,000 person-years of development time, as determined using the widely-used basic COCOMO model.
  4. Red Hat Linux 7.1 represents over a 60% increase in size, effort, and traditional development costs over Red Hat Linux 6.2 (which was released about one year earlier).

Many other interesting statistics emerge; here are a few:

You can get:

  1. ``More than a Gigabuck: Estimating GNU/Linux's Size'', my latest SLOC analysis paper which analyzes Red Hat Linux 7.1. You can also get some of the supporting information (intended for those who want to do further analysis), such as the complete summary, summary SLOC analysis of the Linux 2.4 kernel, map of build directories to RPM spec files, spec summaries, counts of files, and detailed file-by-file SLOC counts. You can also get version 1.0, version 1.01, version 1.02, version 1.03, version 1.04 or version 1.05 of the paper.

  2. ``Estimating Linux's Size,'' the previous paper which analyzes Red Hat Linux 6.2. Various background files and previous editions are also available. You can see the ChangeLog, along with older versions of the paper (original paper (version 1.0), version 1.01, version 1.02 and version 1.03). version 1.04). You can also see some of the summary data: SLOC sorted by size, filecounts, unsorted SLOC counts, unsorted SLOC counts with long lines, and SLOC counts formatted for computer processing (tab-separated data). For license information, you can see the licenses allocated to each build directory. If you want to know what a particular package does, you can find out briefly by looking at the package (specification file) descriptions.
  3. Linux Kernel 2.6: It's Worth More! does a deeper analysis of effort of just the Linux kernel.

When referring to this information, please refer to the URL http://www.dwheeler.com/sloc. This is not a legal requirement; of course you are always allowed to deep link to anything you want to! This is just a friendly recommendation, since some of the other URLs may change, and I may add more measurements later.

If you want to get the tools I used, they're available. I call the set SLOCCount, and you can get SLOCCount at http://www.dwheeler.com/sloccount.

Here are some testimonials:

Others have been inspired by my paper More than a Gigabuck: Estimating GNU/Linux's Size to do more analysis, which is great:

  1. One group did an analysis of the Debian GNU/Linux distribution, using my tool sloccount. You can see their very interesting paper Counting Potatoes: The size of Debian 2.2 at http://people.debian.org/~jgb/debian-counting, or you can see an older version of it in Upgrade. They found that Debian 2.2 includes more than 55 million physical SLOC, and would have cost nearly $1.9 billion USD using over 14,000 person-years to develop using traditional proprietary techniques.
  2. In 2005 they measured Debian again, and reported results in Measuring Libre Software Using Debian 3.1 (Sarge) as A Case Study: Preliminary Results. Debian 3.1 ("Sarge") had grown to about 230 million source lines of code, with an estimated 60,000 person-years and $8 billion USD redevelopment cost. This was contained in 8,600 source packages, generating about 15,300 binary packages. Top languages were C (57%) C++ (16.8%), Shell (9%), LISP (3%), Perl (2.8%), Python (1.8%), Java (1.6%), FORTRAN (1.2%), PHP (0.93%), Pascal (0.62%), and Ada (0.61%). The largest programs (in order of size) were OpenOffice.org (1.1.3, mostly C++), the Linux kernel (2.6.8, mostly C), the web authoring system NVU (0.80, mostly C), internet suite Mozilla (1.7.7, mostly C++), compiler suite GCC (3.4.3, mostly C but significant amounts of Ada and C++), truetype font server XFS-XTT (1.4.1, mostly C), and XFree86 (4.3.0, mostly C).
  3. Another person analyzed Perl's CPAN library and determined it would have cost $677 million to develop; this CPAN analysis was a Slashdot article on July 30, 2004.
  4. The Linux Foundation re-performed the analysis in 2008 with Fedora 9, releasing "Estimating the Total Development Cost of a Linux Distribution". Here's their press release.
  5. Debian developer James Bromberger posted "Debian Wheezy: US$19 Billion. Your price... FREE!" in February 2012, where he determined that the newest Debian distribution ("Wheezy") would have taken $19 billion U.S. dollars to develop as proprietary software. This was picked up in the news article "Perth coder finds new Debian 'worth' $18 billion" by Liam Tung, IT News, February 14, 2012.

Comparitive numbers are hard to find. Gary McGraw (of Cigital) has searched public information to find Windows SLOC size. According to his sources, Windows NT 5.0 (in 2000) was 20M SLOC, Windows 2000 (in 2001) was 35M SLOC, and Windows XP (in 2002) was 40M SLOC. (This information is from his briefing Building Secure Software: How to avoid security problems the right way). Another source claims that Windows NT's original release (in 1992) contained 4 million lines, while NT 4.0 (released in 1996) expanded to 16.5 million lines. ( "Crash-Proof Computing" by Tom R. Halfhill, Byte, April 1998). "This Car Runs on Code" by Robert N. Charette (IEEE Spectrum, 2009-02-01) stated that "It takes dozens of microprocessors running 100 million lines of code to get a premium car out of the driveway, and this software is only going to get more complex". "Codebases" at Information is Beautiful creates an interesting visualization of various lines-of-code numbers. Lines of code is a Google doc spreadsheet of various sizes, with URLs to the information sources.

Palle Pedersen done a rough-order-of-magnitude analysis of all Free-libre / open source software, starting with some extremely simplifying assumptions. "Assuming an average open source project is 35,000 lines of code and the average cost of a software developer is $30/hour (~$60,000/year), a simple COCOMO II calculator tells us that the average open source project costs $630,000 to develop. This cost translates into $18 per line of code. Extrapolating that to 1.7 billion lines of code gives us an estimated value of $30.6 billion/year... if the open source community was a country with a GDP of $30.6 billion, it would rank 77 right between Bulgaria and Lithuania... putting the open source community ahead of most countries in the world... Such an economic force should not be underestimated, and this is yet another indication that open source has become a significant part [of] the technology world." The specific number may be significantly off, no one knows, but I think the conclusion (OSS has become a significant part) is spot-on.

A post by agenaille on reddit claims that the web application healthcare.gov is roughly 3.7 million lines of code (including HTML, CSS, and XML that is arguably not code). I have not found a way to independently verify this.

"The Total Growth of Open Source" by Amit Deshpande and Dirk Riehle (2008) analyzed a set of over 5000 FLOSS projects, and found that they were growing at an exponential rate. Indeed, their 2008 results were that the "total amount of source code and the total number of projects double about every 14 months".

There are lots of related statistics. For example, the TIOBE Programming Community Index (TPCI) tracks the popularity of programming languages. Wikipedia: Size in volumes estimates the size of Wikipedia in volumes (hint: it's gigantic).

Remember, there's more to a program than how many lines of code it has, as the August 26, 2003 Dilbert strip shows.

You can also view my home page (http://www.dwheeler.com), or related pages such as my pages on "Why open source software / free software (OSS/FS)? Look at the Numbers!", my open source software / free software references, and how to write secure programs.

This site is hosted by Webframe.org.