Counting Source Lines of Code (SLOC)
Click here to
get the paper, ``More than a Gigabuck: Estimating GNU/Linux's Size,''
which presents my latest GNU/Linux
size estimates, approach, and analysis.
My latest size-estimation paper is
More than a Gigabuck: Estimating GNU/Linux's Size
(June 2001).
Here are a few interesting facts quoting from the paper
(which measures Red Hat Linux 7.1):
-
It would cost over $1 billion (a Gigabuck)
to develop this Linux distribution by conventional proprietary means
in the U.S. (in year 2000 U.S. dollars).
- It includes over 30 million physical source lines of code (SLOC).
- It would have required about 8,000 person-years of
development time, as determined using the widely-used basic COCOMO model.
- Red Hat Linux 7.1 represents over a 60% increase in
size, effort, and traditional development costs over Red Hat Linux 6.2
(which was released about one year earlier).
Many other interesting statistics emerge; here are a few:
-
The largest components (in order) were the
Linux kernel (including device drivers), Mozilla
(Netscape's open source web system including a web browser,
email client, and HTML editor),
the X window system (the infrastructure for the graphical user interface),
gcc (a compilation system),
gdb (for debugging),
basic binary tools,
emacs (a text editor and far more),
LAPACK (a large Fortran library for numerical linear algebra),
the Gimp (a bitmapped graphics editor), and
MySQL (a relational database system).
Note that some projects (in particular KDE and GNOME) are in aggregate
large enough to be one of the largest components, but because they are
developed and distributed as a large number of smaller components,
their totals don't appear in the list of largest components.
-
The languages used, sorted by the most lines of code, were
C (71% - was 81%), C++ (15% - was 8%),
shell (including ksh),
Lisp, assembly, Perl, Fortran, Python, tcl, Java,
yacc/bison, expect, lex/flex, awk, Objective-C, Ada, C shell,
Pascal, and sed.
-
The predominant software license is the GNU GPL.
Slightly over half of the software is simply licensed using the GPL,
and the software packages using the copylefting licenses (the GPL and LGPL),
at least in part or as an alternative, accounted for 63% of the code.
In all ways, the copylefting licenses (GPL and LGPL) are the dominant licenses
in this Linux distribution.
In contrast, only 0.2% of the software is public domain.
You can get:
-
``More than a Gigabuck: Estimating GNU/Linux's Size'',
my latest SLOC analysis paper which analyzes Red Hat Linux 7.1.
You can also get some of the supporting information
(intended for those who want to do further analysis), such as the
complete summary,
summary SLOC analysis of
the Linux 2.4 kernel,
map of build directories
to RPM spec files,
spec summaries,
counts of files, and
detailed file-by-file SLOC counts.
You can also get version 1.0,
version 1.01,
version 1.02,
version 1.03,
version 1.04
or
version 1.05
of the paper.
-
``Estimating Linux's Size,''
the previous paper which analyzes Red Hat Linux 6.2.
Various background files and previous editions are also available.
You can see the
ChangeLog, along with older
versions of the paper
(original paper (version 1.0),
version 1.01,
version 1.02 and
version 1.03).
version 1.04).
You can also see some of the summary data:
SLOC sorted by size,
filecounts,
unsorted SLOC counts,
unsorted SLOC counts with long lines,
and
SLOC counts formatted for computer processing (tab-separated data).
For license information, you can see
the licenses allocated
to each build directory.
If you want to know what a particular package does, you can find out
briefly by looking at the
package (specification
file) descriptions.
- Linux Kernel 2.6: It's Worth More! does a deeper analysis of effort of just the Linux kernel.
When referring to this information, please refer to the URL
http://www.dwheeler.com/sloc.
Some of the other URLs may change, and I may add more measurements later.
If you want to get the tools I used, they're available.
I call the set SLOCCount, and you can get SLOCCount at
http://www.dwheeler.com/sloccount.
Here are some testamonials:
- "This is a remarkable piece of work. I'm impressed, and expect to
get good use out of some of the statistics." - Eric S. Raymond
- "I have just read your paper on estimating GNU/Linux size.
BEAUTIFUL PAPER. WONDERFUL. My highest praise for your efforts.
This is really great work.
I enjoyed reading it." - Wesley Strawn
Others have been inspired by my paper
More than a Gigabuck: Estimating GNU/Linux's Size to
do more analysis, which is great:
-
One group did an analysis of the Debian GNU/Linux distribution, using my tool
sloccount.
You can see their very interesting paper
Counting Potatoes: The size of Debian 2.2 at
http://people.debian.org/~jgb/debian-counting,
or you can see an older version of it in
Upgrade.
They found that Debian 2.2 includes more than 55 million physical SLOC, and
would have cost nearly $1.9 billion USD using over 14,000 person-years
to develop using traditional proprietary techniques.
-
In 2005 they measured Debian again, and reported results in
Measuring Libre Software Using Debian 3.1 (Sarge)
as A Case Study: Preliminary Results.
Debian 3.1 ("Sarge") had grown to about 230 million source lines of code,
with an estimated 60,000 person-years and $8 billion USD redevelopment cost.
This was contained in 8,600 source packages, generating about
15,300 binary packages.
Top languages were C (57%) C++ (16.8%), Shell (9%), LISP (3%), Perl (2.8%),
Python (1.8%), Java (1.6%), FORTRAN (1.2%), PHP (0.93%), Pascal (0.62%),
and Ada (0.61%).
The largest programs (in order of size) were
OpenOffice.org (1.1.3, mostly C++),
the Linux kernel (2.6.8, mostly C),
the web authoring system NVU (0.80, mostly C),
internet suite Mozilla (1.7.7, mostly C++),
compiler suite GCC (3.4.3, mostly C but significant amounts of Ada and C++),
truetype font server XFS-XTT (1.4.1, mostly C),
and XFree86 (4.3.0, mostly C).
-
Another person
analyzed Perl's CPAN library and determined it would have
cost $677 million to develop;
this CPAN analysis was a Slashdot article on July 30, 2004.
Comparitive numbers are hard to find.
Gary McGraw (of Cigital) has searched public information to find
Windows SLOC size.
According to his sources, Windows NT 5.0 (in 2000) was 20M SLOC,
Windows 2000 (in 2001) was 35M SLOC, and Windows XP (in 2002) was 40M SLOC.
(This information is from his briefing
Building Secure Software: How to avoid security problems
the right way).
Palle Pedersen
done a rough-order-of-magnitude analysis of all
Free-libre / open source software,
starting with some extremely simplifying assumptions.
"Assuming an average open source project is 35,000 lines
of code and the average cost of a software developer is
$30/hour (~$60,000/year), a simple COCOMO II calculator tells us
that the average open source project costs $630,000 to develop.
This cost translates into $18 per line of code.
Extrapolating that to 1.7 billion lines of code gives
us an estimated value of $30.6 billion/year...
if the open source community was a country with a GDP of $30.6 billion,
it would rank 77 right between Bulgaria and Lithuania...
putting the open source community ahead of most countries in the world...
Such an economic force should not be underestimated, and this is
yet another indication that open source has become a significant part
[of] the technology world."
The specific number may be significantly off, no one knows,
but I think the conclusion (OSS has become a significant part) is spot-on.
Remember,
there's more to a program than how many lines of code it has, as the August 26, 2003 Dilbert strip shows.
You can also view
my home page
(http://www.dwheeler.com), or related pages such as my pages on
"Why
open source software / free software (OSS/FS)? Look at the Numbers!", my
open source
software / free software references, and
how to write
secure programs.
This site is hosted by Webframe.org.