The Heartbleed vulnerability is a serious security vulnerability formally identified as CVE-2014-0160 [Heartbleed.com] and described in CERT Vulnerability Note VU#720951. Heartbleed is a vulnerability in OpenSSL, a widely-used toolkit that implements the cryptographic protocol Secure Sockets Layer (SSL) and its successor the Transport Layer Security (TLS). When you use a web browser with an “https://” URL you are using SSL/TLS, and in many cases at least one side (the client or server) uses OpenSSL. The XKCD cartoon Heartbleed Explanation is a great explanation that shows how the vulnerability can be exploited [XKCD], pointing out that it is remarkably easy to exploit.
The impact of the Heartbleed vulnerability was unusually large. Heartbleed affected a huge number of popular websites, including Google, YouTube, Yahoo!, Pinterest, Blogspot, Instagram, Tumblr, Reddit, Netflix, Stack Overflow, Slate, GitHub, Yelp, Etsy, the U.S. Postal Service (USPS), Blogger, Dropbox, Wikipedia, and the Washington Post. UK parenting site Mumsnet (with 1.5 million registered users) had several user accounts hijacked and its CEO was impersonated. A breach at Community Health Systems (CHS), initially via Heartbleed, led to an information compromise that affected an estimated 4.5 million patients [TrustedSec] [Ragan2014]. One paper stated that “Heartbleed’s severe risks, widespread impact, and costly global cleanup qualify it as a security disaster” [Durumeric2014]. Bruce Schneier put it succinctly: “On the scale of 1 to 10, this is an 11” [Schneier2014].
Google and Codenomicon independently found and reported this vulnerability at close to the same time. Rita Mailheau reports, based on work by Ben Grubb from the Sidney Morning Herald, that Neel Mehta and his team from Google Security discovered Heartbleed on 2014-03-21 during a source code review, and that engineers at Finnish company Codenomicon (Antti Karjalainen, Riku Hietamäki, and Matti Kamunen) separately discovered Heartbleed on 2014-04-02 using a new extension (called Safeguard) in their Defensics fuzz testing tool [Mailheau]. There is strong evidence that no attacker conducted widespread scanning for vulnerable servers before the public revelation of Heartbleed on 2014-04-07, since no Heartbleed-basd attacks were found before that date in four large network data taps (though it is possible that targeted attacks occurred before then) [Fisher2014] [Durumeric2014].
A key reason for Heartbleed’s large impact was that many widely-used tools and techniques for finding such defects did not find Heartbleed. This paper discusses specific tools and techniques that could have detected or countered Heartbleed, and vulnerabilities like it, ahead-of-time. I will first briefly examine why many tools and techniques did not find it, since it’s important to understand why many common techniques didn’t work. I will also briefly cover preconditions, impact reduction, applying these approaches, exemplars, and conclusions. This paper does not describe how to write secure software in general; for that, see my book Secure Programming for Linux and Unix HOWTO [Wheeler2004] or other such works. I think the most important approach for developing secure software is to simplify the code so it is obviously correct, including avoiding common weaknesses, and then limit privileges to reduce potential damage. However, here I will focus on ways to detect vulnerabilities, since even the best developers make mistakes that lead to vulnerabilities. This paper presumes you already understand how to develop software.
If you’re in a hurry, you can jump directly to the conclusions.
My goal is to help prevent similar vulnerabilities by helping projects improve how they develop secure software. As the fictional character Mazer Rackham says in Orson Scott Card’s Ender’s Game, “there is no teacher but the enemy... only the enemy shows you where you are weak”. Let’s learn from this vulnerability how we can avoid similar vulnerabilities in the future.
There are many detailed explanations of why the code was vulnerable, e.g., [Cassidy2014]. However, for our purposes we only need to focus on the broader technical reasons that this vulnerability existed and stayed undetected for so long.
This OpenSSL vulnerability was caused by well-known general weaknesses (a weakness is basically a type of potential vulnerability). The key weakness can be classified as a buffer over-read (CWE-126) in the heap, which could happen because of improper input validation (CWE-20) of a heartbeat request message. CWE-126 is a special case of an “out-of-bounds read” (CWE-125), which itself is a special case of “improper restriction of operations within the bounds of a memory buffer” aka “improper restriction” (CWE-119). These are really well-known weaknesses; many tools specifically look for improper restriction of operations within the bounds of a memory buffer. OpenSSL is routinely examined by many tools, too.
Kupsch and Miller specifically examined the Heartbleed vulnerability and identified several reasons this vulnerability was not found sooner, even though people and tools were specifically looking for vulnerabilities like it [Kupsch2014-May]. They even noted that, “Heartbleed created a significant challenge for current software assurance tools, and we do not know of any such tools that were able to discover the Heartbleed vulnerability at the time of announcement.” Here I will emphasize a few of its points and add a few points of my own.
Please note that I am focusing on the technical aspects here. “Of Money, Responsibility, and Pride” by Steve Marquess discusses how the OpenSSL work has been funded in the past (primarily through work-for-hire contracts). They did have some funding, and were even turning away money in a number of cases (see the essay for more). But at the time of Heartbleed there was only one person working on OpenSSL full-time, in spite of the importance of OpenSSL. Since that time, the Core Infrastructure Initiative (CII) has invested money in OpenSSL, and things have gotten better. Obviously money matters, and I'm not discounting that, but many other authors have discussed funding. In this essay, I'm focusing on the technical aspects of Heartbleed (and what we can learn from those aspects).
But first, a few quick comments on terminology:
Static analysis tools work without executing the program. The most commonly-discussed type of static analysis tools for finding vulnerabilities are variously called source code weakness analyzers, source code security analyzers, static application security testing (SAST) tools, static analysis code scanners, or code weakness analysis tools. A source code weakness analyzer searches for vulnerabilities using various kinds of pattern matches (e.g., they may do taint checking to track data from untrusted sources to see if they are sent to potentially-dangerous operations). There are various reports that evaluate these tools, e.g., [Hofer2010].
However, it’s known that many widely-used static analysis tools would not have found this vulnerability ahead-of-time:
The only static analysis tool I've found so far that existed at the time, and was able to find Heartbleed ahead-of-time without a non-standard or specialized configuration, is the FLOSS tool CQual++. CQual++ is a polymorphic whole-program dataflow analysis tool for C++, inspired by Jeff Foster's Cqual tool (indeed, it uses the same backend solver). Although CQual++ focuses on C++, it is also able to analyze C programs (including OpenSSL). CQual++ is the main tool provided by Oink/Elsa (Oink is a collaboration of C++ static analysis tools; Elsa is the front-end for Oink). Daniel S. Wilkerson reported in Oink documentation that "After the Heartbleed bug came out, someone at a government lab that will not let me use their name wrote me (initially on 18 April 2014), saying: Yes, you are interpreting me correctly. CQual++ found Heartbleed while the [proprietary] tools I tried did not." The paper "Large-Scale Analysis of Format String Vulnerabilities in Debian" by Karl Chen and David Wagner suggests that this toolsuite can be effective at detecting vulnerabilities. However, the same reporter may have also made it clear why CQual++ was not used to find the problem first: "I also applied CQual++ to an important internal project and found it very effective (though a bit difficult to run and interpret) at identifying places where sanitization routines weren't being called consistently."
A fundamental issue is that most of these tools do not guarantee to find all vulnerabilities; most do not even guarantee to find vulnerabilities of any particular kind. Sadly, the terminology about this is confusing, so I will first need to clarify the terminology.
In this paper I will call a software analysis tool incomplete if the tool does not necessarily find all vulnerabilities (of a given kind) in the software being analyzed. Previous versions of this paper, and many people, use the term unsound instead to describe tools that look for vulnerabilities (aka bug-finders) that do not claim to find all vulnerabilities. For example, Bessey et al discuss Coverity’s static analysis tool and say, “like the PREfix product, we were also unsound. Our product did not verify the absence of errors but rather tried to find as many of them as possible... Circa 2000, unsoundness was controversial in the research community, though it has since become almost a de facto tool bias for commercial products and many research projects...” [Bessey2010] This term unsound can cause confusion, because people who develop or use program checkers use the term unsound with a different meaning. One blog post explains why the same term seems to have two conflicting meanings: “most program checkers prove theorems about programs. In particular, most aim to prove programs correct in some respect (e.g. type safety). A theorem prover is sound [if and only if] all the theorems it proves are true... People in the program-checking field are accustomed to this, so they habitually think soundness [means] proving the absence of bugs. But a bug-finder doesn’t aim to prove correctness. Instead, it aims to prove incorrectness: to prove the presence of bugs. It’s sound [if and only if] all the bugs it reports are real bugs - that is, if it has no false positives. False negatives (overlooking bugs) are OK, because they don’t make its claims incorrect.” [ArcaneSentiment2014]
I have adopted the NIST SAMATE SATE V Ockham Sound Analysis Criteria [NIST-Sound] in this paper to eliminate this confusion. In the NIST SAMATE terminology, tools that do not guarantee to find all vulnerabilities (of any particular kind) are termed incomplete. Here’s how NIST differentiates between soundness and completeness: “a site is a location in code where a weakness might occur. A buggy site is one that has an instance of the weakness, that is, there is some input that will cause a violation. A non-buggy site is one that does not have an instance of the weakness, in other words, is safe or not vulnerable... A finding is a definitive report about a site. In other words, that the site has a specific weakness (is buggy) or that the site does not have a specific weakness (is not buggy)... Sound means every finding is correct. [A sound] tool need not produce a finding for every site; that is completeness” [NIST-Sound].
Why are so many source code weakness analyzers incomplete? First, most programming languages are not designed to be easy to analyze, making it more difficult to analyze programs in general. Second, most software is not written to make it easy for static analyzers to analyze them. As a result, complete analysis tools may require a lot of human help to apply to existing programs. In contrast, incomplete analysis tools can be applied immediately to existing programs. They manage this by using heuristics to help them identify likely vulnerabilities and complete their analysis within useful times. However, this presents a major caveat: incomplete source code weakness analyzers often miss vulnerabilities.
Clearly Heartbleed is one of those cases where these incomplete heuristics led to a failure by many static analysis tools to find an important vulnerability. The fundamental reason they all failed to find the vulnerability is that the OpenSSL code is extremely complex; it includes multiple levels of indirection and other issues that simply exceeded these tools’ abilities to find the vulnerability. Developers should simplify the code (e.g., through refactoring) to make it easier for tools and humans to analyze the program, as I discuss further later. A partial and deeper reason is that the programming languages C, C++, and Objective-C are notoriously difficult to statically analyze; constructs like pointers (and especially function pointers) can be difficult to statically manage.
This does not mean that static analyzers are useless. Static analyzers can examine how the software will behave under a large number of possible inputs (as compared to dynamic analysis), and the tool heuristics often limit the number of false positives (reports of vulnerabilities that are not vulnerabilities). But - and this is the important point - the heuristics used by incomplete static analysis tools sometimes result in a failure to detect important vulnerabilities.
Dynamic approaches involve running the program with specific inputs and trying to find vulnerabilities.
A limitation of dynamic approaches is that it’s impossible to fully test any program in human-relevant timetables. For example, a trivial program that adds two 64-bit integers has 2128 possible inputs. Testing all inputs (assuming a 4GHz processor and 5 cycles to test each input) would require 13.5 sextillion years (1.35 x 1022 years). Even massively-parallel computing does not really help. Real programs, of course, have far more complex inputs than this! Thus, dynamic approaches cannot show that a program is secure in a strong sense; all they can show is the absence of vulnerabilities with the tests that were used.
But this does not mean that dynamic approaches are useless. Dynamic approaches can be a very useful way to improve security, as long as their limitations are understood. Of course, dynamic approaches (aka software testing) is a useful approach for finding defects. A general introduction to software testing not specific to security is available in Introduction to Software Testing by Paul Ammann and Jeff Offutt [Ammann2008].
Let me discuss two areas that are widely used, but would fail to find Heartbleed: a mostly-positive test suite and traditionally-applied fuzzers.
One approach is to create a big automated test suite. Eric S. Raymond and some others have been discussing Heartbleed, and in our discussion he stated that, “I think a lot of people have an intuition that test suites don’t work very well... What I’ve learned since is that the gap is relatively narrow - pushing conventional methods hard enough can get you pretty close to never-break”. I completely agree with him that a good automated regression test suite is powerful, especially for non-security defects. If you don’t have one, create one, full stop, we agree.
However, whether or not a test suite would have found the Heartbleed vulnerability depends on how you create this test suite. The way many developers create test suites, which produce which I call “mostly-positive” test suites, would probably not have found Heartbleed. I will later discuss negative testing, a testing approach that would have worked, but we first need to understand why common testing approaches fail.
Many developers and organizations almost exclusively create tests for what should happen with correct input. This makes sense, when you think about it; normal users will complain if a program doesn’t produce the correct output when given correct input, and most users do not probe what the program does with incorrect input. If your sole goal is to quickly identify problems that users would complain about in everyday use, mostly-positive testing works. Besides, many software developers have a bias to focus on making the program work with correct input, and at most try to handle some error conditions they can easily foresee, so they have a natural tendency to create tests with correct input. Many developers simply don’t think about what happens when an attacker sends input that is carefully crafted to exploit a program.
I will call the approach of primarily creating tests for what should happen with correct input a mostly-positive test suite. Unfortunately, in many cases today’s software regression test suites are mostly-positive. Two widely-practiced test approaches typically focus on creating mostly-positive test suites:
Mostly-positive testing is practically useless for secure software. Mostly-positive testing generally isn’t testing for the right thing! In the Heartbleed attack, like most attacks, the attacker sends data in a form not sent in normal use. TDD and interoperability testing are good things... but you typically need to augment them if your goal is secure software.
Code coverage tools as typically used would not have helped either. Some developers may, in addition, run code coverage tools to see what wasn’t tested, and then add additional tests so that a larger percent of the code is covered by tests. Code coverage tools are actually hybrids of static and dynamic analysis, but for purposes of this paper we will discuss them here. The key questions with code coverage are (1) what specific code coverage measurement(s) are used, and (2) what are their minimum value(s)? These vary, but in practice many people are happy with a test suite that only tests 80%-90% of the code when measured as statements or branches (aka decisions). A very few might press all the way up to 100% coverage of statements or branches. Some, particularly those in the safety community, may use a slightly more rigorous coverage measure such as modified condition/decision coverage (MC/DC). Other coverage measures are possible, but these are the common ones. Test coverage tools for these common measures have some security value, for example, they can sometimes detect malicious software that is waiting for a trigger (since tests will often not include the trigger). They can also check if exception handlers seem to run correctly. But even 100% coverage, as measured by typical code coverage tools, would not have been enough to counter Heartbleed. The Heartbleed vulnerability involved the absence of proper input validation. Fundamentally, a code coverage tool as typically used cannot notice missing code; it can only notice existing statements or branches that are untested.
It is not clear that a less-common approach, program mutation testing, would have worked either. Mutation testing is a different coverage measure and way to develop new tests. As applied to programs, in mutation testing you “mutate” a source program, using a set of mutation operators, to create “mutants”. If a test can detect a difference between a mutant and the original, then the mutant is “killed”. This would not have detected the lack of input validation, because there was no input validation test to mutate. It is conceivable that it might have found the buffer over-read, and there has been some research work on using program mutation testing to detect vulnerabilities (e.g., [Shahriar2008]). You can also mutate input data structures and use those mutated inputs as tests. Mutating input data structures certainly could find the Heartbleed vulnerability. Two approaches I discuss later (thorough negative testing and fuzzing with address checking and standard memory allocator) can be viewed as ways to apply mutation testing to input data structures.
I should note that this is not unique to OpenSSL. CVE-2014-1266, aka the goto fail error in the Apple iOS implementation of SSL/TLS, demonstrated that its testing was also mostly-positive. In this vulnerability, the SSL/TLS library accepted valid certificates (which were tested). However, no one had tested to ensure that the library rejected certain kinds of invalid certificates. If you only check if valid data produces valid results, you are unlikely to find security vulnerabilities, since most attacks are based on invalid or unexpected inputs.
You can find this vulnerability (and similar ones) if you create the test suite using a different approach (negative testing) that I describe below. But first, let’s discuss fuzzing.
Fuzz testing is the process of generating pseudo-random inputs and then sending them to the program-under-test to see if something undesirable happens. The tools used to implement fuzz testing are called fuzzers.
Note that fuzz testing is different from traditional testing; in traditional testing, you have a given set of inputs, and you know what the expected output should be for each input. Traditional testing can be expensive as the number of tests grows, because you have to figure out the expected output. The mechanism that determines the expected output is called an Oracle. The costly problem of invoking an Oracle for a large number of test inputs is sometimes called the Oracle problem.
Fuzz testing approaches the Oracle problem differently, because it only tries to detect “something bad” like a program crash. Fuzz testing makes it easy to try many more input test cases in fuzz testing, by making the output checking much less precise. The fuzzing approach was originally developed by Barton Miller in 1988 at the University of Wisconsin. The “fuzz testing of application reliability” site at http://pages.cs.wisc.edu/~bart/fuzz/ has more information about fuzzing in general. For more about fuzzing, see [Takanen2008] and [Sutton2007].
Fuzzers are often used to help find security vulnerabilities, because they can test a huge number of unexpected inputs. In particular, fuzzers are often useful for finding input validation errors, and Heartbleed was fundamentally an input validation error. Yet typical fuzzers completely failed to find the Heartbleed vulnerability!
Fundamentally, the way fuzzers are typically applied would not have found Heartbleed. Heartbleed was a buffer over-read vulnerability, not a buffer over-write vulnerability. Most fuzzers just send lots of data and look for program crashes. However, while buffer over-writes can often lead to crashes, buffer over-reads typically do not crash in normal environments (my thanks to Mark Cornwell who pointed this out).
Several mechanisms are sometimes used to improve the likelihood of detecting or countering buffer over-write. But again, Heartbleed involved an over-read not an over-write, so some of these additional mechanisms would not help at all. For example, canary-based protection approaches (e.g., ProPolice) and non-executable stacks are designed to counter over-writes - not over-reads. GNU libc’s malloc() has the option MALLOC_CHECK_; this uses a less-efficient implementation that tolerates simple memory allocation errors (such as double-free) and tries to detect corruption (e.g., caused by writing past the end of an allocated block). The MALLOC_CHECK_ option is a helpful countermeasure against over-writes, but I have no evidence that it would have detected or countered an over-read like Heartbleed. Similarly, Dmalloc’s fence-post (bounds) checking “cannot notice when the program reads from these areas, only when it writes values.”
Fuzzers can find vulnerabilities like Heartbleed. However, to make that happen, we need to extend the error-detection capabilities that they use (beyond the simple approaches widely used today). One way to extend their error-detection capabilities is to use special address-checking tools that can detect memory problems like over-reads during fuzzing. These special address-checking tools (such as address sanitizer or a guard page system) turn subtle problems into something the fuzzer can detect, such as a crash. These special tools typically require that the program allocate and deallocate memory normally.
It is known that OpenSSL does not directly allocate and deallocate memory directly using standard calls. Instead, it used a caching freelist system internal to OpenSSL to reuse allocated memory. Since OpenSSL does not return (deallocate) memory back to the underlying system once it was done with it (in some cases), special tools could fail to detect some common weaknesses such as use-after-free or double-free that they would otherwise find. It also sometimes prevented some operating system and run-time mitigation mechanisms from working. In short, it is widely agreed that this OpenSSL memory allocation approach prevented many mitigation and weakness detection mechanisms from working.
There have been conflicting reports on whether or not these special tools could have specifically found Heartbleed without code changes in OpenSSL. Kupsch and Miller reported in the April 22, 2014 edition of their paper that OpenSSL uses a custom memory allocator, and that “to a dynamic analysis tool, it appears as if the library is allocating large memory buffers and not returning them, but in reality, it is subdividing these large blocks of memory and returning them for use” [Kupsch2014-April]. This kind of subdivision completely defeats the ability of these special tools to detect over-reads like the one in Heartbleed. Based on this information, older versions of this paper reported that fuzzers would have been unlikely to have found Heartbleed in unmodified OpenSSL, even if some of these special tools were used. However, Chris Rohlf and I have since independently investigated the OpenSSL code. On careful examination it appears that while OpenSSL does have a custom system, this memory subdivision does not occur in OpenSSL. The fact that it is so difficult to even determine what the allocator does is testimony that the memory allocation system itself is too complex! Based on this more recent information, it appears that fuzzing could have found Heartbleed, but only if special tools were used with fuzzers, and Kupsch and Miller have updated their paper [Kupsch2014-May]. This is, however, a minor point. Kupsch and Miller were and are correct that typical fuzz testing would not have found this vulnerability, and that the OpenSSL code countered many mitigation and defect-detection tools.
There has been some speculation that fuzzing hasn’t been done as rigorously for OpenSSL and other cryptographic libraries because encryption greatly reduces effectiveness of fuzz testing unless the fuzzer is given keys and is specially written to attack the library [Uberti]. That might be true. However, nothing prevents anyone from writing fuzzers that are given keys (for purposes of testing). Besides, the Heartbleed vulnerability can be found even without keys. Thus, it’s really the fact that it was an over-read that made traditional fuzzing ineffective.
Some fuzz testing systems are white box fuzz testing systems. These systems typically use static analysis to determine what parts of the program are not tested by earlier fuzz testing, and then develop new inputs to test those previously-untested portions. SAGE (Scalable, Automated, Guided Execution) is an example of a tool that takes this approach [Godefroid2008] [Bounimova2013].
To summarize: Traditionally-applied fuzzers and fuzz testing could not find Heartbleed. As I will soon describe, fuzzing can be effective if special address checking tools and a standard memory allocator are used.
Here is a partial list of tools and techniques that would have countered Heartbleed ahead-of-time (either with certainty or with very high confidence). I will specifically note some free / libre / open source software (FLOSS) where that makes sense to do so.
But first, some caveats:
So given those caveats, what specifically could have countered this vulnerability ahead-of-time? To make this especially useful, I have roughly ordered them in cost order, with the cheapest approaches listed first. It is a really rough order, and some are especially debatable; suggestions on how to improve it are welcome. In many cases the more expensive approaches are more general and can counter many other kinds of vulnerabilities, not just Heartbleed. The subheadings identify in parentheses which use dynamic analysis, static analysis, or a hybrid. The final subsection discusses other approaches that might have worked.
Negative testing is creating tests that should cause failures (e.g., rejections) instead of successes. For example, a system with a password login screen will typically have many positive regression tests to show that logins succeed if the system is given a valid username and credential (e.g., password). Negative testing would create many tests to show that invalid usernames, invalid passwords, and other invalid inputs will prevent a login. One book defines negative testing as “unexpected or semi-valid inputs or sequences of inputs... instead of the proper data expected by the... code” [Takanen2008 page 24]. There are many ways to do negative testing, including creating specific tests (the focus of this section) and creating semi-random test input (covered in a later section as fuzzing).
Thorough negative testing in test cases creates a set of tests that cover every type of input that should fail. I say every type of input, because you cannot test every input, as explained in the section on dynamic analysis. You should include invalid values in your regression test suite to test each input field (in number fields at least try smaller, larger, zero, and negative), each state/protocol transition, each specification rule (what happens when this rule is not obeyed?), and so on. This would have immediately found Heartbleed, since Heartbleed involved a data length value that was not correct according to the specification. It would also find other problems like CVE-2014-1266, the goto fail error in the Apple iOS implementation of SSL/TLS. In CVE-2014-1266, the problem was that iOS accepted invalid certificates. There were many tests with valid certificates... but clearly not enough tests to check what happened with invalid ones.
In most cases only negative tests, not positive tests, have any value for security. As I noted earlier, what matters about test suites is how you create them. This is probably obvious to many readers of this paper. In particular, I suspect Eric S. Raymond is including these kinds of tests when he discusses the advantages of testing. However, this is not obvious to many software developers. All too many developers and organizations only use a mostly-positive test suite instead. Many developers find it very difficult to think like an attacker, and simply fail to consider widespread testing of inputs that “should not happen”.
One great thing about thorough negative testing is that this can at least be partially automated. You can create tools that take machine-processable specifications and generate lots of tests to intentionally fail it... and then see if the implementation can handle it.
Another great thing about thorough negative testing is that if there’s a standard (which there is in this case), it’s possible to collaboratively develop a separate common test suite as a FLOSS project. Then it’s possible to quickly test all current and future implementations and prevent many problems from getting out to users. I would strongly encourage creating general-purpose test suites for protocols like SSL/TLS; that would reduce effort (people only need to create the test suite once), and it would help increase the security for all implementations (not just one). Individual implementations would still need to supplement the general tests with additional tests, but a common big test suite would be a big help.
Software testing is, in fact, an entire field. There are many different kinds of test approaches and test coverage criteria. I can only summarize testing in this paper. For more general information, again, see Introduction to Software Testing by Paul Ammann and Jeff Offutt [Ammann2008]. But the point still stands: testing with only valid input will fail to find many security-related problems, including Heartbleed.
I do not think that you should depend solely on thorough negative testing, or any other single technique, for security. Negative testing, in particular, will only find a relatively narrow range of vulnerabilities, such as especially poor input validation. Dynamic approaches, by their very nature, can only test an insignificant portion of the true input space anyway. But - and this is key - this approach can be very useful for finding security vulnerabilities before users have to deal with them.
Unfortunately traditional fuzz testing approaches were not helpful in this case. But there are simple lessons we can learn. Fuzzing would have been much more effective if a special tool called an address accessibility checker had also been used. These kinds of special tools can detect many out-of-bound reads in addition to out-of-bound writes during execution, and can often detect other memory problems as well. They are especially good at detecting when a read or write incrementally goes beyond the end of the buffer, and that is exactly the problem with Heartbleed.
There are a number of special tools that perform some sort of address accessibility checking; every tool has its pros and cons. However, if you haven’t used anything else, I strongly recommend that you check out address sanitizer (ASan).
Address sanitizer (ASan) was first released in 2012, and is now easily available; it’s just an extra flag (-fsanitize=address) built into the LLVM/clang and gcc compilers. Address sanitizer is nothing short of amazing; it does an excellent job at detecting nearly all buffer over-reads and over-writes (for global, stack, or heap values), use-after-free, and double-free. It can also detect use-after-return and memory leaks. It cannot find all memory problems (in particular, it cannot detect read-before-write), but that’s a pretty good list. Its performance overhead averages 73%, with a 2x-4x memory overhead. This performance overhead is usually fine for a test environment, and it’s remarkably small given how good it is at detecting these problems. Many other memory-detection mechanisms have a far larger speed and memory use penalty, and many guard page tools (described below) can only detect heap-based problems. The one big drawback with ASan is that in current implementations you have to recompile the software to use it; in many cases that is not a problem.
For more about ASan, see the USENIX 2012 paper [Serebryany2012] or the ASan website (http://code.google.com/p/address-sanitizer/). The test processes for both the Chromium and Firefox web browsers already include ASan.
Christopher T. Celi (of NIST) confirmed to me on 2014-07-10 that address sanitizer does detect Heartbleed if an attacking query is made against a vulnerable OpenSSL implementation. He ran OpenSSL version 1.0.1e (released in February 2013), which is known to be vulnerable to Heartbleed. He use gcc (version 4.8+) and its -fsanitize=address flag to invoke address sanitizer. As expected, a normal heartbeat request causes no trouble, but a malicious heartbeat request is detected by ASan, and ASan then immediately causes a crash with a memory trace. In his test suite ASan reported, in its error trace, that the there was an error when attempting a “READ of size 65535”. He comments that, “Though the output is a bit more cryptic than that of Valgrind, ASan is better for testing with a fuzzer as it crashes upon finding an error. Because of the output however, one would have to analyze the specific input that caused the crash a bit more heavily than with Valgrind.” As I note later, he also confirmed that Valgrind works.
Even more importantly, Hanno Boeck confirmed in 2015 that the fuzzing tool american fuzzy lop (afl), when combined with Address Sanitizer, does automatically find Heartbleed. It only took 6 hours on non-fancy hardware, and that is a short time for a fuzzer. To be fair, at the time afl was barely known, harder to use, and had trouble working with Address Sanitzer. He noted that, "A lot of other things have been improved in afl, so at the time Heartbleed was found american fuzzy lop probably wasn't in a state that would've allowed to find it in an easy, straightforward way." [Boeck2015] However, afl has become a remarkably powerful yet easy-to-use fuzzer. It tracks the branches that are taken and how often, then prefers using tests that cover the program differently when it evolves new tests. This is so successful that afl has pulled JPEGs out of thin air [lcamtuf2014]. This suggests that using afl, combined with Address Sanitizer or something similar, is worth considering today.
There are other tools that can detect memory access and allocation problems. These include binary simulators (e.g., valgrind), guard page systems (e.g., electric fence), and the CPU-specific bound checking mechanisms (such as Intel Memory Protection Extensions (MPX)). You can use several different tools (possibly on different fuzzer runs). For fuzzing to detect Heartbleed and vulnerabilities like it, the mechanism must be able to detect an over-read (not just an over-write) and eventually lead to a crash or other problem detectable by fuzzing. Some of these approaches have very significant performance overheads; where significant, these overheads can reduce the amount of fuzz testing that can be done in a fixed amount of time.
Binary simulators (such as valgrind and Dr. Memory) indirectly execute a program, while performing additional functions such as tracking memory accesses. A widely-used and widely-respected tool in this category is valgrind; valgrind’s memcheck plug-in can detect a variety of errors including over-reads on the heap. Valgrind works by creating a “synthetic processor” and monitoring execution. There are various plug-ins for valgrind; the memcheck tool tracks if memory is valid (if it has been initialized) and and if it can be accessed (e.g., if it has been allocated or not). Valgrind can be used on programs when you do not have the source code. However, valgrind greatly slows down the program, often 25-50 times, and often increases code size by a factor of 12 [Takanen2008, page 182], but this may be fine for testing. Valgrind’s memcheck is powerful for detecting heap-based vulnerabilities like Heartbleed, but it has an important limitation: Memcheck cannot do bounds checking on global or stack arrays. A tool that works in a similar way to valgrind, but focuses especially on memory access issues, is Dr. Memory. Both valgrind and Dr. Memory are FLOSS. These are very useful tools for finding memory-related errors, especially if you lack source code. However, ASan tends to be better if you have the source code and you want to do dynamic bounds-checking; ASan is much faster, takes less memory, and can do bounds-checking for heap, stack, and global data.
Christopher T. Celi (of NIST) confirmed to me on 2014-07-07 that Valgrind does detect Heartbleed if an attacking query is made against a vulnerable OpenSSL implementation. He ran OpenSSL version 1.0.1e (released in February 2013), which is known to be vulnerable to Heartbleed. In this configuration Valgrind detected an “invalid read” of a region that had been allocated by malloc. The invalid read occurred as expected inside the standard C function memcpy, which was called by tls1_process_heartbeat (which is responsible for receiving a heartbeat and processing a response), which was called by ssl3_read_bytes. Valgrind could also report that the memory was allocated by the standard C function malloc through OpenSSL CRYPTO_malloc, again, as expected. In this particular test he sent a message that was known to trigger the Heartbleed attack. He notes that to have detected this ahead-of-time with Valgrind, “Someone testing the code would likely have to use a fuzzer to assemble the proper bytes of hex to send to the server.” Note that he also confirmed that ASan works.
Many tools use guard pages to detect reads or writes that march over or under a buffer. In these systems, a guard page is added after and/or before the allocated memory; attempts to access the guard page region is trapped and specially responded to (e.g., it may lead to a crash). Often these tools are implemented by intercepting a few heap memory allocation calls (such as malloc). Tools that intercept heap allocations and add guard pages typically do not require source code, which is an advantage, but they can can only detect heap-based problems. Also, many guard page systems have a significant performance overhead in both speed and memory use. For example, Guard Malloc is report to increase execution time by a factor of 100 times or more [Takanen2008, page 181] in addition to a very large memory overhead. These tools primarily focus on detecting access of unallocated memory (including use after free), but they can sometimes detect use before initialization by filling un-initialized memory with unusual values. Examples of such tools include electric fence, Detect Unintended Memory Access (DUMA) (a fork of electric fence), guard malloc, and the OpenBSD malloc; all of these are FLOSS.
Some system memory allocators, such as the OpenBSD malloc, have a built-in guard page mechanism. These would have inhibited or stopped Heartbleed, depending on how it is implemented. In particular, OpenBSD’s malloc implementation supports guard pages. In OpenBSD, the “G” option causes “each page size or larger allocation is followed by a guard page that will cause a segmentation fault upon any access.” This can be combined with the “P” option (the default), which moves allocations within a page (“allocations larger than half a page but smaller than a page are aligned to the end of a page to catch buffer overruns in more cases.”) The OpenBSD mechanism can be enabled for a particular program or even enabled by default across the whole system, and this can protect many situations. The OpenBSD malloc approach is reported to have relatively moderate overhead and yet “caught serious bugs in lots of major software” [OpenBSD-Journal] [Felker2014].
There is a weakness in the OpenBSD malloc mechanism: Even with both G and P enabled, small allocations (half a page or less) are not immediately followed by a guard page. I think it would be even better if the OpenBSD guard page mechanism could insert a guard page immediately after even relatively small allocations, even though this would probably have a serious speed and memory size impact. But even as it is, enabling both G and P means that all allocations larger than half a page are immediately followed by a guard page (subject to alignment limits), and that allocations that are a half a page or less will at most leak half a page. That can be very significant reduction in leak size compared to the 64KiB of the original Heartbleed attack, depending on the page size (often 4KiB).
Memory allocations must be aligned, so guard pages may leak a few bytes at the end depending on the implementation. I suspect ASan would be faster than adding guard page on every allocation, but adding guard pages do not require a recompile in most programs, so there is an advantage to having it. Unfortunately, the popular GNU libc malloc does not include this kind of functionality at all.
Intel Memory Protection Extensions (Intel MPX) or other CPU-specific bounds checking mechanisms might help. MPX adds new registers called bound registers to hold bounds for pointers, and new instructions to manage and use the bounds. MPX is to be released as part of the Skylake architecture, but as of 2014 these CPUs are not available to the public. It will take longer for them to be widely available, and that does not necessarily help non-Intel systems (e.g., smartphones do not usually use an Intel chip).
There are other tools and approaches. The point is that many tools can detect memory over-reads, and using at least one of these tools can make fuzz testing more effective.
In general, when using fuzz testing you should turn on as many anomaly detectors as you can. The only detection mechanism used for the first fuzzer was “did the unchanged program crash/hang?” - and many fuzzers still only do that. You should at least enable program assertion checks and create as many assertions as you reasonably can. You might also do additional checking to ensure that the intermediate or final state is valid (for example, sanity-check outputs and examine what files are produced in what directories). But for the purposes of Heartbleed-like vulnerabilities, you should at least turn on invalid memory access detectors like ASan.
Many of these tools, including ASan and guard page based programs, require that the program under test allocate and deallocate memory normally. In particular, the program must not combine multiple allocations into one allocation request (e.g., as is done by a slab allocator or memory slicing implementation). At the least, the program should make it trivial to use a normal allocation approach instead for use in fuzz testing (and test that it works).
It it true that encryption libraries can create special issues for fuzzers [Uberti]. But these issues are easily addressed. As Paul Black has stated to me separately, “a tool based on mutated messages should mutate all parts of the message at all levels: individual bits, before encryption, after encryption, session creation, the whole handshake, [etc.]”. Or as Apostol Vassilev has stated to me separately, “a thorough fuzzer should exercise forbidden state machine transitions”.
The first fuzzers generated truly random data to be sent to a program. However, other methods for creating data can improve fuzzing effectiveness. Most fuzzers can be divided into three categories:
There are many other ways to categorize fuzzers, too. Template-based fuzzers use existing traces and fuzz parts of the recorded data. Block-based fuzzers break individual protocol messages down in static and variable parts and fuzz only the variable part. Dynamic Generation/Evolution-based fuzzers learn the protocol of the Target of Evaluation (TOE) by feeding the TOE with data and interpreting its responses, e.g. using evolutionary algorithms. Model-based fuzzers employ a model of the protocol. The model is executed on- or offline to generate complex interactions with the TOE. This enables fuzzing data after a point such as authentication, an important issue for SSL/TLS.
Fuzzing can also be combined with traditional tests, again, so that the fuzzing can go beyond a point like authentication. For more discussion about these fuzzer variations and the use of model-based fuzzers, see [Schieferdecker2012]. Codenomicon also discusses different fuzzing approaches and their coverage. Other relevant papers include “A Model-based Approach to Security Flaw Detection of Network Protocol Implementations” [Hsu2008] Whitebox-based fuzzers examine the program to improve what to fuzz (and thus are really hybrid analysis approaches); these can extend effectiveness, but they require more effort to implement and simply are not necessary to find vulnerabilities like Heartbleed. For more information on fuzz testing, see [Takanen2008] and [Sutton2007].
A lot of work has been going on to improve the coverage of code in fuzz testing. In general, as more code is covered by fuzz testing (as measured as statements or branches), the more likely that fuzz testing will detect a vulnerability if present. Thus, some fuzzers use information about the program being executed to improve fuzzing capabilities. Some tools, such as the FLOSS American fuzzy lop, instrument code to improve code coverage while fuzzing. Microsoft has had good experience with constraint-based whitebox fuzz testing, in which they leverage symbolic execution on binary traces and constraint solving to construct new inputs to a program [Bounimova2013] However, simply sending the data is not enough; a fuzzer would have to have detected that there was a problem, and out-of-bounds reads typically do not cause a crash or other easily-detected problem unless something else has been done.
It is possible that a mutation-based fuzzer could have found Heartbleed, once it is coupled with better fault detection, but a mutation-based fuzzer would probably only find Heartbleed if the starting test cases included a heartbeat message. A generation-based fuzzer, once coupled with better fault detection, would be highly likely to find Heartbleed... but only if (1) it included rules to generate a heartbeat, and (2) it fuzzed lengths as well. For example, the Sulley fuzzing framework automatically computes block lengths, but by default it does not fuzz the lengths to make them incompatible with the data being sent. If you use Sulley, you’ll probably need to set the “block sizers” to be “fuzzable=True”; this creates a more rigorous test (as is probably needed to detect Heartbleed) but it is not the default. Thus, Heartbleed shows that we need to be fuzzing lengths and not assuming that lengths are always being checked properly.
Oh, I should add a quick terminology note. People sometimes use the term “negative testing” to make it appear to be a synonym for fuzz testing [Takanen2008, page xix]. I do not use the term that way. Instead, I use the term negative testing in a broader sense. Still, fuzz testing is a useful approach for negative testing, so much so that I have listed it as a separate category.
It would be possible to send inputs like a traditional fuzzer, but examine the outputs more thoroughly. This approach is discussed later in the section on fuzzing with output examination.
It’s debatable whether or not fuzzers are more expensive than negative testing, but here is my reasoning. One advantage of negative testing is that it is really easy to get started; presuming you already have a test suite, you can just start adding negative tests. More importantly, though, negative tests rapidly give an unambiguous answer as to what caused the problem, and since they require little computing power (compared to fuzz testing) developers can easily re-run a test suite on every patch. In contrast, fuzz testing often requires more computing power and interpretation of results; computing power is cheap, but this factor still slows down feedback to developers. The potentially-faster feedback of negative testing could lead to faster developer detection and fixes. Today a key cost driver is developer time, not computing time; a mechanism that best reduces developer time is really helpful and tends to be less costly. Also, you can make a negative test suite once for a given protocol; you can then easily reuse the test suite on every implementation and every patch of each implementation. Of course, these are not in conflict; it is better to do both negative testing and fuzz testing.
What if you want to use a program right now, in situations where it’s really important to counter attackers from unknown potential vulnerabilities? It turns out there is at least one way that could have worked. In addition, it might have provided some early warning of exploitation (a rather late form of detection, but it is detection).
One approach is to use a mechanism that detects (at run-time) attempts to read past the end of an allocated memory region. In this approach, you’re not just changing how tests are run; the idea is that you actually use this version during operation! This requires that the program allocate and deallocate memory normally. In particular, the program must not combine multiple allocations into one allocation request (as is done by a slab allocator or memory slicing implementation).
There are several mechanisms that could detect such things at run-time. These are basically a subset of the detection mechanisms for fuzzing with address checking and standard memory allocator (dynamic analysis), with the additional challenge that speed and memory use are much more important. Here are few examples:
I suspect ASan would be faster than adding guard page on every allocation, but adding guard pages does not usually require a recompile, so there is an advantage to using it instead.
This is really a damage reduction approach, instead of an approach that eliminates the problem. From a security point of view this approach turns a loss of confidentiality into a loss of availability. In many cases, however, this is a good trade-off. Also, this approach makes the problem visible once the system is under attack; once a problem is visible it is usually easy to correct.
This approach can be easily combined with a honeypot or honeynet (my thanks to Vincent Legoll, who pointed this out to me on 2014-05-05). Set up these hardened implementations on honeypot/honeynet systems (systems that should not be used by non-attackers), basically to detect and trap attackers. If an attacker tries to break the software, the software would crash instead, and that could be logged and tracked as especially important. Forensics could then detect some specific zero-day exploitations. I think this could also be done by some logging systems combined with intrusion detection systems; again, if a crash occurs in a hardened crypto library, log it specially. This would make it much easier to detect widespread exploitation of a 0-day attack. Distributions, core infrastructure organizations, and other organizations could establish these across the Internet and help protect us all. This would a relatively late form of detection, but in some cases it would detect attacks before others were attacked.
While this approach doesn’t fully fix the problem, it does provide a powerful mitigation, and can be used as part of a larger detection approach. Some distributions or organizations might want to use these countermeasures in specific situations, or at least make these countermeasures easier to enable.
Changing the code doesn’t cost much effort, and recompiling is usually fairly simple also (when you have the source code). However, the performance loss would be really significant in many settings; it’s like losing part of the hardware performance you paid for. For example, using ASan you lose around half your (speed) performance. Thus, I’m counting this approach as a more expensive solution, to capture the loss of this hardware. In many situations the operational impact would be significant; on smartphones this would reduce speed and battery life, and on popular servers this could slow response and increase electrical power costs. If future CPUs add hardware support for ASan, the speed impact could be reduced significantly (the ASan paper estimates that the speed overhead would go from 73% to about 20%). I would love to see CPU manufacturers explore this.
The vulnerable code was reviewed by a human, so merely having a single human reviewer was obviously not enough.
However, a variation would have worked - requiring the human (manual) review to specifically check every field to ensure that every field was validated. Checklists sometimes get a bad name in computer security. I suspect one reason is that sometimes checklists are deployed to people who don’t know what they’re doing, who then can’t use them effectively. But expert airplane pilots routinely use checklists, even though they do know what they are doing. If patches are only accepted after they are reviewed using a checklist, and the checklist includes “must show that every untrusted data field is validated”, then it is likely that this vulnerability would have been countered.
I had originally included this approach as part of the approach thorough human review / audit, but this is a different and much lower-cost approach. However, it does require that the reviewer(s) apply it to every patch as they come in; it cannot easily help with a large body of pre-existing code.
Fuzz testing traditionally involves sending lots of input to a program and looking for grossly incorrect behavior such as crashes. In fuzzing with output examination, the fuzzing system also examines the TOE output, e.g., to determine if the output is expected, or if it has various anomalies that suggest vulnerabilities. To accomplish this, the fuzzing system is provided additional information about the expected response (e.g., as required by a specification) or lack thereof, e.g., some constraints on the expected TOE output. The TOE response can also be compared to patterns (typically based on heuristics) that suggest vulnerable behavior (such as evidence of a cross-site scripting vulnerability).
This is possible to do with generation-based fuzzers because these kinds of fuzzers are already provided information about the correct sequence of interface input (e.g., of a protocol). This approach simply extends this information to also describe the expected output. The description of expected output need not be exact. It becomes increasingly likely to find existing vulnerabilities by making the output description increasingly precise, but of course, more exacting descriptions require much more effort to create. In some sense this approach stretches fuzz testing back towards traditional thorough negative testing in test cases. Early fuzz testing gave up the idea of knowing exactly what the expected output is (to simplify creating test cases), while this approach re-introduces the idea of examining results more carefully for correct behavior.
This is the approach that Codenomicon used to find Heartbleed. In their approach, they developed an additional mechanism called “Safeguard” inside their Defensics tool. Safeguard analyzes the TOE responses to determine if they matched what was expected. More information about this (at a very high level) can be found in [Codenomicon-How], [Eadicicco], and [Chandrashekar]. Codenomicon originally added this to just one protocol suite, SSL/TLS, but based on their success with this approach they are adding this approach to several other interfaces. I understand they intend to add Safeguard to five more interfaces by the end of June 2014, which clearly indicates that Codenomicon thinks this approach has value.
Codenomicon (particularly Mikko Varpiola) provided more information about Safeguard and SSL/TLS in particular. Safeguard was inspired by examination of an earlier vulnerability, CVE-2012-2388. This vulnerability involved signature handling which was tied to user authentication. A Codenomicon engineer realized that this kind of vulnerability could be detected by a fuzzer if it could detect that certain stages of a protocol could be incorrectly skipped. They then began to develop additional checks to examine the TOE output more rigorously.
Safeguard (at least for SSL/TLS) implements four kinds of checks:
Mikko Varpiola told me that these seem to be surprisingly useful when fuzzing protocols that carry usernames. These are often processed by a SQL database, and this kind of fuzzing can help detect SQL injection vulnerabilities.
My thanks to the people from Codenomicon for providing information to me on how Safebuard works in Defensics: Steve Hayes, Josh Morin, Bob Sturm, and Mikko Varpiola. Mikko Varpiola, in particular, provided me with a lot of more detailed information on Safeguard. Any mistakes in my paraphrasing of their information are my own.
One advantage of this approach is that you only need to observe the output of the system. You do not need source code or the ability to manipulate the underlying platform, so this can be used to examine systems like routers as black boxes. That is really impressive, and is a significant contrast to fuzzing with address checking (which typically requires source code or at least the ability to manipulate the underlying platform).
A challenge with fuzzing with output examination is that someone must create this additional information about the expected output. Obviously it takes additional time to encode the information about the expected output (since this in addition to the interface information that is already necessary to generate input). Another problem is that determining this information is not easy. Specifications (such as IETF RFCs) are notorious for under-specifying what should happen with incorrect or barely-correct input. It is possible to start with a more rigorous requirement and then add various exceptions or allow more variations, but this can take many iterations involving detailed examinations of TOE output. It is also possible to weakly specify the results, but the more generous the specification, the less likely it is to find vulnerabilities.
I have identified this as somewhat more costly because it requires significant interface-specific analysis to determine what the output requirements should be, and then encode them. However, once this information is encoded it can be reused to test later versions or alternative implementations of the same interface.
Traditional source code weakness analyzers could not find Heartbleed, because they used general-purpose heuristics that simply didn’t work well enough in this case, in part because of the complexity of the code. It is always best simplify the code where you can, but there is always some minimal complexity based on what you are trying to accomplish, and real humans are unlikely to achieve perfect simplicity anyway. Coverity is developing some new heuristics that they think would detect Heartbleed [Chou2014] ... and good for them! At least one person has implemented similar heuristics using clang [Ruef2014] Indeed, I expect all source code weakness analyzers to improve over time, and thus find vulnerabilities that they didn’t find before. But generic heuristics can only go so far at any point in time; can you go beyond?
The answer is yes, and I call this a context-configured source code weakness analyzer. The basic idea is that you start with a source code weakness analyzer, but you then provide far more information about the program that you are analyzing.
This approach requires much more time than just running a source code weakness analyzer, and this additional information is typically tied to just one specific tool (tying you to that tool). However, if you provide more information about your program, the source code weakness analyzer can do a much better job.
Klocwork has shown that this approach definitely works for Heartbleed [Sarkar2014].
Now let’s talk about annotation systems. There are various ways to provide this additional information to static analysis tools; an annotation system adds this additional information as part of the program itself. One common way is to add an annotation system to the programming language, and then modify the program to use these annotations. These annotations can be added by directly changing the code (using new keywords), added as comments, or added in separate files. Examples of tools or annotation languages for C include Microsoft’s SAL, splint, Deputy, Oink/CQual++, cqual, and Frama-C ANSI/ISO C Specification Language (ACSL). Static analysis tools can check the information from the annotation system on every compilation, providing quick feedback once they are used, and they are not limited to specific input values (i.e., they are not limited by the problems of dynamic analysis). You could easily argue that adding this information (via annotation systems) is really a different technique.
Seriously using these additional annotations to counter vulnerabilities often requires a non-trivial amount of work if you are starting with existing code. There are also many different incompatible annotation systems for C, and there are no standards for them, which further impedes their use. After all, it takes work to add annotations, and those annotations lock you into to a specific tool. Microsoft SAL has additional problems; there is no FLOSS implementation and it is only available on Windows. I think that annotation systems would be much more widely used if there was a single widely-accepted standard annotation notation for each major programming language, including C. It would be hard to get that kind of agreement for languages like C when there isn’t already such a notation. Peter Gutmann has written a post on some of his experiences [Gutmann].
However, annotation systems have many advantages. Annotation systems can find vulnerabilities that simply are not countered by switching to a different language. Also, they are often cheaper than switching to a different language (because you are simply adding additional information to an existing program). Of course, these are not in conflict; you can switch languages and use a code annotation system for the new language.
Another approach that would probably have detected Heartbleed is 100% branch coverage of alternative implementations. As noted earlier, branch testing cannot detect when input validation code is missing in a particular program. Branch coverage can, however, detect existing untested branches in a different implementation. Striving for a test suite that gives full branch coverage of multiple implementations greatly increases the likelihood that missing validation code and missing exception handling would be detected. Stronger test coverage measures, such as modified condition/decision coverage (MC/DC), would work as well.
Like all coverage approaches, this is fundamentally a hybrid analysis technique. This uses dynamic analysis to run tests... and static analysis to determine which branches (or related coverage measures) have been left untested.
This approach is a somewhat specialized approach for finding vulnerabilities. The test suite must be applied across multiple implementations, all with 100% branch coverage, so it requires multiple implementations to be used at all. What’s more, the more different the implementations, the better. Also, this approach is probably less capable (by itself) at finding security vulnerabilities than other approaches. That’s because there may be many different inputs that follow the same path, yet only a small subset of them might trigger a vulnerability. It also only works if one of the other implementations implements the particular component under test (in SSL/TLS support for the heartbeat is optional) and implements the potentially-missing input validation code.
I have never seen this specific approach discussed in the literature; usually people discuss branch coverage of a single implementation (instead of multiple implementations). Still, it is fair to note that this approach can not only help improve quality, but it could also have found this particular vulnerability.
One trouble: these would not necessarily counter Heartbleed, because much depends on the configuration extensions or annotations used and how they are used. In particular, the output would need to be checked thoroughly enough to detect that a problem occurred. On the other hand, they do not depend on hitting exactly the right input; static analyzers can examine a large number of situations simultaneously.
Multi-implementation 100% branch coverage is more costly than thorough negative testing, primarily because if you have a poor test suite it can take a lot of time to work backwards from a missed branch to figure out how to trigger it. Also, missed branches are often specialized error-handling systems that can be difficult to trigger, or undocumented “can’t happen” branches used as part of defensive design. In addition, the test suite has to grow enough to cover multiple implementations at 100%; many organizations do not even try to grow a test suite to do 100% branch coverage of a single implementation, never mind 100% coverage of multiple implementations.
Software developers could aggressively insert and enable run-time assertions. There is speculation that this might have countered Heartbleed, so I will discuss this possibility here.
A software developer can assert that various value relationships or states must be true. These assertions can then be checked at run-time, at least while testing the software. Nearly all languages have a built-in assert mechanism (or equivalent) that can cause an exception or crash if a condition is not true at run-time at a specific program location. Several languages have more advanced built-in assertion mechanisms for specifying preconditions, postconditions, and invariants that can be checked at run-time (examples include Eiffel’s design-by-contract mechanisms and the Ada 2012 contracts). In some cases the language can optimize some of these assertions away, leaving the assertions it cannot optimize away at run-time. Indeed, an annotation system may be partly implemented statically, and partly implemented dynamically; see my previous comments about annotation systems for their static application.
Temporally Enhanced System Logic Assertions (TESLA) is an even more advanced research approach that allows temporal assertions. You can find further information at the website http://www.cl.cam.ac.uk/research/security/ctsrd/tesla/. Frama-C E-ACSL annotation language is a subset of ACSL; Frama-C can take E-ACSL annotations and cause run-time failures if the annotations are violated. E-ACSL support is in a preliminary state in Frama-C as of May 2014.
There is no doubt that assertions can be an excellent mechanism for detecting invalid states, and invalid states can sometimes be an indicator of a vulnerability.
However, this approach does have some weaknesses when it comes to countering Heartbleed. Neither the original developer nor its reviewer realized that checking the request packet length value was important; since the length check was not included, it is unlikely that the developer would have remembered to add assertions to check for it. This is also a problem for thorough negative testing, but negative testing is easily done by a group separate from those developing functional code, and it is much easier to ensure that (for example) all data fields are checked, so I think negative testing would be more likely to find this specific type of vulnerability. Thus, while aggressive annotations can be a very useful approach for countering vulnerabilities, it is somewhat speculative it would have worked for this particular case.
Note that aggressive run-time assertions work very well with the fuzz testing approach described earlier. Run-time assertions detect very specific problems in program state, and thus create more situations that a fuzzer can detect. Jesse Ruderman expressed the complementary relationship of fuzzers and assertions in a wonderfully pithy way:
Fuzzers make things go wrong.
Assertions make sure we find out.
I have placed this approach as a somewhat more expensive option. For this approach to have detected Heartbleed (without knowing about it ahead of time) would have required very aggressive use of assertions. Adding all those assertions would take significant development time and would typically also impose a significant run-time cost.
The underlying cause of Heartbleed is that the C programming language (used by OpenSSL) does not include any built-in detection or countermeasure for improper restriction of buffers (including buffer over-writes and over-reads). Improper restriction can often lead to catastrophic failures, so almost all other programming languages automatically counter improper restriction (e.g., by resizing data structures or by raising an exception when the buffer is exceeded).
If vulnerabilities in a given program can have catastrophic effects, then those choosing its programming language(s) should prefer the options that reduce the likelihood of vulnerabilities. The more catastrophic the effects, the stronger this preference should be. Most programming languages provide at least some direct protections against otherwise-dangerous vulnerabilities, such as improper restriction protection. Some programming languages also provide constructs that are less likely to be misused or are less likely to be incorrectly used. Ideally, a language would prevent all vulnerabilities. It is highly unlikely that a general-purpose language could ever prevent all vulnerabilities, but it is a worthwhile goal for language designers to strive for. There is no “perfectly safe” programming language; instead, there is a continuum, with some languages providing more vulnerability countermeasures than others.
The most dangerous widely-used languages for security-relevant software are C, C++, and Objective-C. All of these languages provide no built-in restrictions on buffer access, indeed, it takes a non-trivial effort to avoid problems like buffer over-reads and over-writes. Improper restrictions on buffer access continue to be a widely-used type of vulnerability that often have catastrophic effects. Using or switching to almost any other language (other than C, C++, or Objective-C) would completely eliminate buffer-related vulnerabilities, including Heartbleed. This is especially true for C, because it lacks many of the higher-level constructs that make it somewhat easier to avoid buffer-handling problems. Most languages also prevent memory deallocation errors that could lead to vulnerabilities (e.g., automatic garbage collection), and some languages are designed to counter additional vulnerabilities as well. One of the reasons there are so many vulnerabilities in modern systems is the overuse of the C, C++, and Objective-C languages. In fact, some people have proposed banning the use of these languages in security-sensitive code.
However, there are reasons that C, C++, and Objective-C are widely used. The TIOBE Programming Community index measures programming language popularity, and as of April 2014 these languages occupy three of the four top slots (the top 4 languages in order are C, Java, Objective-C, and C++). These reasons include higher performance (in speed and memory use), ease of interface, large libraries, platform preference, and familiarity. Also, switching languages for large existing programs (like OpenSSL) is usually a big effort. Let’s examine a few of these reasons.
One oft-cited issue is that programs in C, C++, and Objective-C tend to have noticeably higher speed than programs written in most other languages. In addition, most other languages lack the lower-level mechanisms that are needed if you need to directly interact with hardware (and this direct interaction can also increase speed). If you need the speed, and perhaps the direct interaction, the list of likely languages gets much shorter. And speed sometimes matters in a world of mobile devices (which have limited resources) and massive server farms (where poor performance would make them further heat their environment). The benchmarks game includes some speed analysis of various programs written in various languages [BenchmarksGame]. The posting “Approximate speed classes of programming languages” took that data and grouped languages into different tiers based on their approximate speed [Jplus2014]. No benchmark is perfect, and it is always best to measure performance for a specific situation. Still, I prefer measured numbers to big guesses, and this dataset is representative enough to get started. If performance (as measured as speed) is your most important criteria, and you do not want to write in assembly language, the other options according to that analysis are:
There are many other programming languages, especially if you’re willing to give up a little speed as determined by that benchmark. For example, Go (developed by Google) has good performance. (There’s even been some work on converting C to Go automatically, though currently that work is only focused on translating the compiler, not C in general.) Rust is another programming language you can consider. Java has reasonable performance on modern JITs, once it gets going, but there is a non-trivial startup time. Other languages that look promising by these benchmark metrics include Scala, Free Pascal, Lisp SBCL (Steel Bank Common Lisp), Haskell, C# on Mono, F# on Mono, and OCaml (depending on how you cut off the next tier). Neither the D programming language nor the Nimrod programming language are listed in that benchmark, but they are also designed for efficiency.
Of course, if speed is not critical, there are a huge number of languages available. At least one study suggests that there is no statistical difference in the number of vulnerabilities in programs written with .NET, Java, ASP, PHP, Cold Fusion, and Perl. I often use Python when speed is not important because it has a clean and easily-understood syntax. Other languages, such as Ruby and Clojure, have many fans. Scheme is powerful (and I think the readable Lisp extensions solve the readability problems often noted about Lisp-based languages like Scheme). All of these other languages are safer than C, C++, or Objective-C, in the sense that all of them protect against buffer over-reads by default.
There are just too many programming languages to list, so I’ll stop here. My goal is not to list all alternatives; my goal is to make it clear that there are alternatives.
Performance is not just about speed; memory management approaches can also be important. This is especially true on mobile devices like smartphones. C, C++, and Objective-C do not provide automated garbage collectors; many other languages include them. Developers are generally more productive (in terms of functionality over time) if they do not have to think about memory management, but in some environments that is unrealistic. Drew Crawford has a lengthy discussion about mobile device development, where he states that “automated garbage collectors work well if you have at least six times as much memory as needed, but efficiency can [be] greatly harmed if there is less than four times as much memory. iOS has formed a culture around doing most things manually and trying to make the compiler do some of the easy parts. Android has formed a culture around improving a garbage collector that they try very hard not to use in practice. But either way, everybody spends a lot of time thinking about memory management when they write mobile applications. There’s just no substitute for thinking about memory” [Crawford2013]. Automated garbage collection is deprecated in OS X Mountain Lion v10.8, and will be removed in a future version of OS X; Automatic Reference Counting (ARC) is the recommended approach instead for OS X and iOS [Apple2013]. Again, there are reasons people choose C, C++, and Objective-C.
So why are programs in C, C++, and Objective-C often higher-performance (in speed and memory management) than many alternatives? The answer is, in part, because the languages are designed to be that way. In particular, C is designed to make it possible to write programs that run quickly and work well with limited resources (e.g., little memory). The C rationale states that a key principle in C is “trust the programmer” and that “many operations are defined to be how the target machine’s hardware does it” (which impedes portability but helps performance). Also, C’s performance cost model is transparent, so a C or C++ developer can usually estimate the performance aspects of a construct before using it. Different programming languages provide different levels of abstraction, and languages with higher-level abstractions can sometimes make detailed control more difficult. Indeed, many developers have difficulties estimating the performance aspects of programs written in languages that are significantly higher-level than C. Humans do not always estimate correctly, of course, but it is often hard to achieve good performance if it is hard to estimate performance. Performance transparency is especially important in cryptography, because developers need to counter timing attacks and electrical power attacks (my thanks to Markus Armbruster for pointing out this link between cryptography and performance transparency). So merely obtaining high performance is not enough; it is sometimes necessary to ensure that timing or power variances are small, yet few tools provide these measures. Cryptographic libraries can be written in other languages, of course, but other complications can arise depending on what language is used. Of course, implementations greatly vary in their performance; heavily-optimized compilers and run-times can achieve great performance compared to compilers and run-times that are not as heavily optimized.
Many developers choose C, C++, or Objective-C to simplify interfacing with other components. Many useful utilities have C interfaces, and most language infrastructures can call libraries written in C. However, many programming language systems have easy ways to both call C routines and to be called by other systems using C interfaces. Thus, this isn’t as important a reason today to choose these languages.
Developers using C, C++, and Objective-C can reduce their risks in various ways, such as using less-risky library functions and using language subsets. These merely reduce the risk somewhat, not eliminate it; it is really easy to make a mistake even when using these facilities. Still, risk reduction can be valuable.
Creating library functions that reduce the likelihood of security vulnerabilities, and then using them, can help reduce the risk of vulnerabilities. This is especially true if these functions are part of the standard library for these languages, since these will tend to be widely-understood, portable, and well-supported. Here are a few examples for various languages:
You can also use a subset of some language (including C, C++, and Objective-C) with the goal of increasing security (e.g., by avoiding dangerous constructs). However, when designing subsets it is important to measure actual problems, since otherwise the subset may be unhelpful or even counter-productive. Les Hatton has developed EC-- (a safer subset of C) based on measurements of actual code [Hatton2003]. Les Hatton has also published devastating critiques of C subsets like MISRA C 1998 and MISRA C 2004, where he finds that “both versions of the MISRA C standard are far too noisy to be of any real use... [the] real to false positive ratio is not much better in MISRA C 2004 than it was in MISRA C 1998 and it is unacceptably low in both” [Hatton2005].
You can write insecure software in any language. For example, vulnerabilities to SQL injection are another common weakness, and this can occur in practically every language (including C, Java, and many others). However, most languages provide mechanisms (like prepared statements in a built-in or external library) that are easy to use and completely avoid common problems like SQL injections. Yes, you have to be careful to use these mechanisms... but that is true for C, Java, and most other languages.
Selecting safer languages does not avoid all problems that can occur in unsafe languages. Safer languages are typically implemented on top of infrastructures and libraries that are themselves written in unsafe languages like C, C++, or Objective-C. The run-time libraries of most language implementations today are written in C, many libraries are written in C, most applications eventually call down to the C run-time library, and practically all operating system kernels in wide use today are written in C. Still, it is a matter of degree; vulnerabilities specific to C, C++, or Objective-C can only occur in the parts written in those languages, so selecting safer languages can reducing overall risk. (I knew of this long before, but my thanks to David Ramos at Stanford University for encouraging me to add this information.)
Some languages (like C# and Ada) normally counter vulnerabilities but have escape mechanisms that let you temporarily disable protections (e.g., both can temporarily disable buffer over-read protection). But these escape mechanisms are easily found and are isolated in well-written code; they help reduce the overall risk by isolating unsafe actions to small, easily-reviewed portions. In most languages, people have to work to create buffer over-read vulnerabilities (if they are possible at all). In contrast, in C, C++, and Objective-C, you have to do extra work to avoid buffer over-read vulnerabilities.
It’s also true that other languages sometimes have other additional challenges when using them to create secure software. For example, I know of no way to securely erase data inside Java. This is because Java lacks functionality like .NET’s SecureString; Java structures (like a String, raw array, or StringBuffer) may end up copied several times in memory due to how memory allocation and garbage collector work. This is not unique to Java; it’s hard to securely erase data in many languages. This is in contrast to C; C developers often get it wrong (the naive approach allows C compilers to remove the erasure), but it is possible to securely erase data in C. However, in Java (and some other languages), it is relatively easy to solve this by creating a small non-Java module to securely erase a few values; the rest of the program is still protected against buffer over-reads and over-writes. Besides, exploiting additional memory copies requires a significant amount of access to a program or its environment. Secure erasures are often a useful damage-reduction measure (such as the damage caused through Heartbleed). In contrast, buffer over-reads and over-writes can sometimes be exploited directly with far less access, sometimes simply through a network connections. In general, failing to restrict buffers is far more dangerous than the other problems that most other languages add.
It is more difficult to write secure software in C, C++, and Objective-C. Most languages have built-in and complete protections against buffer over-reads and over-writes... but C, C++, and Objective-C are notable exceptions. On the other hand, it should be obvious why they are used.
Whenever a new security-relevant program is begun, the programming language should be carefully considered. Choosing a safer language, where reasonable, can automatically eliminate entire sets of potential vulnerabilities - including the buffer over-reads that permitted Heartbleed. Also, computers are much more powerful than they were historically; in many cases some performance can be traded away. What’s more, when starting to write a new program, using another programming language is nearly zero-cost. I believe there are cases where less-safe languages are appropriate, and rewriting code takes a lot of effort. However, using almost any language other than C, C++, or Objective-C will at least restrict operations to be within their boundaries, and vulnerabilities due to improper restriction often have extremely large impacts.
Choosing a safer language at the beginning of a project, especially a widely-used language with at least one FLOSS implementation, has very low or no cost. I have identified safer languages as a higher-cost approach because switching a non-trivial program (like OpenSSL) to a different safer language is a big effort.
A complete static analyzer, also sometimes called a sound static analyzer, is designed to find all vulnerabilities of a given category. Creating these kinds of tools is rather challenging for C, but it is possible. However, these analysis tools often have to limit what constructs they can use, and developers typically have to limit their programs to work within those constructs and/or provide additional annotations to provide additional information the tool needs. For example, Astrée is a static program analyzer that aims to prove the absence of Run Time Errors (RTE) in C programs, including out-of-bounds array indexing. However, Astrée requires that the program avoid dynamic memory allocation and recursion. Note that the current OpenSSL implementation strongly depends on dynamic memory allocation.
Tools that focus on finding everything will often report issues that are not vulnerabilities, which must then be analyzed to determine if they are actually vulnerabilities. However, if it is critical to counter all vulnerabilities, that trade-off may be worth making.
A thorough independent human review of software, specifically focused on ensuring security and finding vulnerabilities, is typically an excellent way to find vulnerabilities. These reviews, aka an audit, should presume the software is vulnerable until shown otherwise.
Neel Mehta of Google was one of the two discoverers of Heartbleed. According to comments on Y Combinator’s “Hacker News”, Neel Mehta found Heartbleed by auditing code [DrewHintz].
The notion that thorough human review is often better than tools that use heuristics to find vulnerabilities makes intuitive sense, but more importantly, experimental data confirms it. For example, Kupsch and Miller found that a human audit using their First Principles Vulnerability Assessment (FPVA) approach was far more comprehensive on a sample program than using Coverity Prevent and Fortify Source Code Analyzer (SCA) [Kupsch2009]. All human reviews can unintentionally fail to identify vulnerabilities, but such reviews can do quite well. In the Kupsch experiment, the FPVA human review found 15 serious vulnerabilities; Fortify found 6, Coverity found 1, and neither automated tool found a vulnerability not found by the human review.
But the downside of human review should also be obvious: It takes effort and expertise to do this kind of review, and changes also require review. Human review is simply not practical to do for all software, and even when it is, the challenge is finding a way to fund it.
A minor downside is that human review can only review a specific version, yet software changes over time. This, however, is not serious problem. A good review will not only specifically analyze the software to see if it is correct, but will also identify systemic changes that would greatly reduce the likelihood of vulnerabilities even as the software changes.
Note that this kind of audit is different than a typical simple review of a patch before acceptance. The addition that created the Heartbleed vulnerability was created by developer who was trying to avoid vulnerabilities, and it was accepted by another reviewer. However, patch reviews usually simultaneously review the functional improvement and look for vulnerabilities, so it is much easier for them to miss vulnerabilities. As noted earlier, a human review requiring validation of every field for each patch would probably have caught this... but now that the code exists, merely reviewing new patches will not be enough. Trying to review every patch separately at this point is probably not cost-effective. Also, patch-by-patch review can miss global problems that might be otherwise missed. A separate review, focusing on vulnerabilities across an entire system, is often more effective.
In many cases the software should be modified and simplified before this review/audit takes place, and I think that is true of OpenSSL especially. Programs that are excessively complicated are difficult for both tools and humans to properly evaluate. OpenSSL uses unnecessarily complex structures, which makes it harder to both humans and machines to review.
This kind of review really does happen. Indeed, around the same time as Heartbleed was announced, a security review of a key part of TrueCrypt was released [Junestam2014] [Ritter2014]. (TrueCrypt’s odd license is probably not FLOSS, and it abruptly shut down in May 2014 with a possibility of revival. It is still an independent audit of software with publicly-viewable source code.) More recently, the Linux Foundation’s Core Infrastructure Initiative will fund an audit of OpenSSL by the Open Crypto Audit Project (as well as provide enough money to the OpenSSL project to hire two full-time developers).
In short, thorough independent human review takes significant effort and expertise, but it can produce great results.
But what if you really want to be essentially certain that a program does exactly what it is supposed to do? There is a set of approaches called formal methods that can give far greater confidence than any of the techniques listed above. Formal methods involve the use of “mathematically rigorous techniques and tools for the specification, design and verification of software and hardware systems” [Butler]. Given the difficulties of applying formal methods, they are more likely to be small programs or modules at the moment, but it is definitely possible to apply formal methods. Besides, if you really want to have extremely high confidence that a program does or does not do something, formal methods are still the only way to achieve that confidence.
There are various ways to apply formal methods. Some people only use formal methods to create specifications, and do nothing more with formal methods (this is called level 0 or formal methods lite). Some people may go a little further, proving some statements about the specifications or refining the specification towards a more concrete model (aka level 1). Neither of these approaches would have found Heartbleed. For Heartbleed, formal methods would have to be applied to create proofs about the code itself (aka level 2), at the source or executable code level, so that is what I will focus on here.
In general, proving claims about code itself requires that you create a formal specification (describing what the code should do), and then a proof that the code meets that specification. Just about any specification would have sufficed to find Heartbleed, though. Trying to read from an out-of-bounds area (an “improper restriction”) is undefined in C, and in most systems you cannot prove anything if the program permits an undefined activity. Thus, trying to prove anything would force you to prove it could not read beyond... and you would not be able to do it. (Some systems can sometimes give you counter-examples; a counter-example would immediately reveal this problem.) The attraction of formal methods is that you can go far beyond this; with formal methods you can prove that the program always does something specific when it runs.
In practice proofs about programs typically involve systems that allow annotation of code. It’s worth noting that annotation systems can often be used in a variety of ways short of formal proof. For more information, see my previous comments about annotation systems.
If you’re interested in more, especially more information about the FLOSS tools available that support formal methods, see my presentation on formal methods from my class on developing secure software. An interesting suite for formal methods is Toccata (formerly ProVal), which combines Frama-C, Why3 (formerly Why), and many automated and interactive tools. By combining many different tools it potentially makes it possible to prove programs correct with far less effort than before. What’s more, they can handle a large subset of C (as well as other languages); most formal methods systems cannot. SPARK 2014 is based on Ada but lets you prove claims about programs, and they have recently connected it to Toccata. Recent advances in algorithms (e.g., SAT solvers) and greatly increased distributed hardware performance are gradually making these approaches easier to apply (not easy, but easier).
Examples of formally verified programs other than cryptographic libraries include seL4 (a proven implementation of the L4 microkernel interface), CompCert C (a compiler for a subset of C), cakeML (a compiler for a subset of ML), Tokeneer (an identity system that primarily serves as a complete demonstration of using Z and SPARK), and iFACTS (part of an air traffic control system using Z and SPARK).
The OpenSSL implementation of SHA-256 has been formally proved as a full formal machine-checked verification. The proof was created using the Coq proof assistant and the Verifiable C program logic (a separation logic for C). [Appel2014]
I have tried to create a complete list, but there may be other approaches that would have detected the Heartbleed vulnerability ahead-of-time. People are continuously developing new approaches, of course. In addition, sometimes it is not clear that a particular approach would have found Heartbleed ahead-of-time.
For example, is not at all clear to me that current concolic testing tools would have found the Heartbleed vulnerability. In at least some cases it seems doubtful. CREST, for example, is an automated test generation tool for C that uses concolic testing. CREST is FLOSS, though it depends on at least one non-FLOSS component. However, CREST currently only reasons symbolically about linear integer arithmetic, so it seems unlikely that it would have worked with OpenSSL. If anyone can confirm or disprove that a concolic tester would have found Heartbleed, please let me know.
It is possible that the results of the STONESOUP research program, such as the Preventing Exploits Against Software of Uncertain Provenance (PEASOUP) project, could have countered Heartbleed. However, I have not been able to confirm a STONESOUP project that (1) countered Heartbleed and (2) was ready and available for production use at the time instead of just research use. If anyone has more information that they would like to share with me, especially if they are willing to make it public, please let me know.
Many of these techniques have important preconditions; let’s talk further about each.
Many of the static techniques for countering Heartbleed-like defects, including manual review, were thwarted because the OpenSSL code is just too complex. Code that is security-sensitive needs to be “as simple as possible”.
Many secure software developers suspect that first using “software quality” tools to detect especially complicated structures, and then simplifying those structures, is likely to produce more secure software. The idea is that static analysis approaches (both automated tools and manual human review) tend to have problems with complex code; using tools to detect some of that complexity, and simplifying the code, may make the static analysis approaches more effective. I suspect that they are right, but I have not seen any rigorous data that supports it. Thus, this seems like a plausible idea, but I hope that someone will eventually create and publish some scientific research to support or refute this hypothesis.
In any case, simplifying code is more than running software quality tools. It is a mindset; there should be a continuous effort to simplify (refactor) the code, because otherwise just adding capabilities will slowly increase the software complexity. The code should be refactored over time to make it simple and clear, not just constantly add new features. Little things like code formatting matter, since badly-formatted code is much harder for humans to review. The goal should be code that is obviously right, as opposed to code that is so complicated that I can’t see any problems.
Overly-complex code often leads to vulnerabilities. In 2006 Debian accidentally broke the OpenSSL pseudo-random number generator by modifying the software to eliminate a valgrind warning. However, the person modifying the software did not really understand it. That person asked for help, but the very complexity of the OpenSSL code made it hard for others to realize that the change introduced a vulnerability. Cox reviews what happened and concludes: “Try not to write clever code. Try to write well-organized code. Inevitably, you will write clever, poorly-organized code. If someone comes along asking questions about it, use that as a sign that perhaps the code is probably too clever or not well enough organized. Rewrite it to be simpler and easier to understand.” [Cox2008].
The LibreSSL developers have taken the OpenSSL code and are specifically working to simplify the code. LibreSSL - An OpenSSL replacement (by Bob Beck) describes some problems of the OpenSSL codebase (from the viewpoint of the LibreSSL project fork). They are reformatting the code to make it much easier to understand. They are doing many sensible things too, such as removing code to support long-obsolete VAX VMS systems. The opensslrampage.org website captures some of the comments by the LibreSSL developers. Some of their comments are over-the-top (after all, they’re trying to justify why people should use their fork instead), but many of their comments are completely legitimate. For example, OpenSSL creates portable C code in poor ways (e.g., by creating a complex nest of #ifdef clauses and a set of functions that have similar names yet different semantics compared to standard functions). There are much better ways to create secure yet portable programs, e.g., see [Miller2005]. However, the LibreSSL developers are also removing code people care about. For example, they are also removing support for FIPS 140-2 validation, which is required for US government use and something many private companies want. There is always a conflict between making the code simple and making the code useful in many circumstances; the simplest code does nothing! Still, it is clear that many programs (including OpenSSL) could be made much simpler than they currently are.
Although it is slightly out-of-scope for this paper, a related problem is that application program interfaces (APIs) are often absurdly complex or difficult to use.
Most cryptographic libraries and data-transport libraries are absurdly complex. They often present to developers a “confusing array of settings and options”. As a result, a vast number of applications and higher-level libraries incorrectly use the cryptographic libraries, resulting in vulnerable systems. Most of these problems have been worked out in web browsers, but they continue to be a problem in all other code. For more information, see “The Most Dangerous Code in the World: Validating SSL Certificates in Non-Browser Software” [Georgiev2012].
Although this is technically not a vulnerability in the SSL/TLS implementation, that is irrelevant. The cryptographic library is the component that creates the overly-complex interface, so as a result, it is still the component at fault.
A related problem is that underlying libraries and systems that cryptographic libraries are built on have APIs that are often overly difficult to use. The C standard omits important functionality like asprintf and reallocarray (reallocarray is useful!). As a result, programmers have to work around these omissions, but their solutions often lead to bugs. Some of these bugs, unfortunately, lead to vulnerabilities.
Secure programs should allocate and deallocate memory normally, without special program-specific allocation systems or memory caching systems. At the very least it should easy to disable them, and testing should ensure that disabling them works. Some of the techniques for mitigating the effects of Heartbleed appear to have been thwarted because of the way that OpenSSL allocated memory.
The basic problem was that OpenSSL included an application-specific caching freelist of unallocated memory. Its purpose was to speed up allocations when the same size is repeatedly requested. By default OpenSSL did allocate memory normally (going through malloc), However, instead of deallocating when a memory region was no longer in use, in some cases it put the region into a freelist of unused regions, making it ready for immediate reuse. This cached freelist unfortunately subverted some mitigation mechanisms in some operating systems and C run-times, because they were not always notified when a memory region was no longer in use.
As Theo de Raadt noted, “years ago we added exploit mitigations counter measures to libc malloc and mmap, so that a variety of bugs can be exposed. Such memory accesses will cause an immediate crash, or even a core dump... [once the problem is revealed] then the bug can be analyzed, and fixed forever. Some other debugging toolkits get them too. To a large extent these come with almost no performance cost. But around that time OpenSSL adds a wrapper around malloc & free so that the library will cache memory on [its own].... OH, because SOME platforms have slow performance, it means even if you build protective technology into malloc() and free(), it will be ineffective. On ALL PLATFORMS, because that option is the default, and Ted’s tests show you can’t turn it off because they haven’t tested without it in ages. So then a bug shows up which leaks the content of memory mishandled by that layer. If the memory had been properly returned via free, it would likely have been handed to munmap, and triggered a daemon crash instead of leaking your keys.” [de Raadt] [Ted]
There seems to be a lot of confusion about what exactly went wrong with OpenSSL’s memory allocation approach, and Chris Rohlf has made a number of useful clarifications [Rohlf2014]. I think these clarifications are important, because we must first understand the problem before we can fix it. In particular, Rohlf points out that OpenSSL did use the standard malloc() C memory allocation routine (by default) when it wanted a whole new memory block. The issue is that once a memory block was allocated, OpenSSL itself managed that memory further. Rohlf also (correctly) points out that in most typical environments this use of a freelist is irrelevant; a freelist puts different memory allocations together, but many typical memory allocation systems also put different memory allocations together. However, while Rohlf is absolutely correct that typical memory allocation implementations do the same thing, the point is the OpenSSL implementation seriously impeded various mitigation measures. There is another confusion about the memory allocation system of OpenSSL, but first we need to go over some basics.
A common approach is to handle at least some memory allocations and deallocations specially. The idea is to cache and reuse some objects or memory regions when they are no longer in use; in some cases this approach can significantly improve performance. More specific examples of this approach include specially handling a common memory allocation size, and/or reusing objects or memory regions by holding a cache of unused ones. There are a number of specific techniques for doing this, including creating an object pool, as well as creating a slab allocator. The Glib library (the basic support for GTK+) includes a mechanism called a memory slice to improve memory allocation performance. Many graphical user interfaces (GUIs) and programs that are not security-sensitive use these approaches.
It turns out that some of these approaches can disable some detection tools like address sanitizer (ASan), electric fence, and valgrind. This is particularly a problem for fuzz testing; if these tools are disabled, then fuzz testing becomes much less effective. Indeed, fuzz testing may not be able to detect many out-of-range reads when these approaches are in use.
Earlier reports about OpenSSL seemed to suggest that OpenSSL was using an approach that self-managed a larger region of memory and then subdivided it further. This is what happens in a slab allocator or memory slicing implementation. This approach, intended to improve performance, would foil ASan and guard page systems, and thus, could inhibit detection of over-reads entirely in those cases. Older versions of this paper reported that this seemed to be what was happening. However, I have since delved more into the OpenSSL code, and this does not seem to be true in OpenSSL. That is good news of a sort for Heartbleed. Still, these kinds of allocation schemes are relatively common, and I know of no one who has warned anyone about the risks of these approaches.
Security-sensitive software should avoid using memory caching systems, and should never combine multiple allocations into one underlying allocation request (as is done by a slab allocator or memory slicing implementation). At the very least they should provide an easy well-documented mechanism to disable caching/combining memory allocations and include tests of that configuration in its regression test suite (to ensure it is tested). OpenSSL had a disabling mechanism, but it no longer worked (because no one tested it), and in any case few people understood that these memory allocation mechanisms could disable many capabilities of security analysis tools.
We also need to modify our educational material so that developers and tester will know that memory caching systems can seriously hamper security analysis. I have already done this in my presentation materials for my class on developing secure software; others need to do the same.
In the longer term, perhaps there should be some standard interfaces for caching freelists / slab allocators in languages such as C. If there were standard interfaces then tools could easily be modified to automatically adjust to them.
This is speculative on my part, but I believe that much more code review and many more contributions would occur if OpenSSL used a standard widely-used license. OpenSSL uses an odd variant license that is incompatible with the GPL and LGPL [McLoughlin2004] [GNU-Licenses]. This is weird, because the GPL is the single most common FLOSS license. This incompatibility is worked around through a license loophole in many cases, or by making an explicit license exception in the software that uses OpenSSL. However, this awkward licensing situation means that many people who prefer the GPL or LGPL will often not help develop or audit OpenSSL. Some of those who prefer less-restrictive licenses may also be less inclined to help, because again, it is not a standard license.
I do have some evidence that the non-standard licensing is a problem. A completely separate software package, GnuTLS, was initially specifically created so that software using a standard GPL license could easily use SSL/TLS. The LibreSSL fork of OpenSSL appears to be switching to the 2-clause BSD license (a more common license) when they write new code, in comparison to the OpenSSL license. The OpenSSL developers themselves appear to be aware of this problem; Google’s “BoringSSL” announcement mentions that “We have already relicensed some of our prior contributions to OpenSSL under an ISC license at their request and completely new code that we write will also be so licensed”. The ISC license is a much more common FLOSS license, and is functionally equivalent to the 2-clause BSD and MIT licenses.
It’s been known, for a long time, that using a widely-used FLOSS license is important for FLOSS projects. Bruce Perens noted back in 1999, “do not write a new license if it is possible to use one of the ones listed here. The propagation of many different and incompatible licenses works to the detriment of Open Source software because fragments of one program cannot be used in another program with an incompatible license” [Perens1999]. Later on the Open Source Initiative (OSI) created the License Proliferation Project, noting that many licenses “were legally incompatible with other free and open source licenses, seriously constraining the ways in which developers could innovate by combining rather than merely extending Open Source software”. A key result is that OSI now directly lists, in its Open Source Licenses page, only Popular Licenses, which are “popular, widely used, or have strong communities”.
Most FLOSS is released under the GPL, LGPL, MIT/X, Revised BSD (aka BSD-new or BSD 3-clause), BSD 2-Clause (aka simplified or FreeBSD), or Apache 2.0 licenses, and I recommend limiting new FLOSS programs to that list of licenses. Yes, you could add a few more; the OSI’s popular license list (as of 2014-05-01) includes those and a few more. However, the point here is that the OpenSSL license is simply not a common license. Most importantly, it is a non-standard license that is known to be incompatible with some of the most widely-used licenses. Where possible, it’s best to stick to common licenses instead.
So what would likely have reduced the impact of Heartbleed-like vulnerabilities, rather than eliminate it entirely? After all, when a vulnerability slips through, you would like to reduce its impact. Below are a few approaches.
Many systems include memory allocation mechanisms (by default or enable-able) that reduce damage, and some may sometimes detect problems too. This is no guarantee against problems, but it might reduce the impact. For example:
Programs should overwrite (destroy) critical information whenever they are done with it. Critical information includes passwords and private cryptographic keys. This reduces the impact of a vulnerability, since once information is destroyed it cannot be revealed.
Be sure this is not optimized away; most compilers will eliminate such overwrites if you don’t take steps to avoid it. This is such a common mistake that it has been assigned a weakness id of CWE-14 (Compiler Removal of Code to Clear Buffers). It is specifically identified as CERT C Coding standard recommendation MSC06-C (beware of compiler optimizations) and C++ Secure Coding Standard recommendation MSC06-CPP (Be aware of compiler optimization when dealing with sensitive data). My own book Secure Programming for Linux and Unix HOWTO includes a section specifically discussing how to Specially Protect Secrets (Passwords and Keys) in User Memory.
Let’s see why this is a problem. A naive programmer who wanted to erase memory in C might choose the standard C function memset(). However, if memset() is used to erase sensitive data, the program would normally not use that memory again. Why would it? That data has been erased! However, modern compilers will typically notice that the memory isn’t being used again, and will will silently remove the memory erasure code. After all, if the memory won’t be used again, it’s a waste of resources to erase it. This is not a compiler error; compilers are explicitly allowed to do this, and it is even enshrined in the C and C++ standards. The problem is that the compilers were given incorrect information. Software developers need to specifically tell the compiler that this erasure is not a no-op, and that therefore this optimization should not be done.
A cryptographic system has perfect forward secrecy (PFS), aka forward secrecy, when it non-deterministically generates new random public keys for each session. When PFS is not used, the compromise of a private key typically enables attackers to decrypt past communications that they have recorded. When PFS is enabled, messages are not necessarily exposed when some keys are exposed, because there is no single secret value used for all messages.
It can be helpful to separate critical cryptographic secrets from the rest of the code, so that even if even the rest of the program is subverted it cannot directly access secrets like private keys.
This applies the basic security principle that programs should be provided only most limited privilege necessary to their job. These approaches can reduce the likelihood or impact of a critical vulnerability by reducing the amount of software where a vulnerability would reveal critical information like a key. However, I am listing it as a risk reduction approach, not as a full countermeasure. Somewhere code must have access to privileged data and that code might be vulnerable. Here are a few additional notes:
The entire SSL/TLS infrastructure has a large number of serious problems, including a large number of often-untrustworthy root certificate authorities, poorly-vetted certificates, and a badly broken certificate revocation process [Goodin2011]. As a related point, Qualsys SSL labs has posted an SSL threat model. Moxie Marlinspike’s SSL And The Future Of Authenticity argues that “the current problems with the [SSL/TLS Certificate Authority] system can be reduced to a single missing property. I call this property trust agility” [Marlinspike2011]. “Assessing legal and technical solutions to secure HTTPS” outlines systemic vulnerabilities of HTTPS (from both a legal and technological perspective); they conclude that vulnerabilities will likely persist for years to come, due to problems such as perverse incentives [Arnbak2014].
For the moment, let’s specifically focus on the broken certificate revocation process, which is of special concern when trying to respond to Heartbleed. The Heartbleed vulnerability made it possible for attackers to exfiltrate the private keys for their certificates, which is really bad. In theory, vulnerable sites can solve private key exposure by deploying new certificates and revoking the old certificates (once the underlying cause, like Heartbleed, is fixed). Unfortunately, today’s certificate revocation mechanisms are fundamentally broken, and few people (until recently) have paid much attention to the problems. As a result, today attackers can often to cause web browsers to accept certificates even after they have been revoked. There is a critical need for a push to create security standards in this area that work far better than currently-available mechanisms, and then implement them by default everywhere. I personally think that there should be a determined push for X.509 OCSP must-staple, but whatever the solution turns out to be, we need one.
This is a complicated area; I will try to just summarize the issues. Here are a few certificate revocation mechanisms, and the problems with them:
Mistakes will happen; organizations must plan to quickly repair them when necessary. Yet many organizations are unprepared for the inevitable discovery of vulnerabilities in the systems that they sell, provide, or maintain. Shodon (an Internet-of-Things search engine) estimates that 200,000 devices connected to the open Internet were still vulnerable to Heartbleed over 18 months after it was publicly announced.
For example, most users of the older Android version 4.1.4 are currently vulnerable to Heartbleed. That’s because although Google fixed this version of Android quickly, the phone manufacturers are sometimes very slow at updating the phone, and then the phone service carriers typically delay updates for unconscionably long times (or fail to provide them at all). It may be time to talk about holding manufacturers and carriers liable for failing to repair phones they’ve sold in a timely way when there is a known fix for a security defect.
Similarly, I think the FIPS 140-2 validation process needs to be changed to allow for rapid update of libraries when a vulnerability is found. Steve Marquess argues in “The Immutability of FIPS” (updated 2014-03-29) that “the single most distinguishing (and IMHO deplorable) feature of FIPS 140-2 validation is the almost total prohibition of changes to validated modules.”
Of course, continue to store passwords as salted hashed values, and not as cleartext or as a reversible value. If you are storing passwords as cleartext or a reversible value, and you do not have a compelling reason, you should probably be fired.
I should at least mention that various general improvements would likely reduce the number of exploitable vulnerabilities in general, and thus should be applied. This includes secure software education/training (which should reduce the total number of vulnerabilities in software) and reducing attacker incentives (which should reduce the total number of unreported vulnerabilities that attackers have in their arsenals).
Many software developers still get little education or training in how to develop secure software. Yet nearly all software is either connected directly to a network, or at least receives data through a network. Many developers do not know how to design software in a way that resists attack, what the common weaknesses are, or how to correctly counter them. In fact, many developers do not even know the difference between security software and secure software. The material is available, indeed, I give away a book on how to develop secure software [Wheeler2004]. But developers have to learn and apply this knowledge.
We also need to reduce the economic incentives for attack, or at least, the economic incentives for finding vulnerabilities and selling them to attackers. In many cases people will not tell defenders about vulnerabilities because they can make more money selling information to attackers. After all, people might legally make over a million dollars selling information on a single vulnerability. Bug bounty values simply cannot keep up in the current environment. We need to investigate ways to reduce these incentives. For example, perhaps we should criminalize selling vulnerability information to anyone other than the supplier or the reporter’s government. Basically, treat vulnerability information like organ donation: intentionally eliminate the economic incentives in a specific area for a greater social good. I think it’s impossible to prevent a citizen from telling his country’s government about a software vulnerability; a citizen could easily see it as his duty. I also think no government would forbid buying such information for itself. But additional constraints might reduce the number of people actively looking for vulnerabilities to exploit instead of to fix. Obviously there are some people will do illegal things, but some people will avoid doing illegal things in principle, and others will avoid illegal activities because they fear getting caught. You don’t need to stop all possible cases, just enough to change the economics. Maybe there’s a better way. If so, please propose it! In any case, we need to find a way to make capitalism work for instead of against security [Wheeler2013].
As I’ve noted earlier, developing secure software requires a combination of tools and approaches. When there’s a failure, as clearly happened here, we need to learn why it failed and try to do better (based on what we learn).
This has not been a good time for cryptographic libraries. These libraries are vitally important, yet many serious problems have been found in 2014:
The Register ran the article Annus HORRIBILIS for TLS! ALL the bigguns now officially pwned in 2014.
Many people depend on cryptographic libraries, but evaluating them requires specialized expertise. The US government has established a process to evaluate cryptographic modules, called FIPS 140-2. (There’s a difference between a module and a library; stay tuned.) I think it is a good idea to have a broad process for evaluating these items that are vitally important and difficult to evaluate.
There is an important thing to understand, though: the FIPS 140-2 process does not currently examine cryptographic protocols, including implementations of SSL/TLS. Instead, the current FIPS 140-2 process only evaluates the underlying cryptographic algorithms and schemes in the cryptographic module it examines. I did not make that clear in older versions of this paper; I hope this clarifies things. As stated in Implementation Guidance for FIPS PUB 140-2 and the Cryptographic Module Validation Program (April 25, 2014) (referenced via http://csrc.nist.gov/groups/STM/cmvp/standards.html), “the cryptographic modules may implement various protocols known in the security industry. The examples of such protocols are IKE, TLS, SSH, SRTP, SNMP and TPM, listed in NIST SP 800-135rev1... FIPS 140-2 and its Annexes do not address protocols. Only the cryptographic algorithms (such as, for example, AES or DSA) and schemes (such as the key agreement schemes from NIST SP 800-56A or the RSA-based key encapsulation schemes) that are Approved and allowed may be used in the Approved mode of operations.”
In fact, OpenSSL can keep its FIPS 140-2 validation (#1747) even after it fixes the Heartbleed vulnerability, due to some odd technical details. The FIPS 140-2 process didn’t evaluate the normal OpenSSL code, but instead only evaluated a specially-crafted software module called the OpenSSL FIPS Object Module. The OpenSSL FIPS Object module (aka “FIPS module”) has the same interface, and is derived from the OpenSSL code, but it is a separate component. Users who enable “FIPS mode” end up running the FIPS module code instead of the regular OpenSSL code for certain cryptographic capabilities. Thus, the OpenSSL FIPS Object module can maintain its FIPS 140-2 validation, even after it was changed to fix Heartbleed. That’s because the FIPS 140-2 process does not evaluate protocols like SSL/TLS, and because the code that must be changed to fix it are not part of the FIPS module. Note that this is completely different than what happens in other cases. Normally any change at all to a cryptographic module can lead to loss of validation, and it can take a long time (say 6-12 months) and significant money to get a new validation. This risk of loss creates a perverse incentive: cryptographic module developers have a strong incentive to leave problems unfixed. If this is true, then I’m sure the OpenSSL developers and FIPS users are happy... it means that Heartbleed can be rapidly fixed while keeping FIPS compliance (for those who need it). I confirmed this understanding with NIST on 2014-05-09 (thank you very much!).
So by design, the FIPS 140-2 process did not examine the SSL/TLS implementation, or anything involving cryptographic protocols. I think it’s fair to ask, however, if the process should be looking at cryptographic protocols. Properly evaluating cryptographic protocols requires the same sort of specialized knowledge as with other cryptographic code. At the least, it’s fairly easy to create and run a big test suite against a standard protocol like SSL/TLS (that’s because there is a standard external interface). It’s also surprising that the validation process didn’t discover the failure of the random number generator (Dual EC DRBG). I think the FIPS 140-2 process should at least dynamically test every supported cryptographic random number generator to ensure that they are statistically random and that they avoid common mistakes (e.g., that they are not implementations of inappropriate algorithms like linear congruential generator or Mersenne twister). This would cost practically nothing and provide some confidence. If FIPS 140-2 process is not extended to review cryptographic protocols, then something should be established to review them. In general, we need to re-examine FIPS 140-2 to make it faster, less expensive, and more thorough. The current process makes it difficult to update libraries, for example, leading at least one person to conclude that you can be secure or compliant, not both.
No testing process can find all problems, so expecting perfection is not reasonable. Still, there are a variety of newer techniques that might increase the speed, reduce the cost, or improve the testing thoroughness when evaluating cryptographic code. Even more importantly, we all need to learn from the attackers, and improve current evaluation processes to counter the vulnerabilities that are currently slipping through. My goal is not to bash various existing evaluation processes; evaluation is a hard job. My goal is to figure out what can be done to improve things, because cryptography is important in today’s world.
The world needs rigorous and transparent processes for testing cryptographic protocols (including their supporting algorithms), one that allows for rapid updates given security flaws (when they happen) and openly updates its processes to prevent recurrences. I could easily see a FLOSS project to create a much more rigorous test suite that could be used by cryptographic library implementers and concerned users (including governments and their evaluation processes).
I should briefly note that in June 2014 OpenSSL released a roadmap on how they will improve things (Ars Technica commented on the OpenSSL roadmap). The OpenSSL developers plan to do some of the things noted in this paper, e.g., they plan to reduce library API complexity, make the coding style consistent, perform code review on all new commits, audit the current code base, and “regularly audit the code using appropriate analysis tools”. At this time the roadmap does not commit to fixing the OpenSSL licensing problems, which is unfortunate. I think some of the OpenSSL problems stem in part from their weird incompatible license; why would people want to help a project when many FLOSS projects would have trouble using the results?
I suppose I should briefly mention FLOSS in general. Some people have tried to claim that Heartbleed is somehow proof that FLOSS cannot create good software. This doesn’t make sense; legions of vulnerabilities have been found in proprietary software, and the 2013 Coverity Scan Report found that on average “open source code quality surpasses proprietary code quality in C/C++ projects [as measured by average defect density]” [Coverity2014]. In reality, some FLOSS programs are secure, while others are less so, just like proprietary software. FLOSS has some potential advantages for security, but they are only a potentiality. You still have to examine particular software to determine if it is appropriate for your needs... and that includes its security.
Some people have argued that access to source code does not help improve security, but Heartbleed is a good disproof of this. Heartbleed was initially found in exactly the way that FLOSS advocates say FLOSS security works. Heartbleed was first found by Google, who did not write the original code, using thorough human review / audit of the publicly-available OpenSSL source code. It took two years - longer than anyone would like - but Heartbleed is clear evidence that releasing software as FLOSS can lead to finding and fixing vulnerabilities.
It is true that Eric S. Raymond has claimed the following as “Linus’ Law”: “given a large enough beta-tester and co-developer base, almost every problem will be characterized quickly and the fix will be obvious to someone” - or less formally - “given enough eyeballs, all bugs are shallow” [Raymond1999]. But this doesn’t mean what some people want it to mean. First, Raymond is actually contrasting different ways of developing FLOSS; he’s not talking about FLOSS vs. non-FLOSS. Second, notice that the more careful phrasing requires a “large enough co-developer” base; cryptography involves specialized issues that make it harder to get a larger developer base, and the non-standard OpenSSL license has probably also inhibited the number of co-developers. Third, note the text “almost every problem will be characterized quickly”; he never claims that all problems will be found, or that they will all be characterized quickly. Finally, as this paper hopefully makes clear, it turns out that most of the traditional techniques for finding or countering these problems (including dynamic and static analysis) didn’t work, even though a number of tools and approaches were used to analyze OpenSSL. Instead, there were systemic problems finding vulnerabilities like this, because different people started with similar assumptions. The good news is that, because this vulnerability and its causes can be openly discussed, we can identify and fix these systemic problems. In a larger setting, this paper represents the work of people to identify systemic problems, and fix them, so that vulnerabilities like it will be much more likely to be characterized and fixed quickly.
This paper focuses on how to respond technically. However, in many ways non-technical issues matter more. Summer Maynard’s “What Heartbleed Can Teach The OSS Community About Marketing” shows how the Heartbleed vulnerability was marketed (primarily because it “has a name, a logo, and a dedicated web presence”). That marketing was really important; Heartbleed was a bad vulnerability, but the excellent marketing markedly sped up the response and greatly reduced the damage. Projects to help fund critical components like these are also potentially valuable; the Core Infrastructure Initiative (housed at the Linux Foundation) is a multi-million dollar project to fund open source projects that are in the critical path for core computing functions, and was inspired by Heartbleed.
So are there any examples of projects doing well? After all, it’s easy to complain, but if no one can do well, then perhaps it is not possible. Besides, if there are no exemplars to copy from, it’s hard to learn how to do them.
I asked people to identify examples of robust FLOSS, either in terms of reliability or security. Of course, a program can be robust given expected inputs and insecure (because intelligent attackers can specifically rig inputs to cause undesired behavior). Also, even really good systems have occasional problems. Still, it’s a great idea to look at exemplars, because it is much easier to copy approaches once you can see them in action. Here are some of the projects that people identified:
This is certainly not a complete list, and vulnerabilities will probably be found in at least one of them. Still, it’s useful to point to projects that are seriously working to improve security and/or reliability, and how they are doing it, so that others can figure out what might be worth imitating.
The OpenSSL developers and reviewers employed a variety of widely accepted practices like multi-person review and multiple tools to reduce the number of vulnerabilities. However, recent vulnerabilities like Heartbleed suggest that software development projects (both FLOSS and proprietary) often have systemic problems in the way that they counter vulnerabilities.
A key lesson to be learned is that the static and dynamic analysis approaches often used by many projects today cannot find problems like Heartbleed. This includes mostly-positive automated test suites, common fuzz testing approaches, and typical statement or branch code coverage approaches. Several source code weakness analyzer developers are improving their tools to detect vulnerabilities very similar to Heartbleed, and that is good news. But it is obvious that this is not enough.
No tool or technique guarantees to find all possible vulnerabilities. However, there are several approaches that could have found Heartbleed, and vulnerabilities like it, before the vulnerable software was released. Projects that want to create secure software need to also add at least one of the following approaches (and preferably several of them):
For example, Google found Heartbleed using thorough human review / audit, while Codenomicon found it very soon afterwards using fuzzing with output examination.
Projects should make sure that they are easier to analyze. For example, they should simplify their code, simplify their Application Program Interface (API), allocate and deallocate memory normally, and (if FLOSS) use a standard FLOSS license that is compatible with widely-used FLOSS licenses. It would also be good to create a single widely-accepted standard annotation notation for each major programming language, including C, so that an annotation language would be easier to adopt. It would be hard to get that kind of agreement for languages like C (when there isn’t already such a notation), but I think more projects would use them if they were standardized.
There are also many ways to reduce the impact of Heartbleed-like vulnerabilities that should be considered:
Educational materials should be modified to add these points. In particular, we need to warn developers about the potential security problems caused by using memory caching systems in C, C++, or Objective-C; few materials do that today. Of course, the real problem is that few developers learn how to develop secure software, even though nearly all programs are under attack (because they connect to the Internet or take data from the Internet).
Now let’s talk specifically about SSL/TLS implementations. In the short term, I think that a FLOSS project should be established to create a thorough regression test suite for SSL/TLS, built using a set of techniques:
Such a test suite need not compete with other test suites, indeed, having multiple test suites for an important interface can be really valuable. Test suites like this can be created for any protocol or interface, especially those with multiple implementations. However, I think it is important to focus on SSL/TLS. A number of serious vulnerabilities have recently been found in a many different SSL/TLS implementations, all of which can be found through negative testing. Also, if any cryptographic library is vulnerable, that vulnerability is likely to expose data that was properly protected by other implementations. This test suite could be reused across all SSL/TLS projects, every time any developer makes any change, eliminating problems long before they show up on a user’s machine. This suite could also be used by potential users and governments. If used for validation, a supplier could be required to provide additional tests to be potentially added to the next version (making the test suite better over time and helping it keep up with functional additions). A common test suite would give us all better confidence in all SSL/TLS libraries (including OpenSSL, the libreSSL fork of OpenSSL, GnuTLS, Mozilla NSS, Google’s BoringSSL, and so on). It also has funding advantages; if you’re not sure if you should support OpenSSL, the LibreSSL fork, the Google BoringSSL, or anything else, that does not matter - such a test suite would help everyone. Individual projects would continue to have their own test suites, but clearly the current test suites are inadequate. Relying on any one technique to detect vulnerabilities is a bad idea, including creating a common rigorous test suite; we need to do more. But this seems to me to be a useful start.
The goal of this paper is not to ridicule the original developers. Instead, the goal of this paper is to help identify how to improve things, so that important projects like OpenSSL can prevent future similar vulnerabilities by changing how they develop and evaluate their software. More broadly, projects need to examine the lessons from Heartbleed (as listed above) and other vulnerabilities, and then make sure that they effectively counter similar problems. They also need to examine the projects that seem to be doing well, to see what approaches could be copied.
If you enjoyed this paper, you might also enjoy the entire suite of related papers in my essay suite Learning from Disaster, which includes my paper on Shellshock [Wheeler2014b], the POODLE attack on SSLv3, and the Apple goto fail vulnerability.
There is no magic bullet. However, there are important lessons that need to be learned. Projects need to aggressively use a suite of approaches so that vulnerabilities like Heartbleed almost never occur again.
URLs were current as 2014-05-28. I have formatted this paper so that people can find the key references from a printed copy.
[AdaCore 2014] AdaCore. “Spotlighting a GAP Member: Vermont Technical College (US): Vermont Tech’s CubeSat is in orbit and sending down photos and data”. GNAT Pro Insider. Spring/Summer 2014. page 2.
[Ammann2008] Ammann, Paul and Jeff Offutt. Introduction to Software Testing. 2008. University Press. ISBN-13: 978-0521880381. ISBN-10: 0521880386. http://cs.gmu.edu/~offutt/softwaretest/
[Anderson2014] Anderson, Paul. Finding Heartbleed with CodeSonar. May 1, 2014. http://www.grammatech.com/blog/finding-heartbleed-with-codesonar
[Appel2014] Appel, Andrew W. Verification of a Cryptographic Primitive: SHA-256 http://www.cs.princeton.edu/~appel/papers/verif-sha.pdf
[Apple2013] Apple. Apple Automatic Reference Counting (ARC). “Transitioning to ARC Release Notes.” August 2013. https://developer.apple.com/library/mac/releasenotes/ObjectiveC/RN-TransitioningToARC/Introduction/Introduction.html
[ArcaneSentiment2014] Arcane Sentiment. A sound bug finder is an unsound correctness prover. 2014. http://arcanesentiment.blogspot.se/2014/04/a-sound-bug-finder-is-unsound.html
[Arnbak2014] Arnbak, Axel et al. “Assessing legal and technical solutions to secure HTTPS”. ACM Queue, vol. 12, no. 8. http://queue.acm.org/detail.cfm?id=2673311
[Boeck2015] Boeck, Hanno. "How Heartbleed could've been found". Hanno's blog. 2014-04-07. https://blog.hboeck.de/archives/868-How-Heartbleed-couldve-been-found.html
[BAH2009] Booz Allen Hamilton. Software Security Assessment Tools Review. March 2, 2009. http://samate.nist.gov/docs/NAVSEA-Tools-Paper-2009-03-02.pdf
[BenchmarksGame] Benchmarks Game. http://benchmarksgame.alioth.debian.org/u32/which-programs-are-fastest.php
[Bessey2010] Bessey, Al, Ken Block, Ben Chelf, Andy Chou, Bryan Fulton, Seth Hallem, Charles Henri-Gros, Asya Kamsky, Scott McPeak, and Dawson Engler. A Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World. Communications of the ACM, Vol. 53 No. 2, Pages 66-75. 2010. http://cacm.acm.org/magazines/2010/2/69354-a-few-billion-lines-of-code-later/fulltext
[Bounimova2013] Bounimova, Ella, Patrice Godefroid, and David Molnar. “Billions and Billions of Constraints: Whitebox Fuzz Testing in Production”. 2013. Bounimova2013
[Brandon2013] Brandon, Carl, and Peter Chapin. A SPARK/Ada CubeSat Control Program. Ada-Europe 2013, LNCS 7896, pp. 51-64. 2013. Springer-Verlag Berlin Heidelberg. http://www.cubesatlab.org/doc/brandon-chapin-AdaEurope2013.pdf
[Brumley] Brumley, David and Dawn Song. Privtrans: Automatically Partitioning Programs for Privilege Separation. http://users.ece.cmu.edu/~dawnsong/papers/privtrans.pdf
[Butler] Butler, Ricky W. “What is Formal Methods?” http://shemesh.larc.nasa.gov/fm/fm-what.html
[Cassidy2014] Cassidy, Sean. Diagnosis of the OpenSSL Heartbleed Bug. 2014-04-07. http://blog.existentialize.com/diagnosis-of-the-openssl-heartbleed-bug.html
[CASC2014]. Certificate Authority Security Council (CASC). CASC Heartbleed Response. 2014-05-08. https://casecurity.org/2014/05/08/casc-heartbleed-response/
[Chandrashekar] Chandrashekar, Keerthi. “How the Heartbleed Bug Was Discovered”. Latin Post. 2014-04-11. http://www.latinpost.com/articles/10440/20140411/heartbleed-bug-discovered.htm
[Chou2014] Chou, Andy. “On Detecting Heartbleed with Static Analysis”. April 13, 2014. http://security.coverity.com/blog/2014/Apr/on-detecting-heartbleed-with-static-analysis.html
[Codenomicon-How] Codenomicon. Heartbleed & SafeGuard: How We Found It. http://www.codenomicon.com/news/news/2014-05-20.shtml
[Coverity2014] Coverity. 2013 Coverity Scan Report. 2014. http://softwareintegrity.coverity.com/register-for-scan-report-2013.html?cs=pr
[Cox2008] Cox, Russ. Lessons from the Debian/OpenSSL Fiasco. May 21, 2008. http://research.swtch.com/openssl
[Crawford2013]. Crawford, Drew. “Why mobile web apps are slow.” July 2013. http://sealedabstract.com/rants/why-mobile-web-apps-are-slow/
[de Raadt] de Raadt, Theo. “Re: FYA: http://heartbleed.com/”. Newsgroup gmane.os.openbsd.misc. http://article.gmane.org/gmane.os.openbsd.misc/211963
[DrewHintz] DrewHintz. Comment on “Neel Mehta donates Heartbleed bounty to Freedom of the Press Foundation”. https://news.ycombinator.com/item?id=7556826
[Durumeric2014] Durumeric, Zakir, James Kasten, David Adrian, J. Alex Halderman, Michael Bailey, Frank Li, Nicholas Weaver, Johanna Amann, Jethro Beekman, Mathias Payer, Vern Paxson. The Matter of Heartbleed. IMC 14, November 5-7, 2014, Vancouver, BC, Canada. https://jhalderm.com/pub/papers/heartbleed-imc14.pdf Available via archive.org.
[Eadicicco] Eadicicco, Lisa. “How A Group Of Engineers Uncovered The Biggest Bug The Internet Has Seen In Years”. Business Insider. 2014-04-10. http://www.businessinsider.com/heartbleed-bug-codenomicon-2014-4
[Felker2014]. Felker, Rich. “Re: Re: Removing sbrk and brk”. 2014-01-07. http://www.openwall.com/lists/musl/2014/01/07/3.
[Fisher2014] Fisher, Dennis. “Research Finds No Large Scale Heartbleed Exploit Attempts Before Vulnerability Disclosure” Threatpost. 2014-09-09. http://threatpost.com/research-finds-no-large-scale-heartbleed-exploit-attempts-before-vulnerability-disclosure/108161
[Georgiev2012]. Georgiev, Martin, Subodh Iyengar, Suman Jana, Rishita Anubhai, Dan Boneh, and Vitaly Shmatikov. “The most dangerous code in the world: validating SSL certificates in non-browser software”. ACM Conference on Computer and Communications Security. 2012. pp. 38-49.
[Gibson] Gibson, Steve. An Evaluation of the Effectiveness of Chrome’s CRLSets. 2014. https://www.grc.com/revocation/crlsets.htm
[Godefroid2008] Godefroid, Patrice, Michael Y. Levin, and David Berkeley. “Automated Whitebox Fuzz Testing”.
[Goodin2014a] Goodin, Dan. March 4, 2014. “Critical crypto bug leaves Linux, hundreds of apps open to eavesdropping”. Ars Technica. http://arstechnica.com/security/2014/03/critical-crypto-bug-leaves-linux-hundreds-of-apps-open-to-eavesdropping
[Goodin2014b]. Goodin, Dan. “How Heartbleed transformed HTTPS security into the stuff of absurdist theater: Certificate revocation checking in browsers is “useless,” crypto guru warns.” Ars Technica. 2014-04-21. http://arstechnica.com/security/2014/04/how-heartbleed-transformed-https-security-into-the-stuff-of-absurdist-theater/
[GNU-Licenses] GNU project. Various Licenses and Comments about Them. https://www.gnu.org/licenses/license-list.html#OpenSSL
[Godefroid2008] Godefroid, Patrice, Michael Y. Levin, and David Berkeley. Automated Whitebox Fuzz Testing. http://research.microsoft.com/en-us/projects/atg/ndss2008.pdf
[Goodin2011] Goodin, Dan. “How is SSL hopelessly broken? Let us count the ways: Blunders expose huge cracks in net’s trust foundation” The Register. 2011-04-11. http://www.theregister.co.uk/2011/04/11/state_of_ssl_analysis/
[Gutmann] Gutmann, Peter. Experiences with SAL/PREfast https://www.cs.auckland.ac.nz/~pgut001/pubs/sal.html
[Hatton2003]. Hatton, Les. “EC--, a measurement based safer subset of ISO C suitable for embedded system development.” 2003. Information and Software Technology, 47 (3) (2005), p. 181-187. http://www.leshatton.org/index_SA.html
[Hatton2005] Hatton, Les. Language subsetting in an industrial context: a comparison of MISRA C 1998 and MISRA C 2004. November 20, 2005. Information and Software Technology 49 (5), p. 475-482, May 2007. http://www.leshatton.org/index_SA.html
[Heartbleed.com] Heartbleed.com web site.
[Hofer2010] Hofer, Thomas. Evaluating Static Source Code Analysis Tools 2010. http://infoscience.epfl.ch/record/153107/files/ESSCAT-report
[Hsu2008] Hsu, Yating, Guoqiang Shu, and David Lee. “A Model-based Approach to Security Flaw Detection of Network Protocol Implementations” http://www.ieee-icnp.org/2008/papers/Index12.pdf
[Jplus2014] Jplus. 2014. Approximate speed classes of programming languages. XKCD Forums. http://forums.xkcd.com/viewtopic.php?f=11&t=108685
[Junestam2014] Junestam, Andreas and Nicolas Guigo (iSECpartners). Open Crypto Audit Project: TrueCrypt: Security Assessment. Prepared for the Open Crypto Audit Project. 2014. https://opencryptoaudit.org/reports
[Kupsch2009] Kupsch, J. A. and Barton P. Miller. “Manual vs. automated vulnerability assessment: A case study”. The First International Workshop on Managing Insider Security Threats, West Lafayette. 2009. http://research.cs.wisc.edu/mist/papers/ManVsAutoVulnAssessment.pdf
[Kupsch2014-April] Kupsch, James A., and Barton P. Miller. “Why do Software Assurance Tools Have Problems Finding Bugs Like Heartbleed?” https://continuousassurance.org/swamp/SWAMP-Heartbleed-White-aper-22Apr2014.pdf
[Kupsch2014-May] Kupsch, James A., and Barton P. Miller. “Why do Software Assurance Tools Have Problems Finding Bugs Like Heartbleed?” https://continuousassurance.org/swamp/SWAMP-Heartbleed.pdf
[Langley2014a] Langley, Adam. No, don’t enable revocation checking. 2014-04-19. https://www.imperialviolet.org/2014/04/19/revchecking.html
[Langley2014b] Langley, Adam. Revocation still doesn’t work. 2014-04-29. https://www.imperialviolet.org/2014/04/29/revocationagain.html
[lcamtuf2014] lcamtuf (Michal Zalewski). Pulling JPEGs out of thin air. http://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-thin-air.html
[Mailheau] Mailheau, Rita. “Getting to the Heart of the Heartbleed Bug”. 2014-05-27. http://ddosattackprotection.org/blog/heartbleed-bug/
[McLoughlin2004] Mark McLoughlin, Mark. June 22, 2004. https://people.gnome.org/~markmc/openssl-and-the-gpl.html
[Marlinspike2011] SSL And The Future Of Authenticity. April 11, 2011. http://www.thoughtcrime.org/blog/ssl-and-the-future-of-authenticity
[Miller2005] Miller, Damien. “Secure Portability”. AUUG 2005. October 2005. http://www.openbsd.org/papers/portability.pdf or http://www.mindrot.org/~djm/auug2005/
[Miller2007] Miller, Damien. “Security measures in OpenSSH”. Proceedings of the AsiaBSDCon 2007, Usenix. March 2007. http://www.openbsd.org/papers/openssh-measures-asiabsdcon2007.pdf
[NIST] NIST (SAMATE project). Tool Survey. http://samate.nist.gov/index.php/Tool_Survey.html
[NIST-Sound] NIST. NIST SAMATE SATE V Ockham Sound Analysis Criteria. http://samate.nist.gov/SATE5OckhamCriteria.html
[OpenBSD-Journal] Various authors (thread). “malloc() Guard Pages”. OpenBSD Journal. 2003. See, e.g., “At least 11 bugs were found so far.” by David Krause.
[Perens1999] Perens, Bruce. “The Open Source Definition” Open Sources: Voices from the Open Source Revolution 1st Edition. January 1999. 1-56592-582-3. http://oreilly.com/catalog/opensources/book/perens.html
[Provos2003] Provos, Niels, Markus Friedl, and Peter Honeyman. “Preventing privilege escalation”. Proceedings of the 12th USENIX Security Symposium. Pages 231-242. August 2003. https://www.usenix.org/events/sec03/tech/full_papers/provos_et_al/provos_et_al.pdf
[Ragan2014] Ragan, Steve. “Heartbleed to blame for Community Health Systems breach”. CSO. Aug 19, 2014 5:44 PM. [http://www.csoonline.com/article/2466726/data-protection/heartbleed-to-blame-for-community-health-systems-breach.html.
[Raymond1999] Raymond, Eric. “The Cathedral and the Bazaar”. The Cathedral and the Bazaar. http://www.catb.org/esr/writings/cathedral-bazaar/ - note that Linus’ law is at http://www.catb.org/esr/writings/cathedral-bazaar/cathedral-bazaar/ar01s04.html
[Ritter2014] Ritter, Tom. iSEC Completes TrueCrypt Audit: Is TrueCrypt Audited Yet? Yes, in part! April 24, 2014. https://isecpartners.github.io/news/2014/04/14/iSEC-Completes-Truecrypt-Audit.html
[Rohlf2014] Rohlf, Chris. 2014-04-11. My heart is ok, but my eyes are bleeding. Leaf Security Research. http://blog.leafsr.com/2014/04/11/my-heart-is-ok-but-my-eyes-are-bleeding/
[Ruef2014] Ruef, Andrew. Using Static Analysis And Clang To Find Heartbleed. April 27, 2014. http://blog.trailofbits.com/2014/04/27/using-static-analysis-and-clang-to-find-heartbleed/
[Rutkowska2013] Rutkowska, Joanna. Thoughts on Intel’s upcoming Software Guard Extensions (Part 1). August 30, 2013. http://theinvisiblethings.blogspot.com/2013/08/thoughts-on-intels-upcoming-software.html
[Sarkar2014] Sarkar, Roy. “Saving you from Heartbleed”. April 16, 2014. http://www.klocwork.com/blog/software-security/saving-you-from-heartbleed/
[Schieferdecker2012] Schieferdecker, Ina, Jürgen Großmann, and Martin Schneider. Model-Based Fuzzing for Security Testing. 2012-04-21. http://www.spacios.eu/sectest2012/pdfs/SecTestICST_Schieferdecker.pdf
[Schneier2014] Schneier, Bruce. Heartbleed. 2014-04-09. https://www.schneier.com/blog/archives/2014/04/heartbleed.html
[Serebryany2012] Serebryany, Konstantin, Derek Bruening, Alexander Potapenko, and Dmitry Vyukov. “Address Sanitizer: A Fast Address Sanity Checker”. USENIX ATC 2012. https://www.usenix.org/system/files/conference/atc12/atc12-final39.pdf
[Shahriar2008] Shahriar, Hossain. Mutation-based testing of buffer overflows, SQL injections, and format string bugs. 2008. http://qspace.library.queensu.ca/handle/1974/1359
[Spinellis2012] Spinellis, Diomidis, Vassilios Karakoidas, and Panagiotis Louridas. “Comparative language fuzz testing: Programming languages vs. fat fingers.” PLATEAU 2012: 4th Annual International Workshop on Evaluation and Usability of Programming Languages and Tools - Systems, Programming, Languages and Applications: Software for Humanity (SPLASH 2012). ACM, October 2012. (doi:10.1145/2414721.2414727) http://www.dmst.aueb.gr/dds/pubs/conf/2012-PLATEAU-Fuzzer/pub/html/fuzzer.html
[Sutton2007] Sutton, Michael, Adam Greene, and Pedram Amini. Fuzzing: Brute Force Vulnerability Discovery. 2007. Addison-Wesley. ISBN 0-321-44611-9.
[Takanen2008] Takanen, Ari, Jared D. Demott, and Charles Miller. Fuzzing for Software Security Testing and Quality Assurance. 2008. Norwood, MA: Artech House. ISBN-13 978-1-69693-214-2.
[Ted] Ted. Analysis of openssl freelist reuse. http://www.tedunangst.com/flak/post/analysis-of-openssl-freelist-reuse
[TrustedSec] CHS Hacked via Heartbleed Vulnerability. August 19, 2014. https://www.trustedsec.com/august-2014/chs-hacked-heartbleed-exclusive-trustedsec/
[Uberti] Uberti, Justin. Untitled blog post. https://plus.google.com/+JustinUberti/posts/Ah5Gwb9jF4q
[Wheeler2004] Wheeler, David A. Secure Programming for Linux and Unix HOWTO. 2004. http://www.dwheeler.com/secure-programs/
[Wheeler2013] Wheeler, David A. Vulnerability bidding wars and vulnerability economics. 2013-11-16. http://www.dwheeler.com/blog/2013/11/16/#vulnerability-economics
[Wheeler2014a] Wheeler, David A., and Rama S. Moorthy. “State-of-the-Art Resources (SOAR) for Software Vulnerability Detection, Test, and Evaluation,” Institute for Defense Analyses Report P-5061, July 2014. Main report and Appendix E (Software State-of-the-Art Resources (SOAR) Matrix). http://www.acq.osd.mil/se/initiatives/init_pp-sse.html. Alternate location (with embedded appendix E): https://www.ida.org/~/media/Corporate/Files/Publications/IDA_Documents/ITSD/2014/P-5061.ashx [Wheeler2014b]. Wheeler, David A. Shellshock. http://www.dwheeler.com/essays/shellshock.html.
[XKCD] Munroe, Randall. “Heartbleed Explanation”. XKCD. http://xkcd.com/1354/
My sincere thanks to the many people who interacted with me to provide useful feedback on either this paper or on issues relating to it. Alphabetically by last name these include Markus Armbruster, Paul E. Black (NIST), Bill Cheswick, Christopher T. Celi (NIST), Michael J Cooper (NIST), Andy Chou, Mark Cornwell, Daniel Kahn Gillmor, Peter Gutmann, Steve Hayes (Codenomicon), Walt Houser, Josh Morin (Codenomicon), Vincent Legoll, Doug Montgomery (NIST), Bart P. Miller (University of Wisconsin), Peter Neumann, Stephen Nightingale (NIST), Vadim Okun (NIST), Elaine R. Palmer, David Ramos, Eric S. Raymond, Chris Rohlf, Bob Sturm (Codenomicon), Mikko Varpiola (Codenomicon), Apostol Vassilev (NIST), David Wagner, and Jim Zemlin (Executive Director of The Linux Foundation). Walt Houser, in particular, sent me many careful edits and critiques that I am very thankful for. There are probably others; I thank them too.
A brief summary of this paper was published in IEEE Computer. The citation is: Wheeler, David. “Preventing Heartbleed”. IEEE Computer. Volume 47, Issue 8. August 2014. pp. 80-83.
You can see a video of me presenting a summary of this paper.
If you enjoyed this paper, you might also enjoy the entire suite of related papers in my essay suite Learning from Disaster. Feel free to see my home page at http://www.dwheeler.com. You may also want to look at my paper Why OSS/FS? Look at the Numbers! and my book on how to develop secure programs.
As always, these are my personal opinions, and not necessarily the opinions of my employer, government, or guinea pig. As usual with my personal writing, I use logical quotation, as do most hackers and Wikipedia (here is more on the advantages of logical quotation). This paper uses the standard prefixes for binary multiples; 1 KiB is exactly 2^10 bytes, while 1 KB is exactly 10^3 bytes.
(C) Copyright 2014 David A. Wheeler.