Cloud Security: Virtualization, Containers, and Related Issues

David A. Wheeler

2019-06-23 (originally 2014-12-02)

There seems to be a lot of confusion about security fundamentals of cloud computing (and other utility-based approaches). For example, many people erroneously think hardware virtualization is required for clouds (it is not), or that hardware virtualization and containerization are the same (they are not).

Here is a quick introduction to clouds, followed by a contrast of some security isolation mechanisms that could be used to implement them: physically separate machines, hardware virtualization, containerization (OS-level virtualization), and traditional multi-user accounts. I also discuss cloud supplier issues, especially supplier trustworthiness and vendor lock-in.

In this paper I will sometimes contrast the needs of systems with extremely strong security requirements, compared to others with less critical requirements; people with different requirements often do not appreciate the needs of others, and are then surprised when someone with different needs accepts a different approach. I will list examples; there’s no way I can list them all, but examples help, and I’ll especially emphasize open source software implementations (you can try them out and examine them to your heart’s content). I also point out some origins and history; many people have no idea that much of this is decades old. This paper necessarily omits a lot; this is big and active area. For example, "Serverless Security implications-from infra to OWASP" by Guy Podjarny has a post that compares serverless (what I call cloud) security implications from a broader perspective than just the different isolation mechanisms that I focus on here. Here, I just try to focus on some fundamentals. In cloud, things are changing all the time - but the fundamentals do not.

What is a cloud?

“The NIST Definition of Cloud Computing” officially defines what “cloud computing” means to the U.S. federal government [Mell2011], and this is the definition we’ll use here. Cloud computing is “a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.” NIST identifies five essential characteristics: On-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service. NIST identifies three service models:

Infrastructure as a Service (IaaS): “The capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications....” IaaS typically provides you computers, which may be virtual, and you can choose an operating system to run on it and do what you want with them. A popular example of IaaS is Amazon Web Services (AWS), including its Amazon Elastic Compute Cloud (EC2) that provides scalable virtual private servers using Xen. Other examples include Windows Azure, Google Compute Engine, and Rackspace Open Cloud. OpenStack is open source software that lets you implement your own IaaS.
Platform as a Service (PaaS): “The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications... The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly configuration settings for the application-hosting environment.” Here you are provided an operating system (and typically services on top of them); you then provide the application(s) to run. Examples include Google App Engine, Red Hat OpenShift (built on OpenShift Origin), Heroku, and Windows Azure Cloud Services. Amazon Web Services (AWS) also provides support as a PaaS. OpenShift Origin is open source software that lets you implement your own PaaS.
Software as a Service (SaaS): “The capability provided to the consumer is to use the provider’s applications running on a cloud infrastructure...”. Here you are just using (via a network) an application that can very rapidly scale. Examples include SalesForce and Google docs.

The figure below illustrates this. The orange rectangles show the parts that the user is responsible for, while the blurred gray regions show the services that are provided by a cloud computing service. The parentheses give examples; examples of infrastructure are computer hardware (which may be virtualized), and examples of a platform include the operating system and middleware (runtimes are also potentially part of a platform). The SaaS figure shows that Data may even be provided by a service; users normally provide data in all cases, but a SaaS provider may also provide a significant amount of data (such as international tax tables or large datasets) or metadata (e.g., data schemas that implement the services it provides).

Comparison of traditional, IaaS, PaaS, and SaaS approaches. Each approach increases the number of activities performed by the service provider.

Figure 1. Comparison of traditional, IaaS, PaaS, and SaaS approaches

NIST identifies several deployment models: private cloud, community cloud, public cloud, and hybrid cloud. For example, in a hybrid cloud some services might depend on a private cloud while other services depend on a public cloud. (see my later discussion on who has access).

Cloud computing is the natural evolution of utility-based computing ideas that have been around for a long time. In 1961, John McCarthy said, “Computing may someday be organized as a public utility just as the telephone system is a public utility... Each subscriber needs to pay only for the capacity he actually uses...” [Garfinkel2011]. Project Multiple Access Computing (MAC) was officially started on July 1, 1963, which had the long-range objective to support “evolutionary development of a computer system easily and independently accessible to a large number of people and truly flexible and responsive to individual needs” [VanVleck]. Project MAC developed MULTICS, specifically to provide large-scale shared computing; MULTICS greatly influenced the later development of Unix and Linux. Another precursor is the idea of a shared network in the “intergalactic computer network” concept of J.C.R. Licklider; he pressed development of the ARPANET (Advanced Research Projects Agency Network) in 1969 that later became the Internet.

What is different today is the widespread availability of technology (including computing power) that makes these ideas much easier and cheaper to implement and apply [Garfinkel2011]. Anyone with a credit card can immediately get access to a lot of computing power. For example, in October 2014, “Databricks participated in the Sort Benchmark and set a new world record for sorting 100 terabytes (TB) of data, or 1 trillion 100-byte records. The team used Apache Spark on 207 EC2 virtual machines and sorted 100 TB of data in 23 minutes. In comparison, the previous world record set by Hadoop MapReduce used 2100 machines in a private data center and took 72 minutes” [Xin2015]. Both Apache Hadoop MapReduce and Apache Spark are open source software that greatly simplify software development for analyzing data using multiple machines. Since they are both designed to simplify developing distributed systems, in many cases both make it easier to use cloud-based systems to analyze large or complex datasets. "13 ways the cloud has changed (since last you looked)" by Peter Wayner (InfoWorld) describes some of the many variations in configurations and pricing; the good news is that there are many options, the challenge is to find the best option for a given purpose. The Open Guide to Amazon Web Services provides a nice survey of the many services in Amazon Web Services (AWS), including tips, notes about vendor lock-in, and providing names for alternative services. Bruce Schneier’s “Should Companies Do Most of Their Computing in the Cloud?” discusses many of the issues involved in cloud computing, including economic and legal issues, and emphasizes that you need to weigh the benefits against the risks.

Note that:

Hardware virtualization is not required for a cloud. The NIST definition never mentions hardware virtualization at all, for example. Hardware virtualization is a technology often used to create clouds, and vendors selling virtualization-based systems certainly emphasize hardware virtualization, but hardware virtualization is merely one way to get there. Indeed, NIST SP 800-146 mentions cloud security issues that arise “when [the] providers offer computing resources in the form of [virtual machines]” (section 9.4, virtual machines) - this text presumes virtual machines are not always involved in clouds because they use the word “when” and not “because”. The fact that clouds do not require virtualization was also noted in [Chandersekaran2011].
Clouds need not be public. Private clouds are perfectly acceptable. Private clouds often cost much more than public clouds (due to reduced sharing), but if you have very serious security requirements, private clouds may be worth investigating (see who has access).

Isolation mechanisms

For cloud computing (or any shared system) to work, you need to have some technology to implement it. Since workloads are using shared resources, one of most important issues in implementing a cloud is the isolation mechanism used to separate the workloads.

There are many ways to isolate workloads; common ones (in order of decreasing security isolation) are air-gapped physically separate machines, physically separate machines, hardware virtualization, containerization (OS-level virtualization), and traditional multi-user accounts. Most of these are not mutually exclusive; you can use hardware virtualization on physically separate machines, or run containers inside hardware virtualization. That said, it is the strongest isolation mechanism in use that matters most. It is important to understand the basic mechanisms in use, because they affect the underlying security. In particular, it is important to understand the limitations of any isolation mechanism.

Air-gapped physically separate machines

A strong form of isolation is to not connect machines at all through any network. This has its advantages for security. It means that the vulnerabilities in an computer, or an operating system's networking stack, or the OS low-level services, cannot be exploited over a network. Many computers have mechanisms that will surprise you, for example, computers with Intel Active management technology (AMT) can be remotely controlled over a network, even if computer is turned off (a computer that is "off" with AMT has a separate powered computer that obeys commands over a TCP/IP network).

There are many ways to leak data even with an air gap. Air-gapped systems can still have information exfiltrated from them using mechanisms such as power line use by computers, mobile phones to monitor radio frequencies, heat, fan speed, and low frequency magnetic radiation (this propagates though the air, and can be generated by a CPU by a normal user process). Indeed, radio emission issues have been known about for many decades (look up TEMPEST). Electromagnetic emissions can be shielded, and the other issues can be countered by larger physical separation, but note how hard an attacker has to work if the defender is serious about separation. In practice, air-gapped systems have to have occasional communication (e.g., by inserting a USB stick), and those provide rich opportunities for attack. Air gapping can be helpful when security is critical, but it's no guarantee by itself.

Besides, if the machine is air-gapped from all other machines, I think it doesn't really meet the definition of cloud computing. Thus we don't need to belabor this option (I'll basically ignore it from here on). It's possible to air-gap a network of machines from wider networks, and then apply cloud techniques, but network-based attacks will obviously work within that network. So let's look for other isolation techniques that provide some security with less pain.

Physically separate machines (for “bare metal clouds”)

Some systems may be the direct target of strong persistent attacks, e.g., from a nation state or other advanced persistent threat. In addition, in some systems, any security breach may directly lead to harsh consequences (such as the death of many people or the loss of hundreds of millions of dollars). In these situations you may need high assurance that a system will never have a breach. In these cases common cost reduction approaches to implement clouds may create unjustifiable and dangerous risks. These approaches include (1) depending on hardware virtualization to isolate multiple virtual machines running on a single real machine, and (2) multi-tenancy.

Not everyone is subject to these kinds of attacks or impacts. But part of the problem today is that people do not appreciate situations people are in, and want “one size fits all”. In general, any sharing (such as sharing of the same computer) is much more risky from a security point-of-view compared to using separate machines.

For example, If you don’t use physically separate machines, then you are depending (in part) on software to keep machines isolated. However, attackers have always found vulnerabilities in the isolation software (there is the possible exception of formally-proved separation software, but such software is really rare today). For some situations any break at all is a disaster; since software (other than formally-proven software) cannot make those kinds of guarantees, using software for separation is inappropriate when a break-in is a disaster.

Besides, some attacks are (in my opinion) impractical to counter without physically separate machines. Perhaps the most obvious problem are covert channels, a common problem in shared systems that is well-known by experts in computer security. A covert channel is the ability to transfer information between processes that are not supposed to be allowed to communicate. For example, imagine that a system has really sensitive data, and has been tricked into running malicious code. You might imagine that putting the system into an isolated virtual machine that cannot send network traffic would solve the problem, but you would be wrong. If you want to prevent a subverted system from leaking data at all, preventing its network traffic from escaping and having virtual machines that cannot talk directly to each other is not enough. A subverted system could try to transmit data to another supposedly-separate process on the same machine by computing hard (meaning a 1), or not computing hard (meaning a 0), to leak data to another virtual machine that can observe these shifts. Similarly, a subverted system could use the floating-point register (meaning 1), or not use it (meaning 0), and again leak data to another virtual machine that could observe it. Or it could use a memory region to keep it in cache (meaning 1), or ignore it so it will leave the cache and take longer to access (meaning 0). It is absurdly difficult for developers to even determine how much time things take on a modern CPU; x86 machine code is now a high-level language. You would think this couldn’t leak much information, but modern error-correction systems, compression algorithms, and high-performance systems mean that covert channels are disturbingly fast. In my opinion it is impractical today to build software that fully prevents covert channels on the same processor; there are just too many shared components (channels) to deal with. It is much cheaper and easier to provide each process their own processor, and then devise your network so that networks have isolated channels (e.g., by using fixed allocations).

Another example is defects in hardware protection mechanisms. Computer hardware is complicated, and some of its defects are security-relevant. For example, Google’s Project Zero team posted in 2015 a description of how to exploit the DRAM Rowhammer bug to gain kernel privileges. In short, the Rowhammer attack enables programs to change the values in other programs (including the operating system kernel) on a number of systems. In 2016 it was found that even DDR4 memory, once considered safe, was vulnerable to Rowhammer. Google demonstrated that this could be exploited on a variety of systems. Other examples are Spectre and Meltdown. Systems that do not share hardware are immune to failures in the hardware sharing mechanism.

You can still build systems that meet the NIST definition of cloud computing, even if you cannot risk sharing computer hardware or software between different projects at the esame time. Simply require that each executing machine be allocated, at any one time, to a single (dedicated) real machine. You can use hardware virtualization or containerization for some purposes (e.g., to simplify migration), without depending on their isolation properties to allow you to use a single computer to run multiple virtual machines or multiple containers. This approach is sometimes called a bare metal cloud. Obviously you still need to manage things, and you need to protect the management system; you could do this with a separate physically-isolated network, which reduces opportunities for access (just presume that all systems being managed are malicious). (If even that is too risky, perhaps a cloud isn’t appropriate for your problem at all.) Some people use this approach purely for performance reasons, in addition or instead of its advantages for isolation.

This approach is typically more expensive than other isolation mechanisms, but in some cases this extra expense is worth it. The good news is that modern computer systems are much cheaper than decades ago, so you can simply buy (or have someone trusted buy) many computers with much less money than in the past. In a number of cases this extra cost is quite doable. An example of this kind of architecture is given in “High Assurance Challenges for Cloud Computing” [Chandersekaran2011], which won the award “Certificate of Merit for International Conference on Computer Science and Applications 2011”. I know the authors; Sekar Chandersekaran sadly passed away in May 2014, but I think his insight that clouds do not require hardware virtualization needs to be understood more widely.

From a security point of view, physically separate machines are better than depending on mechanisms like hardware virtualization to provide security isolation. You can still use hardware virtualization to ease migration, but that is a separate issue. On the other hand, they require far more resources than directly depending on hardware virtualization for security, so let us describe that next.

Hardware virtualization

In hardware virtualization (aka virtualization) you use software - called a “virtual machine manager” (VMM) or “hypervisor” - to create a “virtual machine” that acts like a real computer. You can then install an operating system on each virtual machine. Each virtual machine is called a guest, while the underlying system that enables virtual machines (including the hypervisor) is called the “host”. In many cases multiple virtual machines execute on a single real machine, and many VMMs support migration of a virtual machine to a different real machine when more resources are needed. This is an obvious way to implement an IaaS.

There are many ways to implement a host. You can implement the VMM:

Underneath the operating system, as if it were a tiny operating system itself. Examples include VMware vSphere (also called “ESXi”) and Xen.
As part of the operating system kernel. Linux’s Kernel-based Virtual Machine (KVM) is an example.
On top of the operating system as an application. VirtualBox is an example.

If you want to scale up to using these mechanisms, you typically want management software to automatically manage and scale up these mechanisms. CloudForms, Red Hat Enterprise Virtualization, and even OpenStack provide mechanisms for managing hypervisors.

These are, of course, old ideas. The term “virtualization” stems from work by IBM in the 1960s, especially the CP-40 mainframe originally developed in 1964. However, CPUs and networks have become much more capable, making virtualization cost-effective in many more situations. Virtualization is often mentioned in wider society, for example, Dilbert of 2008-02-12 mentions it.

In many cases VMMs can share memory pages across VMs as an efficiency measure. When properly implemented these should not have many security implications, but of course, the trick is getting them “properly implemented”. A proper VMM implementation should ensure that pages always have useless data (usually zeros) before handing a new page to a virtual machine, and this eliminates most problems. Shared pages may be read-only, while others may be copy-on-write. Read-only pages are easier to implement correctly, so they have a lower implementation error risk. Shared memory pages create covert channels, but as I noted earlier, if you’re worried about covert channels in a VMM, you are probably the wrong approach anyway (you should be using physically separate machines). (My thanks to Gunnar Hellekson who suggesting talking about VMMs sharing memory pages.)

Hardware virtualization typically provides better isolation than OS-level containerization, but at higher costs in startup time and resources. Intel has been developing “Clear Containers” (with more detail on Clear Containers on LWN.net) that try to optimize hardware virtualization so thoroughly that they have similar start-up times and resource uses as compared to OS-level containerization. This is pretty clever; they use a tiny mini-hypervisor to boot straight into the Linux kernel, optimize Linux kernel boot time for a kernel designed for this use, and use various tricks to do zero-copy no-memory cost access for the operating system code and data.

From a security point of view, hardware virtualization is better (it provides more isolation) than containerization, but it provides less isolation than if you used physically separate hardware to perform isolation. The reason is that the attack surface is fundamentally different in each case.

Some people mistakenly think that hardware virtualization is somehow a security guarantee. It is not. If a vulnerability is found in the CPU or VMM (hypervisor), or a vulnerability is found that that can be exploited through the VMM, then the system can be subverted. And yes, VMMs have vulnerabilities. For more about the kinds of vulnerabilities in VMMs, see [Perez-Botero2013]. Modern hypervisors are typically written to emphasize convenience and shared performance, and aren't designed to counter powerful attacks. For example, modern hypervisor systems generally have a huge 'trusted computing base' (TCB), which makes them very difficult to secure and makes many high security approaches (like formal methods) impractical to apply [Colp2011]. Rafal Wojtczuk states in his Black Hat 2014 presentation that "we make a fact based argument that many hypervisors aren't designed with security in mind"; the problem is that if the goal of a virtualization system is to maximize features, the attack surface grows. In addition, cryptographic side-channel attacks can allow a guest to derive a secret key belonging to a different guess via the CPU cache [Pratiba2015] or the CPU pipeline [D’Antoine2015]. On the other hand, the attack surface for a VMM is much less than for containers, because VMMs provide relatively few services to directly attack compared to a container.

Containerization (OS-level virtualization)

Containerization, also called operating system-level virtualization, occurs when the operating system kernel “allows for multiple isolated user space instances, instead of just one”. The instances are often called containers, and appear to be isolated separate systems (at least via the filesystem). Since containers share a single operating system kernel, and typically easily share files as well, they are typically much more efficient than full hardware virtualization. Note, however, that all systems use the same underlying kernel, so you cannot use different operating systems on the same underlying platform (without additional mechanisms). Containers provide less security isolation than hardware virtualization, as I will describe in a moment. Indeed, containers often don’t fully provide security isolation at all; depending on the technology and threat, you often need to combine them with other mechanisms if you want secure isolation. This is an obvious way to implement a PaaS, though there are limits to the security it can provide.

Historically, one of the most widespread mechanisms for containerization is chroot, which creates restricted views of the filesystem. Chroot was included in version 7 Unix, which was released by Bell Labs in 1979; it was added to BSD by Bill Joy in 1982, and today is in most Unix and Unix-like kernels including the Linux kernel. A short explanation may help for those not familiar with it. On Unix and Linux all files start from the root of a filesystem named “/”. The “chroot” privileged system call changes the directory referred to by “/”, making it possible to create separate filesystem views for a process and its descendants. The implementation of this is remarkably simple: every process includes a pointer to what it considers the root of the filesystem, and all processes created by that process inherit this value. The chroot mechanism creates an isolated instance that simply cannot see the files created outside of its chroot region. Many tools, including mock and pbuilder, are built on top of chroot. In short, chroot is very useful and widely used. However, chroot only isolates filesystems; it does not isolate networking, process lists, or many other system capabilities. Also, chroot cannot effectively isolate privileged users (in particular root); in most implementations a process with root privileges can trivially break out of a chroot environment.

Many operating systems have added more capable mechanisms than chroot to support containerization. These isolated regions are often called containers or jails. These mechanisms at least isolate filesystems, just like chroot, but they try to perform other functions as well. Typically they implement at least network isolation, some other kinds of namespace isolation (e.g., isolating processes so one container cannot see the processes in other containers), perform copy-on-write (an optimization that grealy speeds startup and reduces storage needs), and let organizations define quotas for each container (e.g., for storage or CPU). Some also attempt to limit what privileged users (in particular the “root” user) is allowed to do inside it, though this is harder to do than you might think. Because it is difficult to do, even systems that try to limit root privileges in containers have a non-trivial risk that mistakes could lead to vulnerabilities. Examples of these more capable containerization mechanisms include FreeBSD jail, Solaris containers, HP-UX containers, and Parallels Virtuozzo / OpenVZ containers. In 2008 the Linux kernel added support for Linux containers (LXC).

Docker has rapidly become popular as a way to simplify creating and sharing containers. Docker builds on various existing kernel mechanisms to make containers much easier to create and apply. Historically Docker used Linux Containers (LXC) as its lower-level mechanism; more recently it switched to runC (formerly libcontainer) to run containers. Project Atomic provides tools designed to deploy and manage Docker containers. Project Atomic Host, in particular, is a lightweight operating system assembled out of upstream RPM packages designed to run applications in Docker containers. Cockpit from project Atomic is a kind of minimum container management tool that supports “atomic” update and rollback for applications and host.

Rocket is a more recently-created alternative container format and system developed by CoreOS. The CoreOS developers were originally involved in developing Docker, and Brandon Philips (co-founder/CTO of CoreOS) was a top Docker contributor. However, the CoreOS developers wanted a simple composable building block that “we can all agree on”. The CoreOS developers are concerned that Docker development is “creating a monolithic binary that runs primarily as root” (a design that has many security weaknesses and concerns). Rocket is an alternative container runtime, built by people who were originally Docker contributors, that focuses on “creating simple reusable container suite of tools designed to make it easy to create and use containers”. CoreOS plans to continue to ship Docker, but also to develop Rocket as a smaller and simpler alternative. Rocket is new (as is Docker), so it’s difficult to really compare them security-wise; simpler systems (like Rocket) are often easier to analyze, and thus potentially easier to secure, but you need to actually examine software for security weaknesses (and fix them) before the advantages of simplicity really pay off.

The Open Container Initiative (OCI) is a “lightweight, open governance structure, to be formed under the auspices of the Linux Foundation, for the express purpose of creating open industry standards around container formats and runtime.”

Docker and other container approaches have become very popular, but people often misunderstand their security capabilities. Dan Walsh has written a two-part series about the security of Docker containers, and he explains that “Some people make the mistake of thinking of containers as a better and faster way of running virtual machines. From a security point of view, containers are much weaker... [when using containers, you should] Drop privileges as quickly as possible; Run your services as non-root whenever possible; Treat root within a container as if it is root outside of the container... Only run containers from trusted parties.” Fundamentally, not everything in Linux is namespaced, so Linux does not provide the level of isolation that hardware virtualization does. There are definitely mechanisms that help isolate containers to reduce their risks, especially SELinux, and containers can make it easier to determine where to put the boundaries. Still, the risks are greater compared to hardware virtualization or physically separate machines [Walsh2014a] [Walsh2014b] [RedHat] [ProjectAtomic]. Dan Walsh has some simple summaries: “containers do not contain”; “docker pull is in the same class as yum install” (do not run untrusted Docker images); and “Treat a Docker image the same way you would treat other software you install on your machine” [Walsh2014c]. In addition, a “docker pull” both fetches and unpacks a container image in one step. There is no verification step. That means that malformed packages can compromise a system even if the container itself is never run. In addition, Jonathan Rudenberg’s post on 2014-12-23 explained why Docker’s “verification” of the time didn’t verify anything; untrusted input should not be processed without verifying its signature, yet Docker failed to do so. My personal opinion is that while Docker has lots of rich features, it’s been developed in a hurry as a large monolithic system; hurried monoliths are usually hideously insecure. Thus, I think that if you use Docker you should also use other mechanisms (like SELinux or AppArmor) for actually enforcing isolation that are specifically designed for security and have a longer track record, and you should be selective of what software is run in these containers. I think my opinion is shared by others.

Now to be fair, some Linux container mechanisms (e.g., in lxc and libvirt) can use user namespaces so that the “root” privileged user in a container is actually a different unprivileged user in the overall host system. My thanks to Serge E. Hallyn for pointing out this important clarification. That said, the user accounts have to be set up this way, and the programs they run have to work when configured this way (if the program is unprivileged in the larger context, it may be unable to perform its intended function). These sorts of details are important, but they also make it really hard to make simple overall statements about security and containers.

OpenVZ came before LXC in Linux, and is very featureful. However, the OpenVZ developers made the mistake of not getting their work into the mainline Linux kernel at an early date. As a result, LXC (built into the Linux kernel) quickly gained popularity. In December 2014 Parallels announced that it will be merging its open-source OpenVZ and proprietary Parallels Cloud Server projects into an open source “Virtuozzo Core”. That said, OpenVZ still has significant use, and the competition between them will probably be good for both.

"Abusing Privileged and Unprivileged Linux Containers" by Jesse Hertz (2016) examines some of the security mechanisms behind containers and shows how they can be exploited. Basically, it can be challenging to configure containers to be secure. That paper focuses on Linux, LXC, Docker, and AppArmor, but many of the techniques apply to any Linux container system.

From a security point of view, containers have the potential to be better than a simple multi-user shared server, because they can provide more isolation. They isolate the filesystem at least, and modern containerization systems isolate many other parts such as the network and process lists. In particular, using chroot (a weak form of containerization) is well-understood and has a long history of use as an isolation mechanism. If the design and code were absolutely and totally perfect, and containers were designed to fully contain privileged users, then containerization could be just as secure as hardware virtualization.

However, real-world code is not perfect. Containers share the operating system kernel, and thus, any vulnerability in the operating system kernel can lead to subversion of the entire system (including all its containers). Since operating system kernels directly provide much more functionality than hypervisors (e.g., they implement filesystems, application process controls, and networking), container-based systems fundamentally have a much larger attack surface. In practice, modern full-featured kernels always have some vulnerabilities, and containerization directly exposes the kernel to a program, so containerization is fundamentally less able to isolate components than hardware virtualization does.

In practice, containerization is not even as strong it could be theoretically; most containers have all sorts of namespace leaks. Some systems, such as Solaris, have undergone significant analysis of their container system, and as a result are much lower risk. However, many other systems have had relatively little security analysis of how well containers limit processes for security, especially for privileged accounts.

In particular, Linux containers (currently the most common kind) were not particularly designed to be used for security containment. Thus, when using Linux containers you often need to use other mechanisms that are specifically designed for security isolation (such as SELinux, seccomp, and separate user accounts), and you need to assume that privileged accounts are not isolated by containers. Gunnar Hellekson put it this way: “containers are a way to manage resources on a machine, and create a logical separation of workloads. Any security benefits are accidental and unintended.”

That does not mean that containerization is useless! Containers are extremely useful, and currently a lot of work is going into improving their security. For example, see "Making Containers More Isolated: An Overview of Sandboxed Container Technologies" by Jay Chen (June 6, 2019). Chen says, "traditional containers such as Docker, Linux Containers (LXC), and Rocket (rkt) are not truly sandboxed as they share the host OS kernel. They are resource-efficient, but the attack surface and the potential impact of a breach are still large, especially in a multi-tenant cloud environment that co-locate containers belonging to different customers. The root of the problem is the weak separation between containers... [this] covers four unique projects from IBM, Google, Amazon, and OpenStack, respectively, that use different techniques to achieve the same goal, creating stronger isolation for containers."

If the containers are used to ease software management, instead of trying to separate different tenants in a multi-tenant system, then containers work very well today [Walsh2014a]. Google runs all their VMs inside containers so they can have a uniform mechanism for resource limits.

Containers do provide some security isolation, especially if care is taken to ensure that the processes in the container are not given any privileges, and if either (1) the containerization system is designed for security (as is true for Solaris containers) or (2) another mechanism that is designed for security is used with it (as is typically done with Linux containers). My point is just that containers have inherent limitations for security, in particular, they are always subject to kernel vulnerabilities. Containers do not provide as strong an isolation as hardware virtualization even when supplemented with security isolation mechanisms (as is typically done with Linux containers). However, containers are far more efficient than full hardware virtualization; the lower cost from higher efficiency is compelling in many cases.

Traditional multi-user accounts

Traditional multi-user accounts also provide some isolation. Some people would not use the term “cloud” for systems using this isolation mechanism, but nothing in the NIST definition forbids it. In any case, most would be willing to call such systems “utilities”. Historically Unix systems were often shared systems that different people could simultaneously log in to and use. Different people had different user identifiers, every file was owned by a user, and owners could then set the privileges to determine what other users could do with their files. Multi-user account isolation is very well understood, baked into current designs, and heavily tested; as a result, this mechanism tends to be very strong for its intended purpose. It is easier for me to describe multi-user account access as separate from other mechanisms like containers, but this is really a false dichotomy; containers and multi-user accounts actually work well together.

Multi-user account mechanisms still work today, and they are in active use. Some shared web systems, for example, use multi-user accounts as their isolation mechanism. Android uses these mechanisms in an interesting way: each application on an Android device is assigned a different user id. Apple iOS technically does not use multi-user accounts to separate applications, but its mechanism for separating applications has many similarities. Some systems that use chroot also use a dedicated user, combining multi-user account isolation (which has had long analysis) with the file namespace isolation provided by chroot. Cloud-based systems, particularly those that are software as a service, could also use these mechanisms to implement limited isolation (especially if they are a SaaS)... but they often will not.

These mechanisms do provide some separation, but they also potentially expose much more information to other users. It is easy to see what others are doing, and small mistakes can lead to the revelation of much more information to others. In most systems these access controls are discretionary; a malicious program or misconfiguration can leave the barn door open. Thus, in some sense they provide even less isolation than containers. So while this can be used as an isolation mechanism, other mechanisms will often make more sense when implementing a cloud, possibly in combination with this one.

That does not make these kinds of mechanisms useless for security. However, where these mechanisms are used there are often other mitigations or environmental situations that reduce risk. Historically multi-user isolation mechanisms were often used when all users were in the same organization or at least had established business relationships (and thus were easier to track down if they intentionally performed an attack). On a smartphone users typically choose what applications to install (instead of just running arbitrary programs sent to them). Many modern systems (including smart phones and modern Linux systems) primarily install programs from distribution repositories and application stores, which also reduce the risk of installing and running malicious applications (since these intermediaries have a chance to analyze the program). I could imagine a SaaS provider using multi-user accounts as an isolation mechanism, because the system will only be running software selected or developed by the cloud supplier.

In contrast, Iaas or PaaS cloud suppliers must be able to run software supplied by different users, generally without prefiltering them. As a result they will almost certainly want stronger mechanisms than multi-user accounts, especially if they are public clouds.

Cloud supplier issues

There are many potential issues with the supplier of the cloud, including the trustworthiness of the supplier, who else has access (public vs. limited community vs. private cloud), and vendor lock-in. I feel I should at least mention cloud technical management as well.

Cloud supplier trustworthiness

The cloud supplier has direct access and control of all data and processes in their cloud. This introduces many possible risks. The supplier organization may be malicious, one or more of its employees may be malicious (perhaps because they are bribed or extorted), or the supplier’s system may be subverted.

In many cases there are easy mitigations that are adequate for the purpose. Backups sent outside the cloud, as always, can counter the risk of data loss; this is especially true if you can easily move elsewhere (see the text on vendor lock-in). Selecting reputable suppliers can help a great deal as well. Many cloud suppliers are willing to agree to certain stipulations and accreditations that reduce risk, too. For example, the US government’s Federal Risk and Authorization Management Program (FedRAMP) process was established with the purpose of creating “transparent standards and processes for security authorizations and allowing agencies to leverage security authorizations on a government-wide scale” and includes Third Party Assessment Organizations (3PAOs). Many cloud suppliers have a financial incentive to counter subversion, since that could risk their business, and good suppliers have teams focused on securing their systems.

You also have to realistically consider your alternatives: A good cloud provider (public or private) with a strong security team may be much better overall than an internal set of single-use servers that have little security investment due to inadequate budget.

There are stronger mitigations available as well. You can choose to store indexes to data, or hash values of data, instead of the data itself (e.g., store “employee numbers” but not the names and other information about the employee). You can also store encrypted values (without the keys), and then extract data from the cloud to use it in certain ways. Attackers cannot get what is not there.

That said, using clouds - especially public clouds - is a trade-off, and sometimes the trade-off is inappropriate. That’s especially an issue if you have very high-value systems or information, such as many military systems. Tal Kopan stated in Politico (2015-02-26) that the “cyber line of the day award” went to Rep. Jim Cooper (Tennessee Democrat), who quipped during an Armed Services hearing that the computer cloud might as well stand for “the Chinese Love Our Uploaded Data.” “Cooper said he believes we’re ‘already in a cyber war’ and sought assurances from military CIOs that they’re taking proper steps to secure DoD networks.”

An intriguing new approach in development is homomorphic encryption, where computations can be directly applied to encrypted data and produce an encrypted result so that the decrypted result matches the result of the original operation on the unencrypted result. In this approach, a cloud provider could perform the computation yet be unable to reveal what the results meant. Fully homomorphic encryption can handle any computation, but all known solutions today are many orders of magnitude slower. I am somewhat skeptical that fully homomorphic encryption will be practical in anything other than extremely specialized uses, at least for the next decade. Partially homomorphic encryption can handle only some kinds of computation, but has a smaller performance impact. Current partially homomorphic encryption mechanisms still take substantially more computing resources than doing the computing directly, but they are much closer to being practical, especially if hardware assistance can be developed. There are some specialized applications where this approach seems especially promising. It remains to be seen how widely partially homomorphic encryption could be used; I am slightly skeptical, but intrigued, and I could imagine this becoming useful in some circumstances (especially as research continues and/or hardware to support it becomes more available).

Whether or not the remaining risks are acceptable today depends on your situation. In many cases you can find an acceptable level of risk; after all, you cannot eliminate all risks. In some cases, this is not enough, or the only way to make risks acceptable is to spend more money for a more specialized solution.

Who else has access (private vs. public)?

Whenever you use a cloud, you are sharing resources with others. So not only are you potentially at risk from the cloud supplier - you are also at risk from attacks via those others you share resources with (i.e., the other tenants).

A cloud supplier must support many tenants. The other organizations who share the cloud service have some kind of access... and may be able to exploit system vulnerabilities to get more (or total) access. It may not even be direct; if you share a cloud with another organization that develops a vulnerable service, a malicious organization can exploit those vulnerabilities to gain a foothold, and then use that foothold to attack your system. Public clouds are especially subject to this problem, since they can have practically any tenant. This is a typical rationale for choosing a private or limited community cloud, because reducing the sharing reduces where attacks can easily come from.

I’ve been focusing on technical means to limit access, but all technology has limits. Potential users of cloud computing must to determine if the protection mechanisms of their cloud computing supplier is enough, including physical protection and isolation mechanisms. The issue is determining the risk of the cloud system failing to adequately isolate your service from malicious attackers. Obviously, if the technical measures are adequate for need, that’s fine. The countermeasures discussed above regarding cloud suppliers may also adequately reduce the risks. All approaches have risks, the question is, are they acceptable?

A non-technical approach is to simply not use cloud services that share resources with other organizations you don’t trust. Instead, you can focus on limited community clouds, private clouds, or simply no cloud at all. When you use a private or limited community cloud you can still get commercial support (for both proprietary and open source software solutions), but reducing what is shared can reduce the risk of some kinds of attacks. In some cases these approaches may be the best course... but they’re often much more expensive, so count the cost.

Vendor lock-in

A different potential security issue with cloud computing is the risk of vendor lock-in. Wikipedia defines vendor lock-in as a situation which “makes a customer dependent on a vendor for products and services, unable to use another vendor without substantial switching costs.” Some people don’t consider vendor lock-in a security issue, but I disagree. Security is all about preventing other people from controlling you, and vendor lock-in definitely allows others to control you. If you’re locked into a vendor, then you are powerless if the vendor decides to ignore security issues or stop supporting the service, or substantially raise the rents. Suppliers have strong financial incentives to eventually raise rates to the highest value you can accept, once you are locked in, and it is typically easy to tell if a customer is locked in.

Having suppliers is not the problem - we need suppliers - the problem is excessive dependency on suppliers.

There is a simple way to measure vendor lock-in: the switching cost. Continuously determine what it would cost to switch to another vendor. If the cost of switching to a different vendor is low, then you have no or little lock-in. If it is high, or will become high, then you are at risk; you should be investigating ways to reduce your switching costs. Using standards and open source software are excellent ways to reduce vendor lock-in, but these are means to an end; the key is to determine if your switching costs are high.

Lock-in is a general issue whenever you have suppliers, but it is even more important for services like cloud services. The cloud supplier can choose to stop providing services, raise rates arbitrarily, or delete your data at will - even if there are pieces of paper promising otherwise. Even reputable suppliers might be bought out, have a catastrophic failure, or in other ways stop providing services you depend on. In short, if your cloud provider said, “pay us ten times as much next month or we’ll erase everything,” would you have a workable alternative? Your answer should be “yes”. (For more discussion, see [Antipatterns].)

In addition, lock-in prevents the use of some security mechanisms that derive from having diverse heterogenous suppliers. It is often useful to have multiple suppliers act as a check on each other, to more easily detect problems in one; my dissertation on countering the trusting trust attack is one of many papers that uses that strategy. It is also an ancient battle strategy to use constant movement to strengthen defense; Sun Tzu’s “Art of War” states that “you may retire and be safe from pursuit if your movements are more rapid than those of the enemy”. For example, in some situations you can create a shifting attack surface by constantly migrating ephemeral workloads between different suppliers.

I think SaaS is often an especially high risk, because there is often no effective way to move all the data and processes out to a different system. But any service is at risk, and there are many ways to mitigate those risks. For example, if you can extract all your data into an open standard, that greatly reduces your risk. If you can move your data to an open source software implementation (and thus control it yourself if necessary), that reduces your risk even further.

Handling security updates

In all of these situations, you need to make sure that necessary security updates (patches) are applied, and that secure configurations are used. In physically separate systems and hardware virtualization this is fundamentally the same as it has been historically.

One challenge is that many cloud providers want to provide the “same” image everywhere, or in the case of containerized systems, the image is “fixed”. The technology itself can support updates just fine, but there is a temptation to leave things unchanged even when they need to be changed. Thus, you should examine to make sure that security updates will be applied in a timely way, or that you can force the issue.

Cloud technical management

Clouds work by pooling resources together so that they can be shared; this volume creates the opportunities. But of course this pooling creates a much stronger need for technical management all these resources to serve the cloud provider’s customers.

So clearly you need tools to manage these resources. These can include tools like Chef and Puppet, which let you configure many different (virtual) systems. From a security point-of-view, what is especially key is that these tools let you rapidly apply security updates (patches) and configuration changes that harden systems against attack (or counter an ongoing attack). Of course, these rapid automation tools can also be used to rapidly attack and subvert systems, so these management tools themselves need to be hardened (including being hardened from attacks by the systems the tools are supposed to be controlling).

Frankly, this is a whole topic by itself, so I will stop here.

Other issues

Of course there are many other potential cloud supplier issues that need addressing. They include physical access (can you audit them?), data breach processes, and legal jurisdiction. Which ones matter depends on your circumstance; it’s best to identify the issues that matter to you, then determine which approaches adequately resolve them.

If you are interested, take a look at NIST SP 800-146, “Cloud Computing Synopsis and Recommendations”, which discusses various issues in cloud computing security. The U.S. Department of Defense (DoD) Cloud Computing Security Requirements Guide (SRG) (January 2015) and FedRAMP “Secure Your Cloud” information may also be of interest.

Conclusions

In conclusion, cloud computing is a model of how to share resources, and it can be implemented in many ways. You do not need to use hardware virtualization for cloud computing; that is just one way to do it. What makes sense depends on many factors, including your risk tolerance and costs you are willing to bear. Obviously details and execution matter. A VMM has a smaller attack surface than a component system, but a poorly-maintained and poorly-monitored VMM will typically have a higher risk than anything that is well-maintained and monitored. Still, it’s important to understand the basics so you can understand the trade-offs involved.

Of course, this paper cannot possibly cover everything about cloud security. Really securing a cloud, just like securing any system, requires attention to a whole host of factors. But it is much easier to understand those factors once you understand the basics... and I hope this helps.

References

[Antipatterns] Vendor Lock-In. Antipatterns.com. http://www.antipatterns.com/vendorlockin.htm

[Chandersekaran2011] Chandersekaran, Coimbatore, William R. Simpson, and Ryan Wagner. “High Assurance Challenges for Cloud Computing”. Lecture Notes in Engineering and Computer Science, Proceedings World Congress on Engineering and Computer Science 2011, Volume I, pp. 61-66. Berkeley, CA. October 2011. http://www.iaeng.org/publication/WCECS2011/WCECS2011_pp61-66.pdf

[Garfinkel2011]. Garfinkel, Simson. The Cloud Imperative. Technology Review. 2011-10-03. http://www.technologyreview.com/news/425623/the-cloud-imperative/

[Mell2011] Mell, Peter and Timothy Grance. The NIST Definition of Cloud Computing. National Institute of Standards and Technology (NIST). September 2011. NIST SP 800-145. http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf

[Perez-Botero2013]. Perez-Botero, Diego, Jakub Szefer, and Ruby B. Lee. “Characterizing Hypervisor Vulnerabilities in Cloud Computing Servers”. CloudComputing ‘13, Hangzhou, China. 2013-05-08. http://caslab.eng.yale.edu/people/jakub/papers/scc2013.pdf

[ProjectAtomic]. Project Atomic. Docker and SELinux. http://www.projectatomic.io/docs/docker-and-selinux/

[RedHat] Red Hat. “Secure Containers with SELinux”. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Resource_Management_and_Linux_Containers_Guide/sec-Secure_Containers_with_SELinux.html

[VanVleck]. Van Vleck, Tom. History of Project MAC. Multicians.org. http://www.multicians.org/project-mac.html

[Walsh2014a] Walsh, Daniel J. “Are Docker containers really secure?” Opensource.com. 2014-07-22. http://opensource.com/business/14/7/docker-security-selinux

[Walsh2014b] Walsh, Daniel J. “Bringing new security features to Docker”. Opensource.com. 2014-09-03. http://opensource.com/business/14/9/security-for-docker

[Walsh2014c] Walsh, Dan. Docker’s New Security Advisories and Untrusted Images. 2014-11-25.

[Xin2015] Xin, Reynold. “World Record set for 100 TB sort...”. Opensource.com. 2015-01-15. http://opensource.com/business/15/1/apache-spark-new-world-record.

My thanks to those who provided me helpful feedback including Randy Simpson, Amy Henninger, Gunnar Hellekson, and Serge E. Hallyn. Errors are my own; please let me know of any.

Feel free to see my home page at https://dwheeler.com. You may also want to look at my paper Why OSS/FS? Look at the Numbers! and my book on how to develop secure programs. These are my personal opinions, and not endorsed by my employer, government, or guinea pig.