This essay describes what is spam, why it's a problem, ways you can counter it today, the bigger picture (how it will need to be countered long-term), and a few links to related information.
Spam is unsolicited bulk email (also called unsolicited mass email); it's the (automated) bulk nature of it that is so offensive, as discussed next. A few people try to limit the definition of spam to only cover commercial spam, but it doesn't matter if the spam is commercial or not, if the sheer volume of spam makes it impossible to use the equipment you purchased.
Spamming is another form of stealing and trespass; spammers make other people pay for their messages without permission of the recipient. For example, spam steals a great deal of my time, uses up bandwidth and storage space without my permission, and makes it hard for legitimate email to reach me. And it hurts others; my web pages once made it easy for people to contact me (using "mailto:" links); the volume of spam I get has forced me to remove that capability, making it unnecessarily harder for others to contact me. Many organizations no longer post their email addresses, because their legitimate email is overwhelmed by spam. One estimate finds that spam costs 6 billion Euros a year and the cost is rising. For example, spam uses up lots of bandwidth and disk space that the recipient has to pay for.
A Washington Post article dated March 13, 2003 also discusses spam. They quote Brightmail Inc.'s report that roughly 40% of all e-mail traffic in the United States is spam (up from 8% in late 2001 and nearly doubling in the past six months), and Ferris Research Inc.'s report that spam will cost U.S. organizations more than $10 billion this year. They also quote Robert Mahowald, research manager for IDC; his firm estimates that for a company with 14,000 employees, the annual cost to fight spam is $245,000 and that "there's no end in sight."
A nice summary of the spam problem is Steve Atkins' Size and Cost of the Problem, one of the materials presented to the IRTF Anti-Spam Research Group (ASRG).
Trying to opt-out doesn't work; you're rarely removed, and in most cases you get more spam. See the U.S. Federal Trade Commission (FTC)'s comments that specifically note that "opting out" of spam results in more spam. See also Salon.com's "Remove me!" article.
Spam can even kill. There is a U.S. Secret Service Advisory on 419 schemes that particularly discusses the "Nigerian" letter (called the Nigerian Advance Fee Fraud Overview, a kind of "419" fraudulent scheme) They note that in June of 1995, an American was murdered in Lagos, Nigeria, while pursuing a 4-1-9 scam, numerous other foreign nationals have been reported as missing, this particular type of fraud grosses hundreds of millions of dollars annually, and the monetary losses are continuing to escalate. This kind of fraud doesn't need email, but it's a common spam letter precisely because the criminals don't need to pay "up front" to send spam, so spam has enabled this deadly fraud to ensnare and hurt far more people than it could otherwise. Certainly, people shouldn't fall for such frauds, but since criminals are allowed to send spam to everyone on the planet, spam allows these criminals to exploit the naive and those who momentarily let their guard down. The notion that you can steal other people's resources to support scams is ludicrous.
The IETF (who develop Internet standards) describe why spam is a problem in documents such as RFC 2635. The IETF document RFC 2505 gives more information to mail administrators on how to deal with spam.
Spam has caused me, personally, loss of important data. All email sent to me in August 13-20, 2002, was lost forever due to a torrent of spam. Spam caused me to lose more email in November 2002. Again, Spam effectively steals the ability to effectively use email systems from their rightful users; it's way past time for legislators to understand that spam is theft.
Slate even argues that spam will soon end email as we know it. I think Slate overstates the case, since they presume that "laws have not worked," but in fact, laws to forbid spam have yet to be tried.
Since receivers pay the bulk of the costs for spam, spam use will continue to rise until effective technical and legal countermeasures are deployed, or until people can no longer use digital communications. I hope that legislatures in particular will realize the threat and help work to counteract it. In the meantime, technical approaches without legal help will need to be put in place.
Lots of people have various anti-spam suggestions, and I believe that defense in depth is a good idea. However, there are a few ideas I'm particularly fond of: email passwords, challenge-response passwords, and statistical analysis (e.g., Bayesian).
Email passwords simply ask the sender to include a password (say in the subject or body). The receiver checks, and if the password is included, it is scored as being much less likely to be spam. See my essay on email passwords for more information.
In challenge-response email password systems, email from users in a "whitelist" or that include the receiver's "email password" are accepted; otherwise, the sender is sent a message telling them how to include an email password (say, in the subject line). Simple detection systems prevent email loops (constantly sending emails back and forth), and automatically adding users to the whitelist (e.g., if you send them a message or receive a password from them) would make it easy to use. See my paper on guarded email for a challenge-response protocol that should be very effective. This is based on previous work such as those by Professor Timo Salsi.
Statistical analysis trains on a set of spam and non-spam (ham) messages, and uses that to predict if the next message is spam or ham. Paul Graham's plan for spam, discusses a naive Bayesian approach, in which each email user can run a program that creates a personal spam filter based on statistical analysis. Graham didn't invent the idea, but he did make it much more widely known. There's a great deal of study on the approach, including An evaluation of Naive Bayesian anti-spam filtering, An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages, Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach, and information from lsi.upc.es and monmouth.edu; Slashdot has carried a discussion about it. Ifile implemented the idea many years ago - it claims a first release date of Aug 3 20:49:01 EDT 1996, and the author doesn't claim that is the first time this has been implemented, either. CRM114 has an active filter using this technique, but can extend the probabilities to phrases and not just words (some studies have suggested that doing this doesn't help). Eric Raymond has developed bogofilter, a high-speed implementation.
The SpamBayes project has researched how to improve classification of spam vs. ham messages when examining the message contents (as well as developing an implementation). They began with the naive Bayesian approach, but have been evaluating variations on the technique to improve it further. Their tests show that examining pairs of words are actually less effective than examining words individually, which is an interesting approach. They have also developed a different algorithm for combining the probababilities of individual words, the chi-squared approach, which in their tests produce even better results. Here is their description in their own words:
The chi-squared approach produces two numbers - a "ham probability" ("*H*") and a "spam probability" ("*S*"). A typical spam will have a high *S* and low *H*, while a ham will have high *H* and low *S*. In the case where the message looks entirely unlike anything the system's been trained on, you can end up with a low *H* and low *S* - this is the code saying "I don't know what this message is". Some messages can even have both a high *H* and a high *S*, telling you basically that the message looks very much like ham, but also very much like spam. In this case spambayes is also unsure where the message should be classified, and the final score will be near 0.5. So at the end of the processing, you end up with three possible results - "Spam", "Ham", or "Unsure". It's possible to tweak the high and low cutoffs for the Unsure window - this trades off unsure messages vs possible false positives or negatives.
A selected set from the newsgroup news.admin.net-abuse.sightings might be useful for initial training, though it can't be used directly (a spammer will try to fill it with useful messages to disable filtering). I really like this filtering approach and approaches like it.
I'd like to see mail browsers add a "SPAM" button that will can do a number of configurable actions, and has a useful default. I suggest as the default that it save the message in a "past spam" folder, and occasionally invokes a naive Bayesian statistical analysis program (as Graham describes) to create a filter for the future (then filter out email with a high probability of being spam). Perhaps it could optionally do other things, such as forward a copy to a list of email addresses (e.g., your local "abuse" account, the newsgroup news.admin.net-abuse.sightings, and email addresses of well-known spam killers), or calling on other spam killers to check it like SpamAssassin. It would be great if the spam message could be forwarded to an abuse account of the sending ISP; determining what that ISP is could be difficult. Perhaps there could be a checkbox beside each action like "don't do it when you press SPAM", "do it when you press SPAM", or "confirm before doing it when you press SPAM" - that way, you could get rid of chain letters without sending them to net-abuse. The advantage of these first two approaches is that it doesn't matter if spammers know this is happening, and they can be implemented without requiring some Godlike and expensive central authority.
A nice supporting approach is "graylisting"; it slows spam down, and should make other anti-spam techniques more effective. See Evan Harris's paper on graylisting for more information.
Another approach is to use various blackhole lists, which identify locations that allow spam to be sent, and then summarily throw away all email from those locations. If the location is used for non-spam, eventually the location will have an incentive to stop the spam. Obviously, this has various risks, such as incorrectly identifying spam sources, getting a site off the list, and finding ways to maintain and distribute the list.
There are other ideas, too. Tools like SpamAssassin, while necessarily imperfect, do a reasonable job at detecting spam. There are tip-offs to spam that aren't always caught by generic tools, for example, if you only speak languages that use Latin characters (such as English), it's very likely that an email subject line in an Asian language is spam (and often reprehensible spam, too, like child pornography - why anyone stands up for the right to transmit child pornography spam is beyond my understanding). Various approaches to implement stamps (either money or computational time) make some sense, although they would require extremely widespread deployment to work. If S/MIME and PGP were more widely deployed, and keys were widely available, you could only accept encrypted email, which would at least make spamming slightly harder; see my article on how to easily distribute email keys. One group is even using haiku to counter spam.
Another simple approach is to sort email so that email that is probably ham (not spam) is sorted first or placed in a separate box. For example, email from people you've already sent email to (or in some other way identified as trusted) could be identified as ham. Of course, email sources can be forged, and spammers could use viruses to send spam email from trusted soruces, but this makes a spammer's job a little harder. The general approach of defining a list of email addresses that are not spammers for you is called a "whitelist". A simple way to create a whitelist is to start with the contents of an addressbook and saved email messages. Another approach is to require codewords to be placed in the email before you'll receive them, and the codeword is then placed on a website as a shrouded image (so that humans can read the codeword, but spammer's automated email address harvesters cannot).
I believe that email reading programs, such as Mozilla, will include stronger and stronger anti-spam technical measures over time using methods such as these.
As I noted before, an "opt-out" list is not really a great idea. Why should I have to sign up on a list just so I can use the email account I'm already paying for? However, it may be that legislatures will be unwilling to establish strong anti-spam legislation without one, and for the future I believe it'll be important to enact strong anti-spam legislation (as I'll discuss why legislation is important in a moment).
If the world must have an "opt-out" list, there needs to be a single opt-out list that doesn't help spammers, and costs nothing for non-spammers (whose resources are, after all, being stolen in the first place). Such a list must also allow whole domains to opt-out, not just individual users, since some domains' connections have poor bandwidth or are very expensive - even some spam can take them down.
Here's the best way I think of implementing a bad situation: a non-profit organization (with a .org address) or government agency without a conflict of interest would create and maintain a database of HASHES of email addresses that do NOT want spam (say MD5 and SHA-1 hashes of canonicalized email addresses, e.g., all lower case; an entire site could be represented by "@mycompany.com"). Anyone can download the database, for a fee. Anyone must be able to add or remove their email address from the list for FREE (and it must always be free); they just need to subscribe/unsubscribe, with a separate email to confirm (to show that they really did add their email address to the list; this can be confirmed by emailing them a temporary password to confirm the request; entire sites could require "root" or "postmaster" to represent them). The confirmation would need to be by email, since otherwise many spammers would simply forge messages to remove everyone from the list. It's critical that users can add or remove themselves for free; why should I pay an additional tax just to help the freeloaders who are exploiting others' email addresses? Then legislation can be enacted that gives serious penalties to any spam sent to the "no-spam" list. Capturing the database wouldn't do any good for a spammer; it would only provide hashes and date/time stamps.
Note that this requires almost no resources; adding/removing names can be done via the web, the "database" can be trivial (a text file listing timestamp, action (FORBID or PERMIT spam), and email hashes), and the implementation program can be trivial (a few hundred lines of code at most). Database download or query rights fees (say, $10,000 per 10 million email messages checked) could pay for the whole thing.
This approach isn't foolproof; spammers can use password cracking techniques to figure out at least some of the database contents. More likely, many spammers will simply ignore the list, and find the names just like they do now. But stiff fines for ignoring opt-out lists might cut back a few spammers.
I will say that this sort of opt-out approach hasn't fared well in the past; Shut up and Eat Your Spam discusses the history and bad faith of spammers. To be practical, this requires legislation; spammers wouldn't voluntarily use an effective list (doing so would eliminate the point of spam). Of course, the whole notion that you have to sign into a database to prevent theft is a wrong-headed notion in the first place.
In the bigger picture, I believe there needs to be both laws forbidding spam (all spam, not just commercial spam), as well as technical means to help enforce this. We need both law and technology; each needs the support of the other.
Laws don't solve the whole problem, but if spam were illegal, much more could be done to reduce spam to a much smaller level. Murder still happens - even though it's illegal - but the legal system acts as a deterrence, helping to reduce the occurance of murder. Some spammers spam simply because it's legal where they are; if it was illegal, they would stop. The spammers who will perform illegal acts will find it more difficult. Such laws must be international, but that's actually quite possible. Countries that fail to enact anti-spam laws could find their entire country blacklisted (no one else would accept their email), and that would act as a strong incentive to enact and enforce anti-spam legislation as well. If the top ten spammers were hit hard (say, taking all their assets the first time and throwing them in jail on a repeat offense), spam would go down remarkably - because the worst offenders would stop, and the others would not be next. This is already starting to happen through local laws; a Washington resident has been awarded $250,000 against a spammer, and spammers who filed a frivolous lawsuit to quiet opposition are finding that they may pay dearly for it.
I believe that the solution is quite workable: make unsolicited bulk email illegal. If you send unsolicited bulk email, you need to at least pay a fine for each unsolicited email sent. It's always complicated to define new laws, but it's clearly workable (several governments have already done so!). Unsolicited can be defined e.g., as messages that you didn't request and not related to an ongoing or recent (within one year) business transaction or relationship. Customers must be able to opt-out of messages from businesses they DO work with. The definition of "bulk" could simply be a large number, like at least 1000 recipients; nobody needs to send a message to that many individuals unsolicited. That means that opt-in mailing lists are fine, since you sign up for them. By "email" I mean any communication capable of communicating with many people, such as Internet email, cell phone text messages, and so on. Note that websites don't have a problem with such laws, since web users have to perform an action (such as clicking on a link) to see the data, and thus are requesting to see that particular data. Many governments are already moving this way. People who create spam viruses should be accountable for all the spam they generate, as well as for illegally using others' computers.
But what if governments are unable to do what I believe is the right thing? Is there a partial position that could help, temporarily, until they get more effective legislation enacted? At the least, laws could be put in place to require all unsolicited bulk email to have a standard marking that can be easily identified mechanically (e.g., the first characters in the "Subject" line, or its equivalent, must be exactly the four characters "ADV:") and to make forging "from" information for the purpose of spamming illegal. This would make it trivial to filter out spam, and would make it much easier to use "whitelists". Some U.S. state laws do at least one of these things (see Spamlaws), or at least do it for commercial and/or pornographic spam. Laws should cover all spam; what is "commercial" or "pornographic" can be subjective, and other spam still steals services, so there's no reason to treat different kinds of spam differently. Do you really want massive amounts of pro-Nazi spam, for example, even if it isn't commercial? Simply covering all unsolicited bulk email would be more appropriate and would be much easier to enforce. With those laws (required marking and valid return addresses), people could at least begin to throw away spam if they want to, which I suspect will be true for almost everyone. Perhaps everyone will choose to throw away spam - but if that's true, then that's a consumer choice, and spammers have no right to object to true consumer choice. Those who don't want spam from other countries where these laws aren't passed can simply throw away all email from those countries ("blackholing") - at least this gives users of email a choice!
The U.S. Federal Trade Commission (FTC) has already begun suing spammers who disobey existing laws; these are generally fraudulent acts such as deceptive misuse of others' trademarks, false return addresses, and adding users to spam lists when they ask to be taken off the list, But until laws make spam illegal (or at least, unmarked spam illegal), the FTC cannot take legal actions against spammers in general.
Spam has become so obnoxious that even the Direct Marketing Association (a group whose purpose is to steal time and resources from others) agrees that commercial spam with forged "from" fields should be made illegal in the U.S. by federal law. Perhaps the DMA itself is finding that it's having trouble using email itself! For example, the DMA uses email addresses like email@example.com and Presiden@the-dma.org; either they get piles of spam, or they have to put a big spam filter on it. Of course, if the association for spam is having trouble using email because of spam, then it's clear that spam is out of control. As I noted above, some states already forbid at least commercial spam from forging their "from" addresses, as part of their anti-fraud laws, so enacting at least that requirement across the U.S. shouldn't be that hard. The DMA's accepting of anti-fraudulent spam legislation would not stop many problems, for example, their approaches still want to stay with opt-out and don't want to include the ADV convention required by several states. In other words, the DMA assumes that everyone needs several billion emails a day unless they spend their lives sending opt-out messages individually to every organization on earth. After all, if the DMA has its way, every organization will be sending out cheap emails to everyone on Earth, repeatedly, preventing people from using their own email accounts. Ridiculous.
Other sites related to getting rid of spam include Spamhaus, spam.abuse.net, a list of purported ways to remove yourself from spam mailing lists, info about SAFEeps (fatally flawed since it provides the email addresses - which will help you get MORE spam). The European Parliament requires opt-in (not opt-out) for email. Spamlaws identifies laws regarding spam.
This article was written by David A. Wheeler; you may freely redistribute it. You may also see his personal website at http://www.dwheeler.com.