Fixing Spam

Thursday - June 03, 2004 at 05:05 PM in

Fixing Spam

Most technical solutions to the spam problem fail to address a key point: there are as many definitions of "spam" as there are bulging inboxes. For many antispam crusaders, any unsolicited commercial E-mail is automatically evil.

But for the average user, the definition of spam is probably more akin to "all this disgusting E-mail which I don't want and I can't make stop."

Many others have observed that a big part of the problem with spam is that the Internet E-mail system was never designed to be traceable . The protocols for sending E-mail were designed in an era when the Internet was largely an academic network, and nobody expected to have to be able to tell for sure where an E-mail came from. As a result, it is very simple to forge an E-mail to make it appear to be from someone it isn't, which is a major tool of spammers.

SPF is a good start for making E-mail traceable (or more correctly, verifying that an E-mail is really from who it says it is from). It is only beginning to be adopted, though, and techniques like throwaway domains provide a big loophole.

Schemes like Mailblocks , which use a challenge/response system to block E-mail not sent by a human, are basically 100% effective at stopping spam. They're also 100% effective at stopping other automated E-mail, like your daily stock portfolio update, or notices that your flight will be delayed. They're also rude, especially when implemented badly.

Given the tenaciousness of spammers, there's clearly a financial incentive to keep it up. But what we really need is an anti-spam solution which gives people control of their inboxes without throwing the baby out with the bathwater. Keep in mind, too, that any business which has to be available to communicate with the public can't afford to use overly aggressive anti-spam technology. The cost of false positives (rejecting legitimate E-mail as spam) is simply too high.

No one solution will solve the problem. Spam is really a security issue, and like any security issue, multilayered defenses are required:

Layer 1 (technical): Test to make sure the E-mail meets all technical requirements for being well-formed. Reject anything which appears forged. Test to ensure that the purported sender actually exists. Test to ensure that the sending MTA (mail transfer agent--the server sending the message) is well-behaved--for example, by asking it to try again in a few minutes. If the E-mail is in HTML format, ensure that it is well-formed.

Layer 2 (validation): Attempt to validate the purported sender of the E-mail, and classify as valid/invalid/unsure. This can use SPF and other techniques (most E-mail, right now, will probably fall into the "unsure" category). Anything "invalid" gets thrown out, and "unsure" can be used as an input to other layers.

Layer 3 (Illegal/dangerous content): Reject messages which contain dangerous attachment types--for example, in my company we reject any E-mail which contains an executable file or an archive. This provides us with very strong protection against E-mail viruses. This is also a good point to inspect HTML-based E-mail for things often associated with phishing schemes, like loading images from a domain other than the sender; or intentionally obfuscated HTML source.

Layer 4 (black/white/grey listing): Compare E-mail sender and domain to lists of known "good" or "bad" addresses. Some public blacklists get really carried away (which has probably hurt the whole idea of blacklisting), but what I have in mind is something driven by both the user and technical criteria. For example, SPF-validated E-mail from known legitimate domains (like AOL.com or Microsoft.com) gets whitelisted based on technical criteria, but the user can blacklist either the sender or the domain. Or, a user can whitelist particular senders (which would override any blacklist).

Layer 5 (content-based filtering): Anything which makes it to this point is at least behaving (in a technical sense) like legitimate E-mail, and hasn't been either rejected or accepted. The hope is that most E-mail would never ever make it this far, either being rejected on purely technical grounds, or accepted as validated from a known legitimate sender. Bayesian filtering lives here. Information from other levels can be used as an input to content-based filters, too, since it may be relevant to know that (for example) the sending domain did not have an SPF record.

The idea is that it would be very difficult to craft an illegitimate E-mail which could pass all the different layers of testing, especially since things which might make it easy to pass one layer (such as forging a "from" address, or obfuscating HTML code) make it harder to pass a different layer. And in the higher layers, through black/white lists and content-based filtering, the users gets to control what s/he reads.

In my company, we implement at least parts of layers 2, 3, and 5. This stops 100% of E-mail viruses, and around 99.5% of spam, with almost no false positives.

I'd be interested to hear from anyone who actually runs all these different layers, and how it works.

Posted at 05:05 PM | Permalink | | |

Thursday - June 03, 2004 at 05:05 PM in //<![CDATA[ getCategoryName("BDC584B3-45C4-46DB-942D-D29AA069C2BE"); //]]>

Fixing Spam

Thursday - June 03, 2004 at 05:05 PM in