Bayesian Spam Filtering Still Proves Reliable

by Christopher on June 27, 2011

The first anti-spam filter to use Bayesian filtering techniques, Jason Rennie’s iFile program, was released in 1996. In 2002, programmer and venture capitalist Paul Graham tweaked the technology to greatly reduce the false positive rate, making Bayesian methods capable of standing on their own as an anti-spam filtration system. Now, 15 years after its debut, Bayesian filtering technology is still in use by leading anti-spam vendors.

Bayesian filters are based upon the Bayes Theorem, devised by an 18th century British minister and mathematician named Thomas Bayes. In short, the Bayes Theorem is a statistical method of determining probability. In terms of spam filtering technology, the theorem’s application involves reading the content of an email message and comparing its words and phrases to the content of known spam. If a significant percentage of the words in an incoming message are common in spam, the new message is likely to be spam. If a statistically irrelevant number of words are typical of spam, the new email is probably legitimate.

What gives Bayesian filters an edge, even in light of the endless ruses spammers devise to trick the system, is that they learn. The more spam and legitimate correspondence the filters see over time, the more data they have with which to make statistical probability determinations. Also, when email users correct the filter by identifying false positives or flagging spam that manages to get into the inbox, Bayesian filters pick up new relevant information.

Most importantly, Bayesian anti-spam filtering technology learns on an individual basis. Over time, a user’s email account receives significant amounts of email correspondence from many of the same people and mentioning many of the same things. Similarly, individuals are prone to receiving particularly high quantities of certain types of unwanted bulk email as a side effect of their daily online activities. A Bayesian-based email filter continually obtains new statistical data specifically relevant to the user.

Of course, no system is perfect on its own. Bayesian filters, like other text-based methods, are susceptible to “poisoning.” For example, spammers attempt to fool the filter by pasting in a significant number of words completely unrelated to their spam message. The text may be a copied portion of an online article, or even words automatically randomly inserted from an online dictionary. This waters down the number of common spam terms and phrases, making them statistically irrelevant to the filter.

Spammers have also had luck bypassing Bayesian anti-spam filters by misspelling words, inserting characters or spaces in the middle of words, and replacing letters with digits, as with a 1 in place of an I or a 3 in place of an E. Bayesian filters are continually adjusted to account for such tricks. The battle between spammers and spam filtering technology will undoubtedly go on indefinitely. It is a perpetual series of one side outmaneuvering the other.

The more effective of today’s anti-spam filters rely on multiple filtering technologies that, when used in conjunction, make correct determinations about whether an incoming message is spam with astounding accuracy. Bayesian-based methods are still earning their keep though, and are incorporated into many of the leading anti-spam filters available today. Particular success has been found by combining Bayesian methods with IP address reputation filtering. Products based on this particular combination are widely regarded as the strongest, smartest, most reliable anti-spam filters available.

Posted in: Spam Filtering

Previous post:

Next post: