I’ll Have the Word Salad, Please

by Christopher on February 5, 2011

“Colorless green ideas sleep furiously.”

Sound like the subject or content of a recent spam email you received? This nonsensical statement is, in fact, a well-known example of “word salad,” or a string of words that don’t mean anything together. It was composed in 1957 by linguist Noam Chomsky to illustrate logical form with illogical semantics, or proper grammar with no real meaning. As silly as the sentence sounds, it was key to Chomsky’s then-revolutionary refutation of a current linguistic model.

At some point, you’ve undoubtedly paused and stared, perplexed, wondering why a spammer would send something with a random string of pointless, meaningless words and phrases. As with Chomsky’s composition, word salad also serves important purposes in spam that are not necessarily obvious.

Word salad in spam email is an attempt to sneak past email security systems that block or re-direct spam using Bayesian technology. This anti-spam method was once a primary spam filtering technology, but today is used in conjunction with other methods to make more well-rounded determinations about the validity of an incoming email.

Bayesian spam filtering is named for its foundation on the Bayes’ probability theorem devised by mathematician Thomas Bayes. This anti-spam technology determines whether an incoming message is spam based on the statistically-relevant occurrence of words and phrases known as most-probably contained in spam.

A Bayesian filter notes content such as “sexy singles,” “online gambling,” “free no-risk trial,” “cheapest drugs,” and “enhance your partner’s satisfaction” as most likely to appear in spam, rather than a legitimate message from someone you know. When an email is mostly comprised of such highly probable spam terminology, it is classified by a Bayesian filter as spam. And this is often an accurate determination; if not, perhaps you should consider making some new friends.

By boosting the “legitimate” language of a spam email by pasting in common words and phrases, a spammer dilutes the red-flag terminology. If probable spam words and phrases are statistically irrelevant to the overall content of an email, a Bayesian filter will assume the message is legitimate. Thus, the inclusion of word salad, no matter where it comes from or what it says, helps usher spam past certain email spam filters.

There is a second purpose to word salad in spam as well. Bayesian filters adapt and evolve by learning from the user’s input. When an account holder with Bayesian filtering flags messages as spam, the filter gradually identifies new words and phrases as more likely to be contained in spam. As a user marks more email as spam that contains a large percentage of word salad, the spam filter eventually starts making incorrect determinations.

The result is an ever-increasing number of false positives. Incorrectly identifying a legitimate email as spam is, of course, the most inconvenient and potentially damaging part of spam filtering technology. If spammers can so distort the “understanding” of a Bayesian filter that its false positive rate continues to rise, the user will have to devote considerable time to carefully going through their spam folder. Many users will lower the protection settings on their accounts or just deactivate them.

Fortunately, spam protection has far surpassed Bayesian filters. Most leading anti-spam products continue to use the method in conjunction with others, though. Bayesian filtering certainly has a place in spam filtering technology; when words and phrases in a message are determined to be high-probability spam content, and other, unrelated red flags are raised by an email too, spam filters can make accurate determinations. So don’t be surprised to still see apparently pointless gibberish like “colorless green ideas sleep furiously” in your spam, attempting to fool one part of your spam filter’s checklist. Only now, you’ll know it’s not nearly as pointless as it appears.

Posted in: Spam

Previous post:

Next post: