Checking for those random strings

Rick Mann rmann at latencyzero.com
Sun Dec 14 20:28:39 CET 2003


I've noticed that the majority of spam I receive (99.9%) contains 
strings of random characters. I'm not sure why they're there 
(presumably to fool really, really lame anti spam tools). To me, these 
are *significant* indicators that a message is spam, but how to 
identify them accurately?

I realize this goes against the theory that words should simply be 
counted and compared, but the problem is that these random strings are 
often seen only once, and adding such a message with them doesn't 
really help to identify future messages with similar (but not equal) 
random strings. I also realize that people have probably tried already 
to key on these strings and failed).

One approach would be to see if a string of characters (a word) exists 
in a comprehensive dictionary of commonly used words. This approach has 
many drawbacks, not the least of which are misspelled words and 
non-English words.

You can get around that to some degree by simply assigning a spammish 
score to words that don't appear in a language dictionary. But, it 
might not be enough to sufficiently affect the spam score in the cases 
when the random strings do indicate spam.

Alternatively, one could analyze the character pattern to see if the 
string is likely to be a word in any language. It might be as simple as 
noticing that there are four consonants in a row. One of the drawbacks 
here is that there are many acceptable acronyms that would likely get 
caught by this approach, applied blindly. Maybe strings longer than 
say, six characters with a long portion of consonants? Maybe an 
analysis of the order and frequency of classes of characters in the 
string, compared against averages for a particular language or group of 
languages?

Is there some way to identify a given string of characters as "likely 
to be a real word" or "likely to be gobbledegook?"

This would help keep corpus dictionaries from growing with random 
strings, and probably improve identification of spam. At least, until 
the spammers stop inserting them.

-- 
Rick





More information about the Bogofilter mailing list