Checking for those random strings
Rick Mann
rmann at latencyzero.com
Sun Dec 14 20:28:39 CET 2003
I've noticed that the majority of spam I receive (99.9%) contains
strings of random characters. I'm not sure why they're there
(presumably to fool really, really lame anti spam tools). To me, these
are *significant* indicators that a message is spam, but how to
identify them accurately?
I realize this goes against the theory that words should simply be
counted and compared, but the problem is that these random strings are
often seen only once, and adding such a message with them doesn't
really help to identify future messages with similar (but not equal)
random strings. I also realize that people have probably tried already
to key on these strings and failed).
One approach would be to see if a string of characters (a word) exists
in a comprehensive dictionary of commonly used words. This approach has
many drawbacks, not the least of which are misspelled words and
non-English words.
You can get around that to some degree by simply assigning a spammish
score to words that don't appear in a language dictionary. But, it
might not be enough to sufficiently affect the spam score in the cases
when the random strings do indicate spam.
Alternatively, one could analyze the character pattern to see if the
string is likely to be a word in any language. It might be as simple as
noticing that there are four consonants in a row. One of the drawbacks
here is that there are many acceptable acronyms that would likely get
caught by this approach, applied blindly. Maybe strings longer than
say, six characters with a long portion of consonants? Maybe an
analysis of the order and frequency of classes of characters in the
string, compared against averages for a particular language or group of
languages?
Is there some way to identify a given string of characters as "likely
to be a real word" or "likely to be gobbledegook?"
This would help keep corpus dictionaries from growing with random
strings, and probably improve identification of spam. At least, until
the spammers stop inserting them.
--
Rick
More information about the Bogofilter
mailing list