OT: Chunking the cruft - random lettered words

Tom Anderson tanderso at oac-design.com
Tue Mar 16 05:04:39 CET 2004


On Mon, 2004-03-15 at 12:36, John McCain wrote:
> > Make sure your min_dev value is further from 0.5 than your robx.  This
> > way random words won't effect the classification.  The email will be
> > scored based on the non-random words.
> 
> So if I am using the default robx (.415?), and a mindev of .2, I'm cool.  At
> least according to the strict mathematical interpretation of what you said.
> The only problem is that if what you say is correct, decreasing minimum
> deviation seems counterintuitive.
> 
> Perhaps I didn't understand.  Are my numbers correct?
> 

Yes, if bogofilter sees the word "random" for the first time, it will
score it as 0.415.  However, only words outside of min_dev (in your case
0.7-1.0 and 0.0-0.3) will be used in classifications.  Therefore, the
word "random" will not be used in the classification of the message in
which it is first seen.  

Following this logic, if you receive a message that contains five spammy
words and five-hundred random words not seen before, the message will be
classified as spam, no doubt about it.  Random words don't help the
spammer at all.  Now, if you register this spam, and you get another
similar spam, it may now contain 400-500 spammy words, pinning the score
squarely at 1.0.  In this case, random words destroy the spammer.  

The only way random words can help a spammer is if they have a pretty
good idea of which words are ranked hammy in your database, and insert
those randomly in the spam.  The problem is that everyone's database is
different, so this task is nearly impossible for a spammer to achieve
except for a very targeted audience (which is counter to the entire
concept of spam).  The best they can do is use "common" words, but they
will tend to be more neutral than hammy.

Notice though that if you set your min_dev to 0.05 and your robx to
0.35, then random lettered words and random new words would work to the
spammer's advantage, which is why you always want your robx within the
min_dev range.

Tom

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040315/14b2104a/attachment.sig>


More information about the Bogofilter mailing list