OT: Chunking the cruft - random lettered words

Tom Anderson tanderso at oac-design.com
Tue Mar 16 04:49:05 CET 2004


On Mon, 2004-03-15 at 16:36, Eric Wood wrote:
> Are you saying "random words" or "random lettered words".  There is a
> difference.  Lots of unique words having no previous score would help pass
> the email.  If the email just had less than 20 such words, bogofilter would
> catch it.

> there baker we too find automobile swordfish
> weghj wsj fgwsgfh oashweg ogdhbsdo jsjfsfhl

Those two lines are equivalent to bogofilter if it hasn't ever seen any
of those words before.  Each of those words would get a score of robx
(let's say 0.415), and none would contribute to classification if
min_dev > 0.5-robx.  So, assuming a min_dev of 0.1, the above two lines
would both score 0.415 (since none of the tokens contribute, the message
scores as robx).  Now, assuming bogofilter had seen "too" before, but
none of the others, then the message would score roughly the same as
that word.  

So, while "random" words such as "we, too, find", etc., may slightly
contribute to a hammy score, you may have registered a spam that
contained "wsj, oashweg, baker, swordfish", in which case, these would
contribute to a spammy score.  In fact, if "we, too, find" are only
marginally hammy, while "baker, swordfish" are very spammy, then the
random word sequence may very well tip off bogofilter to the spamminess
irrespective of the actual payload of the spam, thus backfiring entirely
for the spammer.  And words like "wsj, oashweg" will almost certainly
not appear in any hams (except this one), and will thus be a tremendous
indicator of spam if seen again.

Due to this fact, I'm not at all concerned about random words, whether
real words or random letter combinations.  They tend to reveal spam
better than if they were left out.

My problem spams are those which contain a valid English diatribe as the
payload.  For instance, a long testimonial about how a highschool kid
made millions with a chain letter and hid the cash in his closet will
tend to have lots of generic nouns, verbs, and prepositions often used
in normal speech with friends and business associates.  A few dozen
paragraphs of this will wash out the spammy tokens about mailing $5 to
each of the five addresses.  The best I can hope to do is register the
spam several times to make the hammy words a little more neutral so that
the spammy ones stand out more.

Tom

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040315/eb298457/attachment.sig>


More information about the Bogofilter mailing list