What is spam? (was: [bogofilter] ESF and redundancy)

David Relson relson at osagesoftware.com
Tue May 11 18:47:34 CEST 2004


On Tue, 11 May 2004 10:23:08 -0400
Tom Anderson wrote:

...[snip]...

          
> As you can see, some tokens such as "document" and "attached" are
> hammy, however I doubt I've ever received a ham that said "Your
> document is attached."  And yet, some variation of this (ie "Your file
> is attached", etc.) is seen in these virus spams all the time.  With a
> Markovian filter, the 3-4 token phrase would be exponentially more
> relevant than the individual tokens.
> 
> Also of note, even though I've stripped out the non-standard headers
> with spamitarium, it's still largely the "administrative" tokens which
> make this email seem hammy.  Dates are especially frustrating... I
> wish bogofilter would ignore them.  I would strip them with
> spamitarium if they weren't a required part of the spec and used
> extensively by email clients for sorting and such.  Removing
> "X-Priority", "X-MSMail-Priority", "ESMTP", etc., has helped a bit. 
> Adding "helo-oac-design.com" and "as6478" helped a lot.  Without
> spamitarium, this email was scored at 0.067239.  Nonetheless, even at
> 0.468896, it still gets classified as "unsure".  I need something more
> to overcome the hamminess of the "mime:" tokens.  Perhaps simply
> registering this exhaustively until all of those tokens become neutral
> is the answer.  However, the Markovian method is also tempting.

Tom,

You've got the source for the lexer in file lexer_v3.l.  You can easily
modify it to discard "X-Whatever:" and dates.  It might make an
interesting experiment to see if that helps you.

Regards,

David



More information about the Bogofilter mailing list