What is spam? (was: [bogofilter] ESF and redundancy)
relson at osagesoftware.com
Tue May 11 12:47:34 EDT 2004
On Tue, 11 May 2004 10:23:08 -0400
Tom Anderson wrote:
> As you can see, some tokens such as "document" and "attached" are
> hammy, however I doubt I've ever received a ham that said "Your
> document is attached." And yet, some variation of this (ie "Your file
> is attached", etc.) is seen in these virus spams all the time. With a
> Markovian filter, the 3-4 token phrase would be exponentially more
> relevant than the individual tokens.
> Also of note, even though I've stripped out the non-standard headers
> with spamitarium, it's still largely the "administrative" tokens which
> make this email seem hammy. Dates are especially frustrating... I
> wish bogofilter would ignore them. I would strip them with
> spamitarium if they weren't a required part of the spec and used
> extensively by email clients for sorting and such. Removing
> "X-Priority", "X-MSMail-Priority", "ESMTP", etc., has helped a bit.
> Adding "helo-oac-design.com" and "as6478" helped a lot. Without
> spamitarium, this email was scored at 0.067239. Nonetheless, even at
> 0.468896, it still gets classified as "unsure". I need something more
> to overcome the hamminess of the "mime:" tokens. Perhaps simply
> registering this exhaustively until all of those tokens become neutral
> is the answer. However, the Markovian method is also tempting.
You've got the source for the lexer in file lexer_v3.l. You can easily
modify it to discard "X-Whatever:" and dates. It might make an
interesting experiment to see if that helps you.
More information about the Bogofilter