Idea for improving the learning stage

David Relson relson at osagesoftware.com
Sat Sep 8 14:06:19 CEST 2007


On Sat, 08 Sep 2007 12:58:48 +0200
Matthias Andree wrote:

> Andrew <aremo at ngi.it> writes:
> 
> > I've been thinking about separate databases, but I've come to the 
> > conclusion that we wouldn't really need them: words that looked
> > "spammy" in a subject would still look spammy in the body, and
> > vice-versa. So, in my opinion, only one database would still be the
> > way to go.
> 
> No, body and header are orthogonal, since bogofilter tags header
> tokens before registering them, and registering partial messages
> would only skew the individual token probabilities by
> skewing .MSG_COUNT. Try bogolexer or bogoutil dumps...
> 
> Suppose we're registering a header, we'll bump .MSG_COUNT but not
> registering any body tokens, so the significance of all body tokens
> will slowly decrease... so I wonder if we need .BODY_COUNT
> and .HEADER_COUNT or something like that to replace .MSG_COUNT. That
> being an incompatible change, it cannot become part of bogofilter
> 1.0.X.
> 
> I understand what you're aiming at, and I'm not saying it's not useful
> -- it's just that there's more to the solution than just registering
> partial messages.
> 
> -- 
> Matthias Andree

With a bit of scripting one _could_ separate header and body (think
"formail") and run bogofilter (twice) with separate header and body
databases.  Unfortunately that would give two scores which might
conflict (header is spammish and body is hammish, etc) and what to do
with that isn't obvious.  Additionally, as an earlier message
mentioned, there're still the problems of multiple mime parts, forwarded
messages, etc

An idea from several years ago is to generate separate scores for the
header and each mime part.  One could then select the highest score or
the lowest score or the score furthest from 0.5 or ... whatever ... and
call that the message's score.  

Such a strategy _might_ help.  However as often pointed out, bogofilter
is already identifying a huge percentage of the spam.  FWIW, this month
(September) my mail server's bogofilter has classified 6800 spam as
spam, classified 1 spam as unsure, and classified 1 (autogenerated) ham
message as spam.  Me?  I'm totally satisfied with these percentages and
don't see a need to invest significant time trying to improve them.

Regards,

David



More information about the bogofilter-dev mailing list