Idea for improving the learning stage

Sat Sep 8 12:58:48 CEST 2007

Andrew <aremo at ngi.it> writes:

> I've been thinking about separate databases, but I've come to the 
> conclusion that we wouldn't really need them: words that looked "spammy" 
> in a subject would still look spammy in the body, and vice-versa. So, in 
> my opinion, only one database would still be the way to go.

No, body and header are orthogonal, since bogofilter tags header tokens
before registering them, and registering partial messages would only
skew the individual token probabilities by skewing .MSG_COUNT. Try bogolexer or bogoutil
dumps...

Suppose we're registering a header, we'll bump .MSG_COUNT but not
registering any body tokens, so the significance of all body tokens will
slowly decrease... so I wonder if we need .BODY_COUNT and .HEADER_COUNT
or something like that to replace .MSG_COUNT. That being an incompatible
change, it cannot become part of bogofilter 1.0.X.

I understand what you're aiming at, and I'm not saying it's not useful
-- it's just that there's more to the solution than just registering
partial messages.

-- 
Matthias Andree