md5 sums [was: TODO for 1.0]

Greg Louis glouis at dynamicro.on.ca
Mon Jan 13 23:01:32 CET 2003


On 20030113 (Mon) at 1547:20 -0500, David Relson wrote:
> At 03:14 PM 1/13/03, Chris Wilkes wrote:
> 
> >Does re-adding a message over and over help out with training?  I
> >suppose it does as those couple of keywords get weighed more heavily.
> >
> I don't honestly know how much repeated training helps as I haven't done 
> it.

It's a tradeoff.  Running a message twice or more is a quick and dirty
way to get a collection of words into bogofilter's training db.  Dirty,
because doing that worsens the degree to which we violate a fundamental
assumption of the Bayesian statistics we're doing: namely, that the
probability estimates we calculate for individual tokens are
independent of one another.  This is also the main reason for using a
MAX_REPEAT of 1 instead of Graham's 4 when registering messages.

I've done repeated registration for short messages I know are bad spam,
in order quickly to boost the counts of the characterizing tokens in
spamlist.db, but it's not really a Good Thing to do on a regular basis.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |
| Help free our mailboxes. Include                   |
|        http://wecanstopspam.org in your signature. |




More information about the Bogofilter mailing list