md5 sums [was: TODO for 1.0]
Greg Louis
glouis at dynamicro.on.ca
Mon Jan 13 23:01:32 CET 2003
On 20030113 (Mon) at 1547:20 -0500, David Relson wrote:
> At 03:14 PM 1/13/03, Chris Wilkes wrote:
>
> >Does re-adding a message over and over help out with training? I
> >suppose it does as those couple of keywords get weighed more heavily.
> >
> I don't honestly know how much repeated training helps as I haven't done
> it.
It's a tradeoff. Running a message twice or more is a quick and dirty
way to get a collection of words into bogofilter's training db. Dirty,
because doing that worsens the degree to which we violate a fundamental
assumption of the Bayesian statistics we're doing: namely, that the
probability estimates we calculate for individual tokens are
independent of one another. This is also the main reason for using a
MAX_REPEAT of 1 instead of Graham's 4 when registering messages.
I've done repeated registration for short messages I know are bad spam,
in order quickly to boost the counts of the characterizing tokens in
spamlist.db, but it's not really a Good Thing to do on a regular basis.
--
| G r e g L o u i s | gpg public key: |
| http://www.bgl.nu/~glouis | finger greg at bgl.nu |
| Help free our mailboxes. Include |
| http://wecanstopspam.org in your signature. |
More information about the Bogofilter
mailing list