md5 sums [was: TODO for 1.0]
David Relson
relson at osagesoftware.com
Mon Jan 13 21:47:20 CET 2003
At 03:14 PM 1/13/03, Chris Wilkes wrote:
>I orginally was doing your suggestion of running it through a filter
>first to then decide what to do with it in BF, but though since we're
>processing the entire email anyway in BF why not do it in there. Send
>BF a "-m" switch (for "md5") and you'll get the MD5 hash of the body.
>
>It doesn't add very easily into BF's current design as you can't plug in
>a simple checker routine to see what to do with this MD5 hash. You'll
>have to hard code it into main.c or something like that.
>
>Does re-adding a message over and over help out with training? I
>suppose it does as those couple of keywords get weighed more heavily.
>
>Chris
Chris,
I don't honestly know how much repeated training helps as I haven't done
it. A few weeks ago I decided to rebuild my wordlists and used script
contrib/randomtrain to do it. As inputs the script takes spam and ham
mailboxes and feeds messages, in random order, to bogofilter. Bogofilter
correctly classifies most messages after the first few. Any messages that
bogofilter gets wrong are used to train it. The end result is wordlists
based on a fraction of the total (I think it was about 1600 of 20,000
messages for my test). I've found it useful to run randomtrain a second
time as an additional (smaller) group of messages will be added to the
wordlists.
David
More information about the Bogofilter
mailing list