md5 sums [was: TODO for 1.0]

David Relson relson at osagesoftware.com
Mon Jan 13 21:47:20 CET 2003


At 03:14 PM 1/13/03, Chris Wilkes wrote:

>I orginally was doing your suggestion of running it through a filter
>first to then decide what to do with it in BF, but though since we're
>processing the entire email anyway in BF why not do it in there.  Send
>BF a "-m" switch (for "md5") and you'll get the MD5 hash of the body.
>
>It doesn't add very easily into BF's current design as you can't plug in
>a simple checker routine to see what to do with this MD5 hash.  You'll
>have to hard code it into main.c or something like that.
>
>Does re-adding a message over and over help out with training?  I
>suppose it does as those couple of keywords get weighed more heavily.
>
>Chris

Chris,

I don't honestly know how much repeated training helps as I haven't done 
it.  A few weeks ago I decided to rebuild my wordlists and used script 
contrib/randomtrain to do it.  As inputs the script takes spam and ham 
mailboxes and feeds messages, in random order, to bogofilter.  Bogofilter 
correctly classifies most messages after the first few.  Any messages that 
bogofilter gets wrong are used to train it.  The end result is wordlists 
based on a fraction of the total (I think it was about 1600 of 20,000 
messages for my test).  I've found it useful to run randomtrain a second 
time as an additional (smaller) group of messages will be added to the 
wordlists.

David





More information about the Bogofilter mailing list