Fisher, exim, word pairs and thanks

Greg Louis glouis at dynamicro.on.ca
Fri Jan 3 20:02:20 CET 2003


On 20030103 (Fri) at 1308:45 -0500, David Relson wrote:
> At 12:57 PM 1/3/03, Karl Schmidt wrote:

> >I have a hunch there are lots of folks just running  on a personal 
> >machine. Anyway, as a optional stting down the road I bet it will work 
> >very well.  My thanks again for sharing your hard work.
> >
> >On another tangent - once in operation, is it better to add all emails to 
> >the db - even the ones correctly sorted or just the ones missed?
> 
> Karl,
> 
> Some say yes, some say no.  I use '-u' for my small (5 user) domain, so all 
> emails go into the wordlists.  Greg is involved with a larger domain and 
> only feeds in errors.

An attempt to address this question may be found at
http://www.bgl.nu/bogofilter/training2.html
  
The bottom line seems to be that once one has built up a training db of
adequate size (ca 10,000 each of spam and nonspam), training on error
suffices.  Training on error from the very beginning gets to the same
discrimination capability in the end, but takes longer.

To be precise, I train my bogofilters (both home, 2-user, and work,    
80-odd-user) on errors and uncertains, not just errors.
   
Back on performance for a sec: an implementation fast enough for the
big installations is not going to be "too fast" for the smaller ones.
If we don't pay some attention to scaleability all along, however, an
implementation too slow for the big installations may well result.     
I don't think that it's either necessary or desirable to let ourselves   
be satisfied with something that won't scale.  This does _not_ mean I
think big installations will have to do without phrases for the sake of
speed; only that we need to tune for speed as we add complexity to the
discrimination algorithms.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |




More information about the Bogofilter mailing list