religion

Greg Louis glouis at dynamicro.on.ca
Tue Jan 21 23:16:02 CET 2003


People, bogofilter is only a few months old and already the emotional
climate gets hot and heavy if someone hints that a favourite option or
training method is suboptimal.  We still have a lot to learn about how
to use Bayesian filtering effectively, and those who advocate The One
True Path -- including me -- are probably full of it, ok?

Nevertheless, I think there are sound arguments against what -u does,
except if (1) you use it in binary mode like Boris wants, and (2) you
go in and fix it very frequently with -S and -N as appropriate.

In binary mode, you enter every message into the training database. 
That's ok.  But you must correct the errors.  If you don't, your
discrimination will deteriorate at an exponentially increasing rate.

In ternary mode, -u discards the messages that are most valuable for
training, and you train only on those messages that bogofilter already
gets right (or gets drastically wrong, but that's _really_ rare once
your training database gets up to ten thousand messages or so).  Why are
the unsures most valuable for training?  Because they contain all, or
almost all, of the mistakes.  You and I learn much faster by getting
something wrong and being told so than we do by getting something
right and being told so.  The Bayesian algorithm works like that too.
I haven't done actual testing, but the Spambayes folks have; one of
these days I'll look up those threads in their archives and post a
summary.  Bottom line was that training on unsures and errors is as
effective as (and less laborious than) training by manually classifying
every message.  If you find otherwise, by all means continue to do
otherwise -- I'm presenting suggestions, not issuing rules.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |
| Help free our mailboxes. Include                   |
|        http://wecanstopspam.org in your signature. |




More information about the Bogofilter mailing list