bogofilter's default algorithm

Tue Jan 21 07:16:24 CET 2003

At 04:21 PM 2003-01-20 -0500, David Relson wrote:

>Greetings,
>
>One of the big questions amongst the bogofilter developers is:
>
>         What algorithms are people using with bogofilter?

I have no use for a "border case".  I have two buckets: Probably spam, or 
probably not.  If mail were classified into a third bucket, I would look to 
see what went there and make a decision as to which was more likely based 
on what I saw there.  I would rather see two buckets and a line I could 
adjust if there were too much misclassification, which so far there is 
not.  I am still getting a significant number of false negatives (1-2 per 
day) and they are not a point or two below the line, they are 10 points 
below the line, so a small move of the cutoff would not help, all that will 
help is retraining.  I fix all errors, and the same mistake is generally 
not made twice.

I see no problems with having multiple states, as many as are desired, as 
an optional case.  The default case should be the simplest for an end user 
to handle, and that is "yes" or "no", as a best guess.

>The initial implementation of bogofilter used the Graham algorithm and 
>that remained the default for months.
>
>The Robinson algorithm became available with version 0.7.6 in October and 
>became the default algorithm as of version 0.9.1 at the end of 
>November.  It's the default algorithm in the current stable version of 
>bogofilter, i.e. 0.9.1.2.

That is what I am using.

>The Robinson-Fisher algorithm was implemented during November and was 
>released as part of 0.9.1.  It's an improvement over Robinson with its 
>ternary result, i.e.
>
>         it's spam - and I (bogofilter) am sure of it
>         it's ham  - and I'm sure of that
>         it's not clear and I can't tell with any certainty.
>
>I think it's time to promote the Robinson-Fisher bogofilter's standard 
>(default) algorithm.  Version 0.10.0 has been released, though it hasn't 
>achieved "stable" status as yet.  I'm planning on the algorithm change 
>once 0.10.0 is stable.  Note that bogofilter will continue to support the 
>older algorithms.  They will still be selectable by command line switch or 
>config file option.

I think (which means that I am about to speak authoritatively about what I 
believe, if I am wrong, I should be corrected) that adding a third state to 
the output by default makes the output much harder for the simple user 
(like me) to use.  As I understand it, Robinson-Fisher basically takes the 
number output by Robinson and puts it into three piles:  A large number, a 
low number, and a middle range -- that is, instead of

S > cutoff >= H

The same algorithm is run, and the same number is computed, but the result is:

S > cutoff1 >= Unsure > cutoff2 >= H

If cutoff1 and cutoff2 are the same, you have effectively turned 
Robinson-Fischer into Robinson.

It is purely a matter of complexity regarding dealing with the output.  If 
I have three states, even if I put them into two buckets, I have to deal 
with not only, "classified as ham, is ham, classified as ham, but is spam" 
and "classified as spam, and is spam, classified as spam, and is ham", instead:

"classified as ham, is ham, classified as ham, is spam, classified as ??, 
is ham, classified as ??, is spam".

"classified as spam, is spam, classified as spam, is ham, classified as ??, 
is ham, classified as ??, is spam".

The retraining and sorting program has to become twice as complex.  Believe 
me, it was hard enough to deal with four states.

So, I really would rather not have an indeterminate state.  In fact, if I 
could not get a version that did not have an indeterminate state, I would 
not upgrade, eschewing other features, so as to be able to remain determinate.

Furthermore, I think that this is a case where simplicity should control 
the default.  Adding a third state would make the output twice as hard for 
me to deal with, so I presume it would be the same for most people.  I have 
no problem with adding it to the program, as an option.  Making it the 
default seems wrong headed.  The default should be controlled by 
simplicity.  Also, if someone wants to switch, they can just add the option 
that selects that particular option.  Their database is just fine and all 
set for the transition, no?

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally
to mean electronic messages designed to be read by an individual, and it
can include Usenet, SMS, AIM, etc.  But if it is not all three of Unsolicited,
Bulk, and E-mail, it simply is not spam. Misusing the term plays into the
hands of the spammers, since it causes confusion, and spammers thrive on
confusion.  If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!