Is bogofilter classifying algorithm "symetrical"

Thu May 26 00:51:39 CEST 2005

On Wed, 25 May 2005 17:54:43 +0200
Marek Zachara wrote:

> I'll try to explain my question using an example:
> 
> lets assume i have two sets of data i.e.  HAMD and SPAMD and two installation
> of bogofilter (testbeds):
> 
> testbed1:
> I train the bogofilter supplying HAMD set of data as ham and SPAMD as spam
> (trival case).
> 
> testbed2:
> this time the vice versa:  I train the bogofilter supplying HAMD as spam and
> SPAMD as ham.
> 
> now the question is: after training the two instances as above, if I supply
> any further message for classification to both testbeds, will the following
> be always true:
> spamicity(testbed1) = 1-spamicity(testbed2) ?
> 
> why i'm asking this is because i know that scrapping a "ham" message is much
> worse than letting through a spam, so bogofilter may have some mechanisms
> that if in doubt favour classification as "ham".
> 
> I will really be grateful for the answer.
> Marek

H'lo Marek,

The algorithm is _almost_ symmetrical.  There are some special scoring
factors that break the symmetry.  Initially the goal was to bias
messages towards ham scores (on the theory that false negatives are
preferable to false positives).  

A while back, we ran several 100,000 messages through bogotune to
determine high quality parameters for bogofilter to use.  Running
"bogofilter -Q" will display those parameters for you.

If you create 2 separate wordlist directories, train them oppositely,
then score some messages using the two wordlists, adding the scores for
each message will you 1 or very close to it. 

As a test, I took the test mailboxes (files good.mbx and spam.mbx in
directory src/tests/inputs), divided them into separate messages and
ran script test.symmetry.sh (below):

#!/bin/sh

rm -rf gs sg

mkdir gs
bogofilter -C -v -d gs -n -B good.d/msg.?.[1-9]?.txt
bogofilter -C -v -d gs -s -B spam.d/msg.?.[1-9]?.txt
bogoutil -d gs/wordlist.db | tee wordlist.gs | wc -l

mkdir sg
bogofilter -C -v -d sg -s -B good.d/msg.?.[1-9]?.txt
bogofilter -C -v -d sg -n -B spam.d/msg.?.[1-9]?.txt
bogoutil -d sg/wordlist.db | tee wordlist.sg | wc -l

bogofilter -C -v -d gs -B good.d/* spam.d/* > gs.out
bogofilter -C -v -d sg -B good.d/* spam.d/* > sg.out

head gs.out sg.out

### Here's the output (slightly edited for shorter lines) ####

# 3291 words, 39 messages
# 1436 words, 12 messages
4304
# 3291 words, 39 messages
# 1436 words, 12 messages
4304
==> gs.out <==
good.d/msg.n.01.txt X-Bogosity: Ham ... spamicity=0.000000 ...
good.d/msg.n.02.txt X-Bogosity: Ham ... spamicity=0.000000 ...
good.d/msg.n.03.txt X-Bogosity: Ham ... spamicity=0.000000 ...
good.d/msg.n.04.txt X-Bogosity: Ham ... spamicity=0.000000 ...
good.d/msg.n.05.txt X-Bogosity: Ham ... spamicity=0.000000 ...
good.d/msg.n.06.txt X-Bogosity: Ham ... spamicity=0.000052 ...
good.d/msg.n.07.txt X-Bogosity: Ham ... spamicity=0.000000 ...
good.d/msg.n.08.txt X-Bogosity: Ham ... spamicity=0.000000 ...
good.d/msg.n.09.txt X-Bogosity: Ham ... spamicity=0.000000 ...
good.d/msg.n.10.txt X-Bogosity: Ham ... spamicity=0.000000 ...

==> sg.out <==
good.d/msg.n.01.txt X-Bogosity: Spam ... spamicity=1.000000 ...
good.d/msg.n.02.txt X-Bogosity: Spam ... spamicity=1.000000 ...
good.d/msg.n.03.txt X-Bogosity: Spam ... spamicity=1.000000 ...
good.d/msg.n.04.txt X-Bogosity: Spam ... spamicity=1.000000 ...
good.d/msg.n.05.txt X-Bogosity: Spam ... spamicity=1.000000 ...
good.d/msg.n.06.txt X-Bogosity: Spam ... spamicity=0.999961 ...
good.d/msg.n.07.txt X-Bogosity: Spam ... spamicity=1.000000 ...
good.d/msg.n.08.txt X-Bogosity: Spam ... spamicity=1.000000 ...
good.d/msg.n.09.txt X-Bogosity: Spam ... spamicity=1.000000 ...
good.d/msg.n.10.txt X-Bogosity: Spam ... spamicity=1.000000 ...

As you can see, the 0.000000 messages rescored at 1.000000.  The one
exception scored at 0.000052 and 0..999961