Training oddity

Tue Aug 19 13:21:43 CEST 2003

Hi!

I have a strange thing here. Instead of long words:

$ bogofilter -o 0.701,0.201 -d .bogofilter -T <t1
U 0.255139
$ bogofilter -o 0.701,0.201 -d .bogofilter -T <t2
H 0.168237
$ bogofilter -o 0.701,0.201 -d .bogofilter -n <t1
$ bogofilter -o 0.701,0.201 -d .bogofilter -T <t2
U 0.204493

So first message t2 (t1 and t2 contain one message each) is
rated as ham, after just adding another message to the
database as ham, suddenly t2 looks more spammish. How can
this be?

Well, I used -vvv. The reason is one more message in the
database which changes the probability for each word. Should
be a small effect, I thought. But it turned out that one
word was now removed from the list of included words:
t2.after: "groups"  22  0.049451  0.005384  0.100088 -
t2.before:"groups"  22  0.049587  0.005384  0.099846 +
So this one additional message (before:364 messages, after:
365 messages) just took the token "groups" over the edge.

I had never thought about the possibility of that effect.

The lesson learned: Adding messages can spoil the database
-- even on the unexpected side. This will be highly visible
for full training.

pi