program - example of libgmime-1.90.6 use and new algorithm to test

Wed Dec 11 22:12:13 CET 2002

Hi all,

  Included in this message is a program derived from an ancient version of bogofilter
(version 0.7).  I realize this is an ancient version, but I had already made
substantial changes that included some additional features but lost some features
compared to the standard bogofilter releases before I realized bogofilter had been
updated quite a bit.

  I'm posting this here for two reasons.  This program uses libgmime-1.90.6 to parse
the email messages which should allow us to test a full implementation using
libgmime-1.90.6 before we commit to using/not using libgmime.  I haven't had a
chance to look into the output message stuff yet, so this program still spew some
warning messages onto stderr.

  The second reason is this program uses a new algorithm which I think is better.  It
certainly seems to work better than the one I last gave Greg to test (at least for me).
I'm to the point where I can't really test too well locally because the classification
accuracy is too good on the relatively small email corpus I have.

Notes on running:

- This program requires that the first message you train on is spam or the exit code
will end up reversed.  Alternatively, you can train/untrain on a message or add "spam"
to the first line of ~/.bogofilter/classes.txt to get the same effect.
- This program will spew some messages from libgmime to stderr.
- This program requires libgmime-1.90.6 which is the latest version.
- I expect this program to take about twice the time to classify as bogofilter usually
takes because it makes twice the number of calls to get_count usually.

Greg,
  Could you please test this program on your email corpus.  If it hangs, stalls, or fails
on any email messages, please let me know and if you can send me a message that
causes this problem so I can see if it is fixable.

Btw, after looking at the code for the Robinson method in more detail, I've determined that
it is the same as Naive Bayes for the 2 class case with an appropriate setting of the
threshold to account for the relative frequency of spam/ham.  The only difference is in
the method of smoothing applied.  One thing I did notice though, the effect of the smoothing
is strongest when there are more good messages and fewer spam messages.  The more good
messages you get, the stronger the prior.  This is because the good message count gets
scaled to match the number of spam messages, effectively ending up like so:

cnt(w,spam) + robs * robx
---------------------------------------------------
cnt(w,spam) + cnt(w,ham)*cnt(spam)/cnt(ham) + robs

It might be interesting to try:

cnt(w,spam)*cnt(ham) + robs * robx
----------------------------------------------------
cnt(w,spam)*cnt(ham) + cnt(w,ham)*cnt(spam) + robs

which would restore the symmetry between ham and spam.

- Scott

-------------- next part --------------
A non-text attachment was scrubbed...
Name: bogofilter_srl.tar.bz2
Type: application/octet-stream
Size: 23433 bytes
Desc: bogofilter_srl.tar.bz2
URL: <https://www.bogofilter.org/pipermail/bogofilter-dev/attachments/20021211/7f06b349/attachment.obj>