spam and bogofilter

Sat Nov 23 13:30:32 CET 2002

On 20021122 (Fri) at 2336:17 -0500, David Relson wrote:
> It sounds like you've done quite a bit of work with bogofilter.  The 0.7 
> version you have is quite old, as I think you realize.  Development has 
> progressed to 0.9(beta) which was put up on SourceForge today.  Changes 
> between 0.7 and 0.9 have been many and far reaching.  If you download 0.9 
> and look at the NEWS file, you'll get a sense of all that has changed.
> 
> Some of the changes parallel things you've done.  The Judy arrays were 
> replace by a wordhash, for size and speed reasons.  Algorithmically, the 
> original Graham algorithm is 1 of three available - with the other two 
> being the Robinson algorithm and the Fisher modification of the Robinson 
> algorithm.  I'm not enough of a mathematician/statician to comment on how 
> those changes relate to Naive Bayes, which you mention.

I would _really_ like the chance to compare naive Bayes with the
simplified approach.  Our test corpora tend to be rather larger than
what is described below, and although it's not been adopted as a
general standard, I do have a procedure for doing comparison tests. 
See http://www.bgl.nu~glouis/bogofilter/fisher.html for an example.

> I think it'd be great to have your libgmime and lexer 
> work merged into bogofilter.

Hear, hear!

> P.S.  I expect that Greg Louis, our statistical expert, will have a 
> response to your query about testing methods and corpuses (sp?)
(Seems most people use the Latin plural, corpora.  Not sure if US rules
are the same but in English English the other would be spelled
corpusses to rhyme with busses rather than corpuses which would rhyme
with confuses ;)

> At 09:46 PM 11/22/02, Scott Lenser wrote:
> 

> >I modified it to be a Naive Bayes text classifier.  I also changed
> >the treatment of unseen words to add a prior to each word in the
> >form of virtual examples as is often done in using Naive Bayes for
> >text classification.

We bogofilter folks would welcome a bit of explanation of the "virtual
examples" concept.  None of us is familiar enough with Naive Bayes
AFAIK.

> >I have tested this on my personal email corpus.  I have 365 spam
> >messages, 995 messages from family members, 1230 messages pertaining
> >to the research group I am in, and 3061 messages pertaining to the
> >research project I work on.

We can give it a somewhat stronger workout if you like.  I can collect
about 1200 messages a day to use in testing; my training database has
over 10,000 spam and 10,000 nonspam this morning.  If you can provide
me with a binary that runs on linux / glibc, or source from which to
build one, I'd be delighted to run a comparison with our
Robinson-Fisher method of classification (which seems to be the most
successful to date), or even with all three classification methods that
we've tried.

> >I was wondering if you had a test corpus and procedure that I could
> >use to compare the binary that I have versus the current performance
> >of bogofilter.

You'd be welcome to download my spams if you want to run tests
yourself, but I can't give you the nonspams, as they include many
confidential messages.  Sorry for that.

> >- correctly handle headers/mime messages/transfer encodings by using 
> >  libgmime
> >- split lexer into text/html and text/* parts
> >- minor change: I changed the command line interface so that -H and -S 
> >  simply remove from the ham or spam corpus respectively so that a
> >   message can be removed completely from the corpus.  I use -sH and
> >   -hS to switch a message from ham to spam and vice versa.

These sound as though they would be very useful additions to mainstream
bogofilter.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |