spam and bogofilter

Sat Nov 23 05:36:17 CET 2002

Hello Scott,

Thanks for getting in touch.  We welcome all who are interested in bogofilter.

It sounds like you've done quite a bit of work with bogofilter.  The 0.7 
version you have is quite old, as I think you realize.  Development has 
progressed to 0.9(beta) which was put up on SourceForge today.  Changes 
between 0.7 and 0.9 have been many and far reaching.  If you download 0.9 
and look at the NEWS file, you'll get a sense of all that has changed.

Some of the changes parallel things you've done.  The Judy arrays were 
replace by a wordhash, for size and speed reasons.  Algorithmically, the 
original Graham algorithm is 1 of three available - with the other two 
being the Robinson algorithm and the Fisher modification of the Robinson 
algorithm.  I'm not enough of a mathematician/statician to comment on how 
those changes relate to Naive Bayes, which you mention.

libgmime has been discussed as the need for supporting mime, base64, 
quoted-printable is recognized.  We haven't gotten to it as yet.  Likewise 
there has been talk of extending the lexer to handle header tokens, but no 
development.  For a variety of reasons, we've decided to stay with C.

I, personally, would like to see your code and invite you take a look at 
bogofilter 0.9.0.  I think it'd be great to have your libgmime and lexer 
work merged into bogofilter.

May I suggest that you subscribe to the bogofilter and/or bogofilter-dev 
mailing lists?  They may be found at bogofilter at aotto.com and 
bogofilter-dev at aotto.com, respectively.  I'm sure there will be additional 
comments from other members of the group.

David

P.S.  I expect that Greg Louis, our statistical expert, will have a 
response to your query about testing methods and corpuses (sp?)

At 09:46 PM 11/22/02, Scott Lenser wrote:

>Hi,
>
>   I'm a Ph.D. student at Carnegie Mellon University working mostly with 
> robotics and
>AI.  I downloaded bogofilter-0.7 a while back and looked at using it to 
>filter my
>mail.  I wasn't satisfied with the results and felt it could be improved 
>using my
>AI background.  I modified it to be a Naive Bayes text classifier.  I also 
>changed
>the treatment of unseen words to add a prior to each word in the form of 
>virtual
>examples as is often done in using Naive Bayes for text classification.  I 
>then
>started using it to filter my email.  I also modified it to be partially 
>C++ and
>changed from Judy arrays to STL hash_maps (which sped it up 
>considerably).  I just
>recently enhanced it to use libgmime to parse a mail message into body and 
>headers
>and such.  I grab some of the features that make sense to grab from the 
>headers and
>ignore the rest.  libgmime converts the encoding to 8bit so email encoded in
>quoted-printable, base64, and such are all handled correctly.  I only grab 
>features
>from the parts encoded in text/* formats.  I use one lexer for text/html 
>and one
>for all other text/* parts.  The text/html lexer and associated parsing 
>routines
>ignore all html tags (everything between '<' and '>').  All of these go 
>into the
>Naive Bayes classifier.
>
>I have tested this on my personal email corpus.  I have 365 spam messages, 
>995 messages
>from family members, 1230 messages pertaining to the research group I am 
>in, and
>3061 messages pertaining to the research project I work on.  I tested the 
>spam filter
>as follows.  The 995 family messages and 1230 general lab messages were 
>labelled as ham.
>The 365 spam messages were labeled as spam.  I tested the resulting filter 
>on all of
>the mentioned messages with no further training of any kind.  I got the 
>following results:
>
>5/995 (.5%) family messages labelled as SPAM [many of these messages are 
>in html format
>   are are often produced by windows tools]
>6/3061 (.2%) research project messages labelled as SPAM
>0/1230 (.0%) research group messages labelled as SPAM
>(~97%) of SPAM messages labelled as SPAM (I misplaced the exact number)
>
>Most of the misclassified ham messages were emails from companies that I 
>had ordered
>stuff from that were giving me the tracking numbers and such.  Most of the 
>spam messages
>that got through were either really short or encoded in asain characters.
>
>I'm interested in seeing these improvements make there way into the GPL 
>community.
>I was wondering if you had a test corpus and procedure that I could use to 
>compare
>the binary that I have versus the current performance of bogofilter.  I 
>was also
>wondering what you think would be the best way to proceed in getting this 
>improvements
>available to the community.
>
>The main changes I've made are:
>
>- change algorithm to Naive Bayes classifier
>- correctly handle headers/mime messages/transfer encodings by using libgmime
>- split lexer into text/html and text/* parts
>- switch to C++ version of lexer (produced by flex++)
>- replace Judy arrays with STL hash_map
>- minor change: I changed the command line interface so that -H and -S 
>simply remove
>   from the ham or spam corpus respectively so that a message can be 
> removed completely
>   from the corpus.  I use -sH and -hS to switch a message from ham to 
> spam and vice versa.
>
>Look forward to hearing from you.  Please forward this to any parties you 
>think would
>be interested.
>
>- Scott Lenser