spam and bogofilter
relson at osagesoftware.com
Fri Nov 22 23:36:17 EST 2002
Thanks for getting in touch. We welcome all who are interested in bogofilter.
It sounds like you've done quite a bit of work with bogofilter. The 0.7
version you have is quite old, as I think you realize. Development has
progressed to 0.9(beta) which was put up on SourceForge today. Changes
between 0.7 and 0.9 have been many and far reaching. If you download 0.9
and look at the NEWS file, you'll get a sense of all that has changed.
Some of the changes parallel things you've done. The Judy arrays were
replace by a wordhash, for size and speed reasons. Algorithmically, the
original Graham algorithm is 1 of three available - with the other two
being the Robinson algorithm and the Fisher modification of the Robinson
algorithm. I'm not enough of a mathematician/statician to comment on how
those changes relate to Naive Bayes, which you mention.
libgmime has been discussed as the need for supporting mime, base64,
quoted-printable is recognized. We haven't gotten to it as yet. Likewise
there has been talk of extending the lexer to handle header tokens, but no
development. For a variety of reasons, we've decided to stay with C.
I, personally, would like to see your code and invite you take a look at
bogofilter 0.9.0. I think it'd be great to have your libgmime and lexer
work merged into bogofilter.
May I suggest that you subscribe to the bogofilter and/or bogofilter-dev
mailing lists? They may be found at bogofilter at aotto.com and
bogofilter-dev at aotto.com, respectively. I'm sure there will be additional
comments from other members of the group.
P.S. I expect that Greg Louis, our statistical expert, will have a
response to your query about testing methods and corpuses (sp?)
At 09:46 PM 11/22/02, Scott Lenser wrote:
> I'm a Ph.D. student at Carnegie Mellon University working mostly with
> robotics and
>AI. I downloaded bogofilter-0.7 a while back and looked at using it to
>mail. I wasn't satisfied with the results and felt it could be improved
>AI background. I modified it to be a Naive Bayes text classifier. I also
>the treatment of unseen words to add a prior to each word in the form of
>examples as is often done in using Naive Bayes for text classification. I
>started using it to filter my email. I also modified it to be partially
>changed from Judy arrays to STL hash_maps (which sped it up
>considerably). I just
>recently enhanced it to use libgmime to parse a mail message into body and
>and such. I grab some of the features that make sense to grab from the
>ignore the rest. libgmime converts the encoding to 8bit so email encoded in
>quoted-printable, base64, and such are all handled correctly. I only grab
>from the parts encoded in text/* formats. I use one lexer for text/html
>for all other text/* parts. The text/html lexer and associated parsing
>ignore all html tags (everything between '<' and '>'). All of these go
>Naive Bayes classifier.
>I have tested this on my personal email corpus. I have 365 spam messages,
>from family members, 1230 messages pertaining to the research group I am
>3061 messages pertaining to the research project I work on. I tested the
>as follows. The 995 family messages and 1230 general lab messages were
>labelled as ham.
>The 365 spam messages were labeled as spam. I tested the resulting filter
>on all of
>the mentioned messages with no further training of any kind. I got the
>5/995 (.5%) family messages labelled as SPAM [many of these messages are
>in html format
> are are often produced by windows tools]
>6/3061 (.2%) research project messages labelled as SPAM
>0/1230 (.0%) research group messages labelled as SPAM
>(~97%) of SPAM messages labelled as SPAM (I misplaced the exact number)
>Most of the misclassified ham messages were emails from companies that I
>stuff from that were giving me the tracking numbers and such. Most of the
>that got through were either really short or encoded in asain characters.
>I'm interested in seeing these improvements make there way into the GPL
>I was wondering if you had a test corpus and procedure that I could use to
>the binary that I have versus the current performance of bogofilter. I
>wondering what you think would be the best way to proceed in getting this
>available to the community.
>The main changes I've made are:
>- change algorithm to Naive Bayes classifier
>- correctly handle headers/mime messages/transfer encodings by using libgmime
>- split lexer into text/html and text/* parts
>- switch to C++ version of lexer (produced by flex++)
>- replace Judy arrays with STL hash_map
>- minor change: I changed the command line interface so that -H and -S
> from the ham or spam corpus respectively so that a message can be
> removed completely
> from the corpus. I use -sH and -hS to switch a message from ham to
> spam and vice versa.
>Look forward to hearing from you. Please forward this to any parties you
>- Scott Lenser
More information about the Bogofilter