spam and bogofilter

Sat Nov 23 23:22:54 CET 2002

<html>
<br>
Scott,<br>
<br>
Thanks for your message. As David mentioned in his response, the best way to contribute to the GPL community is to join our development mailing list (email bogofilter-dev-subscribe at aotto.com) and offer your guidance and advice. Furthermore, we welcome you to join the bogofilter development team on Sourceforge. It appears that you are already part of the community, and have contributed to the Scwm project. If you are interested in joining the bogofilter effort (even if you are only going to make a small contribution of your time), let me or David know, and we will add you to the project.<br>
<br>
We would like to roll your message classification and the code that allows analysis of mime message components. It looks like some of our efforts have overlapped, so we might make the best positive impact by working together.<br>
<br>
Have you considered expanding the context of the text classification to include more than just the prior word? Perhaps considering more of the surrounding context would offer better results. Are you aware of any studies or experiments done that indicate whether or not this is a good approach?<br>
<br>
Thanks,<br>
<br>
Adrian<br>
<br>
<br>
On Friday, November 22, 2002, at 06:46 PM, Scott Lenser wrote:<br>
<br>
<br>
<blockquote type=cite cite>Hi,<br>
<br>
  I'm a Ph.D. student at Carnegie Mellon University working mostly with robotics and<br>
AI.  I downloaded bogofilter-0.7 a while back and looked at using it to filter my<br>
mail.  I wasn't satisfied with the results and felt it could be improved using my<br>
AI background.  I modified it to be a Naive Bayes text classifier.  I also changed<br>
the treatment of unseen words to add a prior to each word in the form of virtual<br>
examples as is often done in using Naive Bayes for text classification.  I then<br>
started using it to filter my email.  I also modified it to be partially C++ and<br>
changed from Judy arrays to STL hash_maps (which sped it up considerably).  I just<br>
recently enhanced it to use libgmime to parse a mail message into body and headers<br>
and such.  I grab some of the features that make sense to grab from the headers and<br>
ignore the rest.  libgmime converts the encoding to 8bit so email encoded in<br>
quoted-printable, base64, and such are all handled correctly.  I only grab features<br>
from the parts encoded in text/* formats.  I use one lexer for text/html and one<br>
for all other text/* parts.  The text/html lexer and associated parsing routines<br>
ignore all html tags (everything between '<' and '>').  All of these go into the<br>
Naive Bayes classifier.<br>
<br>
I have tested this on my personal email corpus.  I have 365 spam messages, 995 messages<br>
from family members, 1230 messages pertaining to the research group I am in, and<br>
3061 messages pertaining to the research project I work on.  I tested the spam filter<br>
as follows.  The 995 family messages and 1230 general lab messages were labelled as ham.<br>
The 365 spam messages were labeled as spam.  I tested the resulting filter on all of<br>
the mentioned messages with no further training of any kind.  I got the following results:<br>
<br>
5/995 (.5%) family messages labelled as SPAM [many of these messages are in html format<br>
  are are often produced by windows tools]<br>
6/3061 (.2%) research project messages labelled as SPAM<br>
0/1230 (.0%) research group messages labelled as SPAM<br>
(~97%) of SPAM messages labelled as SPAM (I misplaced the exact number)<br>
<br>
Most of the misclassified ham messages were emails from companies that I had ordered<br>
stuff from that were giving me the tracking numbers and such.  Most of the spam messages<br>
that got through were either really short or encoded in asain characters.<br>
<br>
I'm interested in seeing these improvements make there way into the GPL community.<br>
I was wondering if you had a test corpus and procedure that I could use to compare<br>
the binary that I have versus the current performance of bogofilter.  <br>
I was also<br>
wondering what you think would be the best way to proceed in getting this improvements<br>
available to the community.<br>
<br>
The main changes I've made are:<br>
<br>
- change algorithm to Naive Bayes classifier<br>
- correctly handle headers/mime messages/transfer encodings by using libgmime<br>
- split lexer into text/html and text/* parts<br>
- switch to C++ version of lexer (produced by flex++)<br>
- replace Judy arrays with STL hash_map<br>
- minor change: I changed the command line interface so that -H and -S simply remove<br>
  from the ham or spam corpus respectively so that a message can be removed completely<br>
  from the corpus.  I use -sH and -hS to switch a message from ham to spam and vice versa.<br>
<br>
Look forward to hearing from you.  Please forward this to any parties you think would<br>
be interested.<br>
<br>
- Scott Lenser<br>
<br>
</blockquote><br>
<br>
</html>