spam and bogofilter

Sun Nov 24 00:16:38 CET 2002

I should be on the mailing lists now.

> Scott,
> 
> We would like to roll your message classification and the code that 
> allows analysis of mime message components. It looks like some of our 
> efforts have overlapped, so we might make the best positive impact by 
> working together.
> 

Sounds reasonable.  At this point, my version is working well enough that
I'm not looking to put too much more time into it.  I would like to see
it be able to use more than 2 classes.  I'm less likely to want to work
on a C version simply because I don't see any benefit to sticking with C
and find benefits to using C++ (such as access to the STL).  I'd also like
to see it be able to take a bunch of message files as arguments at the
command line instead of on stdin.

> Have you considered expanding the context of the text classification to 
> include more than just the prior word? Perhaps considering more of the 
> surrounding context would offer better results. Are you aware of any 
> studies or experiments done that indicate whether or not this is a good 
> approach?
> 

There have been attempts to do this in the AI text classification community.
The main problem with doing this is it requires much more data to get reasonable
estimates of the probabilities involved.  I think most researchers think
that Naive Bayes works about equally well.  Most researchers who use multiple
word features use either 2 or 3 word sets.  These are called n-grams in general
and digrams and trigrams for the n=2 and n=3 cases, respectively.  I think
a more promising approach would be to try to use Bayes nets.  Learning the
structure of Bayes nets is an open, active research problem though, so it would have
to be hand coded some how probably.

- Scott