[bogofilter] Test sets, accuracy and other things

Tue Sep 10 16:25:36 CEST 2002

[I apologize in advance that this email is a bit long, and for any spelling
 mistakes - I'am dyslexic ]

In a rather fortunate move I have historically kept a copy of all the email
I receive, including all the spam. I have been using this over the last
week or so to test bogofilter, to see who good it really is and what
changes lead to an improvement of the classification.

What I have done is gone through my mail archive (for the record it is all
in mh format) and made some collections of mail. Firstly I have made a folder
of every email I received during the month of August this year and have
hand classified the lot (1204 messages) into three groups: good (931), spam
(202) and virus (68). There are three messages I am not sure how to classify.
Classifying the mail showed up the fact that what is and what is not spam
can be quite wooly at the edges. For example I choose to classify an
email from Xilinx telling me that some of their software I have download
is incompatible with Windows 2000 SP3 as not spam. Though clearly although
it is bulk email and unsolicited, it is highly useful information to have
(not that I would install SP3 anyway but if I was...)

I am using this as a test set for measuring the accuracy of bogofilter.
I have also made a number of other folders, all containing collections of
mail I received in 2002, prior to 1st of August. These are all viruses,
all spam, all personal email (from friends and family) all email directed
at me but not personal (things like eBay bid notices, online receipts
and dispatch notices), one for all questions on Toshiba laptops and
scanners I get asked, and one for each of the mailing list I am subscribed to.
These I use to train bogofilter before testing it on the August email.

While my email may not be entirely representative of the average person,
it does allow me to make quantitive comparisons of various modifications
to the bogofilter algorithms, and thus make at least suggestions as to
what needs changing to improve bogofilter and probably patches as well.
Clearly I will not be making my test sets available for downloading as
they contain confidential and personal information.

Now to the interesting bits, and my hope is that these will stimulate
discussion of the underlying algorithms in bogofilter and hopefully lead
to improvements in them. At this point I will note that I do have some
experience in Bayesian statistics, having worked on a European project
on the forecasting of sales using Bayesian statistics at the University
of Sunderland. A bit different to classification of email, but the underlying
mathematics is all the same.

The first thing I am going to say is that ignoring the contents of MIME
attachments is a sure fire way to let spam get through. In tests if I take
any plain text spam message from August and put it in a quoted printable
plain text attachment it passes straight through bogofilter. Admittedly
in the 2500 spams I have received I don't think any such spam exists, but
if bogofilter ever becomes ubiquitous I am sure they will start appearing
quickly. Therefore my first suggestion is that bogofilter does examine the
contents of at least quoted printable MIME attachements.

Second of the mark is the occurrence of common words in the wordlists and
the effect these have on accuracy of classification. First at least in
bogofilter 0.5 words such as "from" and "subject" get added as high
scoring words to both good and bad lists and have clearly come from the
mail headers. Secondly is the occurrence of words like "the", "are",
"can", "has", "with", "you", etc. on both good and bad lists. Clearly
such words till you little about whether the email is spam or not. However
they appear highly on the word lists and although they cancel each
other out under Bayes rule I am finding that as many as 2/3 of the words
used to calculate the spam probability are these common words.

As a test I took a list of 200 common English words from vmspell (a
Pascal spell checker for VMS that I once entertained the idea of
porting to C and Unix). I removed these words from the good and bad
lists and ran the classification on the August email again. There was
a significant improvement in the classification accuracy (I can produce
detailed figures if need).

Now I am reluctant to suggest using word lists to remove common words
because this ties bogofilter to a particular language and requires the
production of list for each language of emails bogofilter processes.

Therefore I have given this problem quite a bit of thought and have
come up with the following algorithmic way of removing such common
words that count equally to spam/good from the final calculation.
The outline is below, though note the figures are just suggestions at
the moment. They would obviously need some degree of tweaking.

   1. Collect the top 30 words for spam/good
   2. For all words that appear in both lists calculate the contribution
      they will have to the total probability.
   3. If the combined contribution is below some threshold remove the words
      from both lists.
   4. Calculate the probability that the email is spam based on the 10
      best remaining words in each list.

I have no code for this at the moment, which is why the figures will need
tweaking. However if this replicates the effect of removing common words
from the lists (incidentally I reduced the number of words used to
calculate the probability from 15 to 10 for this test), then it would make
bogofilter significantly better at classifying email into spam/non spam.

This sort of filtering is perfectly justifiable from the Bayesian statistical
point of view.

Finally for the time being at least, an idea about return status and
mails with MIME attachements. Some time ago now getting sick of virus
emails in my inbox (not that they have any effect under Linux/exmh of course)
I decided to write a scanner that rejected emails based on two sets of
rules

    1. Attachments with executable extensions in the filename, but the
       mime-type does not match.
    2. Attachments with double extensions in the filename and the second
       one is executable.

Very effective it is too, missing only two virus in the 68 I received in
August. This can be variable, earlier this year I got 40+ in one day
all of which it caught.

What relevance does this have to bogofilter you might ask? Well apart from
the 68 virus I only had another 11 emails with attachements. Now if
bogofilter returned a different exit code for emails that although not
spam contained a MIME attachment I could restrict the scanning for virus
to those 6.5% of emails that could possibly contain a virus. As bogofilter
is scanning all the emails in the first place and already does 99% of the
work required to return that different status code, this would lead to
a dramatic saving of processing power if you are scanning for viruses
as well as spam. I did previously propose to Eric that bogofilter did
the actual scanning based on the above rules which he rightly rejected.
However I think my modified proposal is perfectly acceptable and a
justified extension of bogofilter. What do you all think?

JAB.

-- 
Jonathan A. Buzzard                 Email: jonathan at buzzard.org.uk
Northumberland, United Kingdom.       Tel: +44(0)1661-832195