train on error

Fri Sep 5 10:53:11 CEST 2003

On  at , Unknown wrote:

> Now I ask: the train-on-error method works well? Or do I need to receive
> hundreds of thousands of trillions of billions of emails to it begin to
> be accurate? :)
> 
> What do you think of this method, and what method you currently use, and
> is satisfied with it's accuracy?
> 

I use "train on unsure"

All emails above a certain spamicitiy are *deleted* on my mail server 
machine. Any remaining spams are downloaded on to my client machine.
If these really are spam - they are posted as attachments to a special 
account on the mail server that adds the email to the spamlist database. 
(I use a single shared bogofilter database for all user accounts on the 
mail server box).

I also update the spamlist database from a "honeytrap" account that is a 
source of pure spam. Here again I delete the high spamicity messages and 
only use the borderline spam for updating the shared database.

Why do I do this? Well there are disk quota restriction on the mail server 
machine, so I don't want the database to get too big, and I want the 
database additions to make a difference. The idea is that by focusing on 
the borderline spam cases, I improve discrimination where it counts. It 
might drop the scores of "high spam" emails a bit - but not enough to 
matter.

I have been doing this for 6 months. The stats are:

Database size:
spams: 4000
hams: 1220

Update rate; 
4 spams/day from honeytrap
1 spam/day manual posting of spam to special account

Performance (April)
Spam deleted: 90%
Spam downloaded 10%
False negatives 1%
False positve 0.18%

Performance (Now)
Spam deleted: 97%
Spam downloaded 3%
False negatives: 0.15%
False positves: 0.15%

The results might also be affected by an update to bogofilter (0.9.1.2 to 
13.6.2), but fliter is running in much the same mode (using case 
insensitive tokens).

Just in case you are wondering, I do maintain a log of deleted spams
and (so far) it has all been junk mail (no real messages deleted out of 
4000 received). The same is true of other users (deletions are also logged)

Of course it all depends where you draw the line between "real spam"
that is deleted and "possible spam" that is not. I use the Robinson 
algorithm that tends to give a fairly linear spamicity scale. With this 
measure my current settings are:

0 ..  ham .. 0.54 .. possible spam .. 0.65  ... deletable spam  1.0

-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk