train on error
Peter Bishop
pgb at adelard.com
Fri Sep 5 10:53:11 CEST 2003
On at , Unknown wrote:
> Now I ask: the train-on-error method works well? Or do I need to receive
> hundreds of thousands of trillions of billions of emails to it begin to
> be accurate? :)
>
> What do you think of this method, and what method you currently use, and
> is satisfied with it's accuracy?
>
I use "train on unsure"
All emails above a certain spamicitiy are *deleted* on my mail server
machine. Any remaining spams are downloaded on to my client machine.
If these really are spam - they are posted as attachments to a special
account on the mail server that adds the email to the spamlist database.
(I use a single shared bogofilter database for all user accounts on the
mail server box).
I also update the spamlist database from a "honeytrap" account that is a
source of pure spam. Here again I delete the high spamicity messages and
only use the borderline spam for updating the shared database.
Why do I do this? Well there are disk quota restriction on the mail server
machine, so I don't want the database to get too big, and I want the
database additions to make a difference. The idea is that by focusing on
the borderline spam cases, I improve discrimination where it counts. It
might drop the scores of "high spam" emails a bit - but not enough to
matter.
I have been doing this for 6 months. The stats are:
Database size:
spams: 4000
hams: 1220
Update rate;
4 spams/day from honeytrap
1 spam/day manual posting of spam to special account
Performance (April)
Spam deleted: 90%
Spam downloaded 10%
False negatives 1%
False positve 0.18%
Performance (Now)
Spam deleted: 97%
Spam downloaded 3%
False negatives: 0.15%
False positves: 0.15%
The results might also be affected by an update to bogofilter (0.9.1.2 to
13.6.2), but fliter is running in much the same mode (using case
insensitive tokens).
Just in case you are wondering, I do maintain a log of deleted spams
and (so far) it has all been junk mail (no real messages deleted out of
4000 received). The same is true of other users (deletions are also logged)
Of course it all depends where you draw the line between "real spam"
that is deleted and "possible spam" that is not. I use the Robinson
algorithm that tends to give a fairly linear spamicity scale. With this
measure my current settings are:
0 .. ham .. 0.54 .. possible spam .. 0.65 ... deletable spam 1.0
--
Peter Bishop
pgb at adelard.com
pgb at csr.city.ac.uk
More information about the Bogofilter
mailing list