Cats and dogs

Elijah Saxon elijah at riseup.net
Fri Jul 4 22:18:28 CEST 2003


On Fri, 4 Jul 2003, Peter Bishop wrote:

> It occurs to me that I could use the train-on-error idea in a different
> way.
>
> I currently use a common database (for multiple accounts)
> and I use a "spamtrap" account  to update the spamlist database
>
> The spamlist is getting pretty big now (>2000 messages), so
> rather than cutting off the spam feed completely, I could:
>
> 1) first classify the spamtrap email using bogofilter
> 2) if the test results in No or Unsure, add the message
>     to the spamlist database.
>
> I think this would give most of the benefit of the full spam feed
> without expanding the database too much.

hey, that is exactly what I am doing now. Once I have better statistics, I
will report on how it goes. One worry is that the mail that the spamtrap
gets is not necessarily the same kind of spam that users get. I have also
started 'training to extinction' because i like to live dangerously and
it sounds edgy.

i think that a lot of the tests which people have run as to whether
training to extinction is an acceptable trade off in terms of size and
accuracy have used extreme data sets. i guess i am trying 'start with a
corpus, then train on error to extinction and see what happens'.

initially, i have modest goals: 80% spam caught for thousands of users
with *very* few false positives. i don't care about 99% accuracy: if users
want that, they can install bogofilter themselves. so far, i am doing a
lot better than 80%, but until i have more data it is hard to tell.

about training to extinction: i like to think of it as affirmative action
for minority spam--it seeks to create a diverse word list.

back to cats and dogs: training to extinction might over train on great
danes, but when it encounters that chihuahua it will first call it a cat
and then really study that chihuahua until it is convinced that it is not
a cat. or something like that. over time, we will see.

-elijah





More information about the Bogofilter mailing list