Cats and dogs

Greg Louis glouis at dynamicro.on.ca
Fri Jul 4 22:48:38 CEST 2003


On 20030704 (Fri) at 1318:28 -0700, Elijah Saxon wrote:

> i think that a lot of the tests which people have run as to whether
> training to extinction is an acceptable trade off in terms of size and
> accuracy have used extreme data sets. i guess i am trying 'start with a
> corpus, then train on error to extinction and see what happens'.

s/extreme/tiny/g

Training to extinction doesn't make enough difference to be worthwhile
once you have your training db up to 20,000 spams and 20,000 nonspams.
In the very early stages (hundreds instead of thousands), it may give
an illusion of better performance.  I got suckered into using it for a
while.

> initially, i have modest goals: 80% spam caught for thousands of users
> with *very* few false positives.

There's a commercial product that claims 15% spam delivered but only
one fp in a million nonspams.  That sort of balance is probably
achievable with bogofilter.  I wouldn't know; I don't get enough email
to measure one fp in a million accurately.  But you can certainly edge
your spam cutoff up till your fp rate seems satisfactory.

> back to cats and dogs: training to extinction might over train on great
> danes, but when it encounters that chihuahua it will first call it a cat
> and then really study that chihuahua until it is convinced that it is not
> a cat. or something like that. over time, we will see.
> 
arf a miaow is better than none -- or am I just barking?

Training normally will probably get you pugs for free, or nearly so,
once you've learned chihuahuas and great Danes; training to extinction
probably won't.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |




More information about the Bogofilter mailing list