[cvs] Potential for error?

Mon Oct 21 18:18:53 CEST 2002

Tom,

A better place for this message is the regular bogofilter discussion 
list.  <bogofilter-cvs at lists.sourceforge.net> is the wrong place for it.

At 11:55 AM 10/21/02, Allison, Thomas wrote:

>I was looking through the data that I've collected in the bogofilter file
>using bogoutil (thanks Carl for the tip) and looking through the data I've
>put together using some spam files, perl, and postgres that I played with
>over the weekend and realized a few things that we might consider for
>bogofilter.
>
>Bogofilter's special powers come from it's adaptive method of determining
>spam.  This is what makes it better than spamassassin in that it has dynamic
>criteria.  For a while.
>
>My question is if the MSG_COUNT becomes very high, how does this effect the
>ability for bogofilter to learn about spam within a small number of emails.
> >From what I've read, bogofilter seems to be reasonably trained within ~1000
>emails.
>
>If you have a MSG_COUNT on the order of 10,000 or higher, how many emails
>will it take to learn about a new pattern in spam jargon?
>
>Can it become sluggish in it's ability to adapt to jargon?

These are possible, though I'm not worried about it.  There are a LOT of 
different words in my live word lists.  As of noon today, the spam list has 
106,742 tokens (words) from 6,014 messages and the good list has 286,662 
words tokens 29,065 messages.  There are 337,985 _different_ tokens .  This 
indicates approx 55,000 tokens are unique to the spam list.

There are two main reasons for bogofilter mis-classifying a spam 
message.  One is that the message has two few spammy words.  Feeding that 
message back into the word lists helps cure that.  With all the token 
differencees between the two lists, I think the effect of the corrective 
action is soon felt.  Of course, this can be tested ...

Also, there are different algorithms for calculating spamicity from the 
message and the word lists.  ESR implemented Paul Graham's algorithm and 
that's what bogofilter is currently using.  Gary Robinson analyzed Graham's 
approach and described a different algorithm which Greg Louis has 
implemented.  Greg likes the Robinson algorithm so well that he's using 
it.  I've done some testing and it seems to do a better job of catching 
spam.  The code is in the cvs repository.  It is activated by the '-r' 
switch.  As a caveat, it uses a different MAX_REpeats value, so the 
wordlists should be rebuilt for best results.

>Also, I noticed that there were a lot of words in my lists that weren't
>words.  Things like ab34af127 would be listed, but only once.  Based on
>this, eventually the list files will bloat to inifinity.

I've got lots of ip addresses, numbers, money amounts, version numbers, 
(unreadable) korean, and other stuff.

>Would it be possible to roll-off records which haven't been seen in a long
>time (one year) as a maintenance/utility?

A date-last-modified field could be implemented (perhaps as a config file 
option).  If done, bogoutil ought to have a corresponding maintenance 
mode/operation.  Are you up for the task?

Similarly, one could periodically discard any tokens whose good+spam count 
is 1.