[cvs] Potential for error?
David Relson
relson at osagesoftware.com
Mon Oct 21 18:18:53 CEST 2002
Tom,
A better place for this message is the regular bogofilter discussion
list. <bogofilter-cvs at lists.sourceforge.net> is the wrong place for it.
At 11:55 AM 10/21/02, Allison, Thomas wrote:
>I was looking through the data that I've collected in the bogofilter file
>using bogoutil (thanks Carl for the tip) and looking through the data I've
>put together using some spam files, perl, and postgres that I played with
>over the weekend and realized a few things that we might consider for
>bogofilter.
>
>Bogofilter's special powers come from it's adaptive method of determining
>spam. This is what makes it better than spamassassin in that it has dynamic
>criteria. For a while.
>
>My question is if the MSG_COUNT becomes very high, how does this effect the
>ability for bogofilter to learn about spam within a small number of emails.
> >From what I've read, bogofilter seems to be reasonably trained within ~1000
>emails.
>
>If you have a MSG_COUNT on the order of 10,000 or higher, how many emails
>will it take to learn about a new pattern in spam jargon?
>
>Can it become sluggish in it's ability to adapt to jargon?
These are possible, though I'm not worried about it. There are a LOT of
different words in my live word lists. As of noon today, the spam list has
106,742 tokens (words) from 6,014 messages and the good list has 286,662
words tokens 29,065 messages. There are 337,985 _different_ tokens . This
indicates approx 55,000 tokens are unique to the spam list.
There are two main reasons for bogofilter mis-classifying a spam
message. One is that the message has two few spammy words. Feeding that
message back into the word lists helps cure that. With all the token
differencees between the two lists, I think the effect of the corrective
action is soon felt. Of course, this can be tested ...
Also, there are different algorithms for calculating spamicity from the
message and the word lists. ESR implemented Paul Graham's algorithm and
that's what bogofilter is currently using. Gary Robinson analyzed Graham's
approach and described a different algorithm which Greg Louis has
implemented. Greg likes the Robinson algorithm so well that he's using
it. I've done some testing and it seems to do a better job of catching
spam. The code is in the cvs repository. It is activated by the '-r'
switch. As a caveat, it uses a different MAX_REpeats value, so the
wordlists should be rebuilt for best results.
>Also, I noticed that there were a lot of words in my lists that weren't
>words. Things like ab34af127 would be listed, but only once. Based on
>this, eventually the list files will bloat to inifinity.
I've got lots of ip addresses, numbers, money amounts, version numbers,
(unreadable) korean, and other stuff.
>Would it be possible to roll-off records which haven't been seen in a long
>time (one year) as a maintenance/utility?
A date-last-modified field could be implemented (perhaps as a config file
option). If done, bogoutil ought to have a corresponding maintenance
mode/operation. Are you up for the task?
Similarly, one could periodically discard any tokens whose good+spam count
is 1.
More information about the Bogofilter
mailing list