[cvs] Potential for error?
Tom Allison
tallison at tacocat.net
Mon Oct 21 23:50:16 CEST 2002
David Relson wrote:
> At 04:41 PM 10/21/02, Graham Wilson wrote:
>
>> On Mon, Oct 21, 2002 at 12:18:53PM -0400, David Relson wrote:
>> > >Would it be possible to roll-off records which haven't been seen in a
>> > >long time (one year) as a maintenance/utility?
>> >
>> > A date-last-modified field could be implemented (perhaps as a config
>> > file option). If done, bogoutil ought to have a corresponding
>> > maintenance mode/operation. Are you up for the task?
>> >
>> > Similarly, one could periodically discard any tokens whose good+spam
>> > count is 1.
>>
>> why would you do this?
>
>
> Graham,
>
> If word list size became a major issue - too much disk space or searches
> taking too long - it might be worth while to trim it. Exactly how one
> could/should/would do this is open for debate.
>
> David
>
>
I would not remove tokens that have an even count/probability. After all,
isn't that part of the process? To know which words to ignore.
But I think that a date-last-modified (or accessed) would be useful for
limiting the space of the files.
This might also be useful in limiting the potential effect of creating a
sluggish response to new spam jargon. An example might be if we were getting
spammed with Linux as a key word. It might take a very long time to modify
this token to a 50/50 probability given a base of 1000's of good emails
tipping the scale.
I haven't proven this potential effect, I've only just thought of it today.
If I get the chance I'll try some tests with it later on.
--
Humpty Dumpty was pushed.
More information about the Bogofilter
mailing list