[cvs] Potential for error?

Tom Allison tallison at tacocat.net
Mon Oct 21 23:50:16 CEST 2002


David Relson wrote:
> At 04:41 PM 10/21/02, Graham Wilson wrote:
> 
>> On Mon, Oct 21, 2002 at 12:18:53PM -0400, David Relson wrote:
>> > >Would it be possible to roll-off records which haven't been seen in a
>> > >long time (one year) as a maintenance/utility?
>> >
>> > A date-last-modified field could be implemented (perhaps as a config
>> > file option).  If done, bogoutil ought to have a corresponding
>> > maintenance mode/operation.  Are you up for the task?
>> >
>> > Similarly, one could periodically discard any tokens whose good+spam
>> > count is 1.
>>
>> why would you do this?
> 
> 
> Graham,
> 
> If word list size became a major issue - too much disk space or searches 
> taking too long - it might be worth while to trim it.  Exactly how one 
> could/should/would do this is open for debate.
> 
> David
> 
> 

I would not remove tokens that have an even count/probability.  After all, 
isn't that part of the process?  To know which words to ignore.

But I think that a date-last-modified (or accessed) would be useful for 
limiting the space of the files.

This might also be useful in limiting the potential effect of creating a 
sluggish response to new spam jargon.  An example might be if we were getting 
spammed with Linux as a key word.  It might take a very long time to modify 
this token to a 50/50 probability given a base of 1000's of good emails 
tipping the scale.

I haven't proven this potential effect, I've only just thought of it today.

If I get the chance I'll try some tests with it later on.
-- 
Humpty Dumpty was pushed.





More information about the Bogofilter mailing list