RDMS

Tom Allison tom at tacocat.net
Fri Nov 24 04:12:54 CET 2006


>> I know SpamAssassin did this with their Bayes wordlists and found that initially 
>> it was very painful, but as the number of inserts decays over time the 
>> performance comes back.  But I am violently opposed to the hacked up methods of 
>> SA.  They have large lists of exclusions in their bayesian filter objects that 
>> now have to be maintained along with everything else.  Lots of work.
> 
> I think that ignore lists wouldn't be all that bad for bogofilter
> either. For instance, if you have several inbound paths for messages (=
> accounts), you may not want to "penalize" messages towards spam just
> because they've taken a route with less efficient or nonexisitng
> pre-filtering upstream, IOW, because they've taken a route that is
> haunted by more spam. In such cases, headers specific to that route
> might be eliminated so as not to contribute to the spamicity/bogosity
> (the metric).
> 

Perhaps I haven't been paying much attention to the conversations about 
exclusion lists, but I haven't heard anything in the past to say that exclusion 
lists are going to help the success of a statistical filter.  In fact, I would 
think that the statistical filter would be more effective without any exclusion 
lists simply because if it didn't matter then it wouldn't show up as important. 
  If it did matter, even if it was a routing path, then it would show up.

What I've worked out on the tables is essentially three tables:
user (who the recipient is on the local system)
user_tokens (many to many table associating each token to a known user.  It is 
here that the counts of good/bad instances would be stored for each token)
tokens (words seen...)

There might be some optimizations that can be done, but this is where I would 
start with the first normalized tables.

It would probably add some complexity to the process, but it might also be 
worthwhile.  It's probably a matter of speed of operation versus maintenance 
time/speed...



More information about the Bogofilter mailing list