RDMS

Fri Nov 24 08:59:34 CET 2006

Tom Allison <tom at tacocat.net> writes:

>>> I know SpamAssassin did this with their Bayes wordlists and found that initially 
>>> it was very painful, but as the number of inserts decays over time the 
>>> performance comes back.  But I am violently opposed to the hacked up methods of 
>>> SA.  They have large lists of exclusions in their bayesian filter objects that 
>>> now have to be maintained along with everything else.  Lots of work.
>> 
>> I think that ignore lists wouldn't be all that bad for bogofilter
>> either. For instance, if you have several inbound paths for messages (=
>> accounts), you may not want to "penalize" messages towards spam just
>> because they've taken a route with less efficient or nonexisitng
>> pre-filtering upstream, IOW, because they've taken a route that is
>> haunted by more spam. In such cases, headers specific to that route
>> might be eliminated so as not to contribute to the spamicity/bogosity
>> (the metric).
>> 
>
> Perhaps I haven't been paying much attention to the conversations about 
> exclusion lists, but I haven't heard anything in the past to say that exclusion 
> lists are going to help the success of a statistical filter.  In fact, I would 
> think that the statistical filter would be more effective without any exclusion 
> lists simply because if it didn't matter then it wouldn't show up as important. 
>   If it did matter, even if it was a routing path, then it would show
> up.

Well, I have several inbound paths that take 95% spam and 5%
ham. However, just because so much spam comes in that way, bogofilter
has _learned_ to bias messages coming that way as spam. That's it's job,
and the manual lesson we can teach it with ignore/exclusion lists is
"disregard the path information, it means nothing". After all, we want
to score content (although it's getting harder indeed with all the
garbage one-GIF OCR-fending kind of spam that contains hammish tokens in
garbled order).

> What I've worked out on the tables is essentially three tables:
> user (who the recipient is on the local system)

That might also be a passwd file or LDAP directory were it not that you
already assume SQL, right?

> user_tokens (many to many table associating each token to a known user.  It is 
> here that the counts of good/bad instances would be stored for each
> token)

Hm. How would you model the relation between these and...

> tokens (words seen...)

...these? Is that actually more efficient than just stuffing one table
or column per user?

> It would probably add some complexity to the process, but it might also be 
> worthwhile.  It's probably a matter of speed of operation versus maintenance 
> time/speed...

The query protocol would then ideally be adjustable, so that you can
design the database schema all the way you want and experiment with it
if you (or your DBA) can speak just SQL, but not C.

Best regards,

-- 
Matthias Andree