RDMS

Tom Allison tom at tacocat.net
Fri Nov 24 22:39:06 CET 2006


Matthias Andree wrote:
>> What I've worked out on the tables is essentially three tables:
>> user (who the recipient is on the local system)
> 
> That might also be a passwd file or LDAP directory were it not that you
> already assume SQL, right?

The name of the user wouldn't be determined by the server (LDAP or passwd) -- 
it's not an authentication process here.

The name in this case is the local recipient (or recipient that is permitted by 
your mail server for delivery handling).  And it is that recipient that the data 
tree hinges from...

> 
>> user_tokens (many to many table associating each token to a known user.  It is 
>> here that the counts of good/bad instances would be stored for each
>> token)
> 
> Hm. How would you model the relation between these and...
> 
>> tokens (words seen...)
> 
> ...these? Is that actually more efficient than just stuffing one table
> or column per user?

First normalization of the data would dictate that you have three tables 
something like this:

users
	username (string)
	ham_msgs_seen (integer)
	spam_msgs_seen (integer)

tokens
	token (string)
	last_seen (date)

user_token
	username
	token
	ham_count
	spam_count

Where the count of times each word has been seen in ham/spam is stored in the 
user_token table, not the token table.

I've been playing with this schema a bit using some basic parsing of email 
messages to de-mime and then break up into words.  Then adding all that into 
these tables.

one of the advantages that you can get with something like postgres is that 
these tables will be cached in memory and if sufficiently small (in databases 
this is something under a million rows...) it can remain entirely in memory and 
become a shared data space between the different accessing threads.


>> It would probably add some complexity to the process, but it might also be 
>> worthwhile.  It's probably a matter of speed of operation versus maintenance 
>> time/speed...
> 
> The query protocol would then ideally be adjustable, so that you can
> design the database schema all the way you want and experiment with it
> if you (or your DBA) can speak just SQL, but not C.

For grins I've gotten a lot of this done in terms of the SQL but I'm not making 
any progress on the math.  And it's still in the wrong language...  :)

But for what it's worth.

I've added the email contents of 5 users on my database.
The total number of messages is 7826.

the breakdown of tokens with varying count of users is:
  c | count
---+--------
  5 |   3658  (2%)
  4 |   2904  (2%)
  3 |   4522  (3%)
  2 |  10180  (6%)
  1 | 146409  (87%)
where 'c' is the number of users sharing a given token.

I suspect that with a better job of refining the token process the percentage of 
shared tokens would increase.  While this job was running, most of the tokens 
where ~20% matching instead of the final 13%.  I know the job run wasn't perfect 
but I present this information as it is.

I do suspect that at least in terms of human readable linguistics (email body) 
the overlap will be significantly higher.  But how each person interprets that 
token will make the difference.



More information about the Bogofilter mailing list