RDMS

Tom Allison tom at tacocat.net
Sat Nov 25 19:10:51 CET 2006


Tom Allison wrote:
> For grins I've gotten a lot of this done in terms of the SQL but I'm not making 
> any progress on the math.  And it's still in the wrong language...  :)
> 
> But for what it's worth.
> 
> I've added the email contents of 5 users on my database.
> The total number of messages is 7826.
> 
> the breakdown of tokens with varying count of users is:
>   c | count
> ---+--------
>   5 |   3658  (2%)
>   4 |   2904  (2%)
>   3 |   4522  (3%)
>   2 |  10180  (6%)
>   1 | 146409  (87%)
> where 'c' is the number of users sharing a given token.
> 
> I suspect that with a better job of refining the token process the percentage of 
> shared tokens would increase.  While this job was running, most of the tokens 
> where ~20% matching instead of the final 13%.  I know the job run wasn't perfect 
> but I present this information as it is.
> 
> I do suspect that at least in terms of human readable linguistics (email body) 
> the overlap will be significantly higher.  But how each person interprets that 
> token will make the difference.

I was bored so I wrote a pgsql rendition of bogofilter in perl.
It's far from complete, but the scoring works, which for me is the most 
interesting part.  My goodness the mathematics is complicated in bogofilter. 
Much more so than in SA or some of the others.  I think this is in part why 
bogofilter (for me) has always seemed to be an order of magnitude better.

It runs at a rate of 0.01 to 1.0 seconds per message.  Really don't know if the 
variation is from pgsql or my selection of MIME decoding tools (MIME is a pig). 
  I don't think any of this really pertains to anything in bogofilter other that 
this:  putting it into a database might not suck too bad.

I should probably try and profile it.  But I'll have to get bored again to set 
that up.

It starts out with marginal performance but after about 4 messages it really 
speeds up as the data starts caching on the database.



More information about the Bogofilter mailing list