A case for Markovian

michael at optusnet.com.au michael at optusnet.com.au
Fri May 14 01:32:44 CEST 2004


David Relson <relson at osagesoftware.com> writes:
> On 13 May 2004 12:26:53 +1000
> michael at optusnet.com.au wrote:
[..]
> > Noting that this is particular true given the VERY inefficent way that
> > the bogofilter data is currently stored!
> > 
> > Michael.
> 
> Hi Michael,
> 
> It's true.  Bogofilter's storing of tokens as plain text takes space and
> storing of markovian chains as plain text will use gobs and gobs of
> space.  At least one of the other spam filters computes a hash value for
> each token and stores that.  Of course hashed values have pros and cons.
> 
> Pros:  having a fixed amount of storage per token is undoubtedly a
> big disk space savings.  A smaller database is quicker to search.  Also,
> fixed size database keys (the tokens) may be faster to match.

Note that the match should be much much faster. I.e. a pair of
integer comparisions v a loop around series of byte comparisions.
 
> Cons:  computing the hash costs time.

This is probably lower than you think. On modern CPUs the wall clock
overhead of this is similar to just finding the length of the
string. (the time is dominated by the cache misses).

It's also something that you do only once at parse time, and you get
the time back the first time you need to do a comparision.

> Hashes create possibilities of collisions.

It does indeed. I think the overall probabilty of collision
is low. (a 64 bit hash over 4 billion tokens would give a
~50% of a single collision). I'm of the opinion that that's 
not worth worrying about. :)

michael.




More information about the Bogofilter mailing list