A case for Markovian

David Relson relson at osagesoftware.com
Thu May 13 13:24:09 CEST 2004


On 13 May 2004 12:26:53 +1000
michael at optusnet.com.au wrote:

> Tom Anderson <tanderso at oac-design.com> writes:
> [...] 
> > Relative to what?  Anyone using the current version of bogofilter
> > has already decided that using up a little disk space is a fair
> > exchange for the filtering provided.  It naturally follows that many
> > bogofilter users would like to exchange a little more disk space for
> > even better filtering.  And we'll develop the necessary space-saving
> > features as time goes on... fear of using disk space is not a reason
> > to abandon implementing a method which has shown superior accuracy. 
> > Moreover, like I said before, as long as the phrase size is
> > configurable, then each user may decide for themselves the size vs
> > accuracy tradeoff they're willing to make.
> 
> Noting that this is particular true given the VERY inefficent way that
> the bogofilter data is currently stored!
> 
> Michael.

Hi Michael,

It's true.  Bogofilter's storing of tokens as plain text takes space and
storing of markovian chains as plain text will use gobs and gobs of
space.  At least one of the other spam filters computes a hash value for
each token and stores that.  Of course hashed values have pros and cons.

Pros:  having a fixed amount of storage per token is undoubtedly a
big disk space savings.  A smaller database is quicker to search.  Also,
fixed size database keys (the tokens) may be faster to match.

Cons:  computing the hash costs time.  Hashes create possibilities of
collisions.  Collisions can cause incorrect results, for example if both
"computer" and "refinance" give the same hash code.

Anyhow, it's something to think about.

Regards,

David



More information about the Bogofilter mailing list