A case for Markovian

Tom Anderson tanderso at oac-design.com
Wed May 12 13:12:42 CEST 2004


On Tue, 2004-05-11 at 19:03, Tom Allison wrote:
> Consider that each word has variations that can still be understood:
> Viagra
> vi at gr@
> v.i.a....

This is an argument against single-token filtering, and it generally
fails.  If we can force all spammers to write in gibberish, then their
response rate is going to drop below the break-even point.  And we'll
still be able to filter them.

> Now multiply that by the variations_that_become_ava.ila.ble to_some 1 else
> and you don't have a problem with just the 30,000 words squared, but the 
> spelling varations of each word.  If you just guess at 5 variations plus 
> the original you hit 32,400,000,000 instead of the 900,000,000 that we 
> originally remarked upon.  That's a big database for only two words.

Relative to what?  Anyone using the current version of bogofilter has
already decided that using up a little disk space is a fair exchange for
the filtering provided.  It naturally follows that many bogofilter users
would like to exchange a little more disk space for even better
filtering.  And we'll develop the necessary space-saving features as
time goes on... fear of using disk space is not a reason to abandon
implementing a method which has shown superior accuracy.  Moreover, like
I said before, as long as the phrase size is configurable, then each
user may decide for themselves the size vs accuracy tradeoff they're
willing to make.

Tom





More information about the Bogofilter mailing list