A case for Markovian

Wed May 12 13:12:42 CEST 2004

On Tue, 2004-05-11 at 19:03, Tom Allison wrote:
> Consider that each word has variations that can still be understood:
> Viagra
> vi at gr@
> v.i.a....

This is an argument against single-token filtering, and it generally
fails.  If we can force all spammers to write in gibberish, then their
response rate is going to drop below the break-even point.  And we'll
still be able to filter them.

> Now multiply that by the variations_that_become_ava.ila.ble to_some 1 else
> and you don't have a problem with just the 30,000 words squared, but the 
> spelling varations of each word.  If you just guess at 5 variations plus 
> the original you hit 32,400,000,000 instead of the 900,000,000 that we 
> originally remarked upon.  That's a big database for only two words.

Relative to what?  Anyone using the current version of bogofilter has
already decided that using up a little disk space is a fair exchange for
the filtering provided.  It naturally follows that many bogofilter users
would like to exchange a little more disk space for even better
filtering.  And we'll develop the necessary space-saving features as
time goes on... fear of using disk space is not a reason to abandon
implementing a method which has shown superior accuracy.  Moreover, like
I said before, as long as the phrase size is configurable, then each
user may decide for themselves the size vs accuracy tradeoff they're
willing to make.

Tom