A case for Markovian

Wed May 12 13:21:03 CEST 2004

Tom Anderson wrote:
> On Tue, 2004-05-11 at 19:03, Tom Allison wrote:
> 
>>Consider that each word has variations that can still be understood:
>>Viagra
>>vi at gr@
>>v.i.a....
> 
> 
> This is an argument against single-token filtering, and it generally
> fails.  If we can force all spammers to write in gibberish, then their
> response rate is going to drop below the break-even point.  And we'll
> still be able to filter them.
> 
> 
>>Now multiply that by the variations_that_become_ava.ila.ble to_some 1 else
>>and you don't have a problem with just the 30,000 words squared, but the 
>>spelling varations of each word.  If you just guess at 5 variations plus 
>>the original you hit 32,400,000,000 instead of the 900,000,000 that we 
>>originally remarked upon.  That's a big database for only two words.
> 
> 
> Relative to what?  Anyone using the current version of bogofilter has
> already decided that using up a little disk space is a fair exchange for
> the filtering provided.  It naturally follows that many bogofilter users
> would like to exchange a little more disk space for even better
> filtering.  And we'll develop the necessary space-saving features as
> time goes on... fear of using disk space is not a reason to abandon
> implementing a method which has shown superior accuracy.  Moreover, like
> I said before, as long as the phrase size is configurable, then each
> user may decide for themselves the size vs accuracy tradeoff they're
> willing to make.
> 
> Tom
> 

I'm not worried about the disk space, yet.
But we are looking at >> 10^9 entries in a system.

Does Berkely DB manage those numbers effectively?
(I really don't know.)