A case for Markovian
tallison at tacocat.net
Wed May 12 07:21:03 EDT 2004
Tom Anderson wrote:
> On Tue, 2004-05-11 at 19:03, Tom Allison wrote:
>>Consider that each word has variations that can still be understood:
>>vi at gr@
> This is an argument against single-token filtering, and it generally
> fails. If we can force all spammers to write in gibberish, then their
> response rate is going to drop below the break-even point. And we'll
> still be able to filter them.
>>Now multiply that by the variations_that_become_ava.ila.ble to_some 1 else
>>and you don't have a problem with just the 30,000 words squared, but the
>>spelling varations of each word. If you just guess at 5 variations plus
>>the original you hit 32,400,000,000 instead of the 900,000,000 that we
>>originally remarked upon. That's a big database for only two words.
> Relative to what? Anyone using the current version of bogofilter has
> already decided that using up a little disk space is a fair exchange for
> the filtering provided. It naturally follows that many bogofilter users
> would like to exchange a little more disk space for even better
> filtering. And we'll develop the necessary space-saving features as
> time goes on... fear of using disk space is not a reason to abandon
> implementing a method which has shown superior accuracy. Moreover, like
> I said before, as long as the phrase size is configurable, then each
> user may decide for themselves the size vs accuracy tradeoff they're
> willing to make.
I'm not worried about the disk space, yet.
But we are looking at >> 10^9 entries in a system.
Does Berkely DB manage those numbers effectively?
(I really don't know.)
More information about the Bogofilter