A case for Markovian

Pavel Kankovsky peak at argo.troja.mff.cuni.cz
Fri May 14 16:15:37 CEST 2004


On Tue, 11 May 2004, Tom Allison wrote:

> Now multiply that by the variations_that_become_ava.ila.ble to_some 1 else
> and you don't have a problem with just the 30,000 words squared, but the 
> spelling varations of each word.  If you just guess at 5 variations plus 
> the original you hit 32,400,000,000 instead of the 900,000,000 that we 
> originally remarked upon.  That's a big database for only two words.

You'd need to process at least 3.24E10 + 1 tokens ie. no less than
1.296E11 bytes of text, assuming the shortest possible (3 characters)
tokens + 1 character to separate them (I'd call it quite a lot of data)
to collect 3.24E10 different token pairs.

--Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."




More information about the Bogofilter mailing list