A case for Markovian
peak at argo.troja.mff.cuni.cz
Fri May 14 10:15:37 EDT 2004
On Tue, 11 May 2004, Tom Allison wrote:
> Now multiply that by the variations_that_become_ava.ila.ble to_some 1 else
> and you don't have a problem with just the 30,000 words squared, but the
> spelling varations of each word. If you just guess at 5 variations plus
> the original you hit 32,400,000,000 instead of the 900,000,000 that we
> originally remarked upon. That's a big database for only two words.
You'd need to process at least 3.24E10 + 1 tokens ie. no less than
1.296E11 bytes of text, assuming the shortest possible (3 characters)
tokens + 1 character to separate them (I'd call it quite a lot of data)
to collect 3.24E10 different token pairs.
--Pavel Kankovsky aka Peak [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."
More information about the Bogofilter