Radical lexers
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Thu Jan 22 09:51:17 CET 2004
Tom Anderson <tanderso at oac-design.com> wrote:
>> I'd warrant that your wordlists have a lot of hapaxes (tokens that have
>> occurred once and only once) taking up space. This seems contrary to
>> your efforts to minimize wordlist size :-(
>
>IMHO, more hapaxes for more accuracy is a good trade-off. However,
>wordlist size is definitely important. Therefore, someone should write
>a hapax stripping script that can be run in a cronjob. If I had time,
>I'd do it...
There is already something in the FAQ which almost does it.
Anyhow, there are far more effective methods than that for a
small database. Mine has 3.7M and works great.
pi
More information about the Bogofilter
mailing list