Radical lexers

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Thu Jan 22 09:51:17 CET 2004


Tom Anderson <tanderso at oac-design.com> wrote:

>> I'd warrant that your wordlists have a lot of hapaxes (tokens that have
>> occurred once and only once) taking up space.  This seems contrary to
>> your efforts to minimize wordlist size :-(
>
>IMHO, more hapaxes for more accuracy is a good trade-off.  However,
>wordlist size is definitely important.  Therefore, someone should write
>a hapax stripping script that can be run in a cronjob.  If I had time,
>I'd do it...

There is already something in the FAQ which almost does it.

Anyhow, there are far more effective methods than that for a
small database. Mine has 3.7M and works great.

pi




More information about the Bogofilter mailing list