mailing lists and hapaxes

Bob Friesenhahn bfriesen at simple.dallas.tx.us
Thu Sep 25 02:45:39 CEST 2003


On Wed, 24 Sep 2003, David Relson wrote:

> Greetings,
>
> As part of another test, I grepped my wordlist for my userid and was
> surprised to find 31,400 tokens containing it.  Checking further, I
> found the culprit to be mailing lists that have serial numbered return
> addresses, for example:
>
> bogofilter-return-1000-relson 0 1
> bogofilter-return-1001-relson 0 1
> bogofilter-return-1002-relson 0 1
> bogofilter-return-1003-relson 0 1
> ...
>
> Of the 31,400 tokens, all except 238 are hapaxes, i.e. tokens that have
> appeared exactly once.
>
> This could be a reason to _not_ use '-u' (auto-update).  It could also
> be a reason to periodically delete hapaxes.
>
> Has anybody else noticed this phenomena?  Any thoughts on how best to
> deal with it?

If your example is a common case, then treating dashes or other common
delimiters the same way you treat spaces when parsing would create
more simple words (e.g. bogofilter, return, & relson), but with much
more redundancy.  Purely numeric values can be summariliy discarded.

Bob
======================================
Bob Friesenhahn
bfriesen at simple.dallas.tx.us
http://www.simplesystems.org/users/bfriesen





More information about the Bogofilter mailing list