mailing lists and hapaxes

David Relson relson at osagesoftware.com
Thu Sep 25 02:56:28 CEST 2003


On Wed, 24 Sep 2003 19:45:39 -0500 (CDT)
Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:

> On Wed, 24 Sep 2003, David Relson wrote:
> 
> > Greetings,
> >
> > As part of another test, I grepped my wordlist for my userid and was
> > surprised to find 31,400 tokens containing it.  Checking further, I
> > found the culprit to be mailing lists that have serial numbered
> > return addresses, for example:
> >
> > bogofilter-return-1000-relson 0 1
> > bogofilter-return-1001-relson 0 1
> > bogofilter-return-1002-relson 0 1
> > bogofilter-return-1003-relson 0 1
> > ...
> >
> > Of the 31,400 tokens, all except 238 are hapaxes, i.e. tokens that
> > have appeared exactly once.
> >
> > This could be a reason to _not_ use '-u' (auto-update).  It could
> > also be a reason to periodically delete hapaxes.
> >
> > Has anybody else noticed this phenomena?  Any thoughts on how best
> > to deal with it?
> 
> If your example is a common case, then treating dashes or other common
> delimiters the same way you treat spaces when parsing would create
> more simple words (e.g. bogofilter, return, & relson), but with much
> more redundancy.  Purely numeric values can be summariliy discarded.

It looks like a common mailing list practice is
"listname-return-seqno-userid".  A quick look found five such lists:

advisornews-return-10-relson 0 1 20030601
announce-return-101-relson 0 1 20030601
bogofilter-return-1304-relson 0 1 20030601
mdlug-return-10217-relson 0 1 20030601
spamfilt-return-309-relson 0 1 20030903

As well as another form:

ACUCCAFJTALABGMunity7.relson 0 1 20030601
AMUCCA31MADAAKYunity6.relson 0 1 20030601
AYDBDBLKPAJACIYunity6.relson 0 1 20030601

One thought would be special treatment for the userid portion of the
address in "Received: return-address at domain.com" and similar statements.

David

P.S.  In my original message I deleted timestamps and am showing them
here.




More information about the Bogofilter mailing list