mailing lists and hapaxes
David Relson
relson at osagesoftware.com
Thu Sep 25 02:56:28 CEST 2003
On Wed, 24 Sep 2003 19:45:39 -0500 (CDT)
Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:
> On Wed, 24 Sep 2003, David Relson wrote:
>
> > Greetings,
> >
> > As part of another test, I grepped my wordlist for my userid and was
> > surprised to find 31,400 tokens containing it. Checking further, I
> > found the culprit to be mailing lists that have serial numbered
> > return addresses, for example:
> >
> > bogofilter-return-1000-relson 0 1
> > bogofilter-return-1001-relson 0 1
> > bogofilter-return-1002-relson 0 1
> > bogofilter-return-1003-relson 0 1
> > ...
> >
> > Of the 31,400 tokens, all except 238 are hapaxes, i.e. tokens that
> > have appeared exactly once.
> >
> > This could be a reason to _not_ use '-u' (auto-update). It could
> > also be a reason to periodically delete hapaxes.
> >
> > Has anybody else noticed this phenomena? Any thoughts on how best
> > to deal with it?
>
> If your example is a common case, then treating dashes or other common
> delimiters the same way you treat spaces when parsing would create
> more simple words (e.g. bogofilter, return, & relson), but with much
> more redundancy. Purely numeric values can be summariliy discarded.
It looks like a common mailing list practice is
"listname-return-seqno-userid". A quick look found five such lists:
advisornews-return-10-relson 0 1 20030601
announce-return-101-relson 0 1 20030601
bogofilter-return-1304-relson 0 1 20030601
mdlug-return-10217-relson 0 1 20030601
spamfilt-return-309-relson 0 1 20030903
As well as another form:
ACUCCAFJTALABGMunity7.relson 0 1 20030601
AMUCCA31MADAAKYunity6.relson 0 1 20030601
AYDBDBLKPAJACIYunity6.relson 0 1 20030601
One thought would be special treatment for the userid portion of the
address in "Received: return-address at domain.com" and similar statements.
David
P.S. In my original message I deleted timestamps and am showing them
here.
More information about the Bogofilter
mailing list