spam conference report

Gyepi SAM gyepi at praxis-sw.com
Mon Jan 20 02:06:07 CET 2003


On Sun, Jan 19, 2003 at 07:25:48PM -0500, David Relson wrote:
> At 07:18 PM 1/19/03, Gyepi SAM wrote:
> >On Mon, Jan 20, 2003 at 12:05:12AM +0100, Matthias Andree wrote:
> >> How would you define a stem? the pure-alphanumerical part?
> >
> >that is probably the simplest way.
> >

> Any use of "stems" cuts the number of distinct words recognized.  Is that 
> desirable?  Me?  I don't know.

The stem is just a key whose associated values contain a list of
all the 'related' words and their counts. The method would actually increase
the number of distinct words recognized. The 'stemming' is just to find a common
'root' of all the words on the list.  I am being purposefully vague here becuase
we are not really stemming, per se, just reducing, I suppose.

> If the direction is towards fewer distinct tokens, two techniques that are 
> available are hash coded and soundex.

Hash coded would work if the hash keys are not too distinct. Soundex, IIRC, would map too many words to each key and the list of words would get very long.
 
> I _have_ seen arguments against _all_ techniques that lower the level of 
> detail stored.  Do we want to go in that direction?

Not at all.

-Gyepi




More information about the bogofilter-dev mailing list