spam conference report
David Relson
relson at osagesoftware.com
Mon Jan 20 01:25:48 CET 2003
At 07:18 PM 1/19/03, Gyepi SAM wrote:
>On Mon, Jan 20, 2003 at 12:05:12AM +0100, Matthias Andree wrote:
> > Gyepi SAM <gyepi at praxis-sw.com> writes:
> >
> > > One way I can think of is to cluster all proximate words into a
> single database value
> > > whose key is some root of all the words, perhaps the stem. Presuming
> that the list is not too
> > > long, we still maintain order n log n. So to look up a word:
> >
> > "stem" pretty much sounds like "get aware of the language" and clearly
> > heads the artificial intelligence direction.
> >
> > How would you define a stem? the pure-alphanumerical part?
>
>that is probably the simplest way.
>
> >
> > > 1. compute the word's stem
> > > 2. use the stem as a lookup into the database
> > > 3. get back a list of words and their counts.
> > > 4. walk the list, looking for an exact match
> > > 5. perhaps if no exact match is found, use the first word's count.
> >
> > Other than that, this gets difficult if someone deliberately misspells a
> > word early.
>
>Well, yes, but deliberate misspellings as an attempt to evade filters
>is a separate issue. Likely, the misspelled word just becomes a spam
>indicator.
Any use of "stems" cuts the number of distinct words recognized. Is that
desirable? Me? I don't know.
If the direction is towards fewer distinct tokens, two techniques that are
available are hash coded and soundex.
I _have_ seen arguments against _all_ techniques that lower the level of
detail stored. Do we want to go in that direction?
More information about the bogofilter-dev
mailing list