spam conference report

Mon Jan 20 01:25:48 CET 2003

At 07:18 PM 1/19/03, Gyepi SAM wrote:

>On Mon, Jan 20, 2003 at 12:05:12AM +0100, Matthias Andree wrote:
> > Gyepi SAM <gyepi at praxis-sw.com> writes:
> >
> > > One way I can think of is to cluster all proximate words into a 
> single database value
> > > whose key is some root of all the words, perhaps the stem. Presuming 
> that the list is not too
> > > long, we still maintain order n log n. So to look up a word:
> >
> > "stem" pretty much sounds like "get aware of the language" and clearly
> > heads the artificial intelligence direction.
> >
> > How would you define a stem? the pure-alphanumerical part?
>
>that is probably the simplest way.
>
> >
> > > 1. compute the word's stem
> > > 2. use the stem as a lookup into the database
> > > 3. get back a list of words and their counts.
> > > 4. walk the list, looking for an exact match
> > > 5. perhaps if no exact match is found, use the first word's count.
> >
> > Other than that, this gets difficult if someone deliberately misspells a
> > word early.
>
>Well, yes, but deliberate misspellings as an attempt to evade filters
>is a separate issue. Likely, the misspelled word just becomes a spam 
>indicator.

Any use of "stems" cuts the number of distinct words recognized.  Is that 
desirable?  Me?  I don't know.

If the direction is towards fewer distinct tokens, two techniques that are 
available are hash coded and soundex.

I _have_ seen arguments against _all_ techniques that lower the level of 
detail stored.  Do we want to go in that direction?