spam conference report

Mon Jan 20 01:18:45 CET 2003

On Mon, Jan 20, 2003 at 12:05:12AM +0100, Matthias Andree wrote:
> Gyepi SAM <gyepi at praxis-sw.com> writes:
> 
> > One way I can think of is to cluster all proximate words into a single database value
> > whose key is some root of all the words, perhaps the stem. Presuming that the list is not too
> > long, we still maintain order n log n. So to look up a word:
> 
> "stem" pretty much sounds like "get aware of the language" and clearly
> heads the artificial intelligence direction.
> 
> How would you define a stem? the pure-alphanumerical part?

that is probably the simplest way.

> 
> > 1. compute the word's stem
> > 2. use the stem as a lookup into the database
> > 3. get back a list of words and their counts.
> > 4. walk the list, looking for an exact match
> > 5. perhaps if no exact match is found, use the first word's count.
> 
> Other than that, this gets difficult if someone deliberately misspells a
> word early.

Well, yes, but deliberate misspellings as an attempt to evade filters
is a separate issue. Likely, the misspelled word just becomes a spam indicator.

-Gyepi