spam conference report
Gyepi SAM
gyepi at praxis-sw.com
Mon Jan 20 01:18:45 CET 2003
On Mon, Jan 20, 2003 at 12:05:12AM +0100, Matthias Andree wrote:
> Gyepi SAM <gyepi at praxis-sw.com> writes:
>
> > One way I can think of is to cluster all proximate words into a single database value
> > whose key is some root of all the words, perhaps the stem. Presuming that the list is not too
> > long, we still maintain order n log n. So to look up a word:
>
> "stem" pretty much sounds like "get aware of the language" and clearly
> heads the artificial intelligence direction.
>
> How would you define a stem? the pure-alphanumerical part?
that is probably the simplest way.
>
> > 1. compute the word's stem
> > 2. use the stem as a lookup into the database
> > 3. get back a list of words and their counts.
> > 4. walk the list, looking for an exact match
> > 5. perhaps if no exact match is found, use the first word's count.
>
> Other than that, this gets difficult if someone deliberately misspells a
> word early.
Well, yes, but deliberate misspellings as an attempt to evade filters
is a separate issue. Likely, the misspelled word just becomes a spam indicator.
-Gyepi
More information about the bogofilter-dev
mailing list