charset implementtion progress

Tue Nov 26 21:23:31 CET 2002

David Relson <relson at osagesoftware.com> writes:

> My spamlist has approx 20,000 korean words in it,

Are you sure these are Korean "words" or rather just Korean phrases?
Since Asian languages don't necessarily break words up with spaces, a
western romance language oriented lexer would count whole sentences as
a "word."  True?  (I'm not an expert here, but this is my
impression.)

> none of which are readable.  One of the ideas that the spambayes
> folks use is to convert unreadable characters to questionmarks.

Of course, this makes an assumption that may not be acceptable to a
bogofilter user -- namely one that receives a lot of valid mail in one
Asian charset or another.  Then, this scheme may make both "buy cheap
dvd" and "hi this is your mom" have the same bogofilter token.

Assuming Asian charsets can all be considered SPAM as a justification
for munging them into more tokenizable strings is fine, but only for
folks happy with that assumption.

For folks happy with that assumption, why bother with all the costly
calculations?  Why not mark it as SPAM as soon as "charset=euc-kr" is
seen in the content-type header?

> And the last thing to mention is that the Robinson calculation uses
> all tokens in the message (with spamicities outside of min_dev), so
> all the korean symbols contribute to the score.  This makes any
> individual token less significant, even one as obivous as "euc-kr"

Agreed.  Whether the charset token is considered relatively minor
thing with this approach.