charset implementtion progress

Tue Nov 26 20:33:09 CET 2002

At 02:15 PM 11/26/02, Matt Armstrong wrote:

>David Relson <relson at osagesoftware.com> writes:
>
> > With these routines in place, the regression test results have changed
> > a little bit.  Since"iso-8859-1", "us-ascii", etc are now processed by
> > the got_charset() routine and are not passed on as tokens...
>
>Is it possible to pass them on as tokens too?  The actual charset of
>the message is a reliable SPAM indicator for me.
>E.g. charset="ks_c_5601-1987", charset=euc-kr.  They often showed up
>as the tokens chosen for calculation in the original Graham method.
>I'd hate to lose them.

Matt,

A good question.  Let me comment (at length).

Yes, I can pass it on as a token; and, yes, it is a good indicator.  The 
code for passing it on is ugly and I thought to leave it out.

My spamlist has approx 20,000 korean words in it, none of which are 
readable.  One of the ideas that the spambayes folks use is to convert 
unreadable characters to questionmarks.  For charset=korean, this could be 
done to char above 0x80.  In normal usage, '?' is processed as a word 
separator by the lexer.  Doing this conversion as a type of "case folding", 
wouldn't affect the lexer, but would pass a much smaller set of tokens to 
the spamicity calculation.  Bogofilter quickly trains on tokens like 
"?????ab??" and will correctly classify the message as spam.

Remember my mention of 20,000 korean tokens?  With the question mark 
technique, the number drops to 500, or so.  This is a big win for saving space.

There is an additional wrinkle to passing charset=... as a token.  If no 
normalizing is done, then the wordlist will likely contain upper and lower 
case versions of the symbol, as well as versions with and without the 
(optional) double quotes.  Also, names sometimes use dashes and sometimes 
use underscores.  Summary: "charset=euc-kr" may appear multiple times in 
the wordlist.

And the last thing to mention is that the Robinson calculation uses all 
tokens in the message (with spamicities outside of min_dev), so all the 
korean symbols contribute to the score.  This makes any individual token 
less significant, even one as obivous as "euc-kr"