charset implementtion progress

Wed Nov 27 02:05:24 CET 2002

David Relson <relson at osagesoftware.com> writes:

> My spamlist has approx 20,000 korean words in it, none of which are
> readable.  One of the ideas that the spambayes folks use is to convert
> unreadable characters to questionmarks.  For charset=korean, this could
> be done to char above 0x80.  In normal usage, '?' is processed as a word
> separator by the lexer.  Doing this conversion as a type of "case
> folding", wouldn't affect the lexer, but would pass a much smaller set
> of tokens to the spamicity calculation.  Bogofilter quickly trains on
> tokens like "?????ab??" and will correctly classify the message as
> spam.

That's the American way of doing things. Don't take this personal, but
my experience tells me that many Americans simply are not aware of the
needs of languages beyond US-ASCII. :-( Consider yourself lucky to get
along with 94 printable characters.

How does bogofilter know what character sets the user can read? How are
you telling which UTF-8 sub set I can read? And why should we go this
length at all? Let's use some existing library to canonicalize our stuff
to Unicode, register everything in Unicode and be done with it. The
user's teaching bogofilter will work out in the end.

> Remember my mention of 20,000 korean tokens?  With the question mark
> technique, the number drops to 500, or so.  This is a big win for saving
> space.

OTOH, the character set alone when emitted as a token is going to be
quite indicative for far-east character sets.

Do we really get that many? Even if so, the language itself may have
such a vast alphabet for Chinese, but how many are actually used? Some
tokens like "click" are there, even if in a different language, and we
might rather consider adding a time stamp to our data base and weed out
tokens that occur infrequently and have not appeared for, say, 90 days.

> There is an additional wrinkle to passing charset=... as a token.  If no
> normalizing is done, then the wordlist will likely contain upper and
> lower case versions of the symbol, as well as versions with and without
> the (optional) double quotes.

That's trivial to fix.

> Also, names sometimes use dashes and
> sometimes use underscores.  Summary: "charset=euc-kr" may appear
> multiple times in the wordlist.

Nevermind. The IANA list is there, and if spammers deviate from that,
the display of their message is suboptimal.

> And the last thing to mention is that the Robinson calculation uses all
> tokens in the message (with spamicities outside of min_dev), so all the
> korean symbols contribute to the score.  This makes any individual token
> less significant, even one as obivous as "euc-kr"

Which, I believe, is the key why a Graham -> Robinson switch will give
some people lots of false negatives.

A scientific paper I wrote in 2000 showed me that a descriptive
algorithm (that Graham is undoubtedly) may prove more effective than the
high arts.

The work dealt with content-based image stabilization (more precisely,
compensation global inter-frame motion of the image), and the
"descriptive" algorithm modeled a non-linear phyiscal ropes-and-springs
device; and I was to contrast that to a Kalman filter and a
Rauch-Tung-Striebel smoother which are rather mathematical models, I
could go into detail here, it would require that you know some control
theory.

Anyways, it turned out that the demonstrative algorithm offered a better
compensation than the Rauch-Tung-Striebel smoother for a wide range of
the sequence, and the RTS smoother was more precise for like five frames
near the start and the end of the sequence.

-- 
Matthias Andree