chinese-korean-non_latin spam

Fri Mar 7 18:46:00 CET 2003

Hello,

yesterday I installed bogofilter and in general I am quite satisfied
with its performance. But unfortunately I get many spam mails from Asia,
encoded in some eastern language. Bogofilter has no concept to break up
the stream of characters into words/glyphs and therefore often fails
miserably on that kind of spam.

In order to solve this problem a specialized lexer would be needed for
all major encodings (say latin/chinese/japanese/russian). Further an 
estimator for the encoding is needed, which chooses the right lexer
(looking at the Content-Type is not enough).

Differentiating between latin and chinese can be done by a simple
statistical analysis of the content characters, but I am not sure
about the other encodings. Unfortunatly this would mean, that one has
to process the message twice, first for the character analysis and
second for the word count analysis (at least if not all lexers are
used simultaneously, and the right one is chosen afterwards).

Olaf

Random google links on encoding guessing:
http://www.mandarintools.com/codeguess.html
http://lingua.mtsu.edu/chinese-computing/statistics/
http://www.mashke.org/Conv/

-- 
+----------------------------------------------------------------------+
I Dr. Olaf Rogalsky                         Institut f. Theo. Physik I I
I Tel.: 09131 8528440                       Univ. Erlangen-Nuernberg   I
I Fax.: 09131 8528444                       Staudtstrasse 7 B3         I
I rogalsky at theorie1.physik.uni-erlangen.de  D-91058 Erlangen           I
+----------------------------------------------------------------------+