chinese-korean-non_latin spam
Olaf Rogalsky
Olaf.Rogalsky at physik.uni-erlangen.de
Fri Mar 7 18:46:00 CET 2003
Hello,
yesterday I installed bogofilter and in general I am quite satisfied
with its performance. But unfortunately I get many spam mails from Asia,
encoded in some eastern language. Bogofilter has no concept to break up
the stream of characters into words/glyphs and therefore often fails
miserably on that kind of spam.
In order to solve this problem a specialized lexer would be needed for
all major encodings (say latin/chinese/japanese/russian). Further an
estimator for the encoding is needed, which chooses the right lexer
(looking at the Content-Type is not enough).
Differentiating between latin and chinese can be done by a simple
statistical analysis of the content characters, but I am not sure
about the other encodings. Unfortunatly this would mean, that one has
to process the message twice, first for the character analysis and
second for the word count analysis (at least if not all lexers are
used simultaneously, and the right one is chosen afterwards).
Olaf
Random google links on encoding guessing:
http://www.mandarintools.com/codeguess.html
http://lingua.mtsu.edu/chinese-computing/statistics/
http://www.mashke.org/Conv/
--
+----------------------------------------------------------------------+
I Dr. Olaf Rogalsky Institut f. Theo. Physik I I
I Tel.: 09131 8528440 Univ. Erlangen-Nuernberg I
I Fax.: 09131 8528444 Staudtstrasse 7 B3 I
I rogalsky at theorie1.physik.uni-erlangen.de D-91058 Erlangen I
+----------------------------------------------------------------------+
More information about the Bogofilter
mailing list