Can bogofilter use as a Chinese spam filter?
matthias.andree at gmx.de
Tue Mar 27 18:43:53 EDT 2007
> I'm currently doing my senior project to design a spam filter on Chinese emails. In the file of bogofilter-faq.html, the part of "What can I do about Asian spam?" seems to suggest that bogofilter does not support Chinese language. Am I right? Would you like to give me some suggestions if I want to use bogofilter in Chinese language enviroment, that is to filter Chinese spam from Chinese mails?
Zhang Jing, your "Subject" header line is not properly encoded in your
character set and displays random characters here, not your Chinese name.
Anyways: the problem with Chinese is that, unlike Indoeuropean languages
that we know, written Chinese has no spaces between words, but just
concatenates them until a full-stop. Bogofilter is not programmed to
handle that, but will instead parse full sentences, unless they contain
more than 30 words - so it may catch common spam phrases, but not
individual words unfortunately.
As David suggested, help with enhancing lexer.l to properly emit Chinese
words as single tokens is most welcome.
Hope that helps.
More information about the Bogofilter-dev