Can bogofilter use as a Chinese spam filter?

Tue Mar 27 13:23:04 CEST 2007

On Tue, 27 Mar 2007 10:21:25 +0800 (CST)
ÕÅ¾§ wrote:

> 
> I've read a lot of papers about ending spam as well as Mr Graham's A
> Plan for spam but I have a problem and was wondering if anyone can
> point me to the correct direction. 
> 
> I'm currently doing my senior project to design a spam filter on
> Chinese emails. In the file of bogofilter-faq.html, the part of "What
> can I do about Asian spam?" seems to suggest that bogofilter does not
> support Chinese language. Am I right? Would you like to give me some
> suggestions if I want to use bogofilter in Chinese language
> enviroment, that is to filter Chinese spam from Chinese mails? 
> 
> Best Regards!
> 
> Yours sincerely,
> Zhang Jing

Hello ÕÅ¾§,

About 2 years ago, unicode support was implemented in bogofilter.  This
provides a standardized character set for use in the wordlist and for
processing messages.  How well this works with Chinese is not clear.

Also, bogofilter's parser is based on a flex grammar (see file
src/lexer_v3.l).  The parser recognizes standard email headers (such as
From, Subject:, etc), multipart mime messages, etc.  As these
headers are defined by RFC standards, they apply regardless of the
language of the email, e.g. english, german, chines, hebrew, etc.
However, the grammar tokenizes the message body using rules that
approximate words, for example, whitespace followed by letter
followed by 2 or more letters or digits followed by whitespace.  

Flex's parsing is based on the roman alphabet, so the rules are too.
The grammar processes unicoded chinese without any complaints.
Unfortunately I can't say whether the tokens emitted for chinese have
meaning or are just sequences of unicode characters.

An interesting experiment would be for you (a native speaker of
chinese) to create a wordlist (with both ham and spam messages in
chinese) and test to see how well bogofilter scores new messages
(not already in the wordlist).  I'd be very interested to hear what you
find out.

It's also possible that, with a different grammar, flex would do a good
job of identifying chinese words.  As you have complete source code for
bogofilter, you can experiment with flex and lexer_v3.l.  If you
develop a good parser it could be added to bogofilter.  Again, I'd be
interested in your results.

Regards,

David