Problems with Asian Spam

David Relson relson at osagesoftware.com
Tue Nov 21 13:08:49 CET 2006


On Tue, 21 Nov 2006 09:46:17 +0100
'Stefan Geißler' wrote:

> Hi,
> 
> I have Bogofilter successfully integrated into a procmail script. 
> European and Russian spam is detected well, but Asian spam,
> especially Chinese spam is not detected. Japanese spam is detected
> with more success but not very well. Training with Asian spam results
> in "unsure". Similar spam mails as the ones used for training are not
> detected as spam. The problem is that I may receivce Asian ham mails,
> thus I can not simply delete them through procmail (as suggested in
> the FAQ, at least many of the Asian spam mails state they use
> US-ASCII charset).
> 
> My Bogofilter setup uses the default configuration like unicode=no
> and charset_default=iso-8859-1. I wonder what would happen if I
> change to unicode=yes and charset_default=yes. What would the
> wordlist database think about this?
> 
> Any suggestions and help are welcome.
> Thank you
> 
> Stefan

Hello Stefan,

What version (and distro) of bogofilter are you using?  Bogofilter's
default mode is unicode=yes.  If you've not got that set, I'd recommend
using it. 

Also, the charset_default option takes the name of a charset, i.e.
utf-8, iso-8859-1, etc.  charset_default is not a boolean (yes/no)
option like unicode.

Using unicode=yes will increase the size of your wordlist but will help
your accuracy.  If, indeed, you are presently operating with
unicode=no, it would be smart to start with a brand new wordlist when
you switch to unicode.

Running "bogofilter -Q" will show your current settings.  "bogofilter
-C -Q" will show bogofilter's default settings (unaffected by any
config file, but affected by your wordlist).

 HTH,

Regards,

David



More information about the Bogofilter mailing list