further advice for asian spam and spam assassin text

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Wed Sep 24 17:13:48 CEST 2003


p at dirac.org wrote:

> 1. if i "bogofilter -s" this email, will the spam assassin text pollute my
>    wordlist?  for instance, i don't want the word "spam" to enter the
>    list of bad words since most spam doesn't say "spam".

Since you also train with ham, the word will show up there
(like in this mail or you mail), so this is not a problem.

> 2. i'm still debating whether i'm going to take the FAQ's option 1 (let
>    bogofilter add the tokens) or option 2 (don't train bogofilter on
>    asian spam at all).

I finally decided to add to bogofilter and it works very
nice. Only very few "chinese" mail are used for training and
that's enough.

>    the FAQ says that option 1 can be expensive. 

That depends on the training mehtod. When it was written,
full training was the way to go. So you could add lots of
words to the database. If you just train on error (like with
bogominitrain.pl) this is not the case.

>    by expensive, are we talking about my machine is going to swap
>    everytime bogofilter runs?  or are we talking about "it's expensive
>    for a a 100+ person system"?

Expensive mainly on database size and whatever that does to
you system.


>       :0 HB:
>       * charset=.*(koi8|windows-125[01345678]|big-?5)
>       etc/UNREADABLE
> 
>    the bogofilter FAQ recommends:

That would also catch your an my message. Not what you want,
I guess. I hope you find this mail anyway.

>     UNREADABLE='[^?"]*big5|iso-2022-jp|ISO-2022-KR|euc-kr|gb2312|ks_c_5601-1987'
> 
>     :0:
>     * 1^0 $ ^Subject:.*=\?($UNREADABLE)
>     * 1^0 $ ^Content-Type:.*charset="?($UNREADABLE)
>     spam-unreadable
> 
>     :0:
>     * ^Content-Type:.*multipart
>     * B ?? $ ^Content-Type:.*^?.*charset="?($UNREADABLE)
>     spam-unreadable
> 
>    but it looks like neither of these recipes would've worked anyway, since
>    the Content-Type headers are:
> 
>       Content-Type: text/plain;
>       Content-Transfer-Encoding: 7bit

This is surpising, That is very broken. With this you are
usually not able to get any of the above messages readable.

>    is there a better, more sure way of catching non-english spam? 

No, but you can guess, using a number of non-ASCII
characters in a row, but this is dangerous.


>    i imagine it's the responsibility of the spammer's MTA (or MUA??) to
>    write that header.

MUA

>    and if i were a spammer, i'd surpress any header
>    that can be used to flag my email as spam.

But you want readers to be able to read.

> ----- Forwarded message from XAEvxzl at iris.seed.net.tw -----

Please do *not* forware spam to this list. It pollutes the
database.

> Subject: *****SPAM***** ¥þ³¡¥X²M
                          ~~~~~~~~

This could be used (eight non-ASCII characters in a row).

> ?M???S??(????30??1000??)????????
[...]

This is pretty much what happens without charset declaration.

pi





More information about the Bogofilter mailing list