further advice for asian spam and spam assassin text

p at dirac.org p at dirac.org
Wed Sep 24 16:58:45 CEST 2003


dear bogofilter users,

here's spam that spam assassin caught, but bogofilter didn't.  note that
i run bogofilter without the -u option:

   :0fw
   | bogofilter -f -p -l -e -v -3

instead, i train bogofilter by hand using some mutt macros:

   macro pager S "<pipe-entry>bogofilter -s"
   macro index S "<pipe-entry>bogofilter -s"
   macro pager N "<pipe-entry>bogofilter -n"
   macro index N "<pipe-entry>bogofilter -n"

so bogofilter hasn't been trained on this email yet.


i've read the bogofilter FAQ entry "What can I do about Asian spam",
and i'd like a little more guidance:

1. if i "bogofilter -s" this email, will the spam assassin text pollute my
   wordlist?  for instance, i don't want the word "spam" to enter the
   list of bad words since most spam doesn't say "spam".

2. i'm still debating whether i'm going to take the FAQ's option 1 (let
   bogofilter add the tokens) or option 2 (don't train bogofilter on
   asian spam at all).

   the FAQ says that option 1 can be expensive.  can someone give me an
   idea about how expensive?  i'm on a two person home linux box.  dual
   celeron 333.  ignoring viruses, i get between 10-30 spams a day,
   approximately a 3rd of which is asian spam.  my wife said she gets 1
   or 2 a week (i have no idea how...).

   by expensive, are we talking about my machine is going to swap
   everytime bogofilter runs?  or are we talking about "it's expensive
   for a a 100+ person system"?

3. i've temporarily disabled my non-english procmail filter in attempt
   to reacquaint myself with the type and frequency of spam that's
   coming in.  my recipe looks like:

      # Non English filter.  koi8=Ukranian, win1250=central/eastern europe
      # win1251=cyrillic, win1253=greek, win1254=turk, win1255=hebrew,
      # win1256=arab
      # win1257=baltic, win1258=vietnamese
      #
      :0 HB:
      * charset=.*(koi8|windows-125[01345678]|big-?5)
      etc/UNREADABLE

   the bogofilter FAQ recommends:


    UNREADABLE='[^?"]*big5|iso-2022-jp|ISO-2022-KR|euc-kr|gb2312|ks_c_5601-1987'

    :0:
    * 1^0 $ ^Subject:.*=\?($UNREADABLE)
    * 1^0 $ ^Content-Type:.*charset="?($UNREADABLE)
    spam-unreadable

    :0:
    * ^Content-Type:.*multipart
    * B ?? $ ^Content-Type:.*^?.*charset="?($UNREADABLE)
    spam-unreadable

   but it looks like neither of these recipes would've worked anyway, since
   the Content-Type headers are:

      Content-Type: text/plain;
      Content-Transfer-Encoding: 7bit

   and neither one is a match for the above recipes.

   is there a better, more sure way of catching non-english spam?  i
   imagine it's the responsibility of the spammer's MTA (or MUA??) to
   write that header.  and if i were a spammer, i'd surpress any header
   that can be used to flag my email as spam.

   is there a better way of catching this kind of spam?


note that this email arrived to my account at school and was dot
forwarded to my main workstation connected to the internet via cable
modem.


thanks, and sorry for the long post.

pete




----- Forwarded message from XAEvxzl at iris.seed.net.tw -----

Return-path: psalzman at lifshitz.ucdavis.edu
Envelope-to: p at dirac.org
Delivery-date: Wed, 24 Sep 2003 07:01:34 -0700
Received: from lifshitz.ucdavis.edu ([169.237.42.72])
	by gabriel.localdomain with esmtp (Exim 3.36 #1 (Debian))
	id 1A2ACz-0001gJ-00
	for <p at dirac.org>; Wed, 24 Sep 2003 07:01:33 -0700
Received: (from psalzman at localhost)
	by lifshitz.ucdavis.edu (8.11.6/8.11.6) id h8ODwEO10626
	for p at dirac.org; Wed, 24 Sep 2003 06:58:14 -0700
Received: from A ([61.20.224.166])
	by lifshitz.ucdavis.edu (8.11.6/8.11.6) with SMTP id h8ODwAo10618
	for <psalzman at landau.ucdavis.edu>; Wed, 24 Sep 2003 06:58:10 -0700
Date: Wed, 24 Sep 2003 06:58:10 -0700
Received: from pavo
	by ibm.com with SMTP id qaMATjUicVYctJWB6szAYS9LU25;
	Mon, 24 Sep 2001 22:04:45 +0800
Message-ID: <qUs4xh6d at seed.net.tw>
From: XAEvxzl at iris.seed.net.tw
To: 9zTs at tpts8.seed.net.tw
Subject: *****SPAM***** ¥þ³¡¥X²M
X-Mailer: Dhy60reftIFDxJG85AKi8ohxgCxX
Content-Type: text/plain;
X-Priority: 3
X-MSMail-Priority: Normal
Content-Transfer-Encoding: 7bit
X-MIME-Autoconverted: from Quoted-Printable to 8bit by lifshitz.ucdavis.edu id h8ODwAo10618
X-Spam-Status: Yes, hits=14.2 required=5.0
	tests=HEADER_8BITS,MAILTO_TO_SPAM_ADDR,MISSING_MIMEOLE,
	      MISSING_OUTLOOK_NAME,NO_REAL_NAME,RATWARE_HASH_2,
	      SPAM_PHRASE_00_01,SUBJ_FULL_OF_8BITS,UPPERCASE_25_50
	version=2.43
X-Spam-Flag: YES
X-Spam-Level: **************
X-Spam-Checker-Version: SpamAssassin 2.43 (1.115.2.20-2002-10-15-exp)
X-Spam-Prev-Content-Transfer-Encoding: 8bit
X-Bogosity: No, tests=bogofilter, spamicity=0.500634, version=0.14.2.cvs.20030804
   int  cnt   prob  spamicity histogram
  0.00   13 0.001914 0.000488 #############
  0.10    4 0.175398 0.014667 ####
  0.20    0 0.000000 0.014667 
  0.30    1 0.339654 0.021950 #
  0.40    0 0.000000 0.021950 
  0.50    0 0.000000 0.021950 
  0.60   12 0.642097 0.184099 ############
  0.70    8 0.738586 0.272007 ########
  0.80   23 0.851632 0.443272 #######################
  0.90   15 0.945870 0.517926 ###############
Content-Length: 2038
Lines: 43

SPAM: -------------------- Start SpamAssassin results ----------------------
SPAM: This mail is probably spam.  The original message has been altered
SPAM: so you can recognise or block similar unwanted mail in future.
SPAM: See http://spamassassin.org/tag/ for more details.
SPAM: 
SPAM: Content analysis details:   (14.20 hits, 5 required)
SPAM: NO_REAL_NAME       (2.5 points)  From: does not include a real name
SPAM: HEADER_8BITS       (2.4 points)  Headers include 3 consecutive 8-bit characters
SPAM: RATWARE_HASH_2     (1.1 points)  Bulk email software fingerprint (hash 2) found in headers
SPAM: SPAM_PHRASE_00_01  (0.8 points)  BODY: Spam phrases score is 00 to 01 (low)
SPAM:                    [score: 0]
SPAM: MAILTO_TO_SPAM_ADDR (0.7 points)  URI: Includes a link to a likely spammer email address
SPAM: SUBJ_FULL_OF_8BITS (3.8 points)  Subject is full of 8-bit characters
SPAM: MISSING_MIMEOLE    (0.5 points)  Message has X-MSMail-Priority, but no X-MimeOLE
SPAM: MISSING_OUTLOOK_NAME (1.1 points)  Message looks like Outlook, but isn't
SPAM: UPPERCASE_25_50    (1.3 points)  message body is 25-50% uppercase
SPAM: 
SPAM: -------------------- End of SpamAssassin results ---------------------

?M???S??(????30??1000??)????????
???|???A ?????@??  ?????K???w?Q???????t??10-20???@??????

?C???B?{???B?L?XAV????????
   ?w?????[,?C??NT$30??
   http://ndf.imess.net   http://xnoodler.yoll.net
   ???h???Z  ????????
?J?????????X???I--????--?s?X--?c???????Y?i
?s?W???h????2?M<1-2> ?C?M18??600??  ?O???D??
?m?W?G?????g?u???m?W
?E ?l???????G?????????g
?E ???????}?G?????g?????a?}?????????l?F?H?c
?E ?q???G?????????g(?????d?U???????X)
?E ?q?l?H?c?G?????????g?A?_?h?L?k?T?{?z???q??
???N?H?W???r?????????????????s?P?q??????
?l?H???q???H?c(??2???H?c???H?o???~???????^
?@
 xnoodler at yahoo.com.tw   topfile2002 at yahoo.com.tw 
?q??????:
?@.???l???N???f??????????,???l???H???N?F???e????,?z?A???L???Y?i.
?G.?????????c?????q???t??,???????H???}????,???q????,??email???H???????U?z?H?c
?T.????30?????@?????????s??.?t?[150???l??.????3000???K?l??



----- End forwarded message -----

-- 
GPG Instructions: http://www.dirac.org/linux/gpg
GPG Fingerprint: B9F1 6CF3 47C4 7CD8 D33E 70A9 A3B9 1945 67EA 951D




More information about the Bogofilter mailing list