histogram of wordlist.db

Sat Jan 3 16:01:25 CET 2004

On Sat, 03 Jan 2004, David Relson wrote:

> Greetings,
> 
> Have you ever wondered what it would look like if you had a histogram of
> the spamicity scores of all the tokens in your wordlist?  Mine looks
> like:

The "bath tub curve".

What worries me more is the increasing amount of spam that bogofilter
cannot figure which uses random tokens from a dictionary - it's always
multipart/alternative with utter junk in the text/plain part and a bit
of "usual spam" with deliberate misspellings ("vigara" and things).
Until now, all these spams have an URL in common, a web address, that I
stuffed into my Postfix body_checks, but once they figure how to create
web aliases, this will no longer work.

I'm willing to forward some of the spam off-list.

The histogram of such spam looks like this:

X-Bogosity: No, tests=bogofilter, spamicity=0.500000,
    version=0.16.0.cvs.CVStime_20040102_163533
   int  cnt   prob  spamicity histogram
  0.00   95 0.006758 0.002128 ################################
  0.10   10 0.159268 0.007608 ####
  0.20   26 0.260562 0.030739 #########
  0.30   42 0.356437 0.078117 ###############
  0.40    0 0.000000 0.078117 
  0.50    0 0.000000 0.078117 
  0.60   33 0.642097 0.149531 ############
  0.70   33 0.740073 0.220945 ############
  0.80   24 0.847833 0.276591 #########
  0.90  143 0.980940 0.515557 ################################################

Granted, after training, the spamicity is 1.0, but the next time such a
message arrives, the same problem shows up again as there are again
about 1,000 unrecognized tokens in the message. Anyways, this is the
histogram after training:

X-Bogosity: Yes, tests=bogofilter, spamicity=1.000000,
    version=0.16.0.cvs.CVStime_20040102_163533
   int  cnt   prob  spamicity histogram
  0.00   22 0.019095 0.003752 ##
  0.10   11 0.152382 0.016692 #
  0.20   26 0.261234 0.058540 ##
  0.30   47 0.354514 0.130559 ####
  0.40    0 0.000000 0.130559 
  0.50    0 0.000000 0.130559 
  0.60   35 0.644277 0.232486 ###
  0.70   72 0.734859 0.393187 ######
  0.80   31 0.852838 0.458085 ###
  0.90  678 0.991621 0.791740 ################################################

This is the kind of Bayes evasion spam, it fills your data base with
junk you'll never see again (but which looks like natural language
tokens) and that isn't recognized the first time you'll see such a spam.

Oh, and it's about the only spam that makes it past
bogofilter+spamassassin.  I've tried to  update SA, but its self-test
fails, so I'm not going to install it.

I wonder if the "unknown token" score, currently 0.415, should be rised
as more messages are in the data base, or if this can be coerced into
the model somehow, since "I haven't yet seen this token in ten thousand
mails" certainly is information we don't handle gently at the moment.

I'm on the verge of bouncing all multipart/alternative mail sent to my
accounts.