Filter breakers

David Relson relson at osagesoftware.com
Fri Apr 4 05:33:29 CEST 2008


Hello Stephen,

Several things come to mind:

First, an ignore list can be built to keep  particular tokens from
affecting the results.   

Second, it's surprising that head:Apr is strongly ham.  In my
experience common tokens (like months) occur in comparable numbers of
ham and spam messages, i.e. 1/12 of all ham and 1/12 of all spam occurs
in Jan, in Feb, etc, which leads to neutral scores for such tokens.
Here are the numbers for my monthly tokens:

## echo Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec | bogofilter -C -vvv

X-Bogosity: Unsure, tests=bogofilter, spamicity=0.520000, version=1.1.6
                    n    pgood     pbad      fw     U
  "head:Aug"    39554  0.057297  0.034561  0.376244 -
  "head:Oct"    90032  0.128219  0.079064  0.381431 -
  "head:Jul"    39883  0.046522  0.036883  0.442220 -
  "head:Sep"    74158  0.078445  0.070038  0.471692 -
  "head:May"    64285  0.065717  0.061127  0.481906 -
  "head:Nov"   100958  0.100641  0.096462  0.489399 -
  "head:Dec"   116720  0.111309  0.112435  0.502515 -
  "head:Jun"    58219  0.054415  0.056281  0.508433 -
  "head:Apr"    94728  0.087386  0.091784  0.512273 -
  "head:Jan"   115224  0.102550  0.112320  0.522734 -
  "head:Feb"   110755  0.076473  0.111961  0.594166 -
  "head:Mar"   147972  0.094211  0.151023  0.615831 -
  N_P_Q_S_s_x_md    0  0.000000  0.000000  0.520000
                       0.017800  0.520000  0.375000

Also, your count for head:Apr is only 161.  This indicates that only
161 messages from Apr have been registered in your wordlist.  It seems
like a small count for a wordlist as large as yours.

Lastly, your sample shows a score of 0.5 being classified as Ham.  With
default parameters, bogofilter classifies scores from 0.45 to 0.99 as
Unsure.  Are you using binary (ham/spam) or ternary (ham/spam/unsure)
classification?  Perhaps 3 state classification with different ham/spam
cutoff values would help.

HTH,

David

On Fri, 4 Apr 2008 12:01:28 +0930
Stephen Davies wrote:

> I am still getting too many "obvious" spams slipping through my
> bogofilter setup.
> 
> The more I investigate, the more it seems that quite innocuous
> headers are at least part of my problem.
> 
> The following bogoutil output is quite common. The obviously spam
> components are outweighed by quite harmless header tokens - one of
> the most commonly appearing being the current month header (head:Apr).
> 
> Is there any way to push such header tokens out of the picture?
> (In the example below for example, the to:anonymous token is ignored
> even though the word counts are quite skewed: 23351 to 388.)
> 
> My database is some 200Mb with 3.5 million tokens.
> 
> TIA,
> Stephen Davies
> 
> X-Bogosity: Ham, tests=bogofilter, spamicity=0.500000, version=1.1.5

...[snip]...

> ========================================================================
> This email is for the person(s) identified above, and is confidential
> to the sender and the person(s).  No one else is authorised to use or
> disseminate this email or its contents.
> 
> Stephen Davies Consulting                            Voice: 08-8177
> 1595 Adelaide, South Australia.                             Fax:
> 08-8177 0133 Computing & Network solutions.
> Mobile:0403 0405 83 _______________________________________________
> Bogofilter mailing list
> Bogofilter at bogofilter.org
> http://www.bogofilter.org/mailman/listinfo/bogofilter



More information about the Bogofilter mailing list