Filter breakers

Stephen Davies scldad at sdc.com.au
Fri Apr 4 10:40:13 CEST 2008


G'day David.

I hadn't checked the month total frequencies. I get:

X-Bogosity: Ham, tests=bogofilter, spamicity=0.000006, version=1.1.5
                                        n    pgood     pbad      fw     U
  "head:Oct"                          366  0.015698  0.000350  0.021853 +
  "head:May"                          133  0.005385  0.000148  0.026900 +
  "head:Aug"                          292  0.011515  0.000346  0.029240 +
  "head:Mar"                          431  0.015526  0.000609  0.037777 +
  "head:Sep"                          247  0.008880  0.000350  0.037986 +
  "head:Nov"                          479  0.017072  0.000689  0.038819 +
  "head:Apr"                          168  0.003896  0.000381  0.089081 +
  "head:Dec"                          632  0.013750  0.001493  0.097935 +
  "head:Jan"                          798  0.015354  0.002018  0.116175 +
  "head:Feb"                         1119  0.018906  0.003004  0.137121 -
  "head:Jul"                          120  0.001719  0.000343  0.166291 -
  "head:Jun"                          163  0.000917  0.000560  0.379134 -
  N_P_Q_S_s_x_md                        9  1.000000  0.000011  0.000006
                                           0.017800  0.520000  0.375000
As you point out, these are pretty meaningless in a database the size of mine.

When did headers start being included? Probably the bulk of my database is 
several years old. Do you think something like bogoutil -m wordlist.db -a 
20050101 or bogoutil -m wordlist.db -c 500 might help?

Cheers and thanks,
Stephen

On Friday 04 April 2008 13:03, David Relson wrote:
> Hello Stephen,
>
> Several things come to mind:
>
> First, an ignore list can be built to keep  particular tokens from
> affecting the results.
>
> Second, it's surprising that head:Apr is strongly ham.  In my
> experience common tokens (like months) occur in comparable numbers of
> ham and spam messages, i.e. 1/12 of all ham and 1/12 of all spam occurs
> in Jan, in Feb, etc, which leads to neutral scores for such tokens.
> Here are the numbers for my monthly tokens:
>
> ## echo Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec | bogofilter -C
> -vvv
>
> X-Bogosity: Unsure, tests=bogofilter, spamicity=0.520000, version=1.1.6
>                     n    pgood     pbad      fw     U
>   "head:Aug"    39554  0.057297  0.034561  0.376244 -
>   "head:Oct"    90032  0.128219  0.079064  0.381431 -
>   "head:Jul"    39883  0.046522  0.036883  0.442220 -
>   "head:Sep"    74158  0.078445  0.070038  0.471692 -
>   "head:May"    64285  0.065717  0.061127  0.481906 -
>   "head:Nov"   100958  0.100641  0.096462  0.489399 -
>   "head:Dec"   116720  0.111309  0.112435  0.502515 -
>   "head:Jun"    58219  0.054415  0.056281  0.508433 -
>   "head:Apr"    94728  0.087386  0.091784  0.512273 -
>   "head:Jan"   115224  0.102550  0.112320  0.522734 -
>   "head:Feb"   110755  0.076473  0.111961  0.594166 -
>   "head:Mar"   147972  0.094211  0.151023  0.615831 -
>   N_P_Q_S_s_x_md    0  0.000000  0.000000  0.520000
>                        0.017800  0.520000  0.375000
>
> Also, your count for head:Apr is only 161.  This indicates that only
> 161 messages from Apr have been registered in your wordlist.  It seems
> like a small count for a wordlist as large as yours.
>
> Lastly, your sample shows a score of 0.5 being classified as Ham.  With
> default parameters, bogofilter classifies scores from 0.45 to 0.99 as
> Unsure.  Are you using binary (ham/spam) or ternary (ham/spam/unsure)
> classification?  Perhaps 3 state classification with different ham/spam
> cutoff values would help.
>
> HTH,
>
> David
>
> On Fri, 4 Apr 2008 12:01:28 +0930
>
> Stephen Davies wrote:
> > I am still getting too many "obvious" spams slipping through my
> > bogofilter setup.
> >
> > The more I investigate, the more it seems that quite innocuous
> > headers are at least part of my problem.
> >
> > The following bogoutil output is quite common. The obviously spam
> > components are outweighed by quite harmless header tokens - one of
> > the most commonly appearing being the current month header (head:Apr).
> >
> > Is there any way to push such header tokens out of the picture?
> > (In the example below for example, the to:anonymous token is ignored
> > even though the word counts are quite skewed: 23351 to 388.)
> >
> > My database is some 200Mb with 3.5 million tokens.
> >
> > TIA,
> > Stephen Davies
> >
> > X-Bogosity: Ham, tests=bogofilter, spamicity=0.500000, version=1.1.5
>
> ...[snip]...
>
> > ========================================================================
> > This email is for the person(s) identified above, and is confidential
> > to the sender and the person(s).  No one else is authorised to use or
> > disseminate this email or its contents.
> >
> > Stephen Davies Consulting                            Voice: 08-8177
> > 1595 Adelaide, South Australia.                             Fax:
> > 08-8177 0133 Computing & Network solutions.
> > Mobile:0403 0405 83 _______________________________________________
> > Bogofilter mailing list
> > Bogofilter at bogofilter.org
> > http://www.bogofilter.org/mailman/listinfo/bogofilter
>
> _______________________________________________
> Bogofilter mailing list
> Bogofilter at bogofilter.org
> http://www.bogofilter.org/mailman/listinfo/bogofilter

-- 
========================================================================
This email is for the person(s) identified above, and is confidential to
the sender and the person(s).  No one else is authorised to use or
disseminate this email or its contents.

Stephen Davies Consulting                            Voice: 08-8177 1595
Adelaide, South Australia.                             Fax: 08-8177 0133
Computing & Network solutions.                       Mobile:0403 0405 83



More information about the Bogofilter mailing list