Filter breakers
David Relson
relson at osagesoftware.com
Fri Apr 4 05:33:29 CEST 2008
Hello Stephen,
Several things come to mind:
First, an ignore list can be built to keep particular tokens from
affecting the results.
Second, it's surprising that head:Apr is strongly ham. In my
experience common tokens (like months) occur in comparable numbers of
ham and spam messages, i.e. 1/12 of all ham and 1/12 of all spam occurs
in Jan, in Feb, etc, which leads to neutral scores for such tokens.
Here are the numbers for my monthly tokens:
## echo Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec | bogofilter -C -vvv
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.520000, version=1.1.6
n pgood pbad fw U
"head:Aug" 39554 0.057297 0.034561 0.376244 -
"head:Oct" 90032 0.128219 0.079064 0.381431 -
"head:Jul" 39883 0.046522 0.036883 0.442220 -
"head:Sep" 74158 0.078445 0.070038 0.471692 -
"head:May" 64285 0.065717 0.061127 0.481906 -
"head:Nov" 100958 0.100641 0.096462 0.489399 -
"head:Dec" 116720 0.111309 0.112435 0.502515 -
"head:Jun" 58219 0.054415 0.056281 0.508433 -
"head:Apr" 94728 0.087386 0.091784 0.512273 -
"head:Jan" 115224 0.102550 0.112320 0.522734 -
"head:Feb" 110755 0.076473 0.111961 0.594166 -
"head:Mar" 147972 0.094211 0.151023 0.615831 -
N_P_Q_S_s_x_md 0 0.000000 0.000000 0.520000
0.017800 0.520000 0.375000
Also, your count for head:Apr is only 161. This indicates that only
161 messages from Apr have been registered in your wordlist. It seems
like a small count for a wordlist as large as yours.
Lastly, your sample shows a score of 0.5 being classified as Ham. With
default parameters, bogofilter classifies scores from 0.45 to 0.99 as
Unsure. Are you using binary (ham/spam) or ternary (ham/spam/unsure)
classification? Perhaps 3 state classification with different ham/spam
cutoff values would help.
HTH,
David
On Fri, 4 Apr 2008 12:01:28 +0930
Stephen Davies wrote:
> I am still getting too many "obvious" spams slipping through my
> bogofilter setup.
>
> The more I investigate, the more it seems that quite innocuous
> headers are at least part of my problem.
>
> The following bogoutil output is quite common. The obviously spam
> components are outweighed by quite harmless header tokens - one of
> the most commonly appearing being the current month header (head:Apr).
>
> Is there any way to push such header tokens out of the picture?
> (In the example below for example, the to:anonymous token is ignored
> even though the word counts are quite skewed: 23351 to 388.)
>
> My database is some 200Mb with 3.5 million tokens.
>
> TIA,
> Stephen Davies
>
> X-Bogosity: Ham, tests=bogofilter, spamicity=0.500000, version=1.1.5
...[snip]...
> ========================================================================
> This email is for the person(s) identified above, and is confidential
> to the sender and the person(s). No one else is authorised to use or
> disseminate this email or its contents.
>
> Stephen Davies Consulting Voice: 08-8177
> 1595 Adelaide, South Australia. Fax:
> 08-8177 0133 Computing & Network solutions.
> Mobile:0403 0405 83 _______________________________________________
> Bogofilter mailing list
> Bogofilter at bogofilter.org
> http://www.bogofilter.org/mailman/listinfo/bogofilter
More information about the Bogofilter
mailing list