cannot filter virus letters

David Relson relson at osagesoftware.com
Thu Jan 29 13:34:37 CET 2009


On Thu, 29 Jan 2009 15:03:37 +0300
Dmitry wrote:

>  On Суббота 24 января 2009, Tom Anderson wrote:
> > Dmitry wrote:
> > > After training with the command `bogofilter -s < virus-letter`
> > > spamicity is still very low to be identified as spam. I repeat
> > > training with similar letters (different subject, different
> > > document name in the attachment), but nothing helps to stop this
> > > kind of spam.
> > >
> > > This is the output of the command `bogofilter -vvv`:
> > >
> > > X-Bogosity: Unsure, tests=bogofilter, spamicity=0.519097,
> > > version=1.1.5 n    pgood     pbad      fw     U
> > > "document" 2  0.021739  0.000065  0.007563 + "rcvd:lovepresent.ru"
> > >                90  0.500000  0.004387  0.008798 +
> >
> > Looks like you've got too many friends over at lovepresent.ru!
> >
> > I prefer to do "exhaustive" training, which means to keep training
> > the same spam over and over again until it classifies as spammy.
> > Then you'll be assured not to receive one too similar again.
> 
> Sorry, exhaustive training doesn't change anything in my case.
> Spamicity value is still less than 0.52. Tuning robx/robs gives me
> strange results. Some good letters become spammy after that. I think
> the algorithm has to be changed somehow for small letters with a few
> words in the mesage body. Otherwise, hammy headers always  get
> greater value and never let the spamicity score to be high enough.
> 
> -- 
> Dmitry

Hello Dmitry,

In its earliest days, bogofilter scored a message based on the 15 tokens
with the most extreme scores, i.e. scores furthest from 0.5.  Years ago
bogofilter's scoring algorithm was changed and it now uses tokens
further than 0.5 by 0.375 (which is the default value of the 'min_dev'
parameter).

I've known for quite a while that using min_dev can cause the spamicity
score to be computed based on very few tokens.  It seems that your
message is being scored based on a single token.

I see two things that can be done.

First, you can change the value of min_dev in your bogofilter
configuration file.

Second, bogofilter can be modified so that the number of tokens used
for scoring a message can be set by the user.  The following parameters
come to mind:

   token_count=n -- always score using 'n' tokens
   min_token_count=n -- use at least 'n' tokens for scoring
   max_token_count=n -- use at most 'n' tokens for scoring.

token_count isn't strictly necessary as using the same value for min
and max would have the same effect.  Also token_count wouldn't be
allowed with the min or max counts.

Parameters like these are something I've thought about, but haven't
seen much need for adding.  It would help you, so perhaps I'll find
time to implement it.

Regards,

David

parameter can be added to bogofilter so that it will 



More information about the Bogofilter mailing list