cannot filter virus letters
David Relson
relson at osagesoftware.com
Thu Jan 29 13:34:37 CET 2009
On Thu, 29 Jan 2009 15:03:37 +0300
Dmitry wrote:
> On Суббота 24 января 2009, Tom Anderson wrote:
> > Dmitry wrote:
> > > After training with the command `bogofilter -s < virus-letter`
> > > spamicity is still very low to be identified as spam. I repeat
> > > training with similar letters (different subject, different
> > > document name in the attachment), but nothing helps to stop this
> > > kind of spam.
> > >
> > > This is the output of the command `bogofilter -vvv`:
> > >
> > > X-Bogosity: Unsure, tests=bogofilter, spamicity=0.519097,
> > > version=1.1.5 n pgood pbad fw U
> > > "document" 2 0.021739 0.000065 0.007563 + "rcvd:lovepresent.ru"
> > > 90 0.500000 0.004387 0.008798 +
> >
> > Looks like you've got too many friends over at lovepresent.ru!
> >
> > I prefer to do "exhaustive" training, which means to keep training
> > the same spam over and over again until it classifies as spammy.
> > Then you'll be assured not to receive one too similar again.
>
> Sorry, exhaustive training doesn't change anything in my case.
> Spamicity value is still less than 0.52. Tuning robx/robs gives me
> strange results. Some good letters become spammy after that. I think
> the algorithm has to be changed somehow for small letters with a few
> words in the mesage body. Otherwise, hammy headers always get
> greater value and never let the spamicity score to be high enough.
>
> --
> Dmitry
Hello Dmitry,
In its earliest days, bogofilter scored a message based on the 15 tokens
with the most extreme scores, i.e. scores furthest from 0.5. Years ago
bogofilter's scoring algorithm was changed and it now uses tokens
further than 0.5 by 0.375 (which is the default value of the 'min_dev'
parameter).
I've known for quite a while that using min_dev can cause the spamicity
score to be computed based on very few tokens. It seems that your
message is being scored based on a single token.
I see two things that can be done.
First, you can change the value of min_dev in your bogofilter
configuration file.
Second, bogofilter can be modified so that the number of tokens used
for scoring a message can be set by the user. The following parameters
come to mind:
token_count=n -- always score using 'n' tokens
min_token_count=n -- use at least 'n' tokens for scoring
max_token_count=n -- use at most 'n' tokens for scoring.
token_count isn't strictly necessary as using the same value for min
and max would have the same effect. Also token_count wouldn't be
allowed with the min or max counts.
Parameters like these are something I've thought about, but haven't
seen much need for adding. It would help you, so perhaps I'll find
time to implement it.
Regards,
David
parameter can be added to bogofilter so that it will
More information about the Bogofilter
mailing list