Training question

Thu May 14 21:11:54 CEST 2009

It sounds as though your method is fine, but you just need to follow it 
to its logical conclusion... train with "Ns" until the token is 
recognized as spammy.  I do this using bfproxy with the "Nsx" options... 
it automatically keeps registering until the emails are correctly 
classified or until it has tried a user-defined number of times.

http://orderamidchaos.com/bogofilter/bfproxy

E.g. I just registered one of these Acai spams this way, and here was 
the output (with the "v" option):

subject: Treats blood pressure right!
original spamicity: 0.055994
user classification: spam
command: bogofilter -Ns
words: 88
new spamicity: 0.121036
new spamicity: 0.939520

It registered the spam the first time and tested it again to find that 
it was still in the hammy range.  Therefore, it registered it again, 
this time pushing it well into the spammy range, so it stopped at that 
point.  My "rmax" limit is 50, so if it wasn't making any headway, it 
would stop after 50 times to prevent an infinite loop.  Rarely does it 
need to repeat more than a few times though.

Since I started doing this, I no longer have the problem of having to 
receive and correct similar spams many times.  If I know that something 
is a spam, I want bogofilter to recognize it as such the very first time 
I see it.  This exhaustive training method ensures that it does.

Tom

Stephen Davies wrote:
> The "good" numbers came from a period of a couple of days when my -Ns proc was 
> broken and, as I asked, I don't know how to get rid of them.
> 
> I do not use -u at all.
> 
> I "retrain" by running each undetected spam through bogofilter -Ns once and 
> then through bogofilter -s five times. I would expect - and the -w numbers 
> seem to confirm - that this stacks the stats against these texts.
> 
> Why does this not work?
> 
> Stephen
> 
>  On Monday 11 May 2009 19:01:34 Matthias Andree wrote:
>> Am 11.05.2009, 07:15 Uhr, schrieb Stephen Davies <scldad at sdc.com.au>:
>>> One of the very common types of spam recently is weight loss by taking
>>> Acai
>>> berries.
>>>
>>> I have received thousands of spams with this in the subject and/or body
>>> and
>>> have fed then all into bogofilter as spam (after first reversing the
>>> initial
>>> ham entry).
>>>
>>> My word  list now includes:
>>>                                  spam   good
>>> Acai                            16084    321
>>>                                  spam   good
>>> subj:Acai                        5464    352
>>>
>>>
>>> Despite this, I still see:
>>> -bash-3.2# bogofilter -vvv < spam1 | grep Acai
>>> "subj:Acai"                        5816  0.029983  0.015939  0.347094 -
>>> "Acai"                            16406  0.027416  0.046919  0.631186 -
>>>
>>> What do I have to do to get these (and similar) words recognised as
>>> definitely
>>> spam?
>> How come that >300 of these have been scored as good?
>>
>> If you are using bogofilter with "-u", be sure to THOROUGHLY retrain all
>> unsures and mis-classified messages. If you cannot or do not want to do
>> that, do not run bogofilter in "-u" mode.
>>
>> HTH
> 
> 
>