Training question

Matthias Andree matthias.andree at gmx.de
Tue May 12 09:23:25 CEST 2009


Am 11.05.2009, 15:36 Uhr, schrieb Stephen Davies <scldad at sdc.com.au>:

> The "good" numbers came from a period of a couple of days when my -Ns  
> proc was
> broken and, as I asked, I don't know how to get rid of them.
>
> I do not use -u at all.
>
> I "retrain" by running each undetected spam through bogofilter -Ns once  
> and then through bogofilter -s five times. I would expect - and the -w  
> numbers seem to confirm - that this stacks the stats against these texts.
>
> Why does this not work?

Hi Stephen,

These figures are relative to the total message count of unsolicited and  
good messages (reflected in the .MSG_COUNT special token) that you can  
query with:

     bogoutil -w ~/.bogofilter/wordlist.db .MSG_COUNT

The specific spamicity of this token has been polluted through bogofilter  
-Sn or bogofilter -n, so bogofilter doesn't recognize this token either  
way. It sees there's a 3% probability it's ham and 1.6% that it's spam in  
the subject, and in message bodies. 2.7% good and 4.7% bad.

I wonder if it would be worthwhile for you to backup your ~/.bogofilter  
directory (just in case) and start retraining from scratch.

If that's not possible, but you have clean message boxes of spam and ham  
where there are no uncorrected misfilings, you may have some success with  
letting bogotune figure out other parameters.

There are also tools such as randomtrain that can help you build your  
database from such mail boxes, again assuming that you have spam and good  
messages sorted apart properly.

HTH
Matthias

>
> Stephen
>
>  On Monday 11 May 2009 19:01:34 Matthias Andree wrote:
>> Am 11.05.2009, 07:15 Uhr, schrieb Stephen Davies <scldad at sdc.com.au>:
>> > One of the very common types of spam recently is weight loss by taking
>> > Acai
>> > berries.
>> >
>> > I have received thousands of spams with this in the subject and/or  
>> body
>> > and
>> > have fed then all into bogofilter as spam (after first reversing the
>> > initial
>> > ham entry).
>> >
>> > My word  list now includes:
>> >                                  spam   good
>> > Acai                            16084    321
>> >                                  spam   good
>> > subj:Acai                        5464    352
>> >
>> >
>> > Despite this, I still see:
>> > -bash-3.2# bogofilter -vvv < spam1 | grep Acai
>> > "subj:Acai"                        5816  0.029983  0.015939  0.347094  
>> -
>> > "Acai"                            16406  0.027416  0.046919  0.631186  
>> -
>> >
>> > What do I have to do to get these (and similar) words recognised as
>> > definitely
>> > spam?
>>
>> How come that >300 of these have been scored as good?
>>
>> If you are using bogofilter with "-u", be sure to THOROUGHLY retrain all
>> unsures and mis-classified messages. If you cannot or do not want to do
>> that, do not run bogofilter in "-u" mode.
>>
>> HTH
>
>
>



-- 
Matthias Andree



More information about the Bogofilter mailing list