Training question
Matthias Andree
matthias.andree at gmx.de
Tue May 12 09:23:25 CEST 2009
Am 11.05.2009, 15:36 Uhr, schrieb Stephen Davies <scldad at sdc.com.au>:
> The "good" numbers came from a period of a couple of days when my -Ns
> proc was
> broken and, as I asked, I don't know how to get rid of them.
>
> I do not use -u at all.
>
> I "retrain" by running each undetected spam through bogofilter -Ns once
> and then through bogofilter -s five times. I would expect - and the -w
> numbers seem to confirm - that this stacks the stats against these texts.
>
> Why does this not work?
Hi Stephen,
These figures are relative to the total message count of unsolicited and
good messages (reflected in the .MSG_COUNT special token) that you can
query with:
bogoutil -w ~/.bogofilter/wordlist.db .MSG_COUNT
The specific spamicity of this token has been polluted through bogofilter
-Sn or bogofilter -n, so bogofilter doesn't recognize this token either
way. It sees there's a 3% probability it's ham and 1.6% that it's spam in
the subject, and in message bodies. 2.7% good and 4.7% bad.
I wonder if it would be worthwhile for you to backup your ~/.bogofilter
directory (just in case) and start retraining from scratch.
If that's not possible, but you have clean message boxes of spam and ham
where there are no uncorrected misfilings, you may have some success with
letting bogotune figure out other parameters.
There are also tools such as randomtrain that can help you build your
database from such mail boxes, again assuming that you have spam and good
messages sorted apart properly.
HTH
Matthias
>
> Stephen
>
> On Monday 11 May 2009 19:01:34 Matthias Andree wrote:
>> Am 11.05.2009, 07:15 Uhr, schrieb Stephen Davies <scldad at sdc.com.au>:
>> > One of the very common types of spam recently is weight loss by taking
>> > Acai
>> > berries.
>> >
>> > I have received thousands of spams with this in the subject and/or
>> body
>> > and
>> > have fed then all into bogofilter as spam (after first reversing the
>> > initial
>> > ham entry).
>> >
>> > My word list now includes:
>> > spam good
>> > Acai 16084 321
>> > spam good
>> > subj:Acai 5464 352
>> >
>> >
>> > Despite this, I still see:
>> > -bash-3.2# bogofilter -vvv < spam1 | grep Acai
>> > "subj:Acai" 5816 0.029983 0.015939 0.347094
>> -
>> > "Acai" 16406 0.027416 0.046919 0.631186
>> -
>> >
>> > What do I have to do to get these (and similar) words recognised as
>> > definitely
>> > spam?
>>
>> How come that >300 of these have been scored as good?
>>
>> If you are using bogofilter with "-u", be sure to THOROUGHLY retrain all
>> unsures and mis-classified messages. If you cannot or do not want to do
>> that, do not run bogofilter in "-u" mode.
>>
>> HTH
>
>
>
--
Matthias Andree
More information about the Bogofilter
mailing list