Some naive questions

David Relson relson at osagesoftware.com
Fri Jan 3 17:48:32 CET 2003


Greetings Adriano,

It sounds like you've been reading the mailing list for a while.  That's 
great :-)

At 11:06 AM 1/3/03, Adriano Nagelschmidt Rodrigues wrote:

>Hello,
>
>I've recently started using bogofilter (version 0.9.1.2). So far, It only
>recognizes a tiny percentage of the spam I receive (I guess 20% at most).
>
>I'm running it from procmail as in:
>
>:0HB:
>* ? bogofilter -l -u
>spam

I'll let someone else comment on this, as I'm a bit weak on understanding 
of procmail recipes.  (I do have one that works for me and can send it to 
you, if you wish).

>and correcting the misclassifications manually via mutt macros.

Sounds right.  I presume you're using '-N' and '-S' ??

>Some data:
>
>[chianti:~] $ bogoutil -w .bogofilter/ .MSG_COUNT
>                        spam   good
>.MSG_COUNT             4539    795

It's unusual to have so many more spam than ham messages.

>Although I admit I haven't done much research, I would appreciate if someone
>could comment on the stupid things I have done/thought:
>
>* I fed bogofilter 7.1 MB worth of spam from spamarchive.org in a desperate
>   attempt to make it learn.

bogofilter best learns from the spam and ham that _you_ receive.  Using 
spam from spamarchive.org will help if it has words in common with your 
spam.  I can't say ...

>* I sometimes wonder about what the '-u' switch really buys you.

So do others :-)  Personally, I use it and am glad of it.  Other people, 
whose opinion I respect, do not use it.

What it buys is expansion of wordlists.  When a message is classified, each 
words is scored.  Depending on the ratio of ham words to spam words, the 
message will be scored as spam (or ham).  Some words of each message have 
strong scores, some have  have weak scores, and some have never before been 
seen by bogofilter.  Using '-u', all the words of each message are added to 
the spam (or ham) wordlist.  The result is to add new words to the lists 
(depending on the context in which they are encountered) and to increase 
the counts for the words already in the lists.

>* I never tried the '-f' switch.

Using the '-f' switch divides the mail into three groups - ham, spam, and 
unsure.  Using it, I've found the ham and spam classifying to be nearly 
perfect.  The number of messages that get the "unsure" rating is a fairly 
small percentage of all that arrive and I pass them back to bogofilter 
using '-s' and '-n' (as bogofilter's '-u' switch won't update the wordlists 
when it's unsure).


>* I could feed it more ham.

I'd suggest taking all _your_ email and divide it into ham and spam.  Then 
create totally new wordlists and use them.  Then correct (via your mutt 
macros) _all_ errors.  It will take a few days (maybe longer) to get 
bogofilter trained, but it will be worth it.

>Thanks a lot,

You're very welcome.

David





More information about the Bogofilter mailing list