Some naive questions
David Relson
relson at osagesoftware.com
Fri Jan 3 17:48:32 CET 2003
Greetings Adriano,
It sounds like you've been reading the mailing list for a while. That's
great :-)
At 11:06 AM 1/3/03, Adriano Nagelschmidt Rodrigues wrote:
>Hello,
>
>I've recently started using bogofilter (version 0.9.1.2). So far, It only
>recognizes a tiny percentage of the spam I receive (I guess 20% at most).
>
>I'm running it from procmail as in:
>
>:0HB:
>* ? bogofilter -l -u
>spam
I'll let someone else comment on this, as I'm a bit weak on understanding
of procmail recipes. (I do have one that works for me and can send it to
you, if you wish).
>and correcting the misclassifications manually via mutt macros.
Sounds right. I presume you're using '-N' and '-S' ??
>Some data:
>
>[chianti:~] $ bogoutil -w .bogofilter/ .MSG_COUNT
> spam good
>.MSG_COUNT 4539 795
It's unusual to have so many more spam than ham messages.
>Although I admit I haven't done much research, I would appreciate if someone
>could comment on the stupid things I have done/thought:
>
>* I fed bogofilter 7.1 MB worth of spam from spamarchive.org in a desperate
> attempt to make it learn.
bogofilter best learns from the spam and ham that _you_ receive. Using
spam from spamarchive.org will help if it has words in common with your
spam. I can't say ...
>* I sometimes wonder about what the '-u' switch really buys you.
So do others :-) Personally, I use it and am glad of it. Other people,
whose opinion I respect, do not use it.
What it buys is expansion of wordlists. When a message is classified, each
words is scored. Depending on the ratio of ham words to spam words, the
message will be scored as spam (or ham). Some words of each message have
strong scores, some have have weak scores, and some have never before been
seen by bogofilter. Using '-u', all the words of each message are added to
the spam (or ham) wordlist. The result is to add new words to the lists
(depending on the context in which they are encountered) and to increase
the counts for the words already in the lists.
>* I never tried the '-f' switch.
Using the '-f' switch divides the mail into three groups - ham, spam, and
unsure. Using it, I've found the ham and spam classifying to be nearly
perfect. The number of messages that get the "unsure" rating is a fairly
small percentage of all that arrive and I pass them back to bogofilter
using '-s' and '-n' (as bogofilter's '-u' switch won't update the wordlists
when it's unsure).
>* I could feed it more ham.
I'd suggest taking all _your_ email and divide it into ham and spam. Then
create totally new wordlists and use them. Then correct (via your mutt
macros) _all_ errors. It will take a few days (maybe longer) to get
bogofilter trained, but it will be worth it.
>Thanks a lot,
You're very welcome.
David
More information about the Bogofilter
mailing list