Some naive questions

Adriano Nagelschmidt Rodrigues anr at estadao.com.br
Fri Jan 3 20:43:42 CET 2003


David Relson writes:
> Greetings Adriano,

Thanks for answering, David.

> It sounds like you've been reading the mailing list for a while.  That's 
> great :-)

I (re)subscribed recently...

> >and correcting the misclassifications manually via mutt macros.
> 
> Sounds right.  I presume you're using '-N' and '-S' ??

Yes.

> bogofilter best learns from the spam and ham that _you_ receive.  Using 
> spam from spamarchive.org will help if it has words in common with your 
> spam.  I can't say ...

Subjectively, I feel I receive spam from all over. It is true that most of
them are in English and Portuguese (the email address in question ends in .br
and has been around since '93).

But I think spammers are, by definition, unselective (apart from the very
basics, eg trying to use the target's language).

> >* I sometimes wonder about what the '-u' switch really buys you.
> 
> So do others :-)  Personally, I use it and am glad of it.  Other people, 
> whose opinion I respect, do not use it.
> 
> What it buys is expansion of wordlists.  When a message is classified, each 
> words is scored.  Depending on the ratio of ham words to spam words, the 
> message will be scored as spam (or ham).  Some words of each message have 
> strong scores, some have  have weak scores, and some have never before been 
> seen by bogofilter.  Using '-u', all the words of each message are added to 
> the spam (or ham) wordlist.  The result is to add new words to the lists 
> (depending on the context in which they are encountered) and to increase 
> the counts for the words already in the lists.

Yeah, I grant you that it sounds very adaptative. But then, bogofilter never
stops (auto) learning. I think there are many twists to this argument.

> >* I never tried the '-f' switch.
> 
> Using the '-f' switch divides the mail into three groups - ham, spam, and 
> unsure.  Using it, I've found the ham and spam classifying to be nearly 
> perfect.  The number of messages that get the "unsure" rating is a fairly 
> small percentage of all that arrive and I pass them back to bogofilter 
> using '-s' and '-n' (as bogofilter's '-u' switch won't update the wordlists 
> when it's unsure).
> 
> 
> >* I could feed it more ham.
> 
> I'd suggest taking all _your_ email and divide it into ham and spam.  Then 
> create totally new wordlists and use them.  Then correct (via your mutt 
> macros) _all_ errors.  It will take a few days (maybe longer) to get 
> bogofilter trained, but it will be worth it.

Ok, I'll start from scratch with -f & -u. Unfortunately, I threw away my spam
corpus in disgust :-(  Well, it'll build up quickly.

Let's see how things go.

Regards,

--
Adriano




More information about the Bogofilter mailing list