bogofilter howto question

p at dirac.org p at dirac.org
Tue Sep 23 18:35:49 CEST 2003


On Tue 23 Sep 03, 12:08 PM, Greg Louis <glouis at dynamicro.on.ca> said:
> On 20030923 (Tue) at 1125:31 -0400, Peter Jay Salzman wrote:
> > hi greg,
> > 
> > are you accepting questions to put into the HOWTO you've written?  :)
> 
> If they're asked often enough ;-)  Sure, we welcome feedback!
> 
> > if so, i'd like to know what kinds of email to use as nonspam.  for
> > instance, what if someone sends me email with...
> > 
> > a MIME encoded image file attachment?
> > a sound file?
> > how about spamcop's "go to this link to complain about spam" emails?
> > a tarball containing .tex files?
> > 
> > the README file goes into what kinds of spam not to train with (like
> > email in asian languages.  it suggests filtering them out with
> > procmail).  but i haven't seen what kinds of good email to train with.
> 
> That assumes (wrongly, at least in the case of my users) that
> bogofilter users don't correspond in asian languages.  Since in fact we
> do correspond in Korean and Chinese and so on, we naturally want to
> train on valid examples.
> 
> The rule is really easy (to state, anyhow, maybe not to follow): you
> should train (at first, anyway) with a subset that, as closely as
> possible, resembles the whole of your population of incoming valid
> messages.  (You could let bogofilter classify, then manually verify,
> and then have bogofilter register everything you get -- so-called "full
> training".)  The "at first" is because after one gets to somewhere
> around 10,000 spam and 10,000 nonspam in the training db, switching to
> training only on errors and unsures gives good results with less
> consumption of disk storage.  That's been my experience, anyhow.
> 
> I agree with you that beginning bogofilterers could benefit from a
> statement to that effect, although the faq might be a better place to
> put it than the tuning HOWTO that I wrote.  I hope you don't mind my
> copying this reply to the bogofilter list

sure, no problem!

> and we'd be happy to have you join it, if you haven't already.

already done - the cofirm message arrived a minute ago.  :)

so from your email, if i only occasionally get tarballs or mime encoded
image/sound files, these would *not* be good emails to train with.   is
that about right?

pete

-- 
GPG Instructions: http://www.dirac.org/linux/gpg
GPG Fingerprint: B9F1 6CF3 47C4 7CD8 D33E 70A9 A3B9 1945 67EA 951D




More information about the Bogofilter mailing list