bogofilter howto question
p at dirac.org
p at dirac.org
Tue Sep 23 18:35:49 CEST 2003
On Tue 23 Sep 03, 12:08 PM, Greg Louis <glouis at dynamicro.on.ca> said:
> On 20030923 (Tue) at 1125:31 -0400, Peter Jay Salzman wrote:
> > hi greg,
> >
> > are you accepting questions to put into the HOWTO you've written? :)
>
> If they're asked often enough ;-) Sure, we welcome feedback!
>
> > if so, i'd like to know what kinds of email to use as nonspam. for
> > instance, what if someone sends me email with...
> >
> > a MIME encoded image file attachment?
> > a sound file?
> > how about spamcop's "go to this link to complain about spam" emails?
> > a tarball containing .tex files?
> >
> > the README file goes into what kinds of spam not to train with (like
> > email in asian languages. it suggests filtering them out with
> > procmail). but i haven't seen what kinds of good email to train with.
>
> That assumes (wrongly, at least in the case of my users) that
> bogofilter users don't correspond in asian languages. Since in fact we
> do correspond in Korean and Chinese and so on, we naturally want to
> train on valid examples.
>
> The rule is really easy (to state, anyhow, maybe not to follow): you
> should train (at first, anyway) with a subset that, as closely as
> possible, resembles the whole of your population of incoming valid
> messages. (You could let bogofilter classify, then manually verify,
> and then have bogofilter register everything you get -- so-called "full
> training".) The "at first" is because after one gets to somewhere
> around 10,000 spam and 10,000 nonspam in the training db, switching to
> training only on errors and unsures gives good results with less
> consumption of disk storage. That's been my experience, anyhow.
>
> I agree with you that beginning bogofilterers could benefit from a
> statement to that effect, although the faq might be a better place to
> put it than the tuning HOWTO that I wrote. I hope you don't mind my
> copying this reply to the bogofilter list
sure, no problem!
> and we'd be happy to have you join it, if you haven't already.
already done - the cofirm message arrived a minute ago. :)
so from your email, if i only occasionally get tarballs or mime encoded
image/sound files, these would *not* be good emails to train with. is
that about right?
pete
--
GPG Instructions: http://www.dirac.org/linux/gpg
GPG Fingerprint: B9F1 6CF3 47C4 7CD8 D33E 70A9 A3B9 1945 67EA 951D
More information about the Bogofilter
mailing list