bogofilter howto question

Greg Louis glouis at dynamicro.on.ca
Tue Sep 23 18:08:04 CEST 2003


On 20030923 (Tue) at 1125:31 -0400, Peter Jay Salzman wrote:
> hi greg,
> 
> are you accepting questions to put into the HOWTO you've written?  :)

If they're asked often enough ;-)  Sure, we welcome feedback!

> if so, i'd like to know what kinds of email to use as nonspam.  for
> instance, what if someone sends me email with...
> 
> a MIME encoded image file attachment?
> a sound file?
> how about spamcop's "go to this link to complain about spam" emails?
> a tarball containing .tex files?
> 
> the README file goes into what kinds of spam not to train with (like
> email in asian languages.  it suggests filtering them out with
> procmail).  but i haven't seen what kinds of good email to train with.

That assumes (wrongly, at least in the case of my users) that
bogofilter users don't correspond in asian languages.  Since in fact we
do correspond in Korean and Chinese and so on, we naturally want to
train on valid examples.

The rule is really easy (to state, anyhow, maybe not to follow): you
should train (at first, anyway) with a subset that, as closely as
possible, resembles the whole of your population of incoming valid
messages.  (You could let bogofilter classify, then manually verify,
and then have bogofilter register everything you get -- so-called "full
training".)  The "at first" is because after one gets to somewhere
around 10,000 spam and 10,000 nonspam in the training db, switching to
training only on errors and unsures gives good results with less
consumption of disk storage.  That's been my experience, anyhow.

I agree with you that beginning bogofilterers could benefit from a
statement to that effect, although the faq might be a better place to
put it than the tuning HOWTO that I wrote.  I hope you don't mind my
copying this reply to the bogofilter list -- and we'd be happy to have
you join it, if you haven't already.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |




More information about the Bogofilter mailing list