New version

Wed Mar 17 00:43:51 CET 2004

On 20040316 (Tue) at 1756:26 -0500, Tom Allison wrote:

> > I agree.  There are situations -- a small training database is one of
> > them -- where it makes sense not to consider unknowns, and making sure
> > min_dev excludes them is a way to avoid swamping your valid tokens with
> > priors.  But some of us find that a very small min_dev works really
> > well (bogotune will tell you, if you have enough messages to run it).

> Yes this is exactly my point.
> 
> I am not in a position to run bogotune and so I have to fiddle manually.

Then I recommend you fiddle with the spam cutoff _only_, keeping the
rest of the default settings as they are, or changing things very
gingerly with the aid of a test corpus.  That's what I used to do in
the early days, too.

> While the bogofilter.cf file provides me with the means to readily
> fiddle to my hearts delight, I do not have sufficient emails to run
> bogotune or to do such inherently risky settings as you have (as
> supported by other comments on this list).

Do I come across as contemptuous or something?  (pi and I have been
at loggerheads for over a year, but there are good theoretical reasons
for our disputes.)  Tom Anderson raised a superficially good but
practically dubious objection to my current settings, I tried to
explain what was wrong with his assumptions, and here you are throwing
his mistaken criticism in my teeth as if you were desperate to have
something to reproach me with.  The _fact_ is that a slavish
determination to ignore unknowns can, in the long run, leave you
running bogofilter less effectively than is possible.  That's why I
recommend we tell people to do so _in_the_beginning_ but not forever. 

> I can't possibly dispute the effectiveness of your settings, however I
> think as a general guideline it might be a good idea to keep robx
> withing the 0.5 +/- min_dev range just as it's a good idea to not set
> spam_cutoff to 0.40 right away.

Please read Gary Robinson's article explaining what robx is for.  It's at
http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html

> On a side note IIRC most people have an diction of <30,000 words with a
> common set of 10,000 in their daily language.  Considering thay my one
> month old wordlist exceeds this by almost 4x it's no surprise that
> (according to 'bogoutil -H')
> hapaxes:  ham   11899 ( 8.35%), spam   52460 (36.83%)
>    pure:  ham   24179 (16.98%), spam  107558 (75.52%)
> 
> My 'pure ham' approximates the diction limit of a reasonably educated
> person.  In my case there's probably a lot of tokens that aren't
> linguistically significant (eg: head:AntiVirus 35 110 20040310) but
> there's a limit from a language perspective that you approach as you
> collect ham.

English has nearly a million words.  Most other Indo-European languages
have about a quarter of that.  Sure, few of us know or use more than a
few myriad, but then most of us these days can't spell worth a tinker's
damn (politically correct version of the expression, y'unnerstan'), and
half of us can't type either.  That bumps up the number of tokens in
nonspams significantly.

> The spam content, using random letters and mis-spellings for variations
> will far exceed any typical language on the planet.
> Given that and the concept of setting robx withing 0.5 +/- min_dev
> effectively negates all the random values they use for "spin control" of
> their email.

Ignoring unknowns, if spammers generate lots more unknowns than
nonspammers, is throwing away information that can be used -- like the
presence or absence of any known token -- to indicate likelihood of
spam.  Why is that not obvious?  Why should you think it dangerous to
take advantage of that information?

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |