religion

Wed Jan 22 23:18:20 CET 2003

On 20030122 (Wed) at 0240:25 -0500, Nick Simicich wrote:
> At 08:23 PM 2003-01-21 -0500, Greg Louis wrote:
> 
> >> In fact, training on unsures is more work than just training on errors, 
> >at
> >> least in the short term. (I'm not saying it's not worthwhile.)
> >
> >Training on just errors -- misdelivered spams and undelivered nonspams
> >-- is a bit less work per batch of email, but you're right in referring
> >to the short term: it'll take longer to develop your training db so
> >you'll end up processing more batches.  I suspect the tradeoff may be
> >quite even.
> 
> I read the above.  Several times. If I am training with all mail, and 
> reclassifying all errors, why is that in any way inferior to terniary 
> classifying where I then train with all the mail I was unsure of (other 
> than that it is way simpler because I am not having to deal with at least 
> half of the unsure mail.   Specifically, why would it take any longer to 
> build a training DB?

"Training on unsures is more work than just training on errors," is what
we were discussing.  If you are training on all mail, and reclassifying
all errors, you're not "just training on errors."  And if you're doing
that, it will definitely not take longer to build a training db -- what
I meant is that it would if you were training on errors only.

As far as I know, training on all mail and reclassifying _all_ errors
is inferior only in that there are continually periods of exposure
between the initial training and the reclassification during which the
database has bad data.  I would add that the risk of missing
reclassification and leaving bad data in the training db is another
drawback -- it's more harmful to leave wrong stuff in occasionally than
it is to leave right stuff out occasionally.

> >No doubt it's my fault: I maintain that -u _is_ inherently harmful, in
> >both binary _and_ ternary mode, unless accompanied by manual correction
> >and, in ternary mode, also by manual training on unsures.  I may have
> >not made those qualifiers clear enough in earlier comments, and I
> >appreciate your contribution to getting them clarified.  (I also think
> >-u involves unnecessary risk, as described in your next paragraph.)
> 
> Is the assertion that not using -u once you have a trained database, at 
> least leaves you with a situation no worse than you were in when you start 
> -- it won't misclassify and missort a few spams, and then start cascading 
> into more and more misclassified spams?

Not sure I understand the part before the -- there.  I've never used -u
and I'd advise against doing so, right from the start, and my reason is
that the messages have to be reviewed by a human anyway, and since
that's the case, it's better to review first and train afterward than
to make training errors first and hope to catch them all afterward. 
And I certainly agree that "misclassify and missort a few spams, and
then start cascading into more and more misclassified spams" is an
excellent description of the risk involved in using -u imperfectly.

> >(What I do is use bogofilter to sieve
> >the emails into good and bad -- or now good/bad/unsure -- mbox files,
> >and then use my mua to read the subject lines, correct any
> >misclassifications, and separate out unsures that are actually spam.
> >Then I train on errors and unsures -- to train on everything only takes
> >another couple of minutes and the abovementioned cpu cycles,
> >obviously.)
> 
> I don't understand.  You do not train unsures that are actually hams?  Why 
> not?

Indeed I do.  Oh, I see -- ok, bad locution on my part, sorry.  "I
separate out unsures that are actually spam" and train on both the
unsures that were spams and the remaining unsures that were nonspams. 
That's the whole point of "training on unsures," actually.  After
separation I run "bogofilter -v -n <unsure.nonspam" and "bogofilter -v
-s <unsure.spam" (and train on any other errors as well).  All I
exclude from the training are the messages that were correctly
classified in the first place (which, happily, is usually at least four
fifths of the total).

Hope that helps..........
-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |
| Help free our mailboxes. Include                   |
|        http://wecanstopspam.org in your signature. |