religion

Nick Simicich njs at scifi.squawk.com
Wed Jan 22 08:40:25 CET 2003


At 08:23 PM 2003-01-21 -0500, Greg Louis wrote:

> > In fact, training on unsures is more work than just training on errors, at
> > least in the short term. (I'm not saying it's not worthwhile.)
>
>Training on just errors -- misdelivered spams and undelivered nonspams
>-- is a bit less work per batch of email, but you're right in referring
>to the short term: it'll take longer to develop your training db so
>you'll end up processing more batches.  I suspect the tradeoff may be
>quite even.

I read the above.  Several times. If I am training with all mail, and 
reclassifying all errors, why is that in any way inferior to terniary 
classifying where I then train with all the mail I was unsure of (other 
than that it is way simpler because I am not having to deal with at least 
half of the unsure mail.   Specifically, why would it take any longer to 
build a training DB?

> > In summary, I don't understand why you feel -u is inherently harmful with
> > ternary mode, and I have been quite confused by such comments in the past.
>
>No doubt it's my fault: I maintain that -u _is_ inherently harmful, in
>both binary _and_ ternary mode, unless accompanied by manual correction
>and, in ternary mode, also by manual training on unsures.  I may have
>not made those qualifiers clear enough in earlier comments, and I
>appreciate your contribution to getting them clarified.  (I also think
>-u involves unnecessary risk, as described in your next paragraph.)

Is the assertion that not using -u once you have a trained database, at 
least leaves you with a situation no worse than you were in when you start 
-- it won't misclassify and missort a few spams, and then start cascading 
into more and more misclassified spams?

> > The only problem I've personally had with -u is that false positives AND
> > false negatives MUST be corrected, or things will get worse, but that has
> > nothing to do with whether one is using binary or ternary mode.
>
>I would hypothesize that the effort required to sniff out fp and fn
>after the fact is no less than the effort of doing the whole
>classification semi-manually in either binary or ternary mode, with the
>additional advantage that not using -u avoids putting bad data into the
>training db in the first place.  (What I do is use bogofilter to sieve
>the emails into good and bad -- or now good/bad/unsure -- mbox files,
>and then use my mua to read the subject lines, correct any
>misclassifications, and separate out unsures that are actually spam.
>Then I train on errors and unsures -- to train on everything only takes
>another couple of minutes and the abovementioned cpu cycles,
>obviously.)

I don't understand.  You do not train unsures that are actually hams?  Why not?

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally
to mean electronic messages designed to be read by an individual, and it
can include Usenet, SMS, AIM, etc.  But if it is not all three of Unsolicited,
Bulk, and E-mail, it simply is not spam. Misusing the term plays into the
hands of the spammers, since it causes confusion, and spammers thrive on
confusion.  If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!



More information about the Bogofilter mailing list