religion

Greg Louis glouis at dynamicro.on.ca
Wed Jan 22 02:23:44 CET 2003


On 20030121 (Tue) at 1504:56 -0800, Barry Gould wrote:

> However, you seem to be repeatedly saying that using -u is _inherently_ BAD 
> in a ternary system.
> 
> I don't see any reason that -u would be inherently harmful in a ternary 
> system, especially if manual training on Unsures is also done.

Good point, especially if you remove the word especially ;-)  Otherwise
my concern with -u ignoring (better word than discarding) unsures
remains.

> Furthermore, I have some users who do not give me their uncaught spam, much 
> less their Unsures. Therefore, I _cannot_ train on all Unsures unless I 
> decide to cc them all to myself, which would be an invasion of everyone's 
> privacy.

I'm luckier: our corporate information security policy, of which all
employees receive a copy and about which we give them a two-hour
presentation, explicitly says that we may archive emails for a time and
privacy cannot be guaranteed.  I tell them very firmly that any email
has the privacy and security of a postcard, and that they _must_ use
encryption if they're concerned about some sysadmin or packet sniffer
somewhere reading their mail.  In fact it's almost never necessary to
go beyond the subject header in order to classify the mail -- good
thing, too, because I certainly haven't time to read email bodies.

> I don't understand what you mean by "discards".

Does not use for training.

> With -p, I can still see the Unsure status in my MUA, and use those 
> messages for manual training. Therefore, it hasn't discarded anything in 
> any sense. Maybe you're not using -p?

No, but I don't grok your reasoning here.  Using -u does nothing to
help with unsures; -u trains on recognized spams and nonspams only,
which I believe has little training value.  The fact that you can
supplement it with manual training on unsures is irrelevant: doing that
without using -u at all is very nearly as valuable.

> Training on unsures and errors is just as laborious as anything else.

It can be.  It would be possible to design a full-training procedure
that's as little work as my unsures-and-errors procedure -- in fact, I
have one that just uses a few gazillion more cpu cycles.  But it doesn't
use -u, and I doubt that using -u would decrease the effort.

> In fact, training on unsures is more work than just training on errors, at 
> least in the short term. (I'm not saying it's not worthwhile.)

Training on just errors -- misdelivered spams and undelivered nonspams
-- is a bit less work per batch of email, but you're right in referring
to the short term: it'll take longer to develop your training db so
you'll end up processing more batches.  I suspect the tradeoff may be
quite even.

> In summary, I don't understand why you feel -u is inherently harmful with 
> ternary mode, and I have been quite confused by such comments in the past.

No doubt it's my fault: I maintain that -u _is_ inherently harmful, in
both binary _and_ ternary mode, unless accompanied by manual correction
and, in ternary mode, also by manual training on unsures.  I may have
not made those qualifiers clear enough in earlier comments, and I
appreciate your contribution to getting them clarified.  (I also think
-u involves unnecessary risk, as described in your next paragraph.)

> The only problem I've personally had with -u is that false positives AND 
> false negatives MUST be corrected, or things will get worse, but that has 
> nothing to do with whether one is using binary or ternary mode.

I would hypothesize that the effort required to sniff out fp and fn
after the fact is no less than the effort of doing the whole
classification semi-manually in either binary or ternary mode, with the
additional advantage that not using -u avoids putting bad data into the
training db in the first place.  (What I do is use bogofilter to sieve
the emails into good and bad -- or now good/bad/unsure -- mbox files,
and then use my mua to read the subject lines, correct any
misclassifications, and separate out unsures that are actually spam. 
Then I train on errors and unsures -- to train on everything only takes
another couple of minutes and the abovementioned cpu cycles,
obviously.)

Yes, this is hard to do if you don't have access to the whole stream of
emails.  No, -u is not a solution in that case either, because then you
will not have the opportunity of finding and correcting the errors.

Hope that helps eliminate the confusion, which I'm sorry to have caused
in the first place!  Note, also, that David uses -u with post facto
review, and is as happy with his way of doing things as I am with mine. 
Circumstances and people differ.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |
| Help free our mailboxes. Include                   |
|        http://wecanstopspam.org in your signature. |




More information about the Bogofilter mailing list