Comment re http://www.bgl.nu/~glouis/bogofilter/scale.html

Greg Louis glouis at dynamicro.on.ca
Wed Jan 22 13:50:14 CET 2003


On 20030122 (Wed) at 1258:51 +0100, Johan Almqvist wrote:
> Hello!
> 
> > We haven't yet addressed the point (it's discussed in Graham's original
> > paper, though) that Bayesian spam filtering is likely to be most
> > efficient when it's done on an individual basis.  Nonspam emails for a
> > large user population are, taken as a whole, less dissimilar from spam
> > than any individual's nonspam email is likely to be.  I don't know
> > whether this effect would be significant at the 100-user, the 1000-user
> > or the 10000-user level.  It would have to be tested.
> 
> Just a quick comment on that. I have noted that even for just myself,
> bogofilter (original esr version, read about yours first time today...)
> is more effective with different databases for different sub-accounts
> (mainly due to the fact that I use some addresses mainly for swedish
> communications, other addresses for german and english communications,
> respectively).

Thank you for your comment and for your interest!

That level of difference would be significant with small training
databases, but likely to disappear, I think, as training progressed. 
It would be very interesting if you were to throw the three training
databases together when the totals reach around 10,000 spams and
10,000 nonspams, and measure the efficacy of discrimination both ways.

Since writing the paragraph you quote, I've confirmed by my own
experience that it's much harder to get good filtering for 80 users
with heterogeneous interests (Marketing, Purchasing, Engineering
people) than for one user (myself) with relatively broad interests and
email in English, French and German.  I'm getting around a half percent
false negatives now, with a false positive every three or four thousand
messages, for myself; at work, fn's are still pushing ten percent or
so, because we have to be more lenient to keep the fp's down (also
around one in three or four thousand).  Fortunately my users are mostly
delighted at getting 'way less spam and don't expect perfection, but I
still think we can do a lot better, e.g. with smarter tokenization
(Paul Graham's new paper at http://paulgraham.com/better.html is
interesting on that subject).

As your comment seems generally relevant and not overly personal or
private, I've taken the liberty of copying this to the bogofilter list
(bogofilter-subscribe at aotto.com if you're interested in joining us),
where there will be other interested readers.  Hope you don't mind.

Regards.............
-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |
| Help free our mailboxes. Include                   |
|        http://wecanstopspam.org in your signature. |




More information about the Bogofilter mailing list