Bogofiler with a specified wordlist.db

Fri Apr 7 17:51:45 CEST 2006

mouss wrote:
> As of today, I am not aware of any mathematical proofs that support 
> bayesian filtering. while it has been shown that the independence 
> conditions are not necessary for Bayes formula, we are still left with 
> the fact that practice is the only "proof".

http://edukalibre.org/documentation/bayesian_decision_making.pdf

> The idea is to first make bogo somewhat learn SA, until the user's db is 
> large enough. then user feedback will be required. but this should 
> happen less than if the user starts with just bogo. of course, I have no 
> theoritical nor practical proof of this. It may be just wrong. and even 
> if "theoritically true", it may be the wrong approach.

I cannot see any advantage for Bogofilter to learn from Spam Assassin.
It only takes training a few messages to get meaningful filtering
results.  If user feedback is going to be required at some point anyway,
you might as well start at the beginning.  Starting with SA training
would mean that you'd have to unteach Bogofilter some bad rules.  I
start my users with no training at all.  All of their messages arrive as
"unsure".  When they train with one or two messages, they might start
getting some filtered as spam or ham, but most are still unsure.  As
they continue training, their results are further refined until very few
arrive as unsure.  This is easy for people to grasp, whereas being
thrown into a situation where they are correcting errors right off the
bat can be confusing.

> The recipient _is_ a _filter_. This is what I meant by "human filter". 
> The recipient looks at the mail, and classifies it. and this human 
> filter doesn't have a perfect accuracy. Some people will classify as 
> spam mail from lists they subscribed to. others will classify as spam a 
> mail they don't understand. some people will believe hoaxes and 
> phishes... etc. Even "good" human classifiers get tired, or they may 
> review too quickly....

No, the recipient is not a "human filter".  A "human filter" would be a
secretary, screening your emails.  The recipient's classification is not
filtering, but rather the final determination of value.  Either you want
the email or you don't.  The recipient cannot be wrong in this regard.
They can forget that they subscribed to lists, or mistakenly throw away
something they may have actually wanted, or fall for a phishing scam,
but their choice is definitive nonetheless.  The recipient can only act
as a filter if the email has some further use beyond their own viewing
of it.  In other words, they are not really the recipient.  For
instance, if it needs to be archived, then the archive is the final
recipient and the rules of archiving would be the determinant of value.

But if the buck stops with you, you are the recipient, and you do not
filter, but rather determine value.  If you say it's spam, it's spam.
If you say it's ham, it's ham.  Even if you later decide otherwise.

> So from a theoritical standpoint, users feedback/classification is a 
> filter. I am not saying it could be replaced by an automata. but I am 
> just trying to see if it is feasible to improve the accuracy of the 
> filter by replacing "bad" human filters (people who don't train) with a 
> procedural filter. I still believe recipient feedback is the way to go 
> ... when possible!

A person who does not train a statistical filter is not themselves a bad
filter, but rather they've chosen to have no filter at all.  The
statistical filter is not the recipient.  For instance, consider the
case of an executive and a secretary.  If the secretary doesn't know
which calls are important and which aren't, she'll send them all
through.  If the executive doesn't inform her when a call is important
or unimportant, he has neglected to train his filter and he must screen
all of his own calls.  If the secretary buys a book about rules for
determining importance of calls and follows it religiously, the book is
making the decisions, and you could essentially program these rules into
the phone system itself rather than having a secretary.  She can be no
more accurate than the book unless informed by the executive about his
own personal determination of value.  However, the executive does not
screen calls for the secretary, but rather teaches her or doesn't.

You cannot improve the accuracy of a statistical filter by training it
with a procedural filter beyond the accuracy of the procedural filter
itself, and thus there is no need for the statistical filter.  If that's
all you're going to do, just use the procedural filter exclusively.  A
statistical filter can only be accurate if it is trained by something or
someone even more accurate.  The recipient is 100% accurate, as they
ultimately decide the final value of an email, even if they are
inconsistent in doing so.  You may want to receive emails from Bob this
month, but then not receive his emails next month.  This decision may
even be irrational or later deemed a mistake.  You're still 100%
accurate at the time you make that decision, and you inform your filter
to adjust to your wishes.

If that hypothetical executive told his secretary to screen out all
calls from Bob, but she sent them through anyway, he'd probably be
pretty pissed.  If he didn't mention the Bob situation to his secretary,
that's his failing, but not bad filtering.

> I agree that people should be responsible for their accuracy, and should 
> thus train the filter with their mail. but there are two cases that 
> cause me trouble here:
> - new users. waiting until the filter matures is not acceptable. using a 
> reference corpus or a reference wordlist is feasible. but I am not 
> certain this is the way to go. what if their mail is "too different"? 
> (more on this below).

Bogofilter "matures" very quickly.  I recommend starting with a blank
slate -- no reference corpus.  Users may have to train a few dozen spams
the first day as they arrive, but it'll be a one-time inconvenience.
Within a few days, it should be >90% accurate.  Within two weeks it
should be >99% accurate.

> - lazy (to stay polite:) users: I mean people who either don't train, or 
> those who train inconsistently (which may pollute their db)...

Inconsistent training may be perfectly acceptable.  Consider the case of
Bob above.  But even if it is wrong (eg. an employee filter's mail from
the boss), the user is ultimately responsible.  Telling the filter to
filter the wrong stuff is just as irresponsible as receiving an email
and ignoring it or deleting it.  There is no technical problem here.  If
this is an employee, perhaps they should be fired.  If it is a client,
then their complaints must be answered in kind.

If a client determines that they want filtering but do not want to
train, then you have little choice but to offer them a procedural filter
or some other automated method.  But the results will be less
impressive, and they should know that they'll get more errors.  I
personally have a client who does not want to train, and they are happy
with the results of using DNSBLs exclusively.  They would rather
personally screen 10-20% more spam than train a filter occassionally.
Not a choice I agree with, but it's their choice to make.

> (*) I personally have multiple accounts, used for different purposes. 
> and while a lot of spam is common, some accounts "attract" spam that 
> other accounts don't get. and ham is completely different (for some 
> addresses, a non french message is almost certainly spam. for others, a 
> french mail is almost certainly spam... etc). In this particular case (I 
> admit it is particular), global corpuses don't seem to be the right 
> choice. but I may be wrong.

That's why I suggested you could use individual wordlists trained with
each users' own outgoing mail.  You could also create multiple honeypots
each posted to specific sites more likely to receive the type of spam
the recipient would probably get.

> I use postfix with 587 as the submission port, so this is trivial to do. 
> Now I fear "automated replies" and undetected viruses. Assuming that 
> these are marginal, can I just ignore them?

You would certainly want to scan for viruses if you don't trust your
users to have virus scanners on their machines or you don't trust them
to use them properly or consistently.  However, if you use individual
wordlists, only this user's ham scores would be affected.  And if you
use a global wordlist, you'd probably get just as many spams as hams,
and thus the virus emails would be labelled as unsure.  They are both
sent and received.  In any event, you could always tweak those tokens
manually if people complained.  Remember that the automated approach has
its drawbacks.  And you may even want to prefilter a little bit in this
case with some rules of your own (eg for automated replies).  However, I
can't see classifying automated replies as ham being a bad thing.

> What would be the easiest way to set this up? I am thinking of 
> duplicating an account (or more) so that the results may be compared.

Sounds good to me.  Make sure to post your results to the list!

Tom