Tricky

Tom Anderson tanderso at oac-design.com
Sun Feb 1 19:29:22 CET 2004


On Sun, 2004-02-01 at 08:32, David Relson wrote:
> With bogofilter's use of many tokens from each email in scoring, I've
> yet to see a problem caused by one or two misclassified tokens.

The original "Tricky" email was scored 0.0 and added to my database with
-u and I haven't seen any altered behavior from bogofilter because of
it.  I highly doubt the occasional spam on this list will have much if
any impact on anyone's database.  Simply training on the real spams
(also -u in my case) and errors you receive far outweighs the occasional
misclassification.  Sure, zipping a spam would probably be better, but
it's such a hassle and mostly unnecessary.

If I had received the same spam as quoted in "Tricky", that one only
would probably be misclassified as unsure.  Training bogofilter (-s) on
that spam will then completely counteract the misclassification via the
quoting of that spam.  Therefore, no problems.  Train on it twice if you
really want to go the other way.

> I apologize if I've corrupted your wordlist.  There was no such

This talk of "corruption" or "damage" is a complete misnomer.  Such
words immediately bring to mind corrupt or damaged RDMS's or filesystems
where the actual integrity of the data is at risk, which would in fact
be a problem.  But that's not the case here, as the database is actually
perfectly intact and functioning properly.  The actual word to use might
be "sway", as in "quoting spams on the list tends to sway my
classifications toward false negatives".  No corruption, no damage.  You
can easily "sway" it back the other way through the normal training
process.

> Assuming you're using procmail, maildrop, etc, you could whitelist the
> mailing list with a simple test.  That'd keep list messages out of your
> wordlist.

While a whitelist wouldn't be a bad thing, I believe IMHO that it is
completely unnecessary in regards to the topic at hand.  If you're
completely paranoid, then by all means do it, but otherwise just use
training to smooth out any bumps caused by the list.  I prefer to adhere
to the KISS philosophy and not add any extra complexity to my setup.  I
think others would probably be well-advised to do the same.

Tom

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040201/7ca87127/attachment.sig>


More information about the Bogofilter mailing list