Dealing with wordlist mails
relson at osagesoftware.com
Wed Jan 28 07:27:36 EST 2004
On Wed, 28 Jan 2004 13:13:18 +0100
Lars Clausen wrote:
> I saw on my run-through of bogofiltered mail today that a huge number
> of mails had a bunch of random (but not nonsense) words attached.
> Many of these had bogosity of 0.50000, which is a bad sign, as some
> ham mails come over that.
> Thinking back to the original of bogofilter, is it not that only ham
> mails are likely to contain words that are specific to you? When
> spammers send out wordlist spams, they put in a lot of words that are
> not known at all, so I'm guessing they are marked as
> neither-ham-nor-spam, thus tilting the mail towards the middle.
> Shouldn't unknown words be considered slightly spammish, as they have
> never appeared in your ham? Not a lot, as you'd want your friends to
> be able to introduce new words to you, but slightly? Or is that just
> one of those tweakings that give poorer results?
Good thoughts, but ...
Would you rather lose an important message because your spam filter
classified it wrong or would you rather have a few spam messages in your
inbox? The default score that bogofilter assigns to unknown and rarely
seen words is 0.415, which causes it to favor delivery of spam (rather
than loss of ham). Bogofilter also has a min_dev value so that it will
ignore words that score close to 0.5. Min_dev's default value is 0.1,
so bogofilter will ignore word scores between 0.4 and 0.6
In practice, random words in spam messages have little effect. If you
want more detail on how bogofilter classified a 0.500000 message, run it
with flags "-vv" and "-vvv". The FAQ has info on the output generated
with those flag settings.
More information about the Bogofilter