Dealing with wordlist mails

Greg Louis glouis at dynamicro.on.ca
Wed Jan 28 13:36:40 CET 2004


On 20040128 (Wed) at 1313:18 +0100, Lars Clausen wrote:
> I saw on my run-through of bogofiltered mail today that a huge number of
> mails had a bunch of random (but not nonsense) words attached.  Many of
> these had bogosity of 0.50000, which is a bad sign, as some ham mails
> come over that.  
> 
> Thinking back to the original of bogofilter, is it not that only ham
> mails are likely to contain words that are specific to you?  When
> spammers send out wordlist spams, they put in a lot of words that are
> not known at all, so I'm guessing they are marked as
> neither-ham-nor-spam, thus tilting the mail towards the middle. 
> Shouldn't unknown words be considered slightly spammish, as they have
> never appeared in your ham?  Not a lot, as you'd want your friends to be
> able to introduce new words to you, but slightly?  Or is that just one
> of those tweakings that give poorer results?

David addressed this in a message to the list a couple of days ago. 
Briefly, most users have the min_dev parameter set to some value other
than zero, so a token's likelihood must differ from 0.5 by at least
that value to be included in scoring.  Unknown tokens are assigned the
value of robx, which for many of us is closer than min_dev to 0.5 and
therefore excludes such tokens.

However, in a bogotune run I did just yesterday, I found that for my
recent mail a robx value of 0.55 and a min_dev of 0.02 worked best.
This exactly conforms to your suggestion that unknowns be treated as
"slightly spammish" -- but not everyone will see the same effect.  A
lot of users find that very high min_dev values (that block not only
unknowns but most moderately-scored tokens) work well for them.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |




More information about the Bogofilter mailing list