spam with random words

David Relson relson at osagesoftware.com
Mon Jan 12 19:40:06 CET 2004


On Mon, 12 Jan 2004 18:20:30 -0000
pgb at adelard.com wrote:

> On 12 Jan 2004 at 12:42, Boris 'pi' Piwinger wrote:
> 
> > No. Those random words will show up in many messages (good
> > and bad), so they are only moved slightly to spammish. But
> > that is the whole idea about statistics.
> > 
> 
> Also some (most?) random words will be extremely unlikely for a given 
> user context, e.g. words like "embroidery" so these will be a good 
> indicator of spam if they are ever used again. 
> 
> Previously unseen random words will be ignored  if mindev is set to a 
> suitable value (e.g. 0.15 to 0.2)

Bogofilter gives a score of 0.415 (specifically ROBX) to words it's not
seen before.  Since that's closer to 0.5 than the default min_dev
parameter of 0.1, such words are ignored when the message is scored. 
After a spam message with new words is registered as spam, the next time
any of those new words is seen, bogofilter will recognize them as spam
related.  Think of it as a free pass the first time and a red flag the
second time :-)

Truly random words will never be seen again, so they're nothing to worry
about.  Spelling errors intended to fool rule based filters may work
once, but training lets bogofilter catch them in later messages.

Moral:  don't worry about really short messages that just have urls. 
The messages also have lots of info in their headers and, with a bit of
training, bogofilter will start classifying them correctly (using the
clues from the headers).




More information about the Bogofilter mailing list