The significance of word placement

David Relson relson at osagesoftware.com
Fri Oct 25 01:55:25 CEST 2002


At 05:47 PM 10/24/02, Boris 'pi' Piwinger wrote:

>Hi!
>
>Most of the spam not caught by bogofilter for me is in
>German. A significant portion of that is "women" asking for
>"dates". Strange enough the subject is "Betreff" (which
>means: subject or topic). So this is a 100 percenter in my
>filter. Bogofilter often misses it, though.
>
>Would it be significant if a word shows up in the header or
>even the subject, bogofilter would know better. I have
>virtually hundreds of those messages with that exact
>subject. None of which is ham.
>
>pi


pi,

If I remember right, Mark Hoffman is working on some advanced tokenizing 
features.  One part of the project is to generate compound tokens like 
subject:betreff, from:xyz, etc.

An idea that just occurred to me is that the prefixes (like subject: or 
from:) could be recognized and bogofilter could apply difference weights 
(importances) to such tokens.  I'm going to think out loud here for a minute.

Suppose the special tokens were very easy to recognize. Perhaps a special 
first character, like a colon.  Supposing the colon, a special token would 
look like :subject:betreff.  In the config file, there could be a list of 
prefixes for special tokens and special weights to assign to such 
tokens.  An entry might look like "special_token:  subject  3", which would 
mean to give tokens like ":subject:betreff" triple importance when figuring 
out the spamicity.

The special first character would allow the query_needs_special_handling() 
function to very quickly determine whether a token _might_ or _might_not_ 
need further checking for special handling.  Using the config file to save 
weights, would allow each bogofilter site to select what header fields are 
of interest to them and how important to consider them.  It would even be 
possible for the config file entry to have two weights - one to apply to 
probabilities for the spam word list and one to apply to probabilities from 
the non-spam list.

Anyhow, the above ideas are what your message brings to my mind.

Comments anyone?

David






More information about the Bogofilter mailing list